Using AI in Data Preparation
Can generative models improve dataset discoverability in open data portals?
In a word, yes. In fact, generative AI can address one of the biggest problems in data discoverability, which is insufficient metadata. Open data portals search interfaces typically rely on keyword-based mechanisms and a narrow set of metadata fields which are often often incomplete or inconsistent. This is worsened when users lack familiarity with domain-specific terminology (eg housing, homes, housing stock, houses...) or when users are looking to address a big issue - such as Net Zero - that there are few datasets explicitly on, but many datasets can contribute to.
The majority of these issues of of incompleteness, inconsistency, synonyms and semantics can be addressed by generating keywords, tags, themes and descriptions by LLMs trained on snippets of the datasets themselves (as few as 10 lines).
How do we know (and do) this? Read 'Keywords are not always the key'
Last updated