Page cover

Dataset Summarisation

Data summarisation

  • helps people assess the relevance, usability and quality of a dataset for their own needs

  • improves data discovery, as search algorithms can match the text against keyword queries

Data search often starts with a keyword or query search on a data portal. Users are presented with a compact representation of the results, which might include metadata (title, publisher, publication date, format etc.), a short snippet of text, and, in some cases, a data preview or a visualisation. Metadata is often limited and might not provide enough content to decide whether a dataset is useful for a task. From a user's perspective, having a textual summary of the data is therefore paramount: text is usually richer in context than metadata, and (depending on the context and quality of the representation) can be easier to digest than raw data or graphs.

This summary could contain four main types of information: (i) basic metadata such as format and descriptive statistics; (ii) dataset content, including major topic categories, as well as geospatial and temporal aspects; (iii) quality statements, including uncertainty; and (iv) analyses and usage ideas, such as trends observed in the data.

AI Update: Early experiments suggest that conversational generative AIs such as ChatGPT and Le Chat are very useful in creating dataset summaries and descriptions.

Try telling the chatbot that you want it to summarise the dataset, paying particular attention to the metadata, content, quality and usage ideas. You could also explain that you are thinking about publishing the data and therefore want it to be as useful as possible to potential users.

Last updated