What Features of Datasets Make Reuse More Likely?
Throughout this toolkit we'll be highlighting the practices that facilitate reuse. Empirical research has shown that this includes:
Publishing data to a well-known repository. There is a relationship between the size of the repository and the likelihood of reuse, so if possible, publish data to the largest relevant repository.
What's a repository? Learn about some common ones. Go to: How do I decide where to publish my data?
Providing an informative short textual summary of the dataset. Summarising data as text helps people make sense of it. It also improves data discovery, as search algorithms can match this text against keyword queries.
Best practices in data summarisation
Using AI for data descriptions
Providing a comprehensive README file with plenty of detail in a structured form, and links to further information. This should include information on the creation and use of the dataset. One way to do this is with datasheets.
What's a README file? Find out how to make one here
What's a datasheet? Find out what should be included here.
Keeping datasets to a manageable size. It's not always true that the biggest data is most useful. Many users will be working with standard laptops and computers - datasets should not exceed standard processable file sizes
Making machine learning datasets easy to access. Similarly - datasets should be possible to open with a standard configuration of a common library (such as Pandas).
Last updated