Data Quality Issues for Machine Learning
With recent advances in AI, it is increasingly likely that organisations with larger datasets and more capacity may prepare datasets with an eye to their use in machine learning.
Such datasets and processing require diligent monitoring for issues that might increase bias in products or services or accidentally reveal personal information due to triangulation or similar effects. This calls for additional areas of data quality review. The table below lists these key areas and concerns.
Challenge
Data Quality Category
Intrinsic
Contextual
Representational
Accessibility
Legal and ethical
Some intrinsic aspects of datasets, particularly in personal or sociocultural data, now require greater pre-processing to identify and anonymise or remove sensitive and/or protected characteristics (e.g., gender, race, age).
The relevance of sociocultural data to specific use cases requires an assessment of the presence and distribution of legally protected characteristics.
Documentation of the dataset and its development process can help to anticipate and prevent ethical or legal risks.
Compliance with ethical and legal requirements requires controlled access mechanisms that preserve the security of personal and proprietary data (e.g., data trusts).
Bias
Small contextually relevant datasets can lead to better and fairer performance than large data.
Documenting the environment in which data were collected helps practitioners to assess contextual relevance and to mitigate bias.
Software
Data collection and management software can be used to improve the intrinsic quality of data (e.g., through runtime verification and alerts).
Runtime verification tools can be used to detect contextual drift.
Visualisations and dashboards can make it easier to inspect the quality of a dataset. Documentation facilitates the handover of information across different stages of ML development. This is especially useful in scenarios where datasets and ML are developed by multiple teams.
Software built on top of ML models needs to be tested to ensure that model training and serving data are protected against adversarial attacks.
Source: A Survey of Data Quality Requirements That Matter in Machine Learning Development Pipelines
Last updated