Page cover

Data Quality Issues for Machine Learning

With recent advances in AI, it is increasingly likely that organisations with larger datasets and more capacity may prepare datasets with an eye to their use in machine learning.

Such datasets and processing require diligent monitoring for issues that might increase bias in products or services or accidentally reveal personal information due to triangulation or similar effects. This calls for additional areas of data quality review. The table below lists these key areas and concerns.

Challenge

Data Quality Category

Intrinsic

Contextual

Representational

Accessibility

Legal and ethical

Some intrinsic aspects of datasets, particularly in personal or sociocultural data, now require greater pre-processing to identify and anonymise or remove sensitive and/or protected characteristics (e.g., gender, race, age).

The relevance of sociocultural data to specific use cases requires an assessment of the presence and distribution of legally protected characteristics.

Documentation of the dataset and its development process can help to anticipate and prevent ethical or legal risks.

Compliance with ethical and legal requirements requires controlled access mechanisms that preserve the security of personal and proprietary data (e.g., data trusts).

Bias

Small contextually relevant datasets can lead to better and fairer performance than large data.

Documenting the environment in which data were collected helps practitioners to assess contextual relevance and to mitigate bias.

Software

Data collection and management software can be used to improve the intrinsic quality of data (e.g., through runtime verification and alerts).

Runtime verification tools can be used to detect contextual drift.

Visualisations and dashboards can make it easier to inspect the quality of a dataset. Documentation facilitates the handover of information across different stages of ML development. This is especially useful in scenarios where datasets and ML are developed by multiple teams.

Software built on top of ML models needs to be tested to ensure that model training and serving data are protected against adversarial attacks.

Source: A Survey of Data Quality Requirements That Matter in Machine Learning Development Pipelines

Last updated