Lecture

Get Data to Clean Itself: Data Quality Audits with SelfClean

  • at -
  • ICM Saal 4b
  • Type: Lecture

Lecture description

S. Lionetti, Rotkreuz CH, F. Gröger, Basel CH

Data quality is crucial to obtain valid empirical results in many fields of science. Laboratory experiments are no exception, as noisy measurements or workflow imperfections can compromise downstream analysis. With datasets growing in size and complexity, traditional quality control approaches based on manual inspection or handcrafted rules become increasingly impractical. Scalable methods for systematic data quality audits are therefore of central interest for quality assurance in experimental workflows.
SelfClean is one of a few general frameworks that recently emerged for detecting data quality issues. It reformulates data cleaning as a ranking problem, enabling efficient human review or automated decisions based on score distributions. Using a combination of context-aware selfsupervised representation learning with distance-based indicators, SelfClean can detect three types of issues: off-topic samples, near duplicates, and label errors.

Originally applied to images, where it identifies substantial fractions of problematic samples in vision benchmarks, SelfClean transfers to different data modalities. In the audio domain, leveraging domain-specific models improves results while preserving the overall detection mechanism. Evaluations on standard audio benchmarks and an industrial dataset show competitive performance, with significant reductions in human inspection. Extensive verification with the CleanPatrick benchmark confirms SelfClean’s effectiveness for large-scale near-duplicate detection, while it highlights some challenges for fine-grained label error identification. Preliminary experiments on text data suggest that the same principles extend to language.
In chemistry and laboratory sciences, SelfClean could be employed to streamline the validation of sensor data such as spectra, chromatograms, and mass spectrometry measurements. This prototype of human-AI collaboration shows that it is possible to overcome many limitations of manual review or fixed rules in the detection of experimental artifacts, and indicates that a new level of quality assurance is within reach.

Literature:
[1] Gröger, Fabian, et al. “Intrinsic self-supervision for data quality audits.” Advances in Neural Information Processing Systems 37 (2024): 92273-92316.
[2] Gonzalez-Jimenez, Alvaro, et al. “Representation-Based Data Quality Audits for Audio.” International Conference on Acoustics, Speech and Signal Processing ICASSP (2026).

#analytica
© Messe München GmbH