Handling Data Leakage in Automotive Datasets for Object Detection
Naturvetenskap & IT
Abu Ahammed Babu presenterar sin licentiatuppsats "Handling Data Leakage in Automotive Datasets for Object Detection".
Abu Ahammed Babu presenterar sin licentiatuppsats "Handling Data Leakage in Automotive Datasets for Object Detection".
Background: Object detection is a central component of automotive perception systems that supports safe operation of autonomous driving technologies. The performance of such models is typically evaluated using large-scale image datasets, where choices in dataset construction and splitting strategies strongly influence the ability to detect objects correctly and accurately. Specifically, image similarity between training and test sets can unintentionally create data leakage, a situation where information from the test set is indirectly accessible during training, leading to overly optimistic performance estimates and threatening the reliability of evaluation results. This can have serious consequences in safety-critical domains such as autonomous driving, where a robust and trustworthy object detection model is a prerequisite for deployment.
Objective: The overall aim of this thesis is to investigate the problem of data leakage in automotive object detection research. Specifically, it seeks to understand how different dataset splitting strategies impact the evaluation performance of models, and also to establish methods for detecting data leakage in existing train-test splits of frequently used automotive datasets to enable a more reliable assessment of object detection models.
Method: The research follows an empirical approach and is structured with four studies. Papers A and B adopt a quantitative, experimental design to investigate the extent of various data splitting strategies of automotive image datasets and the overall impact on the model performance. Papers C and D follow a design science methodology, introducing the D-LeDe method for detecting data leakage in any existing data split, and evaluating its effectiveness on multiple automotive datasets and object detection model architectures.
Findings: The results in paper A demonstrate that image similarity can have a measurable impact on reported performance in object detection models. Results in paper B show that splitting data based on semantic similarity can significantly enhance the overall performance. The proposed D-LeDe method is presented in paper C to detect data leakage in any existing data split. Paper D presented how effectively the D-LeDe method performs across different datasets and with multiple models to identify which of the splits have data leakage in them.
Conclusions: This thesis demonstrates that data leakage caused by image similarity across dataset partitions is a tangible and non-trivial problem in automotive object detection. Even modest overlap between training and test sets can inflate model performance, leading to overly optimistic conclusions about generalisation. In larger or highly redundant datasets, the effect can be even stronger. The findings confirm that data leakage is a systematic risk that can occur in many renowned datasets and can compromise the reliability of models benchmarked using the default splits of such datasets. Recognising this problem and practicing caution to avoid data leakage is therefore essential to ensure the reliability of automotive perception models.
Keywords: Automotive Perception, Object Detection, Data Leakage, Image Similarity, Dataset Splitting
Diskussionsledare: Yanja Dajsuren, Assistant Professor of Department of Mathematics and Computer Science, Eindhoven University of Technology, the Netherlands
Examinator: Professor Christian Berger
Länk till fulltextversion av licentiatuppsatsen