Ameliorate Spurious Correlations in Dataset Condensation

Authors: Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh

Year: 2024

Source: https://arxiv.org/abs/2406.06609

TLDR:

This paper investigates the impact of biases in dataset condensation, a technique used to compress large datasets into smaller, synthetic counterparts for efficient model training. The authors identify that color and background biases are amplified during condensation, leading to decreased model performance, while corruption bias is suppressed. To mitigate this, they introduce a novel reweighting scheme using kernel density estimation, which effectively reduces bias in the condensed datasets. Their empirical results demonstrate significant performance improvements over traditional condensation methods, particularly on datasets with color and background biases, highlighting the importance of addressing biases in dataset condensation for fair and accurate machine learning models.

Free Login To Access AI Capability

Free Access To ChatGPT

This paper addresses the challenge of bias amplification in dataset condensation by proposing a reweighting scheme using kernel density estimation, which significantly improves the performance of machine learning models trained on condensed datasets with color and background biases.

Free Access to ChatGPT

Abstract

Dataset Condensation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset condensation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the condensation process, resulting in a notable decline in the performance of models trained on the condensed dataset, while corruption bias is suppressed through the condensation process. To reduce bias amplification in dataset condensation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset condensation and provide a promising avenue to address bias amplification in the process.

Method

The authors used a methodology that involves a comprehensive empirical evaluation on canonical datasets with color, corruption, and background biases to study the impact of bias on dataset condensation performance. They introduced a sample reweighting scheme utilizing kernel density estimation to reduce bias amplification during the condensation process. This approach was empirically tested on multiple real-world and synthetic datasets to demonstrate its effectiveness.

Main Finding

The authors discovered that biases present in the original dataset, particularly color and background biases, are amplified through the dataset condensation process, leading to a notable decline in the performance of models trained on the condensed dataset. Conversely, they found that corruption bias is suppressed during condensation. To counteract this bias amplification, they developed a simple yet effective approach based on a sample reweighting scheme that utilizes kernel density estimation. Their empirical results showed that this method significantly improves the performance of dataset condensation, particularly on datasets with color and background biases.

Conclusion

The conclusion of the paper is that biases in the original dataset, especially color and background biases, are amplified during dataset condensation, which negatively affects the performance of machine learning models trained on the condensed dataset. The authors propose a reweighting scheme using kernel density estimation to mitigate this bias amplification, and their empirical results demonstrate that this method can significantly improve the performance of dataset condensation, making it a promising approach for addressing biases in the condensation process.

Keywords

Dataset Condensation, Bias Amplification, Kernel Density Estimation, Sample Reweighting, Machine Learning, Synthetic Datasets, Empirical Evaluation, Performance Improvement, Bias Mitigation, Color Bias, Background Bias, Corruption Bias, Real-world Datasets, Model Training, Data Storage, Neural Architecture Search, Federated Learning, Continual Learning, Graph Compression, Multimodality, Bias Detection, Bias Mitigation Strategies, Debiasing Algorithm, Bias-mitigating Algorithm

The Best AI PDF Reader