Welcome to DD-Ranking (DD, i.e., Dataset Distillation), an integrated and easy-to-use evaluation benchmark for dataset distillation! It aims to provide a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data.

Motivation

Dataset Distillation (DD) aims to condense a large dataset into a much smaller one, which allows a model to achieve comparable performance after training on it. DD has gained extensive attention since it was proposed. With some foundational methods such as DC, DM, and MTT, various works have further pushed this area to a new standard with their novel designs.

history

Notebaly, more and more methods are transitting from "hard label" to "soft label" in dataset distillation, especially during evaluation. Hard labels are categorical, having the same format of the real dataset. Soft labels are outputs of a pre-trained teacher model. Recently, Deng et al., pointed out that "a label is worth a thousand images". They showed analytically that soft labels are exetremely useful for accuracy improvement.

However, since the essence of soft labels is knowledge distillation, we find that when applying the same evaluation method to randomly selected data, the test accuracy also improves significantly (see the figure above).

This makes us wonder: Can the test accuracy of the model trained on distilled data reflect the real informativeness of the distilled data?

We summaize the evaluation configurations of existing works in the following table, with different colors highlighting different values for each configuration. As can be easily seen, the evaluation configurations are diverse, leading to unfairness of using only test accuracy to demonstrate one's performance. Among these inconsistencies, two critical factors significantly undermine the fairness of current evaluation protocols: label representation (including the corresponding loss function) and data augmentation techniques.

Motivated by this, we propose DD-Ranking, a new benchmark for DD evaluation. DD-Ranking provides a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data.

Features

Fair Evaluation: DD-Ranking provides a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data.
Easy-to-use: DD-Ranking provides a unified interface for dataset distillation evaluation.
Extensible: DD-Ranking supports various datasets and models.
Customizable: DD-Ranking supports various data augmentations and soft label strategies.

DD-Ranking Benchmark

Revisit the original goal of dataset distillation:

The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. (Wang et al., 2020)

Label-Robust Score (LRS)

For the label representation, we introduce the Label-Robust Score (LRS) to evaluate the informativeness of the synthesized data using the following two aspects:

The degree to which the real dataset is recovered under hard labels (hard label recovery): \( \text{HLR}=\text{Acc.}{\text{real-hard}}-\text{Acc.}{\text{syn-hard}} \).
The improvement over random selection when using personalized evaluation methods (improvement over random): \( \text{IOR}=\text{Acc.}{\text{syn-any}}-\text{Acc.}{\text{rdm-any}} \). \(\text{Acc.}\) is the accuracy of models trained on different samples. Samples' marks are as follows:

\(\text{real-hard}\): Real dataset with hard labels;
\(\text{syn-hard}\): Synthetic dataset with hard labels;
\(\text{syn-any}\): Synthetic dataset with personalized evaluation methods (hard or soft labels);
\(\text{rdm-any}\): Randomly selected dataset (under the same compression ratio) with the same personalized evaluation methods.

LRS is defined as a weight sum of \(\text{IOR}\) and \(-\text{HLR}\) to rank different methods: \[ \alpha = w\text{IOR}-(1-w)\text{HLR}, \quad w \in [0, 1] \] Then, the LRS is normalized to \([0, 1]\) as follows: \[ \text{LRS} = (e^{\alpha}-e^{-1}) / (e - e^{-1}) \times 100 \% \]

By default, we set \(w = 0.5\) on the leaderboard, meaning that both \(\text{IOR}\) and \(\text{HLR}\) are equally important. Users can adjust the weights to emphasize one aspect on the leaderboard.

Augmentation-Robust Score (ARS)

To disentangle data augmentation’s impact, we introduce the augmentation-robust score (ARS) which continues to leverage the relative improvement over randomly selected data. Specifically, we first evaluate synthetic data and a randomly selected subset under the same setting to obtain \(\text{Acc.}{\text{syn-aug}}\) and \(\text{Acc.}{\text{rdm-aug}}\) (same as IOR). Next, we evaluate both synthetic data and random data again without the data augmentation, and results are denoted as \(\text{Acc.}{\text{syn-naug}}\) and \(\text{Acc.}{\text{rdm-naug}}\). Both differences, \(\text{Acc.syn-aug} - \text{Acc.rdm-aug}\) and \(\text{Acc.syn-naug} - \text{Acc.rdm-naug}\), are positively correlated to the real informativeness of the distilled dataset.

ARS is a weighted sum of the two differences: \[ \beta = \gamma(\text{Acc.syn-aug} - \text{Acc.rdm-aug}) + (1 - \gamma)(\text{Acc.syn-naug} - \text{Acc.rdm-naug}) \] and normalized similarly.

Contributing

Welcome! We are glad that you by willing to contribute to the field of dataset distillation.

New Baselines: If you would like to report new baselines, please submit them by creating a pull request. The exact format is below: name of the baseline, code link, [paper link and score run using this tool].
New Components: If you would like to integrate new components, such as new model architectures, new data augmentation methods, and new soft label strategies, please submit them by creating a pull request.
Issues: If you want to submit issues, you are encouraged to submit yes directly in issues.
Appeal: If you want to appeal for the score of your method, please submit an issue with your code and a detailed readme file of how to reproduce your results. We tried our best to replicate all methods in the leaderboard based on their papers and open-source code. We are sorry if we miss some details and will be grateful if you can help us improve the leaderboard.

Installation

From pip

pip install ddranking

From source

python setup.py install

Quick Start

Below is a step-by-step guide on how to use our dd_ranking. This demo is for label-robust score (LRS) on soft labels (source code can be found in demo_lrs_soft.py). You can find the demo for LRS on hard label demo in demo_lrs_hard.py and the demo for augmentation-robust score (ARS) in demo_ars.py. DD-Ranking supports multi-GPU Distributed evaluation. You can simply use torchrun to launch the evaluation.

Step1: Intialize a soft-label metric evaluator object. Config files are recommended for users to specify hyper-parameters. Sample config files are provided here.

from ddranking.metrics import LabelRobustScoreSoft
from ddranking.config import Config

>>> config = Config.from_file("./configs/Demo_LRS_Soft_Label.yaml")
>>> lrs_soft_metric = LabelRobustScoreSoft(config)

You can also pass keyword arguments.

device = "cuda"
method_name = "DATM"                    # Specify your method name
ipc = 10                                # Specify your IPC
dataset = "CIFAR100"                     # Specify your dataset name
syn_data_dir = "./data/CIFAR100/IPC10/"  # Specify your synthetic data path
real_data_dir = "./datasets"            # Specify your dataset path
model_name = "ConvNet-3"                # Specify your model name
teacher_dir = "./teacher_models"		# Specify your path to teacher model chcekpoints
teacher_model_names = ["ConvNet-3"]      # Specify your teacher model names
im_size = (32, 32)                      # Specify your image size
dsa_params = {                          # Specify your data augmentation parameters
    "prob_flip": 0.5,
    "ratio_rotate": 15.0,
    "saturation": 2.0,
    "brightness": 1.0,
    "contrast": 0.5,
    "ratio_scale": 1.2,
    "ratio_crop_pad": 0.125,
    "ratio_cutout": 0.5
}
random_data_format = "tensor"              # Specify your random data format (tensor or image)
random_data_path = "./random_data"          # Specify your random data path
save_path = f"./results/{dataset}/{model_name}/IPC{ipc}/dm_hard_scores.csv"

""" We only list arguments that usually need specifying"""
lrs_soft_metric = LabelRobustScoreSoft(
    dataset=dataset,
    real_data_path=real_data_dir, 
    ipc=ipc,
    model_name=model_name,
    soft_label_criterion='sce',  # Use Soft Cross Entropy Loss
    soft_label_mode='S',         # Use one-to-one image to soft label mapping
    loss_fn_kwargs={'temperature': 1.0, 'scale_loss': False},
    data_aug_func='dsa',         # Use DSA data augmentation
    aug_params=dsa_params,       # Specify dsa parameters
    im_size=im_size,
    random_data_format=random_data_format,
    random_data_path=random_data_path,
    stu_use_torchvision=False,
    tea_use_torchvision=False,
    teacher_dir=teacher_dir,
    teacher_model_names=teacher_model_names,
    num_eval=5,
    device=device,
    dist=True,
    save_path=save_path
)

For detailed explanation for hyper-parameters, please refer to our documentation.

Step 2: Load your synthetic data, labels (if any), and learning rate (if any).

>>> syn_images = torch.load('/your/path/to/syn/images.pt')
# You must specify your soft labels if your soft label mode is 'S'
>>> soft_labels = torch.load('/your/path/to/syn/labels.pt')
>>> syn_lr = torch.load('/your/path/to/syn/lr.pt')

Step 3: Compute the metric.

>>> lrs_soft_metric.compute_metrics(image_tensor=syn_images, soft_labels=soft_labels, syn_lr=syn_lr)
# alternatively, you can specify the image folder path to compute the metric
>>> lrs_soft_metric.compute_metrics(image_path='./your/path/to/syn/images', soft_labels=soft_labels, syn_lr=syn_lr)

The following results will be printed and saved to save_path:

HLR mean: The mean of hard label recovery over num_eval runs.
HLR std: The standard deviation of hard label recovery over num_eval runs.
IOR mean: The mean of improvement over random over num_eval runs.
IOR std: The standard deviation of improvement over random over num_eval runs.
LRS mean: The mean of Label-Robust Score over num_eval runs.
LRS std: The standard deviation of Label-Robust Score over num_eval runs.

DD-Ranking Metrics

DD-Ranking provides a set of metrics to evaluate the real informativeness of datasets distilled by different methods. The unfairness of existing evaluation is mainly caused by two factors, the label representation and the data augmentation. We design the label-robust score (LRS) and the augmentation robust score (ARS) to disentangle the impact of label representation and data augmentation on the evaluation, respectively.

Evaluation Classes

LabelRobustScoreHard computes HLR, IOR, and LRS for methods using hard labels.
LabelRobustScoreSoft computes HLR, IOR, and LRS for methods using soft labels.
AugmentationRobustScore computes the ARS for methods using soft labels.
GeneralEvaluator computes the traditional test accuracy for existing methods.

LabelRobustScoreHard

CLASS
dd_ranking.metrics.LabelRobustScoreHard(config=None,
dataset: str = 'CIFAR10',
real_data_path: str = './dataset/',
ipc: int = 10,
model_name: str = 'ConvNet-3',
data_aug_func: str = 'cutmix',
aug_params: dict = {'cutmix_p': 1.0},
optimizer: str = 'sgd',
lr_scheduler: str = 'step',
step_size: int = None,
weight_decay: float = 0.0005,
momentum: float = 0.9,
use_zca: bool = False,
num_eval: int = 5,
im_size: tuple = (32, 32),
num_epochs: int = 300,
real_batch_size: int = 256,
syn_batch_size: int = 256,
use_torchvision: bool = False,
eval_full_data: bool = False,
random_data_format: str = 'tensor',
random_data_path: str = './dataset/',
num_workers: int = 4,
save_path: Optional[str] = None,
custom_train_trans: Optional[Callable] = None,
custom_val_trans: Optional[Callable] = None,
device: str = "cuda",
dist: bool = False
)
[SOURCE]

A class for evaluating the performance of a dataset distillation method with hard labels. User is able to modify the attributes as needed.

Keyboard shortcuts

DD-Ranking API Documentation