492 MB
6 files
Updated 26 days ago
Name
Size
data
scripts
.gitattributes2.46 kB
xet
README.md10.7 kB
xet
README.md

Dataset Card for Waterbirds

The Waterbirds dataset is constructed by cropping out birds from photos in the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al., 2011) and transferring them onto backgrounds from the Places dataset (Zhou et al., 2017). The original instructions and data from which this dataset and its card have been created can be found here.

Dataset Details

Dataset Description

The official train-test split of the CUB dataset is used, randomly choosing 20% of the training data to serve as a validation set.

Train:

  • 95% of all waterbirds against a water background and the remaining 5% against a land background.
  • 95% of all landbirds against a land background with the remaining 5% against water.

Dev/Test

  • 50% of all waterbirds against a water background and the remaining 50% against a land background.
  • 50% of all landbirds against a land background with the remaining 50% against water.

NOTE: By construction, this creates a distribution shift between train and dev/test.

In a typical application, the validation set might be constructed by randomly dividing up the available training data. We emphasize that this is not the case here: the training set is skewed, whereas the validation set is more balanced. We followed this construction so that we could better compare ERM vs. reweighting vs. group DRO techniques using a stable set of hyperparameters. In practice, if the validation set were also skewed, we might expect hyperparameter tuning based on worst-group accuracy to be more challenging and noisy. Due to the above procedure, when reporting average test accuracy in our experiments, we calculate the average test accuracy over each group and then report a weighted average, with weights corresponding to the relative proportion of each group in the (skewed) training dataset.

  • Curated by: Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto, and Percy Liang
  • License: MIT

Dataset Sources

The dataset is formed from Caltech-UCSD Birds 200 and Places

Dataset Creation, Bias, Risks, and Limitations

The dataset was uploaded to HuggingFace by Augustin Godinot. For reproducibility, the upload script scripts/waterbirds.py is provided.

For more information on the creation of the dataset, please refer to the original authors and paper.

Citation

BibTeX:

@inproceedings{
  Sagawa2020Distributionally,
  title={Distributionally Robust Neural Networks},
  author={Shiori Sagawa and Pang Wei Koh and Tatsunori B. Hashimoto and Percy Liang},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=ryxGuJrFvS}
}
Total size
492 MB
Files
6
Last updated
May 27
Pre-warmed CDN
US EU US EU

Contributors