Buckets:

salman53
/

waterbirds-bucket

492 MB

6 files

Updated 26 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
data		26 days ago	3 items
scripts		26 days ago	1 items
.gitattributes	2.46 kB xet	26 days ago	19463de8
README.md	10.7 kB xet	26 days ago	60c85fa1

README.md

Dataset Card for Waterbirds

The Waterbirds dataset is constructed by cropping out birds from photos in the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al., 2011) and transferring them onto backgrounds from the Places dataset (Zhou et al., 2017). The original instructions and data from which this dataset and its card have been created can be found here.

Dataset Details

Dataset Description

The official train-test split of the CUB dataset is used, randomly choosing 20% of the training data to serve as a validation set.

Train:

95% of all waterbirds against a water background and the remaining 5% against a land background.
95% of all landbirds against a land background with the remaining 5% against water.

Dev/Test

50% of all waterbirds against a water background and the remaining 50% against a land background.
50% of all landbirds against a land background with the remaining 50% against water.

NOTE: By construction, this creates a distribution shift between train and dev/test.

In a typical application, the validation set might be constructed by randomly dividing up the available training data. We emphasize that this is not the case here: the training set is skewed, whereas the validation set is more balanced. We followed this construction so that we could better compare ERM vs. reweighting vs. group DRO techniques using a stable set of hyperparameters. In practice, if the validation set were also skewed, we might expect hyperparameter tuning based on worst-group accuracy to be more challenging and noisy. Due to the above procedure, when reporting average test accuracy in our experiments, we calculate the average test accuracy over each group and then report a weighted average, with weights corresponding to the relative proportion of each group in the (skewed) training dataset.

Curated by: Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto, and Percy Liang
License: MIT

Dataset Sources

The dataset is formed from Caltech-UCSD Birds 200 and Places

Repository: https://github.com/kohpangwei/group_DRO
Paper: Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Dataset Creation, Bias, Risks, and Limitations

The dataset was uploaded to HuggingFace by Augustin Godinot. For reproducibility, the upload script scripts/waterbirds.py is provided.

For more information on the creation of the dataset, please refer to the original authors and paper.

Citation

BibTeX:

@inproceedings{
  Sagawa2020Distributionally,
  title={Distributionally Robust Neural Networks},
  author={Shiori Sagawa and Pang Wei Koh and Tatsunori B. Hashimoto and Percy Liang},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=ryxGuJrFvS}
}

Total size: 492 MB

Files: 6

Last updated: May 27

Pre-warmed CDN: US EU US EU