This repository contains the raw datasets for the Training Data Extraction Challenge organized at SaTML 2023.
The main repository provides the challenge data as a list of pointers into The Pile.
To save participants the need for downloading and decompressing 800GB of text, you can find the raw numpy files here:
- train_prefix.npy (1.4 MB)
- train_suffix.npy (1.4 MB)
- train_preprefix.npy (2.9 MB)
- train_dataset.npy (5.7 MB)
Will be added once the validation set is released.
Will be added once the validation set is released.