Datasets for the SATML 2023 challenge on training data extraction

This repository contains the raw datasets for the Training Data Extraction Challenge organized at SaTML 2023.

The main repository provides the challenge data as a list of pointers into The Pile.

To save participants the need for downloading and decompressing 800GB of text, you can find the raw numpy files here:

Train

Will be added once the validation set is released.

Will be added once the validation set is released.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
LICENSE		LICENSE
README.md		README.md