Add Support for Loading a Specific Dataset Revision #1911

thomascleberg · 2024-09-12T21:04:13Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

This feature would add support from loading from a particular Huggingface dataset revision - that is, pr or commit hash. This is an important feature of datasets on Huggingface Hub, it's illustrated in the 2nd example of load_dataset.

Some dataset management strategies or workflows involve using these revisions as "candidates", so that a pull request would be merged when a successful experiment is completed using it. In this case, we need to be able to specify the revision of the dataset.

✔️ Solution

There would be an optional revision parameter on datasets that allows you to specify the revision number.

datasets:
  # HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
  - path: vicgalle/alpaca-gpt4
  # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
    type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
    ds_type: # Optional[str] (json|arrow|parquet|text|csv) defines the datatype when path is a file
    data_files: # Optional[str] path to source data files
    shards: # Optional[int] number of shards to split data into
    name: # Optional[str] name of dataset configuration to load
    train_on_split: train # Optional[str] name of dataset split to load from
    revision: # Optional[str] The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets.

❓ Alternatives

I considered changing our workflow to have their be one dataset per experiment instead of revision, but that's not a realistic solution because it involves changing a whole experimental and merging setup to avoid specifying a revision number for the dataset.

📝 Additional Context

I have a PR ready to go on this one:

Add Support for revision Dataset Parameter to specify reading from Huggingface Dataset Revision #1912

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

The text was updated successfully, but these errors were encountered:

thomascleberg added the enhancement New feature or request label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Loading a Specific Dataset Revision #1911

Add Support for Loading a Specific Dataset Revision #1911

thomascleberg commented Sep 12, 2024 •

edited

Loading

Add Support for Loading a Specific Dataset Revision #1911

Add Support for Loading a Specific Dataset Revision #1911

Comments

thomascleberg commented Sep 12, 2024 • edited Loading

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

thomascleberg commented Sep 12, 2024 •

edited

Loading