Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature to get the compliment of a hash sample #72

Merged
merged 1 commit into from
Nov 7, 2023

Conversation

IanMagnusson
Copy link
Contributor

We need to be able to get both the held out data and the not held out data when making splits with scripts/hash_sample.py.

I added a --complement flag to this script that just writes out the hashes that dont match the calculate_md5_suffix suffixes. I also updated the logging statement to reflect that it is doing this.

I did a very rudimentary test by splitting a RedPajama file (pretraining-data/sources/redpajama/v1/documents/split=train/dataset=c4/c4-train.00000-of-01024_00000.jsonl.gz) this way at 5% and making sure that its hash sample and its compliment added up to the full number of documents (17932 + 338023 = 355955).

@IanMagnusson IanMagnusson requested a review from soldni November 7, 2023 18:32
Copy link
Member

@soldni soldni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@soldni soldni merged commit c8d8547 into main Nov 7, 2023
12 checks passed
@soldni soldni deleted the hash-sample-compliment branch November 7, 2023 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants