-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reddit processing code #74
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope you don't mind me commenting! This looks great! I'm computing the correlations of our different filters/taggers for each dataset. Am I understanding correctly that the filters for Reddit are:
- Length (Min & Max len)
- Score / Upvotes (Has to be >= 3)
- Over 18 (for submission data)
It seems like it's much less than we had for Common Crawl (which is fine with me; just wondering if a heatmap correlation of them will be very interesting cc @soldni)
Also on a high-level, I think it'd be nice to indicate which of the different variations is the "final one" i.e. the best one that I assume will be included in dolma? I assume it is atomic_content_v5, so I mostly looked at that one. Maybe the others can just all be put into a separate folder named alternatives
or something
sources/reddit/README.md
Outdated
## atomic_content_v5 | ||
|
||
A refined version of atomic_content_v3, v5 uses different length and selection criteria for comments and submissions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is atomic_content_v5
the final one to be included in Dolma?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
'body': trim(normalize_string(comment['body']), max_length), | ||
'body_is_trimmed': len(comment['body']) > max_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why trim it if the trimmed ones seem to be filtered out anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch- I started out trimming comments in earlier versions of the dataset (when removing a long comment would leave a gap in a conversational thread) but removed them in the atomic versions (this is just a vestige of that change).
Co-authored-by: Niklas Muennighoff <[email protected]>
* initial commit of reddit processing scripts * Minor cleanup and added Readme * removed programming subreddits from one stray dir * Update sources/reddit/README.md Co-authored-by: Niklas Muennighoff <[email protected]> * Apply suggestions from code review --------- Co-authored-by: Luca Soldaini <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]>
* added more runs * new plots * tokenizer fix * squatted * new lang id * all fasttext lang id * plots * further plots * wip * progress! * style * fixed format * added configs * dts * configs * more * refine * fix * fix * adding new features to deduper * accidentally removed tests * added cli options * big commit * improvement to tokenizer * bumping version * fix error in empty * new dedupe docs * names * configs * fixed paths * stack * switched to v2 * fixed dedupe config * updated * middle dedupe * mix text length * Reddit processing code (#74) * initial commit of reddit processing scripts * Minor cleanup and added Readme * removed programming subreddits from one stray dir * Update sources/reddit/README.md Co-authored-by: Niklas Muennighoff <[email protected]> * Apply suggestions from code review --------- Co-authored-by: Luca Soldaini <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> * more plots * fixed version * names * different path * added support for retries * wip test * fixed tests * fixed * removing repetitions * dedupe docs * reddit stats * paths * bugfix * base * version of pycld2 that compiles on M macs * new config middle * 3 parts * further s3 tests * decode * still write empty docs to attributes when skip_empty is True * wiki adjusted * wiki config * simple counts * changed path * added new features * plots * added new digits vocab * added config to sample * small * added tokenizer script * code abl * cargo * version bump * made it stable * topics * sampling * rename * new config for 1.6 * llama config * llama config (fix) * figures * adding docs dedupe * added more dedup configs * style * added counts * more cli * style * style * removed autopep8 * resorted * testing change * corner cases * figures * added current paper * reverted cli * documentation --------- Co-authored-by: Dustin Schwenk <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]>
Adding the processing code for building five variations of Reddit pretraining data from the raw Pushshift comment and submission dumps.
The code relies on apache beam and GCP's Dataflow service and so it sits apart from the rest of the Dolma codebase. It's included for reproducibility of the Reddit data in v1.5 (atomic_comments_v5) as well as the data used in the ablation experiments leading to that selection.