Reddit processing code #74

drschwenk · 2023-11-13T20:10:20Z

Adding the processing code for building five variations of Reddit pretraining data from the raw Pushshift comment and submission dumps.

The code relies on apache beam and GCP's Dataflow service and so it sits apart from the rest of the Dolma codebase. It's included for reproducibility of the Reddit data in v1.5 (atomic_comments_v5) as well as the data used in the ablation experiments leading to that selection.

Muennighoff

Hope you don't mind me commenting! This looks great! I'm computing the correlations of our different filters/taggers for each dataset. Am I understanding correctly that the filters for Reddit are:

Length (Min & Max len)
Score / Upvotes (Has to be >= 3)
Over 18 (for submission data)

It seems like it's much less than we had for Common Crawl (which is fine with me; just wondering if a heatmap correlation of them will be very interesting cc @soldni)

Also on a high-level, I think it'd be nice to indicate which of the different variations is the "final one" i.e. the best one that I assume will be included in dolma? I assume it is atomic_content_v5, so I mostly looked at that one. Maybe the others can just all be put into a separate folder named alternatives or something

sources/reddit/README.md

Muennighoff · 2023-11-19T17:12:11Z

sources/reddit/README.md

+## atomic_content_v5                       
+
+A refined version of atomic_content_v3, v5 uses different length and selection criteria for comments and submissions.


Is atomic_content_v5 the final one to be included in Dolma?

Muennighoff · 2023-11-19T17:14:16Z

sources/reddit/atomic_content_v5/build_comment_data.py

+        'body': trim(normalize_string(comment['body']), max_length),
+        'body_is_trimmed': len(comment['body']) > max_length,


Why trim it if the trimmed ones seem to be filtered out anyways?

Good catch- I started out trimming comments in earlier versions of the dataset (when removing a long comment would leave a gap in a conversational thread) but removed them in the atomic versions (this is just a vestige of that change).

Co-authored-by: Niklas Muennighoff <[email protected]>

sources/reddit/README.md

* initial commit of reddit processing scripts * Minor cleanup and added Readme * removed programming subreddits from one stray dir * Update sources/reddit/README.md Co-authored-by: Niklas Muennighoff <[email protected]> * Apply suggestions from code review --------- Co-authored-by: Luca Soldaini <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]>

* added more runs * new plots * tokenizer fix * squatted * new lang id * all fasttext lang id * plots * further plots * wip * progress! * style * fixed format * added configs * dts * configs * more * refine * fix * fix * adding new features to deduper * accidentally removed tests * added cli options * big commit * improvement to tokenizer * bumping version * fix error in empty * new dedupe docs * names * configs * fixed paths * stack * switched to v2 * fixed dedupe config * updated * middle dedupe * mix text length * Reddit processing code (#74) * initial commit of reddit processing scripts * Minor cleanup and added Readme * removed programming subreddits from one stray dir * Update sources/reddit/README.md Co-authored-by: Niklas Muennighoff <[email protected]> * Apply suggestions from code review --------- Co-authored-by: Luca Soldaini <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> * more plots * fixed version * names * different path * added support for retries * wip test * fixed tests * fixed * removing repetitions * dedupe docs * reddit stats * paths * bugfix * base * version of pycld2 that compiles on M macs * new config middle * 3 parts * further s3 tests * decode * still write empty docs to attributes when skip_empty is True * wiki adjusted * wiki config * simple counts * changed path * added new features * plots * added new digits vocab * added config to sample * small * added tokenizer script * code abl * cargo * version bump * made it stable * topics * sampling * rename * new config for 1.6 * llama config * llama config (fix) * figures * adding docs dedupe * added more dedup configs * style * added counts * more cli * style * style * removed autopep8 * resorted * testing change * corner cases * figures * added current paper * reverted cli * documentation --------- Co-authored-by: Dustin Schwenk <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]>

drschwenk added 3 commits November 13, 2023 01:39

initial commit of reddit processing scripts

0402c04

Minor cleanup and added Readme

0e5a71e

removed programming subreddits from one stray dir

1430d68

drschwenk assigned soldni Nov 13, 2023

Merge branch 'main' into reddit

a13238c

Muennighoff reviewed Nov 19, 2023

View reviewed changes

Update sources/reddit/README.md

5f87298

Co-authored-by: Niklas Muennighoff <[email protected]>

soldni reviewed Nov 27, 2023

View reviewed changes

sources/reddit/README.md Outdated Show resolved Hide resolved

soldni added 3 commits November 26, 2023 20:32

Apply suggestions from code review

7cf0650

Merge branch 'main' into reddit

c5af4a7

Merge branch 'main' into reddit

4e64073

soldni approved these changes Nov 30, 2023

View reviewed changes

soldni merged commit afab18c into main Nov 30, 2023
13 checks passed

soldni deleted the reddit branch November 30, 2023 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reddit processing code #74

Reddit processing code #74

drschwenk commented Nov 13, 2023

Muennighoff left a comment •

edited

Loading

Muennighoff Nov 19, 2023

soldni Nov 27, 2023

Muennighoff Nov 19, 2023

drschwenk Nov 27, 2023

		## atomic_content_v5

		A refined version of atomic_content_v3, v5 uses different length and selection criteria for comments and submissions.

		'body': trim(normalize_string(comment['body']), max_length),
		'body_is_trimmed': len(comment['body']) > max_length,

Reddit processing code #74

Reddit processing code #74

Conversation

drschwenk commented Nov 13, 2023

Muennighoff left a comment • edited Loading

Choose a reason for hiding this comment

Muennighoff Nov 19, 2023

Choose a reason for hiding this comment

soldni Nov 27, 2023

Choose a reason for hiding this comment

Muennighoff Nov 19, 2023

Choose a reason for hiding this comment

drschwenk Nov 27, 2023

Choose a reason for hiding this comment

Muennighoff left a comment •

edited

Loading