Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reddit processing code #74

Merged
merged 8 commits into from
Nov 30, 2023
Merged

Reddit processing code #74

merged 8 commits into from
Nov 30, 2023

Conversation

drschwenk
Copy link
Contributor

Adding the processing code for building five variations of Reddit pretraining data from the raw Pushshift comment and submission dumps.

The code relies on apache beam and GCP's Dataflow service and so it sits apart from the rest of the Dolma codebase. It's included for reproducibility of the Reddit data in v1.5 (atomic_comments_v5) as well as the data used in the ablation experiments leading to that selection.

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope you don't mind me commenting! This looks great! I'm computing the correlations of our different filters/taggers for each dataset. Am I understanding correctly that the filters for Reddit are:

  • Length (Min & Max len)
  • Score / Upvotes (Has to be >= 3)
  • Over 18 (for submission data)

It seems like it's much less than we had for Common Crawl (which is fine with me; just wondering if a heatmap correlation of them will be very interesting cc @soldni)

Also on a high-level, I think it'd be nice to indicate which of the different variations is the "final one" i.e. the best one that I assume will be included in dolma? I assume it is atomic_content_v5, so I mostly looked at that one. Maybe the others can just all be put into a separate folder named alternatives or something

sources/reddit/README.md Outdated Show resolved Hide resolved
Comment on lines 38 to 40
## atomic_content_v5

A refined version of atomic_content_v3, v5 uses different length and selection criteria for comments and submissions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is atomic_content_v5 the final one to be included in Dolma?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

Comment on lines +69 to +70
'body': trim(normalize_string(comment['body']), max_length),
'body_is_trimmed': len(comment['body']) > max_length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why trim it if the trimmed ones seem to be filtered out anyways?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch- I started out trimming comments in earlier versions of the dataset (when removing a long comment would leave a gap in a conversational thread) but removed them in the atomic versions (this is just a vestige of that change).

Co-authored-by: Niklas Muennighoff <[email protected]>
sources/reddit/README.md Outdated Show resolved Hide resolved
@soldni soldni merged commit afab18c into main Nov 30, 2023
13 checks passed
@soldni soldni deleted the reddit branch November 30, 2023 05:36
soldni added a commit that referenced this pull request Dec 20, 2023
* initial commit of reddit processing scripts

* Minor cleanup and added Readme

* removed programming subreddits from one stray dir

* Update sources/reddit/README.md

Co-authored-by: Niklas Muennighoff <[email protected]>

* Apply suggestions from code review

---------

Co-authored-by: Luca Soldaini <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
soldni added a commit that referenced this pull request Feb 1, 2024
* added more runs

* new plots

* tokenizer fix

* squatted

* new lang id

* all fasttext lang id

* plots

* further plots

* wip

* progress!

* style

* fixed format

* added configs

* dts

* configs

* more

* refine

* fix

* fix

* adding new features to deduper

* accidentally removed tests

* added cli options

* big commit

* improvement to tokenizer

* bumping version

* fix error in empty

* new dedupe docs

* names

* configs

* fixed paths

* stack

* switched to v2

* fixed dedupe config

* updated

* middle dedupe

* mix text length

* Reddit processing code (#74)

* initial commit of reddit processing scripts

* Minor cleanup and added Readme

* removed programming subreddits from one stray dir

* Update sources/reddit/README.md

Co-authored-by: Niklas Muennighoff <[email protected]>

* Apply suggestions from code review

---------

Co-authored-by: Luca Soldaini <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>

* more plots

* fixed version

* names

* different path

* added support for retries

* wip test

* fixed tests

* fixed

* removing repetitions

* dedupe docs

* reddit stats

* paths

* bugfix

* base

* version of pycld2 that compiles on M macs

* new config middle

* 3 parts

* further s3 tests

* decode

* still write empty docs to attributes when skip_empty is True

* wiki adjusted

* wiki config

* simple counts

* changed path

* added new features

* plots

* added new digits vocab

* added config to sample

* small

* added tokenizer script

* code abl

* cargo

* version bump

* made it stable

* topics

* sampling

* rename

* new config for 1.6

* llama config

* llama config (fix)

* figures

* adding docs dedupe

* added more dedup configs

* style

* added counts

* more cli

* style

* style

* removed autopep8

* resorted

* testing change

* corner cases

* figures

* added current paper

* reverted cli

* documentation

---------

Co-authored-by: Dustin Schwenk <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants