Skip to content

Commit

Permalink
Update sources/reddit/README.md
Browse files Browse the repository at this point in the history
Co-authored-by: Niklas Muennighoff <[email protected]>
  • Loading branch information
soldni and Muennighoff authored Nov 27, 2023
1 parent a13238c commit 5f87298
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion sources/reddit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[Reddit](https://www.reddit.com/) is a social news aggregation and discussion website where users can post links and take part in discussions. Until the spring of 2023, Reddit made its data publicly available through an API that many 3rd party developers built upon. The Pushshift effort was a social media data collection, analysis, and archiving platform that collected Reddit data through the API and made it available to researchers (details of this collection can be found in [The Pushshift Reddit Dataset](https://arxiv.org/abs/2001.08435)). Pushshift released hundreds of collected submission and comment and dumps spanning Reddit’s creation in 2006 to the end of the public API in 2023. While these dumps are no longer available from Pushshift, they can still be found in handful of web archives as of the fall of 2023.

Reddit content comes in two flavors related to the nature of the platform: **submissions** and **comments**. Submissions are variably links to articles or other external content, images or videos, or “selftext” (posts with only text written by the submitter to initiate a discussion thread). Comments are user-written dialogue that form a nested, hierarchical, conversational thread discussing a submission. The indeterminate nature of the data allows for a fair amount of freedom when constructing a preatraining dataset, and several variations of the dataset were explored for pretraining data.
Reddit content comes in two flavors related to the nature of the platform: **submissions** and **comments**. Submissions are variably links to articles or other external content, images or videos, or “selftext” (posts with only text written by the submitter to initiate a discussion thread). Comments are user-written dialogue that form a nested, hierarchical, conversational thread discussing a submission. The indeterminate nature of the data allows for a fair amount of freedom when constructing a pretraining dataset, and several variations of the dataset were explored for pretraining data.

# Dataset versions

Expand Down

0 comments on commit 5f87298

Please sign in to comment.