diff --git a/sources/reddit/README.md b/sources/reddit/README.md index 6e99a710..2240d5cf 100644 --- a/sources/reddit/README.md +++ b/sources/reddit/README.md @@ -2,7 +2,7 @@ [Reddit](https://www.reddit.com/) is a social news aggregation and discussion website where users can post links and take part in discussions. Until the spring of 2023, Reddit made its data publicly available through an API that many 3rd party developers built upon. The Pushshift effort was a social media data collection, analysis, and archiving platform that collected Reddit data through the API and made it available to researchers (details of this collection can be found in [The Pushshift Reddit Dataset](https://arxiv.org/abs/2001.08435)). Pushshift released hundreds of collected submission and comment and dumps spanning Reddit’s creation in 2006 to the end of the public API in 2023. While these dumps are no longer available from Pushshift, they can still be found in handful of web archives as of the fall of 2023. -Reddit content comes in two flavors related to the nature of the platform: **submissions** and **comments**. Submissions are variably links to articles or other external content, images or videos, or “selftext” (posts with only text written by the submitter to initiate a discussion thread). Comments are user-written dialogue that form a nested, hierarchical, conversational thread discussing a submission. The indeterminate nature of the data allows for a fair amount of freedom when constructing a preatraining dataset, and several variations of the dataset were explored for pretraining data. +Reddit content comes in two flavors related to the nature of the platform: **submissions** and **comments**. Submissions are variably links to articles or other external content, images or videos, or “selftext” (posts with only text written by the submitter to initiate a discussion thread). Comments are user-written dialogue that form a nested, hierarchical, conversational thread discussing a submission. The indeterminate nature of the data allows for a fair amount of freedom when constructing a pretraining dataset, and several variations of the dataset were explored for pretraining data. # Dataset versions