disambiguating that the repo is for the dolma toolkit in various read…

…mes and docs (#104)
allenai · Jan 30, 2024 · 43b0314 · 43b0314
1 parent 50abfb7
commit 43b0314
Show file tree

Hide file tree

Showing 5 changed files with 21 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -3,21 +3,21 @@
 Dolma is two things:
 
 1. **Dolma Dataset**: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
-2. **Dolma Toolkit**: a high-performance toolkit for curating datasets for language modeling.
+2. **Dolma Toolkit**: a high-performance toolkit for curating datasets for language modeling -- this repo contains the source code for the Dolma Toolkit.
 
 ## Dolma Dataset
 
 Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
 It was created as a training corpus for [OLMo](https://allenai.org/olmo), a language model from the [Allen Institute for AI](https://allenai.org) (AI2).
 
-Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr). Once agreed you can follow the instructions [here](https://huggingface.co/datasets/allenai/dolma#download) to download it.
+Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr) on the [HuggingFace 🤗 Hub](https://huggingface.co/datasets/allenai/dolma). Once agreed you can follow the instructions [here](https://huggingface.co/datasets/allenai/dolma#download) to download the dataset.
 
 You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](docs/assets/dolma-datasheet-v0.1.pdf).
 
 
 ## Dolma Toolkit
 
-Dolma is a toolkit to curate large datasets for (pre)-training ML models. Its key features are:
+This repository houses the Dolma Toolkit, which enables curation of large datasets for (pre)-training ML models. Its key features are:
 
 1. **High Performance** ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
 2. **Portabilty** 🧳: Works on a single machine, a cluster, or cloud environment.

diff --git a/docs/README.md b/docs/README.md
@@ -1,16 +1,16 @@
 # Dolma Toolkit Documentation
 
 
-Dolma is a toolkit to curate dataset for pretraining AI models. Reason to use the Dolma toolkit are:
+The Dolma toolkit enables dataset curation for pretraining AI models. Reason to use the Dolma toolkit are:
 
-- **High performance** ⚡️ Dolma is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
-- **Portable** 🧳 Dolma can be run on a single machine, a cluster, or a cloud computing environment.
-- **Built-in taggers** 🏷 Dolma comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create [Gopher](https://arxiv.org/abs/2112.11446) and [C4](https://arxiv.org/abs/1910.10683).
-- **Fast deduplication** 🗑 Dolma can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
-- **Extensible** 🧩 Dolma is designed to be extensible, and can be extended with custom taggers.
-- **Cloud support** ☁️ Dolma supports reading and write data from local disk, and AWS S3-compatible locations.
+- **High performance** ⚡️ Dolma toolkit is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
+- **Portable** 🧳 Dolma toolkit can be run on a single machine, a cluster, or a cloud computing environment.
+- **Built-in taggers** 🏷 Dolma toolkit comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create [Gopher](https://arxiv.org/abs/2112.11446) and [C4](https://arxiv.org/abs/1910.10683).
+- **Fast deduplication** 🗑 Dolma toolkit can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
+- **Extensible** 🧩 Dolma toolkit is designed to be extensible, and can be extended with custom taggers.
+- **Cloud support** ☁️ Dolma toolkit supports reading and write data from local disk, and AWS S3-compatible locations.
 
-Dataset curation with Dolma usually happens in four steps:
+Dataset curation with the Dolma toolkit usually happens in four steps:
 
 1. Using **taggers**, spans of documents in a dataset are tagged with properties (e.g. the language their are in, toxicity, etc);
 2. Documents are optionally **deduplicated** based on their content or metadata;
@@ -19,26 +19,26 @@ Dataset curation with Dolma usually happens in four steps:
 
 ![The four steps of dataset curation with Dolma.](assets/diagram.webp)
 
-Dolma can be installed using `pip`:
+Dolma toolkit can be installed using `pip`:
 
 ```shell
 pip install dolma
 ```
 
-Dolma can be used either as a Python library or as a command line tool. The command line tool can be accessed using the `dolma` command. To see the available commands, use the `--help` flag.
+Dolma toolkit can be used either as a Python library or as a command line tool. The command line tool can be accessed using the `dolma` command. To see the available commands, use the `--help` flag.
 
 ```shell
 dolma --help
 ```
 
 ## Index
 
-To read Dolma's documentation, visit the following pages:
+To read Dolma toolkit's documentation, visit the following pages:
 
 - [Getting Started](getting-started.md)
 - [Data Format](data-format.md)
 - [Taggers](taggers.md)
 - [Deduplication](deduplication.md)
 - [Mixer](mixer.md)
 - [Tokenization](tokenize.md)
-- [Contributing to Dolma](develop.md)
+- [Contributing to Dolma Toolkit](develop.md)
diff --git a/docs/data-format.md b/docs/data-format.md
@@ -1,6 +1,6 @@
 # Data Format
 
-In this document, we explain the data format for the datasets processed by Dolma.
+In this document, we explain the data format for the datasets processed by Dolma toolkit.
 
 
 ## Directory Structure
@@ -67,7 +67,7 @@ The `metadata` field will be a free-for-all field that contains any source-speci
 
 It is especially important to preserve source-specific identifiers when possible. For example, in S2 raw data, we have S2 IDs for each document, but we should also persist things like the DOI, arXiv ID, ACL ID, PubMed ID, etc. when they're available to us.
 
-### Dolma Attributes Format
+### Dolma Toolkit Attributes Format
 
 Let's say we are at a good state of document, but we need to iterate on the toxicity classifier a few times. We don't want to duplicate multiple copies of the dataset just because we updated the toxicity classifier. Hence, we store **documents** separately from **attributes**, where attributes are newly derived/predicted aspects as a result of using our tools to analyze the documents.
 

diff --git a/docs/develop.md b/docs/develop.md
@@ -1,6 +1,6 @@
-# Contributing to Dolma
+# Contributing to the Dolma Toolkit
 
-We welcome contributions to Dolma. Please read this document to learn how to contribute.
+We welcome contributions to the Dolma Toolkit. Please read this document to learn how to contribute.
 
 ## Development
 

diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -1,12 +1,12 @@
 # Getting Started
 
-To get started, please install Dolma using `pip`:
+To get started, please install Dolma toolkit using `pip`:
 
 ```shell
 pip install dolma
 ```
 
-After installing Dolma, you get access to the `dolma` command line tool. To see the available commands, use the `--help` flag.
+After installing Dolma toolkit, you get access to the `dolma` command line tool. To see the available commands, use the `--help` flag.
 
 ```plain-text
 $ dolma --help