Skip to content

Commit

Permalink
disambiguating that the repo is for the dolma toolkit in various read…
Browse files Browse the repository at this point in the history
…mes and docs (#104)
  • Loading branch information
arnavic authored Jan 30, 2024
1 parent 50abfb7 commit 43b0314
Show file tree
Hide file tree
Showing 5 changed files with 21 additions and 21 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,21 @@
Dolma is two things:

1. **Dolma Dataset**: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
2. **Dolma Toolkit**: a high-performance toolkit for curating datasets for language modeling.
2. **Dolma Toolkit**: a high-performance toolkit for curating datasets for language modeling -- this repo contains the source code for the Dolma Toolkit.

## Dolma Dataset

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
It was created as a training corpus for [OLMo](https://allenai.org/olmo), a language model from the [Allen Institute for AI](https://allenai.org) (AI2).

Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr). Once agreed you can follow the instructions [here](https://huggingface.co/datasets/allenai/dolma#download) to download it.
Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr) on the [HuggingFace 🤗 Hub](https://huggingface.co/datasets/allenai/dolma). Once agreed you can follow the instructions [here](https://huggingface.co/datasets/allenai/dolma#download) to download the dataset.

You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](docs/assets/dolma-datasheet-v0.1.pdf).


## Dolma Toolkit

Dolma is a toolkit to curate large datasets for (pre)-training ML models. Its key features are:
This repository houses the Dolma Toolkit, which enables curation of large datasets for (pre)-training ML models. Its key features are:

1. **High Performance** ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
2. **Portabilty** 🧳: Works on a single machine, a cluster, or cloud environment.
Expand Down
24 changes: 12 additions & 12 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Dolma Toolkit Documentation


Dolma is a toolkit to curate dataset for pretraining AI models. Reason to use the Dolma toolkit are:
The Dolma toolkit enables dataset curation for pretraining AI models. Reason to use the Dolma toolkit are:

- **High performance** ⚡️ Dolma is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
- **Portable** 🧳 Dolma can be run on a single machine, a cluster, or a cloud computing environment.
- **Built-in taggers** 🏷 Dolma comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create [Gopher](https://arxiv.org/abs/2112.11446) and [C4](https://arxiv.org/abs/1910.10683).
- **Fast deduplication** 🗑 Dolma can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
- **Extensible** 🧩 Dolma is designed to be extensible, and can be extended with custom taggers.
- **Cloud support** ☁️ Dolma supports reading and write data from local disk, and AWS S3-compatible locations.
- **High performance** ⚡️ Dolma toolkit is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
- **Portable** 🧳 Dolma toolkit can be run on a single machine, a cluster, or a cloud computing environment.
- **Built-in taggers** 🏷 Dolma toolkit comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create [Gopher](https://arxiv.org/abs/2112.11446) and [C4](https://arxiv.org/abs/1910.10683).
- **Fast deduplication** 🗑 Dolma toolkit can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
- **Extensible** 🧩 Dolma toolkit is designed to be extensible, and can be extended with custom taggers.
- **Cloud support** ☁️ Dolma toolkit supports reading and write data from local disk, and AWS S3-compatible locations.

Dataset curation with Dolma usually happens in four steps:
Dataset curation with the Dolma toolkit usually happens in four steps:

1. Using **taggers**, spans of documents in a dataset are tagged with properties (e.g. the language their are in, toxicity, etc);
2. Documents are optionally **deduplicated** based on their content or metadata;
Expand All @@ -19,26 +19,26 @@ Dataset curation with Dolma usually happens in four steps:

![The four steps of dataset curation with Dolma.](assets/diagram.webp)

Dolma can be installed using `pip`:
Dolma toolkit can be installed using `pip`:

```shell
pip install dolma
```

Dolma can be used either as a Python library or as a command line tool. The command line tool can be accessed using the `dolma` command. To see the available commands, use the `--help` flag.
Dolma toolkit can be used either as a Python library or as a command line tool. The command line tool can be accessed using the `dolma` command. To see the available commands, use the `--help` flag.

```shell
dolma --help
```

## Index

To read Dolma's documentation, visit the following pages:
To read Dolma toolkit's documentation, visit the following pages:

- [Getting Started](getting-started.md)
- [Data Format](data-format.md)
- [Taggers](taggers.md)
- [Deduplication](deduplication.md)
- [Mixer](mixer.md)
- [Tokenization](tokenize.md)
- [Contributing to Dolma](develop.md)
- [Contributing to Dolma Toolkit](develop.md)
4 changes: 2 additions & 2 deletions docs/data-format.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data Format

In this document, we explain the data format for the datasets processed by Dolma.
In this document, we explain the data format for the datasets processed by Dolma toolkit.


## Directory Structure
Expand Down Expand Up @@ -67,7 +67,7 @@ The `metadata` field will be a free-for-all field that contains any source-speci

It is especially important to preserve source-specific identifiers when possible. For example, in S2 raw data, we have S2 IDs for each document, but we should also persist things like the DOI, arXiv ID, ACL ID, PubMed ID, etc. when they're available to us.

### Dolma Attributes Format
### Dolma Toolkit Attributes Format

Let's say we are at a good state of document, but we need to iterate on the toxicity classifier a few times. We don't want to duplicate multiple copies of the dataset just because we updated the toxicity classifier. Hence, we store **documents** separately from **attributes**, where attributes are newly derived/predicted aspects as a result of using our tools to analyze the documents.

Expand Down
4 changes: 2 additions & 2 deletions docs/develop.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contributing to Dolma
# Contributing to the Dolma Toolkit

We welcome contributions to Dolma. Please read this document to learn how to contribute.
We welcome contributions to the Dolma Toolkit. Please read this document to learn how to contribute.

## Development

Expand Down
4 changes: 2 additions & 2 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Getting Started

To get started, please install Dolma using `pip`:
To get started, please install Dolma toolkit using `pip`:

```shell
pip install dolma
```

After installing Dolma, you get access to the `dolma` command line tool. To see the available commands, use the `--help` flag.
After installing Dolma toolkit, you get access to the `dolma` command line tool. To see the available commands, use the `--help` flag.

```plain-text
$ dolma --help
Expand Down

0 comments on commit 43b0314

Please sign in to comment.