Skip to content

Commit

Permalink
refactored to llm-datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
malteos committed Mar 26, 2024
1 parent a0c954f commit a7a597c
Show file tree
Hide file tree
Showing 174 changed files with 299 additions and 295 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/publish_pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
id-token: write
environment:
name: pypi
url: https://pypi.org/p/lm-datasets
url: https://pypi.org/p/llm-datasets
steps:
- uses: actions/checkout@v3
- name: Setup
Expand Down
40 changes: 20 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,51 @@
# lm-datasets
# llm-datasets

<img align="left" src="https://github.com/malteos/lm-datasets/raw/main/docs/images/A_colorful_parrot_sitting_on_a_pile_of_books__whit-removebg-preview.png" height="200" />
<img align="left" src="https://github.com/malteos/llm-datasets/raw/main/docs/images/A_colorful_parrot_sitting_on_a_pile_of_books__whit-removebg-preview.png" height="200" />

![](https://img.shields.io/pypi/l/lm-datasets?style=flat-square)
![](https://img.shields.io/pypi/l/llm-datasets?style=flat-square)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://makeapullrequest.com)

**lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.**
**llm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.**

The documentation is available [here](https://malteos.github.io/lm-datasets/).
The documentation is available [here](https://malteos.github.io/llm-datasets/).

## Quick start

### Installation

Install the `lm-datasets` package with [pip](https://pypi.org/project/lm-datasets/):
Install the `llm-datasets` package with [pip](https://pypi.org/project/llm-datasets/):

```bash
pip install lm-datasets
pip install llm-datasets
```

In order to keep the package minimal by default, `lm-datasets` comes with optional dependencies useful for some use cases.
In order to keep the package minimal by default, `llm-datasets` comes with optional dependencies useful for some use cases.
For example, if you want to have the text extraction for all available datasets, run:

```bash
pip install lm-datasets[datasets]
pip install llm-datasets[datasets]
```

### Download and text extraction

To download and extract the plain-text of one or more datasets, run the following command:

```bash
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR
llm-datasets extract_text $DATASET_ID $OUTPUT_DIR
```

By default, output is saved as JSONL files. To change the output format, you can use the `--output_format` argument as below:

```bash
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
llm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
```

### Available datasets

A list or table with all available datasets can be print with the follow command:

```bash
lm-datasets print_stats --print_output md
llm-datasets print_stats --print_output md
```
#### Token count by language

Expand Down Expand Up @@ -160,7 +160,7 @@ To start the app, first clone this repository, install dependencies, and run the

```bash
# clone is needed since streamlit does not support apps from modules yet
git clone https://github.com/malteos/lm-datasets.git
git clone https://github.com/malteos/llm-datasets.git

streamlit run src/lm_datasets/viewer/app.py -- \
--raw_datasets_dir=$RAW_DATASETS_DIR \
Expand All @@ -176,19 +176,19 @@ To setup, your local development environment we recommend conda and cloning the
The repository also includes settings and launch scripts for VSCode.

```bash
git clone [email protected]:malteos/lm-datasets.git
cd lm-datasets
git clone [email protected]:malteos/llm-datasets.git
cd llm-datasets

conda create -n lm-datasets python=3.10
conda activate lm-datasets
conda create -n llm-datasets python=3.10
conda activate llm-datasets

pip install -r requirements.txt
```

Alternatively, you can install the Python package directly from the dev branch:

```bash
pip install git+https://github.com/malteos/lm-datasets.git@dev
pip install git+https://github.com/malteos/llm-datasets.git@dev
```

### Install the pre-commit hooks
Expand All @@ -209,12 +209,12 @@ pre-commit run --all-files

The tests can be executed with:
```bash
pytest --doctest-modules --cov-report term --cov=lm_datasets
pytest --doctest-modules --cov-report term --cov=llm_datasets
```

## Acknowledgements

The work on the lm-datasets software is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
The work on the llm-datasets software is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D).

## License
Expand Down
10 changes: 5 additions & 5 deletions docs/add-your-own-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ For example, Huggingface datasets only needed to specify some metadata like data
```python
# my_datasets/pg19.py

from lm_datasets.datasets.hf_dataset import HFDataset
from lm_datasets.datasets.base import License, Availability
from llm_datasets.datasets.hf_dataset import HFDataset
from llm_datasets.datasets.base import License, Availability

class PG19Dataset(HFDataset):
DATASET_ID = "pg19"
Expand Down Expand Up @@ -48,7 +48,7 @@ Other datasets may require implementing the full text extraction logic. The exam
import logging
import pandas as pd
from pathlib import Path
from lm_datasets.datasets.base import BaseDataset, Availability, License
from llm_datasets.datasets.base import BaseDataset, Availability, License

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -89,7 +89,7 @@ class CSVExampleDataset(BaseDataset):

## Register new dataset classes

Each dataset class needs to be registered with `lm-datasets` such that the commands know what classes are available.
Each dataset class needs to be registered with `llm-datasets` such that the commands know what classes are available.
This can be done by making a new Python module with a `get_registered_dataset_classes` method that returns a list of dataset classes:

```python
Expand All @@ -107,5 +107,5 @@ def get_registered_dataset_classes():
To load the registerd datasets in the pipeline commands, you need to specify the `--extra_dataset_registries` argument:

```bash
lm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
llm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
```
2 changes: 1 addition & 1 deletion docs/api/base_dataset.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# BaseDataset

::: lm_datasets.datasets.base.BaseDataset
::: llm_datasets.datasets.base.BaseDataset
2 changes: 1 addition & 1 deletion docs/api/config.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Config

::: lm_datasets.utils.config.Config
::: llm_datasets.utils.config.Config
2 changes: 1 addition & 1 deletion docs/api/hf_dataset.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# HFDataset

::: lm_datasets.datasets.hf_dataset.HFDataset
::: llm_datasets.datasets.hf_dataset.HFDataset
2 changes: 1 addition & 1 deletion docs/api/jsonl_dataset.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# BaseDataset

::: lm_datasets.datasets.jsonl_dataset.JSONLDataset
::: llm_datasets.datasets.jsonl_dataset.JSONLDataset
2 changes: 1 addition & 1 deletion docs/compose-train-validation-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ The pipeline step that produces the final training or validation set is the `com
Before you run this command, you should specify in the [config](config-files.md) files what datasets should be selected and how they should be sampled.

```bash
lm-datasets compose –-split=train –-configs=my_dataset.yaml \
llm-datasets compose –-split=train –-configs=my_dataset.yaml \
--text_data_dir=/data/my_text_data \
--composed_data_dir=/data/my_composed_data/train/
```
Expand Down
12 changes: 6 additions & 6 deletions docs/config-files.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Config Files

`lm-datasets` allows you to specific general settings through config files so you do not need to specific always the same command line arguments.
`llm-datasets` allows you to specific general settings through config files so you do not need to specific always the same command line arguments.
Several commands support passing the `--configs` argument which should point to one or more YAML-files on your file system. For example, the text extraction command:

```bash
lm-datasets extract_text ... --configs $PATH_TO_YAML_CONFIG_FILE
llm-datasets extract_text ... --configs $PATH_TO_YAML_CONFIG_FILE
```

## Specifing local paths

In the config files, you can store for example system specific settings like the local paths, where the raw dataset files are located:

```yaml
# ./examples/lm_datasets_configs/my_system.yaml
# ./examples/llm_datasets_configs/my_system.yaml
local_dirs_by_source_id:
redpajama: /my_system_specific_data_directory/redpajama
```
Expand All @@ -21,7 +21,7 @@ The [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPaja
With the above config, we tell the extraction command the path where we downloaded the RedPajama data by providing the config file:
```bash
lm-datasets extract_text redpajama_book --configs ./examples/lm_datasets_configs/my_system.yaml
llm-datasets extract_text redpajama_book --configs ./examples/llm_datasets_configs/my_system.yaml
```

## Dataset selection and sampling
Expand All @@ -30,7 +30,7 @@ The configuration files are also needed for specifying the final dataset composi
The following examples shows a config for an Italian dataset:

```yaml
# ./examples/lm_datasets_configs/italian_data.yaml
# ./examples/llm_datasets_configs/italian_data.yaml

# a fixed random seed for shuffling etc.
seed: 0
Expand Down Expand Up @@ -59,6 +59,6 @@ To use this config, provide the path in the `--configs` argument:

```bash
# compose final dataset
lm-datasets compose ... --configs ./examples/lm_datasets_configs/italian_data.yaml
llm-datasets compose ... --configs ./examples/llm_datasets_configs/italian_data.yaml
```
4 changes: 2 additions & 2 deletions docs/extract-text-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
To download and extract the plain-text of one or more datasets, run the following command:

```bash
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR
llm-datasets extract_text $DATASET_ID $OUTPUT_DIR
```

By default, output is saved as JSONL files. To change the output format, you can use the `--output_format` argument as below:

```bash
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
llm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
```
18 changes: 9 additions & 9 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@

## Installation

Install the `lm-datasets` package with [pip](https://pypi.org/project/lm-datasets/):
Install the `llm-datasets` package with [pip](https://pypi.org/project/llm-datasets/):

```bash
pip install lm-datasets
pip install llm-datasets
```

In order to keep the package minimal by default, `lm-datasets` comes with optional dependencies useful for some use cases.
In order to keep the package minimal by default, `llm-datasets` comes with optional dependencies useful for some use cases.
For example, if you want to have the text extraction for all available datasets, run:

```bash
pip install lm-datasets[datasets]
pip install llm-datasets[datasets]
```

## Quick start
Expand All @@ -22,31 +22,31 @@ pip install lm-datasets[datasets]
To download and extract the plain-text of one or more datasets, run the following command:

```bash
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR
llm-datasets extract_text $DATASET_ID $OUTPUT_DIR
```

By default, output is saved as JSONL files. To change the output format, you can use the `--output_format` argument as below:

```bash
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
llm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
```

### Available datasets

A list or table with all available datasets can be print with the follow command:

```bash
lm-datasets print_stats --print_output md
llm-datasets print_stats --print_output md
```

### Pipeline commands

```
usage: lm-datasets <command> [<args>]
usage: llm-datasets <command> [<args>]
positional arguments:
{chunkify,collect_metrics,compose,convert_parquet_to_jsonl,extract_text,hf_upload,print_stats,shuffle,train_tokenizer}
lm-datasets command helpers
llm-datasets command helpers
chunkify Split the individual datasets into equally-sized file chunks (based on bytes or rows)
collect_metrics Collect metrics (token count etc.) from extracted texts
compose Compose the final train/validation set based on the individual datasets
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# lm-datasets
# llm-datasets

**lm-datasets** is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
**llm-datasets** is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

- [Getting started](getting-started.md)
- [Config files](config-files.md)
Expand Down
2 changes: 1 addition & 1 deletion examples/custom_datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@
To load the registerd datasets in the pipeline commands, you need to specify the `--extra_dataset_registries` argument:

```bash
lm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
llm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
```
2 changes: 1 addition & 1 deletion examples/custom_datasets/my_datasets/csv_example.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import logging
import pandas as pd
from pathlib import Path
from lm_datasets.datasets.base import BaseDataset, Availability, License
from llm_datasets.datasets.base import BaseDataset, Availability, License

logger = logging.getLogger(__name__)

Expand Down
4 changes: 2 additions & 2 deletions examples/custom_datasets/my_datasets/pg19.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from lm_datasets.datasets.hf_dataset import HFDataset
from lm_datasets.datasets.base import License, Availability
from llm_datasets.datasets.hf_dataset import HFDataset
from llm_datasets.datasets.base import License, Availability


class PG19Dataset(HFDataset):
Expand Down
10 changes: 5 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
site_name: "lm-datasets: Documentation"
site_url: https://github.com/malteos/lm-datasets/
site_name: "llm-datasets: Documentation"
site_url: https://github.com/malteos/llm-datasets/

site_description: "Documentation of the lm-datasets framework."
site_author: "Malte Ostendorff and lm-datasets contributors"
site_description: "Documentation of the llm-datasets framework."
site_author: "Malte Ostendorff and llm-datasets contributors"
docs_dir: docs/
repo_name: "GitHub"
repo_url: "https://github.com/malteos/lm-datasets/"
repo_url: "https://github.com/malteos/llm-datasets/"

nav:
- "Home": index.md
Expand Down
Loading

0 comments on commit a7a597c

Please sign in to comment.