refactored to llm-datasets

malteos · Mar 26, 2024 · a7a597c · a7a597c
1 parent a0c954f
commit a7a597c
Show file tree

Hide file tree

Showing 174 changed files with 299 additions and 295 deletions.
diff --git a/.github/workflows/publish_pypi.yml b/.github/workflows/publish_pypi.yml
@@ -20,7 +20,7 @@ jobs:
       id-token: write
     environment:
       name: pypi
-      url: https://pypi.org/p/lm-datasets
+      url: https://pypi.org/p/llm-datasets
     steps:
     - uses: actions/checkout@v3
     - name: Setup

diff --git a/README.md b/README.md
@@ -1,51 +1,51 @@
-# lm-datasets
+# llm-datasets
 
-<img align="left" src="https://github.com/malteos/lm-datasets/raw/main/docs/images/A_colorful_parrot_sitting_on_a_pile_of_books__whit-removebg-preview.png" height="200" />
+<img align="left" src="https://github.com/malteos/llm-datasets/raw/main/docs/images/A_colorful_parrot_sitting_on_a_pile_of_books__whit-removebg-preview.png" height="200" />
 
-![](https://img.shields.io/pypi/l/lm-datasets?style=flat-square)
+![](https://img.shields.io/pypi/l/llm-datasets?style=flat-square)
 [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://makeapullrequest.com)
 
-**lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.**
+**llm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.**
 
-The documentation is available [here](https://malteos.github.io/lm-datasets/).
+The documentation is available [here](https://malteos.github.io/llm-datasets/).
 
 ## Quick start
 
 ### Installation
 
-Install the `lm-datasets` package with [pip](https://pypi.org/project/lm-datasets/):
+Install the `llm-datasets` package with [pip](https://pypi.org/project/llm-datasets/):
 
 ```bash
-pip install lm-datasets
+pip install llm-datasets
 ```
 
-In order to keep the package minimal by default, `lm-datasets` comes with optional dependencies useful for some use cases.
+In order to keep the package minimal by default, `llm-datasets` comes with optional dependencies useful for some use cases.
 For example, if you want to have the text extraction for all available datasets, run:
 
 ```bash
-pip install lm-datasets[datasets]
+pip install llm-datasets[datasets]
 ```
 
 ### Download and text extraction
 
 To download and extract the plain-text of one or more datasets, run the following command:
 
 ```bash
-lm-datasets extract_text $DATASET_ID $OUTPUT_DIR
+llm-datasets extract_text $DATASET_ID $OUTPUT_DIR
 ```
 
 By default, output is saved as JSONL files. To change the output format, you can use the `--output_format` argument as below:
 
 ```bash
-lm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd
+llm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd
 ```
 
 ### Available datasets
 
 A list or table with all available datasets can be print with the follow command:
 
 ```bash
-lm-datasets print_stats --print_output md
+llm-datasets print_stats --print_output md
 ```
 #### Token count by language
 
@@ -160,7 +160,7 @@ To start the app, first clone this repository, install dependencies, and run the
 
 ```bash
 # clone is needed since streamlit does not support apps from modules yet
-git clone https://github.com/malteos/lm-datasets.git
+git clone https://github.com/malteos/llm-datasets.git
 
 streamlit run src/lm_datasets/viewer/app.py -- \
     --raw_datasets_dir=$RAW_DATASETS_DIR \
@@ -176,19 +176,19 @@ To setup, your local development environment we recommend conda and cloning the
 The repository also includes settings and launch scripts for VSCode.
 
 ```bash
-git clone [email protected]:malteos/lm-datasets.git
-cd lm-datasets
+git clone [email protected]:malteos/llm-datasets.git
+cd llm-datasets
 
-conda create -n lm-datasets python=3.10
-conda activate lm-datasets
+conda create -n llm-datasets python=3.10
+conda activate llm-datasets
 
 pip install -r requirements.txt
 ```
 
 Alternatively, you can install the Python package directly from the dev branch:
 
 ```bash
-pip install git+https://github.com/malteos/lm-datasets.git@dev
+pip install git+https://github.com/malteos/llm-datasets.git@dev
 ```
 
 ### Install the pre-commit hooks
@@ -209,12 +209,12 @@ pre-commit run --all-files
 
 The tests can be executed with:
 ```bash
-pytest --doctest-modules --cov-report term --cov=lm_datasets
+pytest --doctest-modules --cov-report term --cov=llm_datasets
 ```
 
 ## Acknowledgements
 
-The work on the lm-datasets software is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
+The work on the llm-datasets software is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
 through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D).
 
 ## License

diff --git a/docs/add-your-own-data.md b/docs/add-your-own-data.md
@@ -12,8 +12,8 @@ For example, Huggingface datasets only needed to specify some metadata like data
 ```python
 # my_datasets/pg19.py
 
-from lm_datasets.datasets.hf_dataset import HFDataset
-from lm_datasets.datasets.base import License, Availability
+from llm_datasets.datasets.hf_dataset import HFDataset
+from llm_datasets.datasets.base import License, Availability
 
 class PG19Dataset(HFDataset):
     DATASET_ID = "pg19"
@@ -48,7 +48,7 @@ Other datasets may require implementing the full text extraction logic. The exam
 import logging
 import pandas as pd
 from pathlib import Path
-from lm_datasets.datasets.base import BaseDataset, Availability, License
+from llm_datasets.datasets.base import BaseDataset, Availability, License
 
 logger = logging.getLogger(__name__)
 
@@ -89,7 +89,7 @@ class CSVExampleDataset(BaseDataset):
 
 ## Register new dataset classes
 
-Each dataset class needs to be registered with `lm-datasets` such that the commands know what classes are available.
+Each dataset class needs to be registered with `llm-datasets` such that the commands know what classes are available.
 This can be done by making a new Python module with a `get_registered_dataset_classes` method that returns a list of dataset classes:
 
 ```python
@@ -107,5 +107,5 @@ def get_registered_dataset_classes():
 To load the registerd datasets in the pipeline commands, you need to specify the `--extra_dataset_registries` argument:
 
 ```bash
-lm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
+llm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
 ```
diff --git a/docs/api/base_dataset.md b/docs/api/base_dataset.md
@@ -1,3 +1,3 @@
 # BaseDataset
 
-::: lm_datasets.datasets.base.BaseDataset
+::: llm_datasets.datasets.base.BaseDataset
diff --git a/docs/api/config.md b/docs/api/config.md
@@ -1,3 +1,3 @@
 # Config
 
-::: lm_datasets.utils.config.Config
+::: llm_datasets.utils.config.Config
diff --git a/docs/api/hf_dataset.md b/docs/api/hf_dataset.md
@@ -1,3 +1,3 @@
 # HFDataset
 
-::: lm_datasets.datasets.hf_dataset.HFDataset
+::: llm_datasets.datasets.hf_dataset.HFDataset
diff --git a/docs/api/jsonl_dataset.md b/docs/api/jsonl_dataset.md
@@ -1,3 +1,3 @@
 # BaseDataset
 
-::: lm_datasets.datasets.jsonl_dataset.JSONLDataset
+::: llm_datasets.datasets.jsonl_dataset.JSONLDataset
diff --git a/docs/compose-train-validation-data.md b/docs/compose-train-validation-data.md
@@ -4,7 +4,7 @@ The pipeline step that produces the final training or validation set is the `com
 Before you run this command, you should specify in the [config](config-files.md) files what datasets should be selected and how they should be sampled.
 
 ```bash
-lm-datasets compose –-split=train –-configs=my_dataset.yaml \
+llm-datasets compose –-split=train –-configs=my_dataset.yaml \
 	--text_data_dir=/data/my_text_data \
 	--composed_data_dir=/data/my_composed_data/train/
 ```

diff --git a/docs/config-files.md b/docs/config-files.md
@@ -1,18 +1,18 @@
 # Config Files
 
-`lm-datasets` allows you to specific general settings through config files so you do not need to specific always the same command line arguments.
+`llm-datasets` allows you to specific general settings through config files so you do not need to specific always the same command line arguments.
 Several commands support passing the `--configs` argument which should point to one or more YAML-files on your file system. For example, the text extraction command:
 
 ```bash
-lm-datasets extract_text ... --configs $PATH_TO_YAML_CONFIG_FILE
+llm-datasets extract_text ... --configs $PATH_TO_YAML_CONFIG_FILE
 ```
 
 ## Specifing local paths
 
 In the config files, you can store for example system specific settings like the local paths, where the raw dataset files are located:
 
 ```yaml
-# ./examples/lm_datasets_configs/my_system.yaml
+# ./examples/llm_datasets_configs/my_system.yaml
 local_dirs_by_source_id:
   redpajama: /my_system_specific_data_directory/redpajama
 ```
@@ -21,7 +21,7 @@ The [RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPaja
 With the above config, we tell the extraction command the path where we downloaded the RedPajama data by providing the config file:
 
 ```bash
-lm-datasets extract_text redpajama_book --configs ./examples/lm_datasets_configs/my_system.yaml
+llm-datasets extract_text redpajama_book --configs ./examples/llm_datasets_configs/my_system.yaml
 ```
 
 ## Dataset selection and sampling
@@ -30,7 +30,7 @@ The configuration files are also needed for specifying the final dataset composi
 The following examples shows a config for an Italian dataset:
 
 ```yaml
-# ./examples/lm_datasets_configs/italian_data.yaml
+# ./examples/llm_datasets_configs/italian_data.yaml
 
 # a fixed random seed for shuffling etc.
 seed: 0
@@ -59,6 +59,6 @@ To use this config, provide the path in the `--configs` argument:
 
 ```bash
 # compose final dataset
-lm-datasets compose ... --configs ./examples/lm_datasets_configs/italian_data.yaml
+llm-datasets compose ... --configs ./examples/llm_datasets_configs/italian_data.yaml
 
 ```
diff --git a/docs/extract-text-data.md b/docs/extract-text-data.md
@@ -4,11 +4,11 @@
 To download and extract the plain-text of one or more datasets, run the following command:
 
 ```bash
-lm-datasets extract_text $DATASET_ID $OUTPUT_DIR
+llm-datasets extract_text $DATASET_ID $OUTPUT_DIR
 ```
 
 By default, output is saved as JSONL files. To change the output format, you can use the `--output_format` argument as below:
 
 ```bash
-lm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd
+llm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd
 ```
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -2,17 +2,17 @@
 
 ## Installation
 
-Install the `lm-datasets` package with [pip](https://pypi.org/project/lm-datasets/):
+Install the `llm-datasets` package with [pip](https://pypi.org/project/llm-datasets/):
 
 ```bash
-pip install lm-datasets
+pip install llm-datasets
 ```
 
-In order to keep the package minimal by default, `lm-datasets` comes with optional dependencies useful for some use cases.
+In order to keep the package minimal by default, `llm-datasets` comes with optional dependencies useful for some use cases.
 For example, if you want to have the text extraction for all available datasets, run:
 
 ```bash
-pip install lm-datasets[datasets]
+pip install llm-datasets[datasets]
 ```
 
 ## Quick start
@@ -22,31 +22,31 @@ pip install lm-datasets[datasets]
 To download and extract the plain-text of one or more datasets, run the following command:
 
 ```bash
-lm-datasets extract_text $DATASET_ID $OUTPUT_DIR
+llm-datasets extract_text $DATASET_ID $OUTPUT_DIR
 ```
 
 By default, output is saved as JSONL files. To change the output format, you can use the `--output_format` argument as below:
 
 ```bash
-lm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd
+llm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd
 ```
 
 ### Available datasets
 
 A list or table with all available datasets can be print with the follow command:
 
 ```bash
-lm-datasets print_stats --print_output md
+llm-datasets print_stats --print_output md
 ```
 
 ### Pipeline commands
 
 ```
-usage: lm-datasets <command> [<args>]
+usage: llm-datasets <command> [<args>]
 
 positional arguments:
   {chunkify,collect_metrics,compose,convert_parquet_to_jsonl,extract_text,hf_upload,print_stats,shuffle,train_tokenizer}
-                        lm-datasets command helpers
+                        llm-datasets command helpers
     chunkify            Split the individual datasets into equally-sized file chunks (based on bytes or rows)
     collect_metrics     Collect metrics (token count etc.) from extracted texts
     compose             Compose the final train/validation set based on the individual datasets

diff --git a/docs/index.md b/docs/index.md
@@ -1,6 +1,6 @@
-# lm-datasets
+# llm-datasets
 
-**lm-datasets** is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
+**llm-datasets** is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
 
 - [Getting started](getting-started.md)
 - [Config files](config-files.md)

diff --git a/examples/custom_datasets/README.md b/examples/custom_datasets/README.md
@@ -8,5 +8,5 @@
 To load the registerd datasets in the pipeline commands, you need to specify the `--extra_dataset_registries` argument:
 
 ```bash
-lm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
+llm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
 ```
diff --git a/examples/custom_datasets/my_datasets/csv_example.py b/examples/custom_datasets/my_datasets/csv_example.py
@@ -1,7 +1,7 @@
 import logging
 import pandas as pd
 from pathlib import Path
-from lm_datasets.datasets.base import BaseDataset, Availability, License
+from llm_datasets.datasets.base import BaseDataset, Availability, License
 
 logger = logging.getLogger(__name__)
 

diff --git a/examples/custom_datasets/my_datasets/pg19.py b/examples/custom_datasets/my_datasets/pg19.py
@@ -1,5 +1,5 @@
-from lm_datasets.datasets.hf_dataset import HFDataset
-from lm_datasets.datasets.base import License, Availability
+from llm_datasets.datasets.hf_dataset import HFDataset
+from llm_datasets.datasets.base import License, Availability
 
 
 class PG19Dataset(HFDataset):

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -1,11 +1,11 @@
-site_name: "lm-datasets: Documentation"
-site_url: https://github.com/malteos/lm-datasets/
+site_name: "llm-datasets: Documentation"
+site_url: https://github.com/malteos/llm-datasets/
 
-site_description: "Documentation of the lm-datasets framework."
-site_author: "Malte Ostendorff and lm-datasets contributors"
+site_description: "Documentation of the llm-datasets framework."
+site_author: "Malte Ostendorff and llm-datasets contributors"
 docs_dir: docs/
 repo_name: "GitHub"
-repo_url: "https://github.com/malteos/lm-datasets/"
+repo_url: "https://github.com/malteos/llm-datasets/"
 
 nav:
   - "Home": index.md