Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
malteos committed Mar 26, 2024
2 parents 4636b43 + 7b9f4e0 commit 7c09588
Show file tree
Hide file tree
Showing 251 changed files with 28,913 additions and 495 deletions.
15 changes: 14 additions & 1 deletion .github/workflows/publish_pypi.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,26 @@
# Manual upload via twine:
# 1) Build project
# $ python setup.py sdist bdist_wheel
# 2) Upload via Twine with API token (user: __token__ password: <your API token>)
# $ python -m twine upload dist/*

name: Publish distributions 📦 to PyPI

on:
workflow_dispatch:
branches: [main]
branches:
- master
- main

jobs:
build-n-publish:
name: Build and publish 🐍 distributions 📦 to PyPI
runs-on: ubuntu-latest
permissions:
id-token: write
environment:
name: pypi
url: https://pypi.org/p/lm-datasets
steps:
- uses: actions/checkout@v3
- name: Setup
Expand Down
12 changes: 12 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,18 @@
"env": {
"_PYTEST_RAISE": "1"
},
},
{
"name": "mkdocs serve",
"type": "debugpy",
"request": "launch",
"module": "mkdocs",
"console": "integratedTerminal",
"justMyCode": false,
"env": {},
"args": [
"serve",
]
}
]
}
1 change: 1 addition & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -58,4 +58,5 @@
"editor.defaultFormatter": "ms-python.black-formatter"
},
"python.formatting.provider": "none",
"python.testing.pytestEnabled": true,
}
202 changes: 101 additions & 101 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,126 +31,126 @@ pip install lm-datasets[datasets]
To download and extract the plain-text of one or more datasets, run the following command:

```bash
lm_datasets extract_text $DATASET_ID $OUTPUT_DIR
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR
```

By default, output is saved as JSONL files. To change the output format, you can use the `--output_format` argument as below:

```bash
lm_datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
lm-datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
```

### Available datasets

A list or table with all available datasets can be print with the follow command:

```bash
lm_datasets print_stats --print_output md
lm-datasets print_stats --print_output md
```
#### Token count by language

| Language | Tokens |
|:-----------|:---------|
| bg | 53 B |
| ca | 5 B |
| code | 250 B |
| cs | 128 B |
| da | 34 B |
| de | 795 B |
| el | 108 B |
| en | 6 T |
| es | 674 B |
| et | 15 B |
| eu | 696 M |
| fi | 55 B |
| fr | 655 B |
| ga | 767 M |
| gl | 70 M |
| hr | 8 B |
| hu | 179 B |
| it | 386 B |
| lt | 24 B |
| lv | 14 B |
| mt | 4 B |
| nl | 238 B |
| nn | 307 M |
| no | 9 B |
| pl | 223 B |
| pt | 187 B |
| ro | 77 B |
| sh | 2 M |
| sk | 47 B |
| sl | 11 B |
| sr | 10 B |
| sv | 89 B |
| uk | 47 B |
| bg | 31 B |
| ca | 6 B |
| code | 212 B |
| cs | 42 B |
| da | 13 B |
| de | 160 B |
| el | 63 B |
| en | 1 T |
| es | 101 B |
| et | 9 B |
| eu | 1 B |
| fi | 19 B |
| fr | 84 B |
| ga | 274 M |
| gl | 231 M |
| hr | 11 B |
| hu | 52 B |
| it | 61 B |
| lt | 7 B |
| lv | 5 B |
| mt | 4 B |
| nl | 44 B |
| nn | 76 M |
| no | 13 B |
| pl | 45 B |
| pt | 46 B |
| ro | 18 B |
| sh | 184 M |
| sk | 32 B |
| sl | 13 B |
| sr | 11 B |
| sv | 19 B |
| uk | 56 B |

#### Token count by source

| Source | Tokens |
|:---------------------------------|:---------|
| academic_slovene_kas | 1 B |
| bgnc_admin_eur | 79 M |
| bgnc_news_corpus | 18 M |
| brwac | 3 B |
| bulgarian_news | 283 M |
| bulnc | 567 M |
| cabernet | 712 M |
| cc_gigafida | 127 M |
| colossal_oscar | 208 B |
| croatian_news_engri | 695 M |
| curlicat | 410 M |
| danewsroom | 472 M |
| danish_gigaword | 1 B |
| dewac | 2 B |
| dialogstudio | 0 |
| dk_clarin | 441 M |
| enc2021 | 0 |
| estonian_reference_corpus | 175 M |
| eurlex | 121 B |
| euscrawl | 423 M |
| ga_bilingual_legistation | 4 M |
| ga_universal_dependencies | 3 M |
| greek_legal_code | 45 M |
| greek_web_corpus | 3 B |
| hrwac | 1 B |
| itwac | 2 B |
| korpus_malti | 366 M |
| legal_mc4 | 29 B |
| macocu | 23 B |
| marcell_legislative_subcorpus_v2 | 31 M |
| norwegian_cc | 5 B |
| openlegaldata | 10 B |
| oscar | 9 T |
| oscar_opengptx | 245 B |
| parlamento_pt | 819 M |
| pes2o | 42 B |
| pl_nkjp | 1 M |
| pl_parliamentary_corpus | 671 M |
| proof_pile | 8 B |
| redpajama | 46 B |
| seimas_lt_en | 48 k |
| sk_court_decisions | 11 B |
| sk_laws | 45 M |
| slwac_web | 1 B |
| sonar | 500 M |
| sonar_new_media | 36 M |
| spanish_legal | 3 B |
| srpkor | 0 |
| starcoder | 250 B |
| state_related_latvian_web | 1 M |
| styria_news | 409 M |
| sv_gigaword | 1 B |
| syn_v9 | 5 B |
| uk_laws | 579 M |
| wiki | 12 B |
| wikibooks | 353 M |
| wikihow | 2 M |
| wikinews | 79 M |
| wikiquote | 268 M |
| wikisource | 2 B |
| wikivoyage | 132 M |
| ylenews | 0 |
| curlicat | 963 M |
| macocu | 74 B |
| redpajama | 44 B |
| wura | N/A |
| wikihow | 99 M |
| pes2o | 57 B |
| proof_pile | 12 B |
| pile_of_law | 111 B |
| math_amps | 7 B |
| edgarcorpus | N/A |
| bulgarian_news | 640 M |
| bulnc | 4 B |
| openlegaldata | 7 B |
| dewac | 3 B |
| ga_bilingual_legistation | 4 k |
| ga_universal_dependencies | 40 k |
| hrwac | 2 B |
| styria_news | 432 M |
| croatian_news_engri | 1 B |
| itwac | 3 B |
| korpus_malti | 816 M |
| sonar | 746 M |
| cc_gigafida | 260 M |
| academic_slovene_kas | 3 B |
| slwac_web | 3 B |
| sk_court_decisions | 24 B |
| sk_laws | 105 M |
| syn_v9 | 13 B |
| cs_en_parallel | 473 M |
| danish_gigaword | 2 B |
| danewsroom | 835 M |
| dk_clarin | 80 M |
| cabernet | 599 M |
| norwegian_cc | 11 B |
| pl_nkjp | 3 M |
| pl_parliamentary_corpus | 1 B |
| parlamento_pt | 732 M |
| brwac | 4 B |
| seimas_lt_en | 12 k |
| state_related_latvian_web | 52 k |
| greek_legal_code | 80 M |
| greek_web_corpus | 11 B |
| estonian_reference_corpus | 481 M |
| enc2021 | 3 B |
| ekspress | 723 M |
| euscrawl | 831 M |
| spanish_legal | 1 B |
| ylenews | 286 M |
| sv_gigaword | 528 M |
| srpkor | 866 M |
| marcell_legislative_subcorpus_v2 | 1 B |
| uk_laws | 2 B |
| eurlex | 41 B |
| legal_mc4 | 28 B |
| wiki | 21 B |
| wikibooks | 313 M |
| wikiquote | 247 M |
| wikinews | 90 M |
| wikisource | 2 B |
| wikivoyage | 119 M |
| colossal_oscar | 2 T |
| starcoder | 212 B |


### Dataset viewer
Expand Down Expand Up @@ -195,20 +195,20 @@ pip install git+https://github.com/malteos/lm-datasets.git@dev

This repository uses git hooks to validate code quality and formatting.

```
```bash
pre-commit install
git config --bool flake8.strict true # Makes the commit fail if flake8 reports an error
```

To run the hooks:
```
```bash
pre-commit run --all-files
```

### Testing

The tests can be executed with:
```
```bash
pytest --doctest-modules --cov-report term --cov=lm_datasets
```

Expand Down
58 changes: 55 additions & 3 deletions docs/add-your-own-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

The first step for adding a new dataset is write a new dataset class.
If your data comes from a common source such as Huggingface, you can build upon existing abstractions.

### Huggingface dataset

For example, Huggingface datasets only needed to specify some metadata like dataset ID, title etc. and the column where the textual data can be extracted from (by default `text` column):

```python
Expand All @@ -17,7 +20,7 @@ class PG19Dataset(HFDataset):
TITLE = "Project Gutenberg books published before 1919"
HOMEPAGE = "https://huggingface.co/datasets/pg19"
LICENSE = License("Apache License Version 2.0 (or public domain?)", url="https://www.apache.org/licenses/LICENSE-2.0.html")
CITATION = """@article{raecompressive2019,
CITATION = r"""@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
Expand All @@ -35,9 +38,58 @@ class PG19Dataset(HFDataset):
title_column_name = "short_book_title"
```

### CSV dataset

Other datasets may require implementing the full text extraction logic. The example below reads text data from CSV files while excluding specific subsets:

```python
# my_datasets/csv_example.py

import logging
import pandas as pd
from pathlib import Path
from lm_datasets.datasets.base import BaseDataset, Availability, License

logger = logging.getLogger(__name__)


class CSVExampleDataset(BaseDataset):
DATASET_ID = "csv_example"
TITLE = "An example for a dataset from CSV files"
AVAILIBITY = Availability.ON_REQUEST
LANGUAGES = ["en"]
LICENSE = License("mixed")

def get_texts(self):
"""
Extract texts from CSV files (format: "documen_id,text,score,url")
"""
# Iterate over CSV files in raw dataset directory
for file_path in self.get_dataset_file_paths(needed_suffix=".csv"):
file_name = Path(file_path).name

if (
file_name.startswith("mc4_")
or file_name.startswith("colossal-oscar-")
or file_name.startswith("wikimedia")
):
# skip subsets that overlap with other datasets (baes on file name)
continue

logger.info("Reading CSV: %s", file_path)
try:
# Use chunks to reduce memory consumption
for df in pd.read_csv(file_path, sep=",", chunksize=10_000):
for text in df.text.values:
# Pass extracted text
yield text
except ValueError as e:
logger.error("Error in file %s; error = %s", file_path, e)
```

## Register new dataset classes

Each dataset class needs to be registered with `lm-dataset` such that the commands know what classes are available.
Each dataset class needs to be registered with `lm-datasets` such that the commands know what classes are available.
This can be done by making a new Python module with a `get_registered_dataset_classes` method that returns a list of dataset classes:

```python
Expand All @@ -55,5 +107,5 @@ def get_registered_dataset_classes():
To load the registerd datasets in the pipeline commands, you need to specify the `--extra_dataset_registries` argument:

```bash
lm_datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
lm-datasets compose ... -extra_dataset_registries=my_datasets.dataset_registry
```
3 changes: 3 additions & 0 deletions docs/api/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Config

::: lm_datasets.utils.config.Config
Loading

0 comments on commit 7c09588

Please sign in to comment.