Skip to content

Commit

Permalink
Added debugging tips for OSCAR filtering
Browse files Browse the repository at this point in the history
  • Loading branch information
TevenLeScao authored Feb 20, 2023
1 parent 6a31b32 commit 9d05884
Showing 1 changed file with 6 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ The filtering parameters for each language are to be specified in the file [para

Run the filtering with the file [main_filtering.py](https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/main_filtering.py), specifying the dataset used and the links to the downloaded models. The different filters are coded in the file [filtering.py](https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py).

Some common issues:
- OSCAR-v2 metadata can cause cryptic Arrow bugs. The `remove_meta` flag will take care of this and/or space issues
- Too-long documents can cause hangs. Use `max_len_prefilter` to remove outliers.
- Memory issues can arise, causing hard-to-debug hangs if a process dies silently. Reducing the number of processes will help in this case.
- If you dataset is very large, you may have space issues in the saving stage. In this case, you will find an equivalent `.arrow` file in your `datasets` cache (typically the last-modified file in `.cache/huggingface/datasets/<dataset_name>/....`) anyway. The saving stage is mostly for better clarity and to avoid manipulating the `datasets` cache.

#### 5. Do the deduplication

Do the deduplication, which is detailed in the sub folder [deduplicate](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/01b_oscar_cleaning_and_filtering/deduplicate).

0 comments on commit 9d05884

Please sign in to comment.