v1.0.4
What's Changed
- Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory by @dependabot in #149
- fix divide by 0 in gopher tagger by @peterbjorgensen in #148
- Fixing dtype option not being correctly propagated by @soldni in #154
- Add support for parsing WARC by @soldni in #153
- Reducing hash calls by @Whattabatt in #156
- Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory by @dependabot in #155
- Adding Quality Classifier from Dolma 1.7 by @soldni in #163
- Adds ZST support in Deduper and Mixer by @soldni in #170
- Workaround to fix memory leak in HuggingFace tokenizer by @soldni in #169
- Adding partition logic by @Whattabatt in #161
- added option for tokenizer to split on special tokens by @soldni in #176
- Version bump for new release (1.0.4) by @soldni in #179
New Contributors
- @Whattabatt made their first contribution in #156
Full Changelog: v1.0.3...v1.0.4