v1.0.2
What's Changed
- Taggers for URL filtering by @soldni in #112
- Updated CFF and Bibtex by @soldni in #118
- Add preliminary Dolma v1.7 configurations, fix corner case in tokens. by @soldni in #120
- Update CITATION.cff by @soldni in #126
- Option to use ngram overlap to dedupe paragraphs by @rodneykinney in #122
- Tagger modules import (fix for #128) by @soldni in #129
- Added Support for JQ syntax in include/exclude mixer config by @soldni in #131
- Added JQ syntax for replacements + added minimum score. by @soldni in #133
- Bump the cargo group group with 1 update by @dependabot in #132
- Improves tool to compute statistics; adds deduplication options. by @soldni in #135
- use precompiled regex when loading url blocklists by @peterbjorgensen in #137
Full Changelog: v1.0.1...v1.0.2