Releases: allenai/dolma
Releases · allenai/dolma
v1.0.5
v1.0.4
What's Changed
- Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory by @dependabot in #149
- fix divide by 0 in gopher tagger by @peterbjorgensen in #148
- Fixing dtype option not being correctly propagated by @soldni in #154
- Add support for parsing WARC by @soldni in #153
- Reducing hash calls by @Whattabatt in #156
- Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory by @dependabot in #155
- Adding Quality Classifier from Dolma 1.7 by @soldni in #163
- Adds ZST support in Deduper and Mixer by @soldni in #170
- Workaround to fix memory leak in HuggingFace tokenizer by @soldni in #169
- Adding partition logic by @Whattabatt in #161
- added option for tokenizer to split on special tokens by @soldni in #176
- Version bump for new release (1.0.4) by @soldni in #179
New Contributors
- @Whattabatt made their first contribution in #156
Full Changelog: v1.0.3...v1.0.4
v1.0.3
What's Changed
- Fix local shuffling failure by @soldni in #140
- Fix issue in getting started tutorial using wikipedia data by @RohitRathore1 in #117
- Add an option to improve tokenization shuffling by @soldni in #141
- Optionally add total/sum to output of analyzer by @soldni in #144
- Add extra tests for multi-byte unicode spans in deduper. by @soldni in #145
- Bump s3 client lib and parameterize region in s3 tests + devcontainer by @undfined in #147
New Contributors
- @RohitRathore1 made their first contribution in #117
- @undfined made their first contribution in #147
Full Changelog: v1.0.2...v1.0.3
v1.0.2
What's Changed
- Taggers for URL filtering by @soldni in #112
- Updated CFF and Bibtex by @soldni in #118
- Add preliminary Dolma v1.7 configurations, fix corner case in tokens. by @soldni in #120
- Update CITATION.cff by @soldni in #126
- Option to use ngram overlap to dedupe paragraphs by @rodneykinney in #122
- Tagger modules import (fix for #128) by @soldni in #129
- Added Support for JQ syntax in include/exclude mixer config by @soldni in #131
- Added JQ syntax for replacements + added minimum score. by @soldni in #133
- Bump the cargo group group with 1 update by @dependabot in #132
- Improves tool to compute statistics; adds deduplication options. by @soldni in #135
- use precompiled regex when loading url blocklists by @peterbjorgensen in #137
Full Changelog: v1.0.1...v1.0.2
v1.0.1
What's Changed
- Update README.md by @eltociear in #115
- do not overwrite tagger outputs with the same output path, fixes #113 by @peterbjorgensen in #114
- Fix broken data sheet link in README by @simonw in #107
- Modify CI to build when version is incremented; increment to v1.0.1 by @soldni in #116
New Contributors
- @eltociear made their first contribution in #115
- @simonw made their first contribution in #107
Full Changelog: v1.0.0...v1.0.1
v1.0.0
What's Changed
- Add robust median to gopher filter by @KennethEnevoldsen in #98
- Disambiguating that the repo is for the dolma toolkit in various docs by @arnavic in #104
- V1.0 candidate; new deduper options, new taggers by @soldni in #100
- Fixing Errors in Linux Build by @soldni in #105
New Contributors
- @KennethEnevoldsen made their first contribution in #98
- @arnavic made their first contribution in #104
Full Changelog: v0.9.4...v1.0.0
v0.9.4
What's Changed
- Bump h2 from 0.3.20 to 0.3.24 by @dependabot in #101
- BOS/EOS/PAD options in
tokens
cli; speed up tokenization by segmenting paragraphs. by @soldni in #102 - Fixed Dangling CLI Options; E2E Tokenizer Tests by @soldni in #103
Full Changelog: v0.9.2...v0.9.4
v0.9.2
What's Changed
- Remove unnecessary spawn in tokenizer, fix config with multiple paths by @soldni in #67
- Add tagger_modules option to tagger cli by @peterbjorgensen in #69
- feature to get the compliment of a hash sample by @IanMagnusson in #72
- Fix Hardcoded Tokenizer by @soldni in #71
- Fix a few issues of the FixedBucketsValTracker by @peterbjorgensen in #73
- Add attribute correlations by @Muennighoff in #68
- Porting missing code filtering rules to dolma repo by @soldni in #86
- Disable cache in CI to prevent build failures by @soldni in #90
- Reddit processing code by @drschwenk in #74
- update readme by @kyleclo in #95
- code/reasoning evaluation script by @benbogin in #94
- Add The Stack statistics by @Muennighoff in #92
- Fixing Build Config Issues by @soldni in #99
New Contributors
- @peterbjorgensen made their first contribution in #69
- @IanMagnusson made their first contribution in #72
- @drschwenk made their first contribution in #74
- @benbogin made their first contribution in #94
Full Changelog: v0.9.1...v0.9.2
v0.9.1
What's Changed
- Fix Jekyll Docs Build by @soldni in #55
- Adding Citation text back to README by @soldni in #56
- Bump rustix from 0.37.20 to 0.37.25 by @dependabot in #59
- Documentation on BaseParallelProcessor by @soldni in #62
- Add download instruction by @Muennighoff in #63
- Fix spawn method for multiprocessing by @soldni in #64
- Fix hardcoded URL by @soldni in #65
- Fix Accidental Override of Boolean Value by @soldni in #66
New Contributors
- @Muennighoff made their first contribution in #63
Full Changelog: v0.9.0...v0.9.1
v0.9.0
What's Changed
- Skipping AWS checks when aws access key is not available by @soldni in #28
- env variable is not passed to tests by @soldni in #29
- Fix make by @chris-ha458 in #24
- Fix
make
more by @chris-ha458 in #31 - ff by @soldni in #36
- Adding C4 example, dryrun mode, profiling taggers by @soldni in #37
- Only run Python style checks on source and tests by @soldni in #38
- fix rust parts by @chris-ha458 in #23
- Add rust unit tests by @chris-ha458 in #35
- Bump webpki from 0.22.0 to 0.22.2 by @dependabot in #52
- Adding Tokenizer, Writing Documentation, Misc Bugs & CLI improvements by @soldni in #54
New Contributors
- @chris-ha458 made their first contribution in #24
- @dependabot made their first contribution in #52
Full Changelog: v0.8.0...v0.9.0