-
Notifications
You must be signed in to change notification settings - Fork 117
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Porting missing code taggers, adding repetition tagger (#86)
- adds support for taggers that use metadata - ports code taggers from `allenai/LLM` - adds new taggers to count repetitions with regex and tokenizers - added tagger to count length without whitespaces - added script to make plots for dolma papers (`scripts/dolma_paper_plots.sh`, `scripts/wandb_to_plot.py`) - added script to find document from tokenizer offset (`scripts/find_offset.py`) - added tests for new taggers - improved GitHub Action to cache state
- Loading branch information
Showing
35 changed files
with
3,246 additions
and
359 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,3 +67,7 @@ target/ | |
|
||
# ignore vscode directory | ||
.vscode | ||
|
||
# ignore temporary directories | ||
/tmp/ | ||
/temp/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.