Skip to content

Commit

Permalink
update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cahya-wirawan committed Aug 9, 2024
1 parent 631eb06 commit 91d05ab
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ $ cd rwkv-tokenizer
$ pytest
```

We did a performance comparison on [the simple English Wikipedia dataset 20220301.en](https://huggingface.co/datasets/legacy-datasets/wikipedia) among following tokenizer:
We did a performance comparison on [the simple English Wikipedia dataset 20220301.en](https://huggingface.co/datasets/legacy-datasets/wikipedia)* among following tokenizer:
- The original RWKV tokenizer (BlinkDL)
- Huggingface implementaion of RWKV tokenizer
- Huggingface LLama tokenizer
Expand All @@ -55,6 +55,9 @@ tokenizer is around 17x faster than the original tokenizer and 9.6x faster than

![performance-comparison](data/performance-comparison.png)

*The simple English Wikipedia dataset can be downloaded as jsonl file from
https://huggingface.co/datasets/cahya/simple-wikipedia/resolve/main/simple-wikipedia.jsonl?download=true

## Bugs
~~There are still bugs where some characters are not encoded correctly.~~ The bug have been fixed in the version 0.3.0.
*This tokenizer is my very first Rust program, so it might still have many bugs and silly codes :-)*

0 comments on commit 91d05ab

Please sign in to comment.