Byte-Pair Encoding Tokenizer

Overview

This project introduces a robust and efficient tokenizer using Byte-Pair Encoding (BPE) technology, tailored for processing and analyzing linguistic data. Initially configured for the English language, the tokenizer's flexible architecture is capable of supporting a wide range of languages, making it an ideal solution for global text processing applications.

Rich Vocabulary: Comes pre-loaded with a rich set of over 25,000 tokens, allowing for effective handling and adaptability to diverse and unseen textual inputs.
High Performance: Optimized for speed and efficiency, the tokenizer processes large volumes of text swiftly, facilitating rapid data throughput without sacrificing accuracy.
Multi-language Capability: Designed to be language-agnostic, offering potential customization options for modeling specific linguistic characteristics of any target language.

Technology Stack

Rust: Employs Rust’s powerful performance characteristics and memory safety guarantees to ensure high-speed data processing with minimal overhead. All text processing is handled in rust, with convenient interfaces for both python and javascript to allow for wide spread usage.
PyO3: Enables seamless integration with Python, allowing the tokenizer to be used as a native Python extension. This integration provides the benefits of Rust's performance and safety in Python's flexible and dynamic ecosystem.
Web Assembly: Compiled to WebAssembly for high-performance use in web applications, ensuring that the tokenizer can be run directly in the browser with near-native speed.

For detailed instructions on how to integrate and utilize the tokenizer in both Python environments and web applications, please refer to the Usage section.

Planned future enhancements include:

GPU Implementation: Using WGPU for processing larger training corpora and achieving faster training and tokenization.
Expanded Language Support: Developing capabilities for capitalized and multilingual tokenization post-performance improvements.

Installation

To run this project, execute the following commands in your terminal:

git clone https://github.com/krrud/rust-bpe.git
cd rust-bpe
cargo build

Usage

To use the tokenizer in python:

First install the .whl:

pip install path/to/tokenizer-build.whl

Then use it as follows:

import rust_bpe

tokenizer = rust_bpe.TokenizerPy("/path/to/trained-tokenizer.json")
tokenized = tokenizer.tokenize("some text to tokenize")
detokenized = tokenizer.detokenize(tokenized)

To use the tokenizer in the browser via Wasm:

async function wasmTokenizer(textInput) {
  try {
    // Import and initialize the module
    const {default: init, TokenizerJs} = await import('/path/to/wasm-pkg');
    await init();

    // Load the trained tokenizer
    const fetchTokenizer = await fetch("/path/to/trained-tokenizer.json");
    if (!fetchTokenizer.ok) {
      throw new Error("Failed to fetch tokenizer");
    }
    const {vocabulary, merge_rules, config} = await fetchTokenizer.json();

    // Instantiate the tokenizer
    const tokenizer = new TokenizerJs(vocabulary, merge_rules, config);

    // Use the tokenizer
    const tokenized = tokenizer.tokenize(textInput);
    const detokenized = tokenizer.detokenize(indices);

  } catch (error) {
    console.error("Failed to load WASM module or tokenizer:", error);
  }
};

To train the tokenizer with Rust directly:

  // Load the dataset from the specified directory
  let corpus = Tokenizer::process_dataset("path/to/dataset");

  // Set the number of iterations for training
  let iterations = 25000;

  // Specify the output path for the trained tokenizer model
  let output = "./src/tokenizer_train.json";

  // Optional: start from a pretrained model if available
  // Replace with Some("path/to/pretrained_tokenizer.json") if applicable
  let pretrained_model = None; 

  // Train the tokenizer
  let tokenizer = Tokenizer::train_cpu(
      &corpus,
      iterations,
      output,
      pretrained_model
  );

To tokenize text to indices, convert to token strings, or detokenize back to the input:

let tokenizer = Tokenizer::load("path/to/your/trained_tokenizer.json").unwrap();
let tokens = tokenizer.tokenize("text to tokenize");  // Returns token indices
let token_strings = tokenizer.get_tokens(&tokens);    // Converts indices to associated strings
let detokenized = tokenizer.detokenize(&tokens);      // Converts back to the original text

Acknowledgements

Training data was graciously provided by:

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
src		src
tests		tests
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Byte-Pair Encoding Tokenizer

Table of Contents

Overview

Technology Stack

Installation

Usage

Acknowledgements

License

About

Releases

Packages

Languages

License

krrud/rust-bpe

Folders and files

Latest commit

History

Repository files navigation

Byte-Pair Encoding Tokenizer

Table of Contents

Overview

Technology Stack

Installation

Usage

Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages