Releases: proger/haloop
Training Transformers
This release doubles down on transformers and introduces a training loop program hala
. Pretraining bidirectional models with token denoising objective (aka masked LM) is available hala --objective denoise
. The first training run on uk4b dataset is happening here: https://wandb.ai/stud76/ha/runs/tjoqx491?workspace=user-stud76
Existing causal models can now be finetuned with conditional language modeling objective hala --objective cond
.
hat
is now a repl for both causal and bidirectional models. The hat
repl now supports history thanks to readline.
RNN training program hal
now supports training from u16
binary datasets like hala
. This allowed me to train a world model on VQ-VAE-tokenized images.
New randomly initialized checkpoints can be created with new the hai
program.
Acoustic training with words
This release enables users of hac
to train word- (token-) level models from manifests with txt files:
hac --train labels:train.tsv --eval labels:eval.tsv --vocab words:words.txt
TSV files are expected to be formatted like below. This format is insipired by kaldi text
with paths instead of utterance ids.
path/to/utterance.wav word1 word2 word3
Words.txt are files with lists of words, repeating words will be ignored.
uk4b with LoRA
This release supports running models adapted using LoRA. New modules: ha.lora
. New APIs: ha.attention.load_model
.
uk4b Transformers
This release introduces a REPL for models trained for my and @dchaplinsky's paper on GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian.
The REPL is accessible via a new CLI program, hat
.
To use hat
, first install some additional dependencies and models:
pip install haloop --upgrade # make sure you have at least 0.0.7
pip install bitsandbytes sentencepiece # I opted for not installing these as dependencies for now
wget https://a.wilab.org.ua/gpt/wiki.model # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt # model checkpoint for GPT-2 Large
Now, start the REPL:
hat --spm wiki.model ckpt10m.pt
v0.0.6: Transducer preparations
Train acoustic model with byte targets out of the box. Complete batched Transducer loss implementation. Default to torch implementation of CTC loss (10x faster for now). Add ResNet-32 encoder as an option. When evaluating LM, report BPC.
License code under GPLv3.
v0.0.5: Progress update
haloop v0.0.4
Renaming the project to haloop. Finally have the name I like