10 Jan 12:14

6bc0fbc

v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine Latest

Latest

New models

ModernBERT

The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:

Rotary Positional Embeddings to support sequences of up to 8192 tokens.
Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
Flash Attention to speed up processing.
A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

Add ModernBERT to Transformers by @warner-benjamin in #35158

Aria

The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

Add Aria by @aymeric-roucher in #34157

TimmWrapper

We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.

Here's a general usage example:

import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor

checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)

with torch.no_grad():
    logits = model(**inputs).logits

top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)

Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:

import torch
from urllib.request import urlopen
from PIL import Image

from transformers import pipeline

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))

Add TimmWrapper by @qubvel and @amyeroberts in #34564

Pixtral-Large

Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.

Update Pixtral conversion script to support large format! by @ArthurZucker in #34801

ColPali

The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.

In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736

Falcon3

Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:

One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.

Add Falcon3 documentation by @mokeddembillel in #35307

Bamba

Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Checkout all Bamba-9B model checkpoints here.

Add the Bamba Model by @fabianlim in #34982

VitPose

ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.

The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

Add VitPose by @SangbumChoi and @NielsRogge in #30530

DINOv2 with registers

The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.

Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.

The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:

no artifacts
interpretable attention maps
and improved performances.

Add DINOv2 with registers by @NielsRogge in #35348

Emu3

The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on [VQ-VA...

Contributors

winglian, mfarre, and 79 other contributors

Assets 2

17 Dec 15:42

ArthurZucker

v4.47.1

241c04d

v4.47.1

Patch release v4.47.1

We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!

Fix GA loss bugs and add unit test (#35121)
Contributed by @techkang and @ArthurZucker.
Fix num_items_in_batch not being an integer (#35115)
Contributed by @xspirus.
Fix FSDP no longer working (#35212)
Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212)
Contributed by @winglian.
Only import torch.distributed if it is available (#35133)
Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295)
Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!

Contributors

winglian, muellerzr, and 5 other contributors

Assets 2

05 Dec 17:45

LysandreJik

v4.47.0

5d7739f

v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel

New models

PaliGemma-2

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

I-JEPA

The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

Add I-JEPA by @jmtzt in #33125

OLMo 2

The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.

The architectural changes from the original OLMo model to this model are:

RMSNorm is used instead of standard layer norm.
Norm is applied to attention queries and keys.
Norm is applied after attention/feedforward layers rather than before.

Commits:

Add OLMo November 2024 by @2015aroras in #34551
Rename OLMo November to OLMo2 by @2015aroras in #34864

Layer-Skip Llama

We add support for Meta's Layer-Skip Llama 3.2 1B model.

The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.

Self-speculation (Layer-Skip Llama) by @ArthurZucker in #34240

Tensor Parallel implementation

This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).

The motivation is multi-fold:

to make modeling code simple as single-worker case:
all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.
to make tensor parallelism easily accessible by users:
added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!

This is the first PR of many to simplify and enable Tensor Parallel across models.

Simplify Tensor Parallel implementation with PyTorch TP by @kwen2501 in #34184

Farewell, Python 3.8

Python 3.8 reaches end of life, and, as such, we drop it from our CI.

Drop support for Python 3.8 by @ydshieh in #34314

GGUF improvements

Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.

Add T5 GGUF loading support by @junejae in #33389
Add GGUF for Mamba by @VladOS95-cyber in #34200
Add Nemotron GGUF Loading Support by @farrosalferro in #34725
Improve gguf tensor processing by @VladOS95-cyber in #34515
Fix use_parallel_residual and qkv_bias for StableLM GGUF config extraction by @Isotr0py in #34450

Fast processors

We continue the work to improve the speed of fast processors as detailed in this roadmap.

We contribute a fast processor to RT-DETR.

Add Image Processor Fast RT-DETR by @yonigozlan in #34354

New pipelines

A new pipeline has been added to transformers: image-text-to-text!

the pipeline support the following inputs:

unbatched images and text - images=image, text=text
batched images and text - images = [image, image], text= [text, text]
several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... ......", "......"]
Chat templates (for models supporting them).

Add image text to text pipeline by @yonigozlan in #34170

Notable refactors

Separate chat templates into a single file

We have had several issues with chat templates because they're stored as single lines in the JSON config files:

Impossible to review diffs
Very hard to edit in the web UI (or in general)
Differences between processor templates in chat_template.json and tokenizer templates in tokenizer_config.json causing confusion
Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead

The solution:

Just move chat templates to a single chat_template.jinja file in the repo
If multiple templates are required, then they should still be stored in the JSON file. This is not supported for Processor classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.
If a chat_template.jinja file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have any chat_template entry in tokenizer_config.json.

For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.

Separate chat templates into a single file by @Rocketknight1 in #33957

Large modular logic refactor

This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:

visit all the modular file (record imports/functions/classes/assignments nodes)
- create function dependency mapping
for each import coming from another model:
- visit the corresponding file
- create function dependency mapping
- update mapping with function/assignment from the modular (updated/new functions)
- create the class dependency graph based on merged dependencies
update dependency graph of the modular with the functions and assignments imported from the other files
for each class recorded in the modular:
- if inherithing from class in another file:
  - replace call to super
  - find the dependencies after the node was replaced
  - follow (updated with modular defs) dependency mapping to add all nodes
- else:
  - only add needed imported functions (and their dependencies)
determine the needed imports and add them

Large modular logic refactoring by @Cyrilvallez in #34487

Community bugfixes and improvements

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned by @Abhishek-TAMU in #33932
Better defaults by @ArthurZucker in #34026
translated gguf.md into chinese by @blueingman in #34163
CI: fix failures by @zucchini-nlp in #34371
Zamba is an LM by @LysandreJik in #34342
add code generation to natural language processing section by @furtnerthomas in #34333
Fix pil_torch_interpolation_mapping import in image_processing_detr_fast by @yonigozlan in #34375
Add code sample docstrings and checkpoint reference for GLM models by @h3110Fr13nd in #34360
refactor: remove redundant if-condition and improve type correctness for convert_tokens_to_ids by @winstxnhdw in #34030
Ignore unsupported kwarg in ProcessorMixin call by @yonigozlan in #34285
[PEFT] Add warning for missing key in LoRA adapter by @BenjaminBossan in #34068
Fix torch.fx issue related to the new loss_kwargs keyword argument by @michaelbenayoun in #34380
Correct the new defaults by @Cyrilvallez in #34377
[auto. ping] Avoid sending empty info + add more team members by @ydshieh in #34383
Fix glm by @Cyrilvallez in #34388
Use non nested images and batched text Idefics2/3 by @yonigozlan in #34222
Fix onnx non-expotable ...

Contributors

winglian, pcuenca, and 125 other contributors

Assets 2

18 Nov 22:13

ArthurZucker

v4.46.3

052e652

Patch release v4.46.3

One small fix for FSDP + gradient accumulation loss issue!

FSDP grad accum fix, #34645 by @winglian

Contributors

winglian

Assets 2

05 Nov 18:21

ArthurZucker

v4.46.2

ccbd57a

Patch release v4.46.2

Mostly had to finish the gradient accumulation !
Thanks to @techkang and @Ryukijano 🤗

VLMs: fix number of image tokens (#34332) by @zucchini-nlp
fix pixtral processor (#34486) by @@molbap
enable average tokens across devices (#34373) by @techkang and @muellerzr
Update trainer for easier handling of accumulate, compile fixes, and … by @muellerzr and @Ryukijano
MPS: isin_mps_friendly can support 0D tensors (#34538) by @gante

Contributors

muellerzr, gante, and 4 other contributors

Assets 2

29 Oct 15:50

ArthurZucker

v4.46.1

bc598c0

Patch release v4.46.1

Patch release v4.4.61

This is mostly for fx and onnx issues!

** Fix regression loading dtype #34409 by @SunMarc
** LLaVa: latency issues #34460 by @zucchini-nlp
** Fix pix2struct #34374 by @IlyasMoutawwakil
** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil
** Fix torch.fx issue related to the new loss_kwargs keyword argument #34380 by @michaelbenayoun

Contributors

michaelbenayoun, SunMarc, and 2 other contributors

Assets 2

24 Oct 08:15

LysandreJik

v4.46.0

c2820c9

Release v4.46.0

New model additions

Moshi

The Moshi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez,
Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.

Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a
text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec,
while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of
explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text
tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of
generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first
real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.

Moshi integration by @ylacombe in #33624

Zamba

Zamba-7B-v1 is a hybrid between state-space models (Specifically Mamba) and transformer, and was trained using
next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the Mistral
v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was
pre-trained on 1T tokens of text and code data.

Add Zamba by @pglorio in #30950

GLM

The GLM Model was proposed in ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools by GLM Team,
THUDM & ZhipuAI.

The abstract from the paper starts with the following:

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This
report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.

add Glm by @Cyrilvallez in #33823

Idefics 3

The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.

Idefics3 is an adaptation of the Idefics2 model with three main differences:

It uses Llama3 for the text model.
It uses an updated processing logic for the images.
It removes the perceiver.

Add Idefics 3! by @andimarafioti in #32473

PhiMoE

The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.

This model is very similar to Mixtral with the main difference of Phi3LongRoPEScaledRotaryEmbedding, where they are
used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP’s up and gate
projection layers are also fused.

PhiMoE by @garg-amit in #33363

Watermarking

This release adds SynthID, a novel state-of-the-art watermarking technique by Google DeepMind. SynthID has a low generation-time computational cost and can be configured to be nearly imperceptible (at the cost of harder watermarking detection). The release also comes with the code to train and run the corresponding detector, which is a machine learning model itself.

from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig

tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')

# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
    keys=[654, 400, 836, 123, 340, 443, 597, 160, 57],
    ngram_len=5,
)

# Generation with watermarking
tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
output_sequences = model.generate(
    **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
)
watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
print(watermarked_text)

Docs for applying SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkLogitsProcessor
Docs for detecting SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkDetector

Add SynthID (watermerking by Google DeepMind) by @gante in #34350

Quantization

BitNet

BitNet is an architecture introduced by Microsoft Research that uses extreme quantization, representing each parameter with only three values: -1, 0, and 1. This results in a model that uses just 1.58 bits per parameter, significantly reducing computational and memory requirements. It replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinears that use ternary precision (or even binary, in the initial version)

FEAT : Adding BitNet quantization method to HFQuantizer by @MekkCyber in #33410

GGUF loading in transformers

More architectures are now supported in our GGUF loader; GGUF files saved with this architecture can now
be loaded directly in transformers to be fine-tuned. We recommend using tooling from llama.cpp to requantize
the models after further training has been done.

Add gguf support for bloom by @VladOS95-cyber in #33473
Add falcon gguf by @g-prz in #33437
Add gguf support for StableLM by @VladOS95-cyber in #33793
Add gguf support for gpt2 by @VladOS95-cyber in #34044
Add GGUF for starcoder2 by @VladOS95-cyber in #34094

Notable improvements and additions

Pipeline API synchronisation

We are pushing for a unified inference API across multiple libraries. As part of this, we are cleaning up the input and output signatures for our pipeline classes and deprecating some rarely-used arguments. This is still a work-in-progress, but when it's finished, transformers pipelines should exactly match workflows in deployment libraries like transformers.js or TGI, allowing you to seamlessly move from development to production.

Sync video classification pipeline with huggingface_hub spec by @Rocketknight1 in #34288
Image pipelines spec compliance by @Rocketknight1 in #33899
Make ASR pipeline compliant with Hub spec + add tests by @Rocketknight1 in #33769
Cleanup return_text and return_full_text options in TextGenerationPipeline by @Rocketknight1 in #33542
Make audio classification pipeline spec-compliant and add test by @Rocketknight1 in #33730
Sync QuestionAnsweringPipeline by @Rocketknight1 in #34039

Also, pipelines now fully support the Processor class, used by vision-language models. Expect full pipeline support for chatting with VLMs in the very near future!

Make pipeline able to load processor by @qubvel in #32514

Executorch compatibility

ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.

We are collaborating with the executorch team so that 🤗 Transformers models can be exported using torch.export. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in ExecuTorch, particularly for mobile and edge use cases.

Generate using exported model and enable gemma2-2b in ExecuTorch by @guangy10 in #33707
Qwen2.5 is ExecuTorch Compatible by @guangy10 in #34102
Olmo is ExecuTorch Compatible by @guangy10 in #34181
Llama3 and Llama2 are ExecuTorch compatible by @guangy10 in #34101

Gradient accumulation bugfix

Fix Gradient Accumulation issue by @ArthurZucker in #34191
Enable users to use their own loss functions + deal with prefetching for grad accum by @muellerzr in #34198
Enable Gradient Accumulation fix across all models + trainer fully in forward() by @muellerzr #34283

Bugfixes and improvements

adding positional encoder changes and tests by @manuelsh in #32600
Uniformize kwargs for chameleon processor by @leloykun in #32181
[MllamaProcessor] Update errors and API with multiple image by @ArthurZucker in #33715
fix: use correct var names for check_tokenizers script by @niqodea in #33702
Fix docs and docstrings Omdet-Turbo by @yonigozlan in #33726
Fix position embeddings singular/plural by @molbap in #33678
Generate: can_generate() recursive check by @gante in #33718
clean_up_tokenization_spaces=False i...

Contributors

jbn, chanind, and 118 other contributors

Assets 2

07 Oct 17:42

ArthurZucker

v4.45.2

53fad64

Release v4.45.2

Patch release v4.45.2

Mostly some warnings that were not properly removed ⚠️ :

Ignore keys on validate_rope #33753 by @zucchini-nlp
remove warning v2 #33761 by @itazap
Config: lower save_pretrained exception to warning #33906 by @gante

🔴 Had a small regression with dynamic Cache 🔴
*Cache: revert DynamicCache init for BC #33861 by @gante

A small fix for idefic 🐩 :

Fixes for issue #33763 in idefics2 model #33766 by @aroun-coumar

And a fix for Siglip 🤧 !

hot fix self.position_embeddings->self.position_embedding #33958 and properly fix and RUN_SLOW #33965 thanks to @mranzinger

Contributors

mranzinger, gante, and 3 other contributors

Assets 2

26 Sep 18:07

ArthurZucker

v4.45.1

e71a01a

Patch Release v4.45.1

Patches for v4.45.1

[MllamaProcessor] Update errors and API with multiple image (#33715) by @ArthurZucker
Generate: can_generate() recursive check (#33718) by @gante
clean_up_tokenization_spaces=False if unset (#31938) by @itazap

Contributors

gante, itazap, and ArthurZucker

Assets 2

25 Sep 18:11

LysandreJik

v4.45.0

2ef31de

Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker

Qwen2-VL

The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.

An extract from the Qwen2-VL blogpost available here is as follows:

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

support qwen2-vl by @simonJJJ in #32318

Qwen2-Audio

The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

They introduce two distinct audio interaction modes:

voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input
audio analysis: users could provide audio and text instructions for analysis during the interaction

Add Qwen2-Audio by @faychu in #32137

OLMoE

OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.

Add OLMoE by @Muennighoff in #32406

Llava Onevision

LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.

Llava Onevision: add model by @zucchini-nlp in #32673

FalconMamba

The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

The team releases an accompanying blog post.

Add new model by @younesbelkada in #32615

Granite Language Models

he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granite language models by @mayank31398 in #31502

Granite MOE

The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granitemoe by @mayank31398 in #33207

Descript-Audio-Codec

The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.

Add Descript-Audio-Codec model by @kamilakesbi in #31494

Pixtral

The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.

The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).

Add support for Pixtral by @ArthurZucker in #33449

Mimi

The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.

Codec integration by @ylacombe in #33565

OmDet-Turbo

The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.

Add OmDet-Turbo by @yonigozlan in #31843

Quantization

GGUF

GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.

Add Qwen2Moe GGUF loading support by @VladOS95-cyber in #33264
Fix incorrect vocab size retrieval in GGUF config by @Isotr0py in #32551
Add chat_template for tokenizer extracted from GGUF model by @Isotr0py in #32908
🚨 Support dequantization for most GGML types by @Isotr0py in #32625
Add support for GGUF Phi-3 by @a8nova in #31844