Skip to content

BridgeTower, Whisper speedup, DETA, SpeechT5, BLIP-2, CLAP, ALIGN, API updates

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 15 Mar 16:25
· 5553 commits to main since this release
d941f07

BridgeTower

The goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.

Whisper speedup

The Whisper model was integrated a few releases ago. This release offers significant performance optimizations when generating with timestamps. This was made possible by rewriting the generate() function of Whisper, which now uses the generation_config and implementing a batched timestamp prediction. The language and task can now also be setup when calling generate(). For more details about this refactoring checkout this colab.
Notably, whisper is also now supported in Flax 🚀 thanks to @andyehrenberg ! More whisper related commits:

DETA

DETA (short for Detection Transformers with Assignment) improves Deformable DETR by replacing the one-to-one bipartite Hungarian matching loss with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.

SpeechT5

The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.

XLM-V

XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).

BLIP-2

BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.

X-MOD

X-MOD extends multilingual masked language models like XLM-R to include language-specific modular components (language adapters) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.

Ernie-M

ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.

TVLT

The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.

CLAP

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.

GPTSAN

GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.

EfficientNet

EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.

ALIGN

ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.

Informer

Informer is a method to be applied to long-sequence time-series forecasting. This method introduces a Probabilistic Attention mechanism to select the “active” queries rather than the “lazy” queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.

API updates and improvements

Safetensors

safetensors is a safe format of serialization of tensors, which has been supported in transformers as a first-class citizen for the past few versions.

This change enables explicitly forcing the from_pretrained method to use or not to use safetensors. This unlocks a few use-cases, notably the possibility to enforce loading only from this format, limiting security risks.

Example of usage:

from transformers import AutoModel

# As of version v4.27.0, this loads the `pytorch_model.bin` by default if `safetensors` is not installed.
# It loads the `model.safetensors` file if `safetensors` is installed.
model = AutoModel.from_pretrained('bert-base-cased')

# This forces the load from the `model.safetensors` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=True)

# This forces the load from the `pytorch_model.bin` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=False)

Variant

This PR adds a "variant" keyword argument to PyTorch's from_pretrained and save_pretrained so that multiple weight variants can be saved in the model repo.

Example of usage with the model hosted in this folder on the Hub:

from transformers import CLIPTextModel

path = "huggingface/the-no-branch-repo"  # or ./text_encoder if local

# Loads the `no_ema` variant. This loads the `pytorch_model.fp16.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder", variant="fp16")

# This loads the no-variant checkpoint, loading the `pytorch_model.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder")

bitsandbytes

The bitsandbytes integration is overhauled, now offering a new configuration: the BytsandbytesConfig.

Read more about it in the documentation.

FSDP

This PR enables the user to make use of the PyTorch/XLA implementation of FSDP, including the newly added auto-wrap feature. Four arguments have been added to training_args.py to facilitate this functionality:

  • xla_fsdp: this flag is a string containing the location of a .json file which specifies the FSDP arguments the user wants to use when wrapping their model.
  • xla_fsdp_min_num_params: this flag is an int which will set a size-based automatic wrapping policy which automatically FSDP wraps any module with at least xla_fsdp_min_num_params many parameters.
  • xla_fsdp_transformer_layer_cls_to_wrap: this flag is a list of (case-sensitive) strings which will set a layer-class-based automatic wrapping policy which automatically FSDP wraps any module whose name matches one of the listed strings.
  • xla_fsdp_grad_ckpt: this flag is a bool which determines whether gradient checkpointing is enabled for the automatically wrapped layers.

Breaking changes

Generate

This PR standardizes beam search behavior across all three frameworks through early_stopping. PyTorch is unchanged, but TensorFlow and Flax users will see a significant speedup if they keep the default generation parameters.

There are, however, minor differences in outputs of the .generate method with beam search on TensorFlow and Flax. It should be very small and will come with significant speedups, but in case it breaks your workflow, we recommend you downgrade to a previous version and let us know in a GitHub issue so that we may investigate what is going on.

  • 🚨🚨 Generate: standardize beam search behavior across frameworks by @gante in #21368

Single model initialization

Model initialization has problems which led to the initialization being incoherent across models and across initialization techniques. This is technically a bugfix, but as it may result in your models being initialized with different values, we think it best to highlight it here.

  • 🚨🚨🚨 Enforce single model initialization by @sgugger in #21431

Deprecations

This PR deprecated the parallelize API which has been replaced by accelerate months ago. We recommend loading the model using the device_map attribute and setting it to balanced to obtain the previous behavior.

Setting your own device_map is still permitted, but it needs to be a dictionary from module name to device, for example:

device_map = {'h.0': 0, 'h.1': 1, ...}

Pipelines

A new pipeline focused on zero-shot audio classification is added to the repository.

Documentation

The task and model summaries have been refactored to take into account the larger number of tasks and models we now have.

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release: