Releases: huggingface/transformers
v4.28.1: Patch release
v4.28.0: LLaMa, Pix2Struct, MatCha, DePlot, MEGA, NLLB-MoE, GPTBigCode
LLaMA
The LLaMA model was proposed in LLaMA: Open and Efficient Foundation Language Models. It is a collection of foundation language models ranging from 7B to 65B parameters. You can request access to the weights here then use the conversion script to generate a checkpoint compatible with Hugging Face
Pix2Struct, MatCha, DePlot
Pix2Struct is a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct has been fine-tuned on various tasks and datasets, ranging from image captioning and visual question answering (VQA) over different inputs (books, charts, science diagrams) to captioning UI components, and others.
- Add Pix2Struct by @younesbelkada in #21400
- Add DePlot + MatCha on
transformers
by @younesbelkada in #22528
Mega
MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA while also having significantly fewer parameters. MEGA’s compute efficiency allows it to scale to very long sequences, making it an attractive option for long-document NLP tasks.
GPTBigCode
The model is a an optimized GPT2 model with support for Multi-Query Attention.
- Add GPTBigCode model (Optimized GPT2 with MQA from Santacoder & BigCode) by @jlamypoirier in #22575
NLLB-MoE
The mixture of experts version of the NLLB release has been added to the library.
NLLB-MoE
Adds the moe model by @ArthurZucker in #22024
Serializing 8bit models
- [
bnb
] Let's make serialization of int8 models possible by @younesbelkada in #22177
You can now push 8bit models and/or load 8bit models directly from the Hub, save memory and load your 8bit models faster! An example repo here
Breaking Changes
Ordering of height and width for the BLIP image processor
Notes from the PR:
The BLIP image processor incorrectly passed in the dimensions to resize in the order (width, height). This is reordered to be correct.
In most cases, this won't have an effect as the default height and width are the same. However, this is not backwards compatible for custom configurations with different height, width settings and direct calls to the resize method with different height, width values.
- 🚨🚨🚨 Fix ordering of height, width for BLIP image processor by @amyeroberts in #22466
Prefix tokens for the NLLB tokenizer
The big problem was the prefix
and suffix
tokens of the NLLB tokenizer.
Previous behaviour:
>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[13374, 1398, 4260, 4039, 248130, 2, 256047]
>>> # 2: '</s>'
>>> # 256047 : 'eng_Latn'
New behaviour
>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[256047, 13374, 1398, 4260, 4039, 248130, 2]
In case you have pipelines that were relying on the old behavior, here is how you would enable it once again:
>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour = True)
- 🚨🚨🚨
[NLLB Tokenizer]
Fix the prefix tokens 🚨🚨🚨 by @ArthurZucker in #22313
TensorFlow ports
The BLIP model is now available in TensorFlow.
- Add TF port of BLIP by @Rocketknight1 in #22090
Export TF Generate with a TF tokenizer
As the title says, this PR adds the possibility to export TF generate with a TF-native tokenizer -- the full thing in a single TF graph.
Task guides
A new task guide has been added, focusing on depth-estimation.
- Depth estimation task guide by @MKhalusova in #22205
Bugfixes and improvements
- Load optimizer state on CPU to avoid CUDA OOM by @sgugger in #22159
- Run all tests by default by @sgugger in #22162
- Fix: unfinished_sequences with correct device by @Stxr in #22184
- Revert 22152 MaskedImageCompletionOutput changes by @amyeroberts in #22187
- Regression pipeline device by @sgugger in #22190
- Update BridgeTowerForContrastiveLearning by @abhiwand in #22145
- t5 remove data dependency by @prathikr in #22097
- Fix DeepSpeed CI by @ydshieh in #22194
- Fix typo in Align docs by @alaradirik in #22199
- Update expected values in
MgpstrModelIntegrationTest
by @ydshieh in #22195 - Italian Translation of migration.mdx by @Baelish03 in #22183
- Update tiny model creation script by @ydshieh in #22202
- Temporarily fix ONNX model exporting error by @SatyaJandhyalaAtMS in #21830
- [
XGLM
] Addaccelerate
support for XGLM by @younesbelkada in #22207 - fixes a typo in WhisperFeatureExtractor docs. by @susnato in #22208
- Hotfix for natten issue with torch 2.0.0 on CircleCI by @ydshieh in #22218
- fix typos in llama.mdx by @keturn in #22223
- fix code example in mgp-str doc by @wdp-007 in #22219
- Use
dash==2.8.1
for now for daily CI by @ydshieh in #22227 - LLaMA house-keeping by @sgugger in #22216
- fix AutoTP in deepspeed could not work for bloom by @sywangyi in #22196
- Add LlamaForSequenceClassification by @lewtun in #22209
- Removed .mdx extension in two links by @MKhalusova in #22230
- fix(docs): fix task guide links in model docs by @Seb0 in #22226
- Fix natten by @alihassanijr in #22229
- Revert "Use
dash==2.8.1
for now for daily CI" by @ydshieh in #22233 - Fix Unnecessary move of tensors from CPU to GPU in LlamaRotaryEmbedding by @ma787639046 in #22234
- [trainer] param count for deepspeed zero3 by @stas00 in #22193
- Update training_args.py -- a nightly install is not required anymore for torch.compile by @pminervini in #22266
- [Docs] fix typos in some tokenizer docs by @yesinkim in #22256
- Italian translation perf_infer_cpu by @nickprock in #22243
- [Trainer] Add optional communication backends for torch.distributed when using GPU by @heya5 in #22247
- Fix the gradient checkpointing bug of the llama model by @yqy2001 in #22270
- Fix balanced and auto device_map by @sgugger in #22271
- Rework a bit the LLaMA conversion script by @sgugger in #22236
- Proper map location for optimizer load by @sgugger in #22273
- Fix doc links by @amyeroberts in #22274
- Move torch.compile() wrapping after DDP/FSDP wrapping to ensure correct graph breaks during training by @ani300 in #22279
- Example of pad_to_multiple_of for padding and truncation guide & docstring update by @MKhalusova in #22278
- Update vision docstring bool masked pos by @amyeroberts in #22237
- replace_8bit_linear modules_to_not_convert default value fix by @BlackSamorez in #22238
- Fix error in mixed precision training of
TFCvtModel
by @gcuder in #22267 - More doctests by @ydshieh in #22268
- fix more doctests by @ydshieh in #22292
- Add translation perf_infer_gpu_one for it by @davidegazze in #22296
- Restore fp16 support on xla gpu device by @ymwangg in #22300
- Correct NATTEN function signatures and force new version by @alihassanijr in #22298
- [deepspeed] offload + non-cpuadam optimizer exception doc by @stas00 in #22044
- Final update of doctest by @ydshieh in #22299
- Add MaskedImageModelingOutput by @alaradirik in #22212
- Enable traced model for text-generation task by @jiqing-feng in #22265
- add low_cpu_mem_usage option in run_clm.py example which will benefit… by @sywangyi in #22288
- fix: Allow only test_file in pytorch and flax summarization by @connor-henderson in #22293
- Fix position embeddings for GPT-J and CodeGen by @njhill in #22069
- Fixed bug to calculate correct xpath_sub_list in MarkupLMTokenizer by @silentghoul-spec in #22302
- Enforce
max_memory
for device_map strategies by @sgugger in #22311 - Beef up Llama tests by @gante in #22314
- docs: Resolve incorrect type typo in trainer methods by @tomaarsen in #22316
- Chunkable token classification pipeline by @luccailliau in #21771
- Fix PipelineTests skip conditions by @ydshieh in #22320
- [deepspeed zero3] need
generate(synced_gpus=True, ...)
by @stas00 in #22242 - [gptj] support older pytorch version by @stas00 in #22325
- Move common properties to BackboneMixin by @amyeroberts in #21855
- Backbone add mixin tests by @amyeroberts in #22542
- Backbone add out indices by @amyeroberts in #22493
- [
MBart
] Addaccelerate
support for MBart by @younesbelkada in #22309 - Fixed gradient checkpoint bug for TimeSeriesTransformer by @mollerup23 in #22272
- Mention why one needs to specify max_steps in Trainer by @lhoestq in #22333
- Fix various imports by @sgugger in #22281
- Minor typo in pipeline FillMaskPipeline's documentation. by @SamuelLarkin in #22339
- Added type hints to TFDeiTModel by @Batese2001 in #22327
- Fix --bf16 option support for Neuron after PR #22300 by @jeffhataws in #22307
- Generate: add test for left-padding support by @gante in #22322
- Enable training Llama with model or pipeline parallelism by @kooshi in #22329
- Automatically create/update tiny models by @ydshieh in #22275
- [HFTracer] Make embeddings ops take on the dtype of the weight by @jamesr66a in #22347
- Fix...
v4.27.4: Patch release
v4.27.3: Patch release
v4.27.2: Patch release
v4.27.1: Patch release
BridgeTower, Whisper speedup, DETA, SpeechT5, BLIP-2, CLAP, ALIGN, API updates
BridgeTower
The goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
- Add BridgeTower model by @abhiwand in #20775
- Add loss for BridgeTowerForMaskedLM and BridgeTowerForImageAndTextRetrieval by @abhiwand in #21684
- [WIP] Add BridgeTowerForContrastiveLearning by @abhiwand in #21964
Whisper speedup
The Whisper model was integrated a few releases ago. This release offers significant performance optimizations when generating with timestamps. This was made possible by rewriting the generate()
function of Whisper
, which now uses the generation_config
and implementing a batched timestamp prediction. The language
and task
can now also be setup when calling generate()
. For more details about this refactoring checkout this colab.
Notably, whisper is also now supported in Flax
🚀 thanks to @andyehrenberg ! More whisper related commits:
- [Whisper] Refactor whisper by @ArthurZucker in #21252
- [WHISPER] Small patch by @ArthurZucker in #21307
- [Whisper] another patch by @ArthurZucker in #21324
- add flax whisper implementation by @andyehrenberg in #20479
- Add WhisperTokenizerFast by @jonatanklosko in #21222
- Remove CLI spams with Whisper FeatureExtractor by @qmeeus in #21267
- Update document of WhisperDecoderLayer by @ling0322 in #21621
- [WhisperModel] fix bug in reshaping labels by @jonatasgrosman in #21653
- [Whisper] Add SpecAugment by @bofenghuang in #21298
- Fix-ci-whisper by @ArthurZucker in #21767
- Fix
WhisperModelTest
by @ydshieh in #21883 - [Whisper] Add rescaling function with
do_normalize
by @ArthurZucker in #21263 - Refactor whisper asr pipeline to include language too. by @Narsil in #21427
- Update
model_split_percents
forWhisperModelTest
by @ydshieh in #21922 - [Whisper] Fix feature normalization in
WhisperFeatureExtractor
by @bofenghuang in #21938 - [Whisper] Add model for audio classification by @sanchit-gandhi in #21754
- fixes the gradient checkpointing of whisper by @soma2000-lang in #22019
- Skip 3 tests for
WhisperEncoderModelTest
by @ydshieh in #22060 - [Whisper] Remove embed_tokens from encoder docstring by @sanchit-gandhi in #21996
- [
Whiper
] addget_input_embeddings
toWhisperForAudioClassification
by @younesbelkada in #22133 - [🛠️] Fix-whisper-breaking-changes by @ArthurZucker in #21965
DETA
DETA (short for Detection Transformers with Assignment) improves Deformable DETR by replacing the one-to-one bipartite Hungarian matching loss with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.
- Add DETA by @NielsRogge in #20983
SpeechT5
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.
XLM-V
XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
- Add XLM-V to Model Doc by @stefan-it in #21498
BLIP-2
BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.
- Add BLIP-2 by @NielsRogge in #21441
X-MOD
X-MOD extends multilingual masked language models like XLM-R to include language-specific modular components (language adapters) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.
Ernie-M
ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.
TVLT
The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.
- Add TVLT by @zinengtang in #20725
CLAP
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
- [CLAP] Add CLAP to the library by @ArthurZucker in #21370
- [
CLAP
] Fix few broken things by @younesbelkada in #21670
GPTSAN
GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.
- add GPTSAN model (reopen) by @tanreinama in #21291
EfficientNet
EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.
- Add EfficientNet by @alaradirik in #21563
ALIGN
ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
- Add ALIGN to transformers by @alaradirik in #21741
Informer
Informer is a method to be applied to long-sequence time-series forecasting. This method introduces a Probabilistic Attention mechanism to select the “active” queries rather than the “lazy” queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.
API updates and improvements
Safetensors
safetensors
is a safe format of serialization of tensors, which has been supported in transformers
as a first-class citizen for the past few versions.
This change enables explicitly forcing the from_pretrained
method to use or not to use safetensors
. This unlocks a few use-cases, notably the possibility to enforce loading only from this format, limiting security risks.
Example of usage:
from transformers import AutoModel
# As of version v4.27.0, this loads the `pytorch_model.bin` by default if `safetensors` is not installed.
# It loads the `model.safetensors` file if `safetensors` is installed.
model = AutoModel.from_pretrained('bert-base-cased')
# This forces the load from the `model.safetensors` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=True)
# This forces the load from the `pytorch_model.bin` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=False)
- [Safetensors] Add explicit flag to from pretrained by @patrickvonplaten in #22083
Variant
This PR adds a "variant" keyword argument to PyTorch's from_pretrained and save_pretrained so that multiple weight variants can be saved in the model repo.
Example of usage with the model hosted in this folder on the Hub:
from transformers import CLIPTextModel
path = "huggingface/the-no-branch-repo" # or ./text_encoder if local
# Loads the `no_ema` variant. This loads the `pytorch_model.fp16.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder", variant="fp16")
# This loads the no-variant checkpoint, loading the `pytorch_model.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder")
- Add variant to transformers by @patrickvonplaten in #21332
- [Variant] Make sure variant files are not incorrectly deleted by @patrickvonplaten in #21562
bitsandbytes
The bitsandbytes
integration is overhauled, now offering a new configuration: the BytsandbytesConfig
.
Read more about it in the [documentation](https://huggingf...
V4.26.1: Patch release
- ESM openfold_utils type hints by @ringohoffman in #20544
- Add cPython files in build by @sgugger in #21372
- Fix T5 inference in float16 + bnb error by @younesbelkada in #21281
- Fix import in Accelerate for find_exec_bs by @sgugger in #21501
- Fix inclusion of non py files in package by @sgugger in #21546
v4.26.0: Generation configs, image processors, backbones and plenty of new models!
GenerationConfig
The generate
method has multiple arguments whose defaults were lying in the model config. We have now decoupled these in a separate generation config, which makes it easier to store different sets of parameters for a given model, with different generation strategies. While we will keep supporting generate arguments in the model configuration for the foreseeable future, it is now recommended to use a generation config. You can learn more about its uses here and its documentation here.
- Generate: use
GenerationConfig
as the basis for.generate()
parametrization by @gante in #20388 - Generate: TF uses
GenerationConfig
as the basis for.generate()
parametrization by @gante in #20994 - Generate: FLAX uses
GenerationConfig
as the basis for.generate()
parametrization by @gante in #21007
ImageProcessor
In the vision integration, all feature extractor classes have been deprecated to be renamed to ImageProcessor
. The old feature extractors will be fully removed in version 5 of Transformers and new vision models will only implement the ImageProcessor
class, so be sure to switch your code to this new name sooner rather than later!
- Add deprecation warning when image FE instantiated by @amyeroberts in #20427
- Vision processors - replace FE with IPs by @amyeroberts in #20590
- Replace FE references by @amyeroberts in #20702
New models
AltCLIP
AltCLIP is a variant of CLIP obtained by switching the text encoder with a pretrained multilingual text encoder (XLM-Roberta). It has very close performances with CLIP on almost all tasks, and extends the original CLIP’s capabilities to multilingual understanding.
BLIP
BLIP is a model that is able to perform various multi-modal tasks including visual question answering, image-text retrieval (image-text matching) and image captioning.
- Add BLIP by @younesbelkada in #20716
BioGPT
BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.
- Add BioGPT by @kamalkraj in #20420
BiT
BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.
- Add BiT + ViT hybrid by @NielsRogge in #20550
EfficientFormer
EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.
- Efficientformer by @Bearnardd in #20459
GIT
GIT is a decoder-only Transformer that leverages CLIP’s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on image captioning and visual question answering benchmarks.
- Add GIT (GenerativeImage2Text) by @NielsRogge in #20295
GPT-sw3
GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.
Graphormer
Graphormer is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
- Graphormer model for Graph Classification by @clefourrier in #20968
Mask2Former
Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over MaskFormer.
- Add Mask2Former by @alaradirik and @shivalikasingh95 in #20792
OneFormer
OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.
- Add OneFormer Model by @praeclarumjj3 in #20577
Roberta prelayernorm
The RoBERTa-PreLayerNorm model is identical to RoBERTa but uses the --encoder-normalize-before flag in fairseq.
- Implement Roberta PreLayerNorm by @AndreasMadsen in #20305
Swin2SR
Swin2R improves the SwinIR model by incorporating Swin Transformer v2 layers which mitigates issues such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data.
- Add Swin2SR by @NielsRogge in #19784
TimeSformer
TimeSformer is the first video transformer. It inspired many transformer based video understanding and classification papers.
UPerNet
UPerNet is a general framework to effectively segment a wide range of concepts from images, leveraging any vision backbone like ConvNeXt or Swin.
- Add UperNet by @NielsRogge in #20648
Vit Hybrid
ViT hybrid is a slight variant of the plain Vision Transformer, by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial “tokens” for the Transformer. It’s the first architecture that attains similar results to familiar convolutional architectures.
- Add BiT + ViT hybrid by @NielsRogge in #20550
Backbones
Breaking a bit the one model per file policy, we introduce backbones (mainly for vision models) which can then be re-used in more complex models like DETR, MaskFormer, Mask2Former etc.
- [NAT, DiNAT] Add backbone class by @NielsRogge in #20654
- Add Swin backbone by @NielsRogge in #20769
- [DETR and friends] Use AutoBackbone as alternative to timm by @NielsRogge in #20833
Bugfixes and improvements
- fix cuda OOM by using single Prior by @ArthurZucker in #20486
- Add ESM contact prediction by @Rocketknight1 in #20535
- flan-t5.mdx: fix link to large model by @szhublox in #20555
- Fix torch device issues by @ydshieh in #20584
- Fix flax GPT-J-6B linking model in tests by @JuanFKurucz in #20556
- [Vision] fix small nit on
BeitDropPath
layers by @younesbelkada in #20587 - Install
natten
with CUDA version by @ydshieh in #20546 - Add entries to
FEATURE_EXTRACTOR_MAPPING_NAMES
by @ydshieh in #20551 - Cleanup some config attributes by @ydshieh in #20554
- [Whisper] Move decoder id method to tokenizer by @sanchit-gandhi in #20589
- Add
require_torch
to 2 pipeline tests by @ydshieh in #20585 - Install
tensorflow_probability
for TF pipeline CI by @ydshieh in #20586 - Ci-whisper-asr by @ArthurZucker in #20588
- cross platform from_pretrained by @ArthurZucker in #20538
- Make convert_to_onnx runable as script again by @mcernusca in #20009
- ESM openfold_utils type hints by @ringohoffman in #20544
- Add RemBERT ONNX config by @hchings in #20520
- Fix link to Swin Model contributor novice03 by @JuanFKurucz in #20557
- Fix link to swin transformers v2 microsoft model by @JuanFKurucz in #20558
- Fix link to table transformer detection microsoft model by @JuanFKurucz in #20560
- clean up unused
classifier_dropout
in config by @ydshieh in #20596 - Fix whisper and speech to text doc by @ArthurZucker in #20595
- Replace
set-output
by$GITHUB_OUTPUT
by @ydshieh in #20547 - [Vision]
.to
function for ImageProcessors by @younesbelkada in #20536 - [Whisper] Fix decoder ids methods by @sanchit-gandhi in #20599
- Add-whisper-conversion by @ArthurZucker in #20600
- README in Hindi 🇮🇳 by @pacman100 in #20097
- Fix code sample in preprocess by @stevhliu in #20561
- Split autoclasses on modality by @stevhliu in #20559
- Fix test for file not found by @sgugger in #20604
- Rework the pipeline tutorial by @Narsil in #20437
- Documentation fixes by @samuelzxu in #20607
- Adding anchor links to Hindi README by @pacman100 in #20606
- exclude jit time from the speed metric calculation of evaluation and prediction by @sywangyi in #20553
- Check if docstring is None before formating it by @xxyzz in #20592
- updating T5 and BART models to support Prefix Tuning by @pacman100 in #20601
- Fix
AutomaticSpeechRecognitionPipelineTests.run_pipeline_test
by @ydshieh in #20597 - Ci-jukebox by @ArthurZucker in #20613
- Update some GH action versions by @ydshieh in #20537
- Fix dtype of weights in from_pretrained when device_map is set by @sgugger in #20602
- add missing is_decoder param by @stevhliu in #20631
- Fix link to speech encoder decoder model in speech recognition readme by @JuanFKurucz in #20633
- Fix
natten
installation in docker file by @ydshieh in #20632 - Clip floating point constants to bf16 range to avoid inf conversion b...
PyTorch 2.0 support, Audio Spectogram Transformer, Jukebox, Switch Transformers and more
PyTorch 2.0 stack support
We are very excited by the newly announced PyTorch 2.0 stack. You can enable torch.compile
on any of our models, and get support with the Trainer
(and in all our PyTorch examples) by using the torchdynamo
training argument. For instance, just add --torchdynamo inductor
when launching those examples from the command line.
This API is still experimental and may be subject to changes as the PyTorch 2.0 stack matures.
Note that to get the best performance, we recommend:
- using an Ampere GPU (or more recent)
- sticking to fixed shaped for now (so use
--pad_to_max_length
in our examples)
Audio Spectrogram Transformer
The Audio Spectrogram Transformer model was proposed in AST: Audio Spectrogram Transformer by Yuan Gong, Yu-An Chung, James Glass. The Audio Spectrogram Transformer applies a Vision Transformer to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results for audio classification.
- Add Audio Spectogram Transformer by @NielsRogge in #19981
Jukebox
The Jukebox model was proposed in Jukebox: A generative model for music by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditionned on an artist, genres and lyrics.
- Add Jukebox model (replaces #16875) by @ArthurZucker in #17826
Switch Transformers
The SwitchTransformers model was proposed in Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity by William Fedus, Barret Zoph, Noam Shazeer.
It is the first MoE model supported in transformers
, with the largest checkpoint currently available currently containing 1T parameters.
- Add Switch transformers by @younesbelkada and @ArthurZucker in #19323
RocBert
The RoCBert model was proposed in RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. It’s a pretrained Chinese language model that is robust under various forms of adversarial attacks.
CLIPSeg
The CLIPSeg model was proposed in Image Segmentation Using Text and Image Prompts by Timo Lüddecke and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen CLIP model for zero- and one-shot image segmentation.
- Add CLIPSeg by @NielsRogge in #20066
NAT and DiNAT
NAT
NAT was proposed in Neighborhood Attention Transformer by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
It is a hierarchical vision transformer based on Neighborhood Attention, a sliding-window self attention pattern.
DiNAT
DiNAT was proposed in Dilated Neighborhood Attention Transformer by Ali Hassani and Humphrey Shi.
It extends NAT by adding a Dilated Neighborhood Attention pattern to capture global context, and shows significant performance improvements over it.
- Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models by @alihassanijr in #20219
MobileNetV2
The MobileNet model was proposed in MobileNetV2: Inverted Residuals and Linear Bottlenecks by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.
MobileNetV1
The MobileNet model was proposed in MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
Image processors
Image processors replace feature extractors as the processing class for computer vision models.
Important changes:
size
parameter is now a dictionary of{"height": h, "width": w}
,{"shortest_edge": s}
,{"shortest_egde": s, "longest_edge": l}
instead of int or tuple.- Addition of
data_format
flag. You can now specify if you want your images to be returned in"channels_first"
- NCHW - or"channels_last"
- NHWC - format. - Processing flags e.g.
do_resize
can be passed directly to thepreprocess
method instead of modifying the class attribute:image_processor([image_1, image_2], do_resize=False, return_tensors="pt", data_format="channels_last")
- Leaving
return_tensors
unset will return a list of numpy arrays.
The classes are backwards compatible and can be created using existing feature extractor configurations - with the size
parameter converted.
- Add Image Processors by @amyeroberts in #19796
- Add Donut image processor by @amyeroberts #20425
- Add segmentation + object detection image processors by @amyeroberts in #20160
- AutoImageProcessor by @amyeroberts in #20111
Backbone for computer vision models
We're adding support for a general AutoBackbone
class, which turns any vision model (like ConvNeXt, Swin Transformer) into a backbone to be used with frameworks like DETR and Mask R-CNN. The design is in early stages and we welcome feedback.
- Add AutoBackbone + ResNetBackbone by @NielsRogge in #20229
- Improve backbone by @NielsRogge in #20380
- [AutoBackbone] Improve API by @NielsRogge in #20407
Support for safetensors
offloading
If the model you are using has a safetensors
checkpoint and you have the library installed, offload to disk will take advantage of this to be more memory efficient and roughly 33% faster.
Contrastive search in the generate
method
- Generate: TF contrastive search with XLA support by @gante in #20050
- Generate: contrastive search with full optional outputs by @gante in #19963
Breaking changes
- 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in
convert_tokens_to_string
by @beneyal in #15775
Bugfixes and improvements
- add dataset by @stevhliu in #20005
- Add BERT resources by @stevhliu in #19852
- Add LayoutLMv3 resource by @stevhliu in #19932
- fix typo by @stevhliu in #20006
- Update object detection pipeline to use post_process_object_detection methods by @alaradirik in #20004
- clean up vision/text config dict arguments by @ydshieh in #19954
- make sentencepiece import conditional in bertjapanesetokenizer by @ripose-jp in #20012
- Fix gradient checkpoint test in encoder-decoder by @ydshieh in #20017
- Quality by @sgugger in #20002
- Update auto processor to check image processor created by @amyeroberts in #20021
- [Doctest] Add configuration_deberta_v2.py by @Saad135 in #19995
- Improve model tester by @ydshieh in #19984
- Fix doctest by @ydshieh in #20023
- Show installed libraries and their versions in CI jobs by @ydshieh in #20026
- reorganize glossary by @stevhliu in #20010
- Now supporting pathlike in pipelines too. by @Narsil in #20030
- Add **kwargs by @amyeroberts in #20037
- Fix some doctests after PR 15775 by @ydshieh in #20036
- [Doctest] Add configuration_camembert.py by @Saad135 in #20039
- [Whisper Tokenizer] Make more user-friendly by @sanchit-gandhi in #19921
- [FuturWarning] Add futur warning for LEDForSequenceClassification by @ArthurZucker in #19066
- fix jit trace error for model forward sequence is not aligned with jit.trace tuple input sequence, update related doc by @sywangyi in #19891
- Update esmfold conversion script by @Rocketknight1 in #20028
- Fixed torch.finfo issue with torch.fx by @michaelbenayoun in #20040
- Only resize embeddings when necessary by @sgugger in #20043
- Speed up TF token classification postprocessing by converting complete tensors to numpy by @deutschmn in #19976
- Fix ESM LM head test by @Rocketknight1 in #20045
- Update README.md by @bofenghuang in #20063
- fix
tokenizer_type
to avoid error when loading checkpoint back by @pacman100 in #20062 - [Trainer] Fix model name in push_to_hub by @sanchit-gandhi in #20064
- PoolformerImageProcessor defaults to match previous FE by @amyeroberts in #20048
- change constant torch.tensor to torch.full by @MerHS in #20061
- Update READMEs for ESMFold and add notebooks by @Rocketknight1 in #20067
- Update documentation on seq2seq models with absolute positional embeddings, to be in line with Tips section for BERT and GPT2 by @jordiclive in #20068
- Allow passing arguments to model testers for CLIP-like models by @ydshieh in #20044
- Show installed libraries and their versions in GA jobs by @ydshieh in #20069
- Update defaults and logic to match old FE by @amyeroberts in #20065
- Update modeling_tf_utils.py by @cakiki in #20076
- Update hub.py by @cakiki in #20075
- [Doctest] Add configuration_dpr.py by @Saad135 in #20080
- Removing RobertaConfig inheritance from CamembertConfig by @Saad135 in #20059
- Skip 2 tests in
VisionTextDualEncoderProcessorTest
by @ydshieh in #20098 - Replace unsupported facebookresearch/bitsandbytes by @tomaarsen in #20093
- docs: Resolve many typos in the English docs by @tomaarsen in #20088
- use huggingface_hub.model_inifo() to get pipline_tag by @y-tag in #20077
- Fix
generate_dummy_inputs
forImageGPTOnnxConfig
by @ydshieh in #20103 - docs: Fixed variables in f-strings by @tomaarsen in #20087
- Add...