Transformers without Tears: Improving the Normalization of Self-Attention
Self-attention Does Not Need Memory
Language Modelling with Pixels
Post-hoc Interpretability for Neural NLP: A Survey
Scaling Laws and Interpretability of Learning from Repeated Data
Compositional Attention: Disentangling Search and Retrieval
Automated Concatenation of Embeddings for Structured Prediction
A Knowledge-based System for Multilingual Named Entity Recognition
Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study
A Comparative Study of Pre-trained Encoders for Low-Resource Named Entity Recognition
Boundary Smoothing for Named Entity Recognition
Block Pruning For Faster Transformers
Natural Language Descriptions of Deep Visual Features
OCR-free Document Understanding Transformer
A Benchmark for Automatic Medical Consultation System: Frameworks, Tasks and Datasets
OPT: Open Pre-trained Transformer Language Models
Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors
Papers that are more general and not limited to nlp. (or even focused on only other tasks)
Towards a Unified View of Parameter-Efficient Transfer Learning
Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining
8-bit Optimizers via Block-wise Quantization
StyleAlign: Analysis and Applications of Aligned StyleGAN Models
Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt