This repository helps:
- Someone who is looking for a quick transformer-based classifier with low computation budget.
- Simple data format
- Simple environment setup
- Quick identifiability
- Someone who wants to tweak the size of key vector and value vector, independently.
- Someone who wants to make their analysis of attention weights more reliable. How? see below...
As shown in our work (experimentally and theoretically)- for a given input X, a set of attention weights A, and output transformer prediction probabilities Y, if we can find another set of attention (architecture generatable) weights A* satisfying X-Y pair, analysis performed over A is prone to be inaccurate.
Idea:
- decrease the size of key vector,
- increase the size of value vector and perform the addition of head outputs.
Our paper: R. Bhardwaj, N. Majumder, S. Poria, E. Hovy. More Identifiable yet Equally Performant Transformers for Text Classification. ACL 2021. (the latest version is available here.)
- I have tried on Python 3.9.2, (since the dependencies are kept as low as possible, should be easy to run/adapt on other Python versions.)
- PyTorch version 1.8.1
- Torchtext version 0.9.1
- Pandas version 0.9.1
declare@lab:~$ python text_classifier.py -dataset data.csv
Note: Feel free to replace data.csv with your choice of text classification problem, be it sentiment, news topic, reviews, etc.
should be two columns, header of the column with labels is "label" and text is "text". For example:
text | label |
---|---|
we love NLP | 5 |
I ate too much, feeling sick | 1 |
BTW, you can try to run on Torchtext provided datasets for classification. For AG_NEWS dataset,
declare@lab:~$ python text_classifier.py -kdim 64 -dataset ag_news
For quick experiments on variety of text classification datasets, replace ag_news with imdb for IMDb, sogou for SogouNews, yelp_p for YelpReviewPolarity , yelp_f for YelpReviewFull, amazon_p for AmazonReviewPolarity, amazon_f for AmazonReviewFull, yahoo for YahooAnswers, dbpedia for DBpedia.
Keep low k-dim and/or switch head addition by using the flag add_heads. Feel free to analyze attention weights for inputs with lengths up to embedding dim that is specified by embedim arguments while running the command below.
declare@lab:~$ python text_classifier.py -kdim 16 -add_heads -dataset ag_news -embedim 256
Note:
- Lower k-dim may/may not impact the classification accuracy, please keep the possible trade-off in the bucket during experiments.
- It is recommended to keep embedim close to maximum text length (see max_text_len parameter below). However, make sure you do not overparametrize the model to make attention weights identifiable for large text lengths.
- batch: training batch size (default = 64).
- nhead: number of attention heads (default = 4).
- epochs: number training epochs (default = 10).
- lr: learning rate (default = 0.001).
- dropout: dropout regularization parameter (default = 0.1).
- vocab_size: set threshold on vocabular size (default = 100000).
- max_text_len: trim the text longer than this value (default = 512).
- test_frac: only for user specified datasets, fraction of test set from the specified data set (default = 0.3).
- valid_frac: fraction of training samples kept aside for model development (default = 0.3).
- kdim: dimensions of key (and query) vector (default = 16).
- add_heads: mention if replace concatenation with addition of multi-head outputs.
- pos_emb: mention if need positional embedding.
- return_attn: mention if attention tensors are to be returned from the model.
- embedim: decides dimension of token vectors and value vector, i.e.,
add_heads | vdim |
---|---|
False | |
True | embedim |
R. Bhardwaj, N. Majumder, S. Poria, E. Hovy. More Identifiable yet Equally Performant Transformers for Text Classification. ACL 2021.
Note: Please cite our paper if you find this repository useful. The latest version is available here.