Skip to content

Latest commit

 

History

History
1611 lines (1221 loc) · 79.4 KB

README.md

File metadata and controls

1611 lines (1221 loc) · 79.4 KB

Description

reference pytorch code for intent(sentence) classification.


Requirements

  • python >= 3.6

  • pip install -r requirements.txt


Data


Pretrained models

  • glove

    $ mkdir embeddings
    $ ls embeddings
    glove.6B.zip
    $ unzip glove.6B.zip 
    
  • BERT-like models(huggingface's transformers)

  • SpanBERT

    • pretrained SpanBERT models are compatible with huggingface's BERT modele except 'bert.pooler.dense.weight', 'bert.pooler.dense.bias'.

Snips data

experiments summary

Accuracy (%) GPU / CPU ONNX Dynamic QAT/FX/DiffQ Inference Inference+Dynamic Inference+QAT/FX/DiffQ Inference+ONNX/Quant Etc
GloVe, GNB 80.43 1.2929 / - - - - - - - -
GloVe, CNN 97.86 1.9874 / 4.1068 2.2263 4.3975 3.0899 / - / - 1.9398 2.9012 1.4329 / - / - 0.5270 / FAIL threads=14
GloVe, Densenet-CNN 97.57 3.6094 / - 3.0717 - - 4.9481 - - 0.8658 / FAIL threads=14
GloVe, Densenet-DSA 97.43 7.5007 / - 4.4936 - - 7.2086 - - 1.5420 / FAIL threads=14
TinyBERT, CLS 97.29 6.0523 / - 5.3673 14.0268 - 5.7238 6.1879 - 1.7821 / 1.8446 threads=14
BERT-small, CLS 98.00 5.9837 / - - 15.2820 - 7.4538 7.2436 - 3.5445 / 2.4141 threads=14
DistilBERT, CLS 97.71 8.0221 / 34.3049 28.6644 31.7260 FAIL / FAIL / - 16.3812 11.2421 FAIL / FAIL / 13.7443 6.1573 / 4.6346 threads=14
SqueezeBERT, CLS 97.29 18.0796 / - - 23.8565 - 20.3999 20.0118 - 11.9890 / FAIL threads=14
MiniLM, CLS 96.86 12.2094 / - 17.5638 38.2084 - 16.8337 17.7702 - 5.0394 / 4.3123 threads=14
MobileBERT, CLS 96.43 49.9843 / - 46.2151 84.4232 - 51.9885 51.4533 - 15.3492 / 12.4416 threads=14
BERT-base, CNN 97.57 12.1273 / - - - - 34.7878 30.5454 - - threads=14
BERT-base, CLS 97.43 12.7714 / - 46.4263 49.4747 - 30.7979 24.5353 - 16.9756 threads=14
BERT-base, CLS 97.00 9.2660 / - 31.5400 33.4623 - 16.7419 13.5703 - 11.7487 del 8,9,19,11, threads=14
BERT-base, Densenet-CNN 96.43 12.1489 / - 24.4789 - - 21.1535 - - 7.0717 / 10.3142 del 6,7,8,9,10,11, threads=14, BERT as feature-based
BERT-large, CNN 98.00 24.277 / - - - - - - - -
BERT-large, CLS 97.86 23.542 / - - - - - - - -
* GPU / CPU : Elapsed time/example(ms), GPU / CPU  [Tesla V100 1 GPU, Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 2 CPU, 14CORES/1CPU, HyperThreading]
* ONNX : --enable_ort
* Dynamic : --enable_dqm
* Inference : --enable_inference
* QAT(Quantization Aware Training)/FX/DiffQ : --enable_qat / --enable_qat_fx / --enable_diffq
* Inference+Dynamic : --enable_inference --enable_dqm
* Inference+QAT/FX/DiffQ : --enable_inference --enable_qat / --enable_inference --enable_qat_fx / --enable_inference --enable_diffq
* Inference+ONNX/Quant : --convert_onnx, --enable_inference --enable_ort / --quantize_onnx, --enable_inference --enable_ort
* default batch size, learning rate, n_ctx(max_seq_length) : 128, 2e-4, 100
* default epoch : 3
* number of tokens / sentence : MEAN : 9.08, MAX:24, MIN:3, MEDIAN:9
# threads Model Inference+ONNX / Inference+QuantizedONNX Etc
1 GloVe, Densenet-DSA 3.33 / -
1 DistilBERT, CLS 51.77 / 15.66
2 DistilBERT, CLS 28.38 / 9.71
3 DistilBERT, CLS 21.47 / 7.79
4 DistilBERT, CLS 18.75 / 6.69
5 DistilBERT, CLS 15.23 / 6.09
6 DistilBERT, CLS 14.22 / 5.69
7 DistilBERT, CLS 12.52 / 5.44
8 DistilBERT, CLS 10.46 / 5.21 good enough
9 DistilBERT, CLS 10.93 / 5.17
10 DistilBERT, CLS 9.75 / 4.99
11 DistilBERT, CLS 9.22 / 4.98
12 DistilBERT, CLS 10.11 / 4.91
13 DistilBERT, CLS 9.45 / 4.81
14 DistilBERT, CLS 9.31 / 4.74
emb_class=glove, enc_class=gnb

  • train
* token_emb_dim in configs/config-glove-gnb.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --config=configs/config-glove-gnb.json
$ python train.py --config=configs/config-glove-gnb.json
  • evaluation
$ python evaluate.py --config=configs/config-glove-gnb.json
INFO:__main__:[Accuracy] : 0.8043,   563/  700
INFO:__main__:[Elapsed Time] : 980.9308052062988ms, 1.292972264542259ms on average

emb_class=glove, enc_class=cnn

  • train
* token_emb_dim in configs/config-glove-cnn.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --config=configs/config-glove-cnn.json
$ python train.py --config=configs/config-glove-cnn.json --embedding_trainable

* tensorboardX
$ rm -rf runs
$ python -m tensorboard.main --logdir runs/ --port port-number --bind_all
  • evaluation
$ python evaluate.py --config=configs/config-glove-cnn.json
INFO:__main__:[Accuracy] : 0.9786,   685/  700
INFO:__main__:[Elapsed Time] : 1351ms, 1.793991416309013ms on average

emb_class=glove, enc_class=densenet-cnn

  • train
* token_emb_dim in configs/config-densenet-cnn.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --config=configs/config-densenet-cnn.json
$ python train.py --config=configs/config-densenet-cnn.json --embedding_trainable
  • evaluation
$ python evaluate.py --config=configs/config-densenet-cnn.json

INFO:__main__:[Accuracy] : 0.9757,   683/  700
INFO:__main__:[Elapsed Time] : 2633ms, 3.609442060085837ms on average

emb_class=glove, enc_class=densenet-dsa

  • train
* token_emb_dim in configs/config-densenet-dsa.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --config=configs/config-densenet-dsa.json
$ python train.py --config=configs/config-densenet-dsa.json
  • evaluation
$ python evaluate.py --config=configs/config-densenet-dsa.json

INFO:__main__:[Accuracy] : 0.9743,   682/  700
INFO:__main__:[Elapsed Time] : 5367ms, 7.500715307582261ms on average

emb_class=bert, enc_class=cnn | cls | densenet-cnn

  • train
* n_ctx size should be less than 512

* enc_class=cnn

$ python preprocess.py --config=configs/config-bert-cnn.json --bert_model_name_or_path=./embeddings/bert-base-uncased
$ python train.py --config=configs/config-bert-cnn.json --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=5e-5 --epoch=3 --batch_size=64

* enc_class=cls

$ python preprocess.py --config=configs/config-bert-cls.json --bert_model_name_or_path=./embeddings/bert-base-uncased
$ python train.py --config=configs/config-bert-cls.json --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=5e-5 --epoch=3 --batch_size=64

* enc_class=densenet-cnn

$ python preprocess.py --config=configs/config-bert-densenet-cnn.json --bert_model_name_or_path=./embeddings/bert-base-uncased
$ python train.py --config=configs/config-bert-densenet-cnn.json --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=0.00034429153877017527 --epoch=3 --batch_size=64 --seed 34 --bert_remove_layers=6,7,8,9,10,11 --bert_use_feature_based
INFO:__main__:[study.best_params] : {'lr': 0.00034429153877017527, 'batch_size': 64, 'seed': 34, 'epochs': 2}

  • evaluation
* enc_class=cnn

$ python evaluate.py --config=configs/config-bert-cnn.json --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9757,   683/  700
INFO:__main__:[Elapsed Time] : 10624ms, 12.127324749642346ms on average
  
** --bert_model_name_or_path=bert-large-uncased --lr=2e-5
INFO:__main__:[Accuracy] : 0.9800,   686/  700
INFO:__main__:[Elapsed Time] : 16994ms, 24.277142857142856ms on average

* enc_class=cls
$ python evaluate.py --config=configs/config-bert-cls.json --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9743,   682/  700
INFO:__main__:[Elapsed Time] : 8940ms, 12.771428571428572ms on average
  
** --bert_model_name_or_path=bert-large-uncased --lr=2e-5
INFO:__main__:[Accuracy] : 0.9786,   685/  700
INFO:__main__:[Elapsed Time] : 16480ms, 23.542857142857144ms on average

** --bert_remove_layers=8,9,10,11 
INFO:__main__:[Accuracy] : 0.9700,   679/  700
INFO:__main__:[Elapsed Time] : 6911ms, 9.266094420600858ms on average

** --config=configs/config-distilbert-cls.json --bert_model_name_or_path=./embeddings/distilbert-base-uncased
INFO:__main__:[Accuracy] : 0.9771,   684/  700
INFO:__main__:[Elapsed Time] : 6607ms, 9.30758226037196ms on average

** --config=configs/config-bert-cls.json --bert_model_name_or_path=./embeddings/squeezebert-uncased
INFO:__main__:[Accuracy] : 0.9729,   681/  700
INFO:__main__:[Elapsed Time] : 12742.885112762451ms, 18.07960045013646ms on average

** --config=configs/config-bert-cls.json --bert_model_name_or_paht=./embeddings/pytorch.uncased_L-4_H-512_A-8
INFO:__main__:[Accuracy] : 0.9800,   686/  700
INFO:__main__:[Elapsed Time] : 4279.639005661011ms, 5.983798800619887ms on average

** --config=configs/config-bert-cls.json --bert_model_name_or_paht=./embeddings/MiniLM-L12-H384-uncased
INFO:__main__:[Accuracy] : 0.9686,   678/  700
INFO:__main__:[Elapsed Time] : 8598.786115646362ms, 12.209460459042004ms on average

** --config=configs/config-bert-cls.json --bert_model_name_or_paht=./embeddings/TinyBERT_General_4L_312D
INFO:__main__:[Accuracy] : 0.9729,   681/  700
INFO:__main__:[Elapsed Time] : 4330.849170684814ms, 6.052388653734723ms on average

** --config=configs/config-bert-cls.json --bert_model_name_or_paht=./embeddings/mobilebert-uncased
INFO:__main__:[Accuracy] : 0.9643,   675/  700
INFO:__main__:[Elapsed Time] : 35100.63934326172ms, 49.98437324818624ms on average

* enc_class=densent-cnn
$ python evaluate.py --config=configs/config-bert-densenet-cnn.json --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9600,   672/  700
INFO:__main__:[Elapsed Time] : 8314.800024032593ms, 11.749483143993372ms on average

INFO:__main__:[Accuracy] : 0.9643,   675/  700
INFO:__main__:[Elapsed Time] : 8580.491542816162ms, 12.148977860872327ms on average


SST-2 data

experiments summary

  • iclassifier
Accuracy (%) GPU/CPU Dynamic Etc
GloVe, GNB 72.27 1.2253 / - - / -
GloVe, CNN 83.09 1.9943 / 4.5757 - / 4.8686 threads=14
ConceptNet, CNN 84.79 2.8304 / - - / -
ConceptNet, CNN 84.90 2.7672 / - - / - optuna
GloVe, DenseNet-CNN 86.38 3.6203 / - threads=14
ConceptNet, DenseNet-CNN 87.26 3.7052 / -
ConceptNet, DenseNet-CNN 87.26 3.5377 / - optuna
GloVe, DenseNet-DSA 85.34 6.2450 / -
DistilFromBERT, GloVe, CNN 86.16 1.7900 / - augmented, from large
DistilFromBERT, GloVe, DenseNet-CNN 88.52 3.6788 / - augmented, from large
DistilFromBERT, GloVe, DenseNet-DSA 88.14 8.4647 / - augmented, from large
DistilFromBERT, BERT-small, CLS 89.29 5.9408 / - fastformers, from base
DistilFromBERT, BERT-small, CLS 91.49 5.9114 / - fastformers, augmented, n_iter=20, from base
DistilFromBERT, BERT-small, CLS 90.33 6.0072 / - fastformers, augmented, n_iter=20, from large
DistilFromBERT, BERT-small, CLS 91.16 6.2050 / - fastformers, augmented, n_iter=10, from base
DistilFromBERT, BERT-small, CLS 91.27 6.5702 / - fastformers, augmented, n_iter=10, from base, meta pseudo labels
DistilFromELECTRA, BERT-small, CLS 89.73 7.5992 / - fastformers, from large
DistilFromRoBERTa, GloVe, CNN 86.55 1.8483 / - augmented, from large
DistilFromRoBERTa, GloVe, DenseNet-CNN 88.80 3.9580 / - augmented, from large
DistilFromRoBERTa, GloVe, DenseNet-DSA 88.25 8.5627 / - augmented, from large
DistilFromELECTRA, GloVe, CNN 86.55 1.7466 / - augmented, from large
DistilFromELECTRA, GloVe, DenseNet-CNN 89.79 3.6406 / - augmented, from large
DistilFromELECTRA, GloVe, DenseNet-DSA 88.58 8.3708 / - augmented, from large
DistilFromELECTRA, DistilBERT, CLS 93.52 7.4879 / - augmented, from large
BERT-tiny, CNN 79.08 4.8604 / -
BERT-tiny, CLS 80.83 3.8461 / -
BERT-mini, CNN 83.36 7.0983 / -
BERT-mini, CLS 83.69 5.5521 / -
BERT-small, CNN 87.53 7.2010 / -
BERT-small, CLS 88.25 5.9604 / -
BERT-medium, CNN 88.58 11.9082 / -
BERT-medium, CLS 89.24 9.5857 / -
TinyBERT, CNN 88.14 6.9060 / - - / -
TinyBERT, CLS 88.19 5.9246 / - - / -
TinyBERT, CLS 89.84 5.9194 / - - / - epoch=30
DistilBERT, CNN 89.90 9.9362 / - - / 35.7070 threads=14
DistilBERT, CLS 91.10 8.9719 / - - / 29.4646 threads=14
DistilBERT, CLS 92.04 6.9790 / - - / - epoch=30
mDistilBERT, CLS 87.10 7.9757 / - - / - epoch=30
MiniLM, CNN 91.49 13.5255 / - - / -
MiniLM, CLS 91.21 12.2066 / - - / -
MiniLM, CLS 93.25 11.5939 / - - / - epoch=30
MobileBERT, CLS 91.05 55.0898 / - - / -
BERT-base, CNN 92.04 14.1576 / -
BERT-base, CLS 92.42 12.7549 / 62.5050 66.4545(92.42) / 50.8080 threads=14
BERT-base, CLS 93.36 15.6755 / - - / - fintuned using amazon reviews, epoch=10
BERT-base, CLS 93.25 14.2535 / - - / - augmented, n_iter=20
BERT-base, CLS 92.81 15.2709 / - - / - fastformers, augmented, n_iter=10
BERT-base, CLS 93.36 15.2605 / - - / - fastformers, augmented, n_iter=10, meta pseudo lables
BERT-base, CNN 90.55 10.6824 / - del 8,9,10,11
BERT-base, CLS 91.49 8.7747 / 42.8989 44.7676(90.61) / 34.3131 del 8,9,10,11, threads=14
BERT-base, CLS 90.23 7.0241 / - del 6,7,8,9,10,11, threads=14
BERT-base, CLS 86.66 5.8868 / - del 4,5,6,7,8,9,10,11, threads=14
BERT-base, CLS 90.44 17.6778 / - BERT as finetune-last
BERT-base, CLS 87.04 9.3327 / - del 6,7,8,9,10,11, BERT as finetune-last
BERT-base, CLS 85.55 6.2722 / - del 3,4,5,6,7,8,9,10,11, BERT as finetune-last
BERT-base, Densenet-CNN 90.06 13.1141 / - - / - del 6,7,8,9,10,11, BERT as feature-based, epoch=10
BERT-base, Densenet-CNN 90.88 13.2195 / - - / - del 6,7,8,9,10,11, BERT as finetune-last, epoch=10
BERT-large, CNN 93.08 28.6490 / -
BERT-large, CLS 94.12 22.3767 / -
BERT-large, CLS 93.57 27.3209 / - fintuned using amazon reviews
BERT-large, CNN 88.47 14.7813 / - del 12~23
BERT-large, CLS 86.71 12.1560 / - del 12~23
SqueezeBERT, CNN 90.61 19.2879 / - epoch=20
SqueezeBERT, CLS 90.12 17.4998 / - epoch=20
SpanBERT-base, CNN 91.82 15.2098 / -
SpanBERT-base, CLS 91.49 13.1516 / -
SpanBERT-large, CNN 93.90 26.8609 / -
SpanBERT-large, CLS 93.96 26.0445 / -
ALBERT-base, CNN 92.04 16.0554 / - epoch=10
ALBERT-base, CLS 90.01 14.6725 / - epoch=10
ALBERT-xxlarge, CNN 95.77 57.4631 / - epoch=10
ALBERT-xxlarge, CLS 94.45 51.8027 / - epoch=10
RoBERTa-base, CNN 92.92 15.1016 / - epoch=10
RoBERTa-base, CLS 93.03 14.6736 / - epoch=10
XLM-RoBERTa-base, CLS 91.49 14.2246 / - epoch=10
RoBERTa-base, CNN 92.26 11.5241 / - del 8,9,10,11 , epoch=10
RoBERTa-base, CLS 91.76 10.0296 / - del 8,9,10,11 , epoch=10
RoBERTa-large, CNN 95.55 26.9807 / - epoch=10
RoBERTa-large, CLS 95.66 23.7395 / - epoch=10
XLM-RoBERTa-large, CLS 92.04 24.8045 / - epoch=10
BART-large, CNN 94.45 35.1708 / - epoch=10
BART-large, CLS 94.89 33.3862 / - epoch=10
ELECTRA-base, CNN 95.39 14.9802 / - epoch=10
ELECTRA-base, CLS 95.22 14.0087 / - epoch=10
ELECTRA-large, CNN 96.05 27.2868 / - epoch=15
ELECTRA-large, CLS 96.43 25.6857 / - epoch=15
DeBERTa-base, CLS 93.41 26.1533 / - epoch=10
DeBERTa-large, CLS 94.95 62.4272 / - epoch=10
DeBERTa-v2-xlarge, CLS 96.21 54.1388 / - epoch=10
BORT, CLS 77.98 6.1346 / - epoch=10
ConvBERT, CLS 77.48 22.6815 / - epoch=10
GPT2-large, CLS 94.45 36.5779 / - epoch=10
GPT2-large, CLS 92.81 42.2791 / - epoch=10, accelerate, deepspeed, fp16
GPT2-xlarge, CLS 93.96 49.2241 / - epoch=10, accelerate, deepspeed, fp16, 1.5B
GPT-NEO, CLS 92.31 36.1055 / - epoch=10, accelerate, deepspeed, fp16, 2.7B, half()
GPT-J-6B, CLS 93.36 78.8771 / - epoch=10, accelerate, deepspeed, fp16, 6B, half()
T5-large, CLS 95.39 29.3724 / - epoch=10
T5-large, CLS 95.55 30.3232 / - epoch=10, accelerate, deepspeed, fp16
T5-3B, CLS 95.99 34.8998 / - epoch=10, accelerate, deepspeed, fp16, 3B/2(T5EncoderModel)
T5-3B, CLS 96.43 33.8611 / - epoch=10, accelerate, deepspeed, 3B/2(T5EncoderModel)
T5-11B, CLS 95.61 113.8510/ - epoch=10, accelerate, deepspeed, fp16, 11B/2(T5EncoderModel)
MegatronBERT-345m, CLS 96.21 31.2049 / - epoch=10
MegatronGPT2-345m, CLS 92.81 27.8698 / - epoch=10
Accuracy (%)
T5-3B 97.4
ALBERT 97.1
RoBERTa 96.7
MT-DNN 95.6
DistilBERT 92.7
emb_class=glove, enc_class=gnb

  • train
* token_emb_dim in configs/config-glove-gnb.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --config=configs/config-glove-gnb.json --data_dir=data/sst2
$ python train.py --config=configs/config-glove-gnb.json --data_dir=data/sst2 --lr=1e-3
  • evaluation
$ python evaluate.py --config=configs/config-glove-gnb.json --data_dir=data/sst2
INFO:__main__:[Accuracy] : 0.7227,  1316/ 1821
INFO:__main__:[Elapsed Time] : 2310.748338699341ms, 1.2253080095563615ms on average

emb_class=glove, enc_class=cnn

  • train
* token_emb_dim in configs/config-glove-cnn.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --data_dir=data/sst2
$ python train.py --data_dir=data/sst2 --lr=1e-3 
  • evaluation
$ python evaluate.py --data_dir=data/sst2

INFO:__main__:[Accuracy] : 0.8281,  1508/ 1821
INFO:__main__:[Elapsed Time] : 3300ms, 1.767032967032967ms on average

INFO:__main__:[Accuracy] : 0.8309,  1513/ 1821
INFO:__main__:[Elapsed Time] : 3725.560188293457ms, 1.994345607338371ms on average

* --embedding_path=./embeddings/numberbatch-en-19.08.txt (from https://github.com/commonsense/conceptnet-numberbatch)
INFO:__main__:[Accuracy] : 0.8479,  1544/ 1821
INFO:__main__:[Elapsed Time] : 5237.588167190552ms, 2.830480088244428ms on average

* --embedding_path=./embeddings/numberbatch-en-19.08.txt --seed 36 --batch_size 32 --lr 0.000685889661509286 (by optuna)
INFO:__main__:[Accuracy] : 0.8490,  1546/ 1821
INFO:__main__:[Elapsed Time] : 5137.487411499023ms, 2.7672590790214118ms on average

emb_class=glove, enc_class=densenet-cnn

  • train
* token_emb_dim in configs/config-densenet-cnn.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --config=configs/config-densenet-cnn.json --data_dir=data/sst2
$ python train.py --config=configs/config-densenet-cnn.json --data_dir=data/sst2
  • evaluation
$ python evaluate.py --config=configs/config-densenet-cnn.json --data_dir=data/sst2

INFO:__main__:[Accuracy] : 0.8638,  1573/ 1821
INFO:__main__:[Elapsed Time] : 6678ms, 3.6203296703296703ms on average

* --embedding_path=./embeddings/numberbatch-en-19.08.txt (from https://github.com/commonsense/conceptnet-numberbatch)
INFO:__main__:[Accuracy] : 0.8726,  1589/ 1821
INFO:__main__:[Elapsed Time] : 6822.043418884277ms, 3.7052182050851674ms on average

* --embedding_path=./embeddings/numberbatch-en-19.08.txt --seed 42 --batch_size 64 --lr 0.0005115118656470668 (by optuna)
INFO:__main__:[Accuracy] : 0.8726,  1589/ 1821
INFO:__main__:[Elapsed Time] : 6442.193614435720869ms, 3.537723017262889ms on average

emb_class=glove, enc_class=densenet-dsa

  • train
* token_emb_dim in configs/config-densenet-dsa.json == 300 (ex, glove.6B.300d.txt )
$ python preprocess.py --config=configs/config-densenet-dsa.json --data_dir=data/sst2
$ python train.py --config=configs/config-densenet-dsa.json --data_dir=data/sst2
  • evaluation
$ python evaluate.py --config=configs/config-densenet-dsa.json --data_dir=data/sst2

INFO:__main__:[Accuracy] : 0.8534,  1554/ 1821
INFO:__main__:[Elapsed Time] : 11459ms, 6.245054945054945ms on average

* try again
INFO:__main__:[Accuracy] : 0.8506,  1549/ 1821
INFO:__main__:[Elapsed Time] : 21745ms, 11.885714285714286ms on average

* softmax masking
INFO:__main__:[Accuracy] : 0.8473,  1543/ 1821
INFO:__main__:[Elapsed Time] : 19214ms, 10.477472527472527ms on average

emb_class=bert, enc_class=cnn | cls | densenet-cnn

  • train
* n_ctx size should be less than 512

* enc_class=cnn

$ python preprocess.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased
$ python train.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=3 --batch_size=64

* enc_class=cls

$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased
$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=3 --batch_size=64

* enc_class=densenet-cnn
$ python preprocess.py --config=configs/config-bert-densenet-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased
$ python train.py --config=configs/config-bert-densenet-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=0.00044768670992748386 --epoch=10 --batch_size=32 --seed=33 --bert_remove_layers=6,7,8,9,10,11 --bert_use_feature_based
INFO:__main__:[study.best_params] : {'lr': 0.00044768670992748386, 'batch_size': 32, 'seed': 33, 'epochs': 9}

  • evaluation
* enc_class=cnn

$ python evaluate.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9204,  1676/ 1821
INFO:__main__:[Elapsed Time] : 25878ms, 14.157692307692308ms on average

** --bert_model_name_or_path=bert-large-uncased
INFO:__main__:[Accuracy] : 0.9308,  1695/ 1821
INFO:__main__:[Elapsed Time] : 52170ms, 28.649093904448105ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-8_H-512_A-8
INFO:__main__:[Accuracy] : 0.8858,  1613/ 1821
INFO:__main__:[Elapsed Time] : 21791ms, 11.908241758241758ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-4_H-512_A-8
INFO:__main__:[Accuracy] : 0.8753,  1594/ 1821
INFO:__main__:[Elapsed Time] : 13206ms, 7.201098901098901ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-4_H-256_A-4
INFO:__main__:[Accuracy] : 0.8336,  1518/ 1821
INFO:__main__:[Elapsed Time] : 13021ms, 7.098351648351648ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-2_H-128_A-2
INFO:__main__:[Accuracy] : 0.7908,  1440/ 1821
INFO:__main__:[Elapsed Time] : 8951ms, 4.86043956043956ms on average

** --bert_model_name_or_path=embeddings/TinyBERT_General_4L_312D
INFO:__main__:[Accuracy] : 0.8814,  1605/ 1821
INFO:__main__:[Elapsed Time] : 12672.252178192139ms, 6.906037802224631ms on average

** --configs/config-distilbert-cnn.json --bert_model_name_or_path=embeddings/distilbert-base-uncased
INFO:__main__:[Accuracy] : 0.8990,  1637/ 1821
INFO:__main__:[Elapsed Time] : 18193ms, 9.936263736263736ms on average

** --configs/config-bert-cnn.json --bert_model_name_or_path=embeddings/MiniLM-L12-H384-uncased
INFO:__main__:[Accuracy] : 0.9149,  1666/ 1821
INFO:__main__:[Elapsed Time] : 24706.911087036133ms, 13.525558042002249ms on average

** --bert_model_name_or_path=./embeddings/squeezebert-uncased --epoch=20 --warmup_epoch=0 --weight_decay=0.0
INFO:__main__:[Accuracy] : 0.9061,  1650/ 1821
INFO:__main__:[Elapsed Time] : 35230.47494888306ms, 19.287928644117418ms on average

** for using SpanBERT embedding, just replace pretrained BERT model to SpanBERT.
** --bert_model_name_or_path=embeddings/spanbert_hf_large
INFO:__main__:[Accuracy] : 0.9390,  1710/ 1821
INFO:__main__:[Elapsed Time] : 49042ms, 26.860989010989012ms on average

** --bert_model_name_or_path=embeddings/spanbert_hf_base
INFO:__main__:[Accuracy] : 0.9182,  1672/ 1821
INFO:__main__:[Elapsed Time] : 27796ms, 15.20989010989011ms on average

** --bert_remove_layers=8,9,10,11
INFO:__main__:[Accuracy] : 0.9055,  1649/ 1821
INFO:__main__:[Elapsed Time] : 19541ms, 10.682417582417582ms on average

** --bert_model_name_or_path=bert-large-uncased --bert_remove_layers=12,13,14,15,16,17,18,19,20,21,22,23 
INFO:__main__:[Accuracy] : 0.8847,  1611/ 1821
INFO:__main__:[Elapsed Time] : 27017ms, 14.781318681318682ms on average

* enc_class=cls

$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9242,  1683/ 1821
INFO:__main__:[Elapsed Time] : 23314ms, 12.754945054945056ms on average

** n_ctx=64
INFO:__main__:[Accuracy] : 0.9259,  1686/ 1821
INFO:__main__:[Elapsed Time] : 23765.23184776306ms, 13.007715246179602ms on average

** n_ctx=64, --lr=2e-5 --epoch=3 --batch_size=64  --warmup_epoch=1 --weight_decay=0.0 --seed=0
INFO:__main__:[Accuracy] : 0.9281,  1690/ 1821
INFO:__main__:[Elapsed Time] : 21707.942724227905ms, 11.878120244204343ms on average

** --bert_model_name_or_path=bert-large-uncased --lr=2e-5
INFO:__main__:[Accuracy] : 0.9412,  1714/ 1821
INFO:__main__:[Elapsed Time] : 40847.62740135193ms, 22.37672412788475ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-8_H-512_A-8
INFO:__main__:[Accuracy] : 0.8924,  1625/ 1821
INFO:__main__:[Elapsed Time] : 17558ms, 9.585714285714285ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-4_H-512_A-8
INFO:__main__:[Accuracy] : 0.8825,  1607/ 1821
INFO:__main__:[Elapsed Time] : 10948.646068572998ms, 5.960442338671003ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-4_H-256_A-4
INFO:__main__:[Accuracy] : 0.8369,  1524/ 1821
INFO:__main__:[Elapsed Time] : 10196ms, 5.552197802197802ms on average

** --bert_model_name_or_path=embeddings/pytorch.uncased_L-2_H-128_A-2
INFO:__main__:[Accuracy] : 0.8083,  1472/ 1821
INFO:__main__:[Elapsed Time] : 7124ms, 3.8461538461538463ms on average

** --bert_model_name_or_path=embeddings/TinyBERT_General_4L_312D
INFO:__main__:[Accuracy] : 0.8819,  1606/ 1821
INFO:__main__:[Elapsed Time] : 10878.073692321777ms, 5.92467928980733ms on average

** --bert_model_name_or_path=embeddings/TinyBERT_General_4L_312D --epoch=30
INFO:__main__:[Accuracy] : 0.8984,  1636/ 1821
INFO:__main__:[Elapsed Time] : 10882.532119750977ms, 5.9194347360631925ms on average

** --configs/config-distilbert-cls.json --bert_model_name_or_path=embeddings/distilbert-base-uncased
INFO:__main__:[Accuracy] : 0.9110,  1659/ 1821
INFO:__main__:[Elapsed Time] : 16431ms, 8.971978021978021ms on average

** --configs/config-distilbert-cls.json --bert_model_name_or_path=embeddings/distilbert-base-uncased --epoch=30
INFO:__main__:[Accuracy] : 0.9204,  1676/ 1821
INFO:__main__:[Elapsed Time] : 12806.593418121338ms, 6.979021790263417ms on average

** --configs/config-distilbert-cls.json --bert_model_name_or_path=embeddings/distilbert-base-multilingual-cased --epoch=30
INFO:__main__:[Accuracy] : 0.8710,  1586/ 1821
INFO:__main__:[Elapsed Time] : 14596.29511833191ms, 7.975767077980461ms on average

** --configs/config-bert-cls.json --bert_model_name_or_path=embeddings/MiniLM-L12-H384-uncased
INFO:__main__:[Accuracy] : 0.9121,  1661/ 1821
INFO:__main__:[Elapsed Time] : 22304.69584465027ms, 12.206664845183655ms on average

** --configs/config-bert-cls.json --bert_model_name_or_path=embeddings/MiniLM-L12-H384-uncased --epoch=30
INFO:__main__:[Accuracy] : 0.9325,  1698/ 1821
INFO:__main__:[Elapsed Time] : 21175.854682922363ms, 11.593997216486668ms on average

** --configs/config-bert-cls.json --bert_model_name_or_path=embeddings/mobilebert-uncased
INFO:__main__:[Accuracy] : 0.9105,  1658/ 1821
INFO:__main__:[Elapsed Time] : 100408.40005874634ms, 55.08984157017299ms on average

** --bert_model_name_or_path=./embeddings/squeezebert-uncased --epoch=20  --warmup_epoch=0 --weight_decay=0.0
INFO:__main__:[Accuracy] : 0.9012,  1641/ 1821
INFO:__main__:[Elapsed Time] : 31975.939989089966ms, 17.499822050660523ms on average

** for using SpanBERT embedding, just replace pretrained BERT model to SpanBERT.
** --bert_model_name_or_path=embeddings/spanbert_hf_large
INFO:__main__:[Accuracy] : 0.9396,  1711/ 1821
INFO:__main__:[Elapsed Time] : 47570ms, 26.044505494505493ms on average

** --bert_model_name_or_path=embeddings/spanbert_hf_base
INFO:__main__:[Accuracy] : 0.9149,  1666/ 1821
INFO:__main__:[Elapsed Time] : 24049ms, 13.151648351648351ms on average

** --bert_remove_layers=8,9,10,11
INFO:__main__:[Accuracy] : 0.9149,  1666/ 1821
INFO:__main__:[Elapsed Time] : 16082ms, 8.774725274725276ms on average

** --bert_remove_layers=6,7,8,9,10,11
INFO:__main__:[Accuracy] : 0.9023,  1643/ 1821
INFO:__main__:[Elapsed Time] : 12865ms, 7.024175824175824ms on average

** --bert_remove_layers=4,5,6,7,8,9,10,11
INFO:__main__:[Accuracy] : 0.8666,  1578/ 1821
INFO:__main__:[Elapsed Time] : 10800ms, 5.886813186813187ms on average

** --bert_model_name_or_path=bert-large-uncased --bert_remove_layers=12,13,14,15,16,17,18,19,20,21,22,23 
INFO:__main__:[Accuracy] : 0.8671,  1579/ 1821
INFO:__main__:[Elapsed Time] : 22229ms, 12.156043956043955ms on average

** fine-tune bert-base-uncased using amazon_us_reviews data(https://huggingface.co/nlp/viewer/?dataset=amazon_us_reviewss&config=Video_v1_00). and then apply to sst2 data.
$ cd etc
$ python download_datasets.py --dataset_name=amazon_us_reviews --task_name=Video_v1_00 --split=train > ../data/amazon_us_reviews_Video_v1_00/total.txt
$ cd ../data/amazon_us_reviews_Video_v1_00
$ python ../etc/split.py --data_path=total.txt --base_path=data

# we have 'data.train', 'data.valid', 'data.test'.
$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/amazon_us_reviews_Video_v1_00 --bert_model_name_or_path=./embeddings/bert-base-uncased
$ python train.py --config=configs/config-bert-cls.json --data_dir=data/amazon_us_reviews_Video_v1_00 --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint-amazon --lr=1e-5 --epoch=3 --batch_size=64 
$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/amazon_us_reviews_Video_v1_00 --bert_output_dir=bert-checkpoint-amazon
INFO:__main__:[Accuracy] : 0.9659,  8917/ 9232
INFO:__main__:[Elapsed Time] : 131532.05513954163ms, 14.232453539549446ms on average

# apply `bert-checkpoint-amazon` to sst2 data
$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./bert-checkpoint-amazon
$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./bert-checkpoint-amazon --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=3 --batch_size=64
$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9336,  1700/ 1821
INFO:__main__:[Elapsed Time] : 28667.39320755005ms, 15.675509106981885ms on average

** fine-tune bert-large-uncased using amazon_us_reviews
$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/amazon_us_reviews_Video_v1_00 --bert_output_dir=bert-checkpoint-amazon
INFO:__main__:[Accuracy] : 0.9719,  8973/ 9232
INFO:__main__:[Elapsed Time] : 232065.03987312317ms, 25.12068054216575ms on average

$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9264,  1687/ 1821
INFO:__main__:[Elapsed Time] : 46093.640089035034ms, 25.243094727233217ms on average
*** --epoch=10  --warmup_epoch=0 --weight_decay=0.0
INFO:__main__:[Accuracy] : 0.9357,  1704/ 1821
INFO:__main__:[Elapsed Time] : 49913.90776634216ms, 27.320906356140807ms on average

** augment train.txt
$ python augment_data.py --input data/sst2/train.txt --output data/sst2/augmented.raw --lower --parallel --preserve_label --n_iter=5 --max_ng=3
$ cp -rf data/sst2/augmented.raw data/sst2/augmented.txt
$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased --augmented --augmented_filename=augmented.txt
$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=3 --batch_size=64 --augmented
$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9325,  1698/ 1821
INFO:__main__:[Elapsed Time] : 26077.781438827515ms, 14.253564195318537ms on average

** --bert_use_finetune_last
$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=1e-4 --epoch=10 --batch_size=64 --bert_use_finetune_last
INFO:__main__:[Accuracy] : 0.9044,  1647/ 1821
INFO:__main__:[Elapsed Time] : 32294.266939163208ms, 17.677827719803695ms on average

** --bert_use_finetune_last --bert_remove_layers=6,7,8,9,10,11
INFO:__main__:[Accuracy] : 0.8704,  1585/ 1821
INFO:__main__:[Elapsed Time] : 17087.70990371704ms, 9.33271423800961ms on average

** --bert_use_finetune_last --bert_remove_layers=3,4,5,6,7,8,9,10,11
INFO:__main__:[Accuracy] : 0.8550,  1557/ 1821
INFO:__main__:[Elapsed Time] : 11510.003805160522ms, 6.272209083640968ms on average

* enc_class=densenet-cnn

$ python evaluate.py --config=configs/config-bert-densenet-cnn.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9006,  1640/ 1821
INFO:__main__:[Elapsed Time] : 23979.56347465515ms, 13.114127745995155ms on average

** --bert_use_finetune_last , without --bert_use_feature_based
INFO:__main__:[study.best_params] : {'lr': 0.00016342288028238142, 'batch_size': 64, 'seed': 42, 'epochs': 19}
$ python train.py --config=configs/config-bert-densenet-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bert-base-uncased --bert_output_dir=bert-checkpoint --lr=0.00016342288028238142 --epoch=20 --batch_size=64 --seed=42 --bert_remove_layers=6,7,8,9,10,11 --bert_use_finetune_last
INFO:__main__:[Accuracy] : 0.9088,  1655/ 1821
INFO:__main__:[Elapsed Time] : 24170.567512512207ms, 13.21952631185343ms on average

emb_class=albert, enc_class=cnn | cls

  • train
* share config-bert-*.json
* n_ctx size should be less than 512

* enc_class=cnn

$ python preprocess.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/albert-base-v2
$ python train.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/albert-base-v2 --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64
  • evaluation
* enc_class=cnn

$ python evaluate.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint 

INFO:__main__:[Accuracy] : 0.9204,  1676/ 1821
INFO:__main__:[Elapsed Time] : 29321ms, 16.055494505494504ms on average

** --bert_model_name_or_path=./embeddings/albert-xxlarge-v2 --batch_size=32
INFO:__main__:[Accuracy] : 0.9577,  1744/ 1821
INFO:__main__:[Elapsed Time] : 104769ms, 57.463186813186816ms on average

* enc_class=cls

$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9001,  1639/ 1821
INFO:__main__:[Elapsed Time] : 26819ms, 14.672527472527472ms on average

** --bert_model_name_or_path=./embeddings/albert-xxlarge-v2 --batch_size=32
INFO:__main__:[Accuracy] : 0.9445,  1720/ 1821
INFO:__main__:[Elapsed Time] : 94456ms, 51.80274725274725ms on average

emb_class=roberta, enc_class=cnn | cls

  • train
* n_ctx size should be less than 512

* enc_class=cnn

$ python preprocess.py --config=configs/config-roberta-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/roberta-large
$ python train.py --config=configs/config-roberta-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/roberta-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64

* enc_class=cls

$ python preprocess.py --config=configs/config-roberta-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/roberta-large 
$ python train.py --config=configs/config-roberta-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/roberta-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64
  • evaluation
* enc_class=cnn

$ python evaluate.py --config=configs/config-roberta-cnn.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9555,  1740/ 1821
INFO:__main__:[Elapsed Time] : 49297ms, 26.98076923076923ms on average

** --bert_model_name_or_path=./embeddings/roberta-base
INFO:__main__:[Accuracy] : 0.9292,  1692/ 1821
INFO:__main__:[Elapsed Time] : 27615ms, 15.101648351648352ms on average

** --bert_model_name_or_path=./embeddings/roberta-base --bert_remove_layers=8,9,10,11
INFO:__main__:[Accuracy] : 0.9226,  1680/ 1821
INFO:__main__:[Elapsed Time] : 21127ms, 11.524175824175824ms on average

* enc_class=cls

$ python evaluate.py --config=configs/config-roberta-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9566,  1742/ 1821
INFO:__main__:[Elapsed Time] : 43363ms, 23.73956043956044ms on average

** --bert_model_name_or_path=./embeddings/roberta-base
INFO:__main__:[Accuracy] : 0.9303,  1694/ 1821
INFO:__main__:[Elapsed Time] : 26822ms, 14.673626373626373ms on average

** --bert_model_name_or_path=./embeddings/roberta-base --bert_remove_layers=8,9,10,11
INFO:__main__:[Accuracy] : 0.9176,  1671/ 1821
INFO:__main__:[Elapsed Time] : 18344ms, 10.02967032967033ms on average

** --bert_model_name_or_path=./embeddings/xlm-roberta-base
INFO:__main__:[Accuracy] : 0.9149,  1666/ 1821
INFO:__main__:[Elapsed Time] : 25997.645378112793ms, 14.224677033476777ms on average

** --bert_model_name_or_path=./embeddings/xlm-roberta-large
INFO:__main__:[Accuracy] : 0.9204,  1676/ 1821
INFO:__main__:[Elapsed Time] : 45271.38304710388ms, 24.804521654988385ms on average

emb_class=bart, enc_class=cnn | cls

  • train
* n_ctx size should be less than 512

* enc_class=cnn

$ python preprocess.py --config=configs/config-bart-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bart-large
$ python train.py --config=configs/config-bart-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bart-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64

* enc_class=cls

$ python preprocess.py --config=configs/config-bart-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bart-large 
$ python train.py --config=configs/config-bart-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bart-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64
  • evaluation
* enc_class=cnn

$ python evaluate.py --config=configs/config-bart-cnn.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9445,  1720/ 1821
INFO:__main__:[Elapsed Time] : 64224ms, 35.17087912087912ms on average

* enc_class=cls

$ python evaluate.py --config=configs/config-bart-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9489,  1728/ 1821
INFO:__main__:[Elapsed Time] : 61015ms, 33.386263736263736ms on average

emb_class=electra, enc_class=cnn | cls

  • train
* share config-bert-*.json
* n_ctx size should be less than 512

* enc_class=cnn

$ python preprocess.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/electra-base-discriminator
$ python train.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/electra-base-discriminator --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64

* enc_class=cls

$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/electra-base-discriminator
$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/electra-base-discriminator --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64
  • evaluation
* enc_class=cnn

$ python evaluate.py --config=configs/config-bert-cnn.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint 

INFO:__main__:[Accuracy] : 0.9539,  1737/ 1821
INFO:__main__:[Elapsed Time] : 29602ms, 14.98021978021978ms on average

** --bert_model_name_or_path=./embeddings/electra-large-discriminator --lr=1e-6
INFO:__main__:[Accuracy] : 0.9566,  1742/ 1821
INFO:__main__:[Elapsed Time] : 54157ms, 28.356593406593408ms on average

** --bert_model_name_or_path=./embeddings/electra-large-discriminator --lr=1e-6 --epoch=15
INFO:__main__:[Accuracy] : 0.9605,  1749/ 1821
INFO:__main__:[Elapsed Time] : 52163ms, 27.286813186813188ms on average

* enc_lass=cls

$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9522,  1734/ 1821
INFO:__main__:[Elapsed Time] : 25956ms, 14.008791208791209ms on average

** --bert_model_name_or_path=./embeddings/electra-large-discriminator --lr=1e-6 --epoch=15
INFO:__main__:[Accuracy] : 0.9643,  1756/ 1821
INFO:__main__:[Elapsed Time] : 47163ms, 25.685714285714287ms on average

emb_class=deberta, enc_class=cnn | cls

  • train
* share config-bert-*.json
* n_ctx size should be less than 512

* enc_class=cls

$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/deberta-base
$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/deberta-base --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64
  • evaluation

* enc_lass=cls

$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9341,  1701/ 1821
INFO:__main__:[Elapsed Time] : 47751.57833099365ms, 26.153329571524818ms on average

** --bert_model_name_or_path=./embeddings/deberta-large --batch_size=32 --gradient_accumulation_steps=2
INFO:__main__:[Accuracy] : 0.9495,  1729/ 1821
INFO:__main__:[Elapsed Time] : 113818.21751594543ms, 62.427293075310004ms on average

** --bert_model_name_or_path=./embeddings/deberta-v2-xlarge --batch_size=16 --gradient_accumulation_steps=4
INFO:__main__:[Accuracy] : 0.9621,  1752/ 1821
INFO:__main__:[Elapsed Time] : 98728.88159751892ms, 54.138861121712154ms on average

emb_class=bort, enc_class=cnn | cls

  • train
* share config-bert-*.json
* n_ctx size should be less than 512

* enc_class=cls

$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bort

$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/bort --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64

  • evaluation

* enc_lass=cls

$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.7798,  1420/ 1821
INFO:__main__:[Elapsed Time] : 11250.777959823608ms, 6.134679160275302ms on average

emb_class=convbert, enc_class=cnn | cls

  • train
* share config-bert-*.json
* n_ctx size should be less than 512

* enc_class=cls

$ python preprocess.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/conv-bert-medium-small

$ python train.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/conv-bert-medium-small --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=64 

  • evaluation

* enc_lass=cls

$ python evaluate.py --config=configs/config-bert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.7748,  1411/ 1821
INFO:__main__:[Elapsed Time] : 41405.57098388672ms, 22.681565337128692ms on average

emb_class=gpt*, enc_class=cnn | cls

  • train
* n_ctx size should be less than 512

* enc_class=cls

$ python preprocess.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt2-large
$ python train.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt2-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=16 --gradient_accumulation_steps=2

** accelerate launch, deepspeed

$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 1
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt2-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=8

# deepspeed related config
https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils.py#L595
self.deepspeed_config = {
  "train_batch_size": None,
  "gradient_accumulation_steps": self.gradient_accumulation_steps,
  "zero_optimization": {
      "stage": self.zero_stage,
      "offload_optimizer": {
          "device": self.offload_optimizer_device,
      },
  },
  "steps_per_print": float("inf"),  # this will stop deepspeed from logging @ stdout
  "zero_allow_untested_optimizer": True,
}
e.g.
{
    "train_batch_size": 256,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "None"
        }
    },
    "steps_per_print": inf,
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": false
    }
}

## sometimes, apex is unstable for unknown reasons. use torch.cuda.amp instead.
$ vi /opt/conda/lib/python3.7/site-packages/accelerate/deepspeed_utils.py
if is_apex_available():
    #from apex import amp
    import torch.cuda.amp as amp

## debugging
$ vi /opt/conda/lib/python3.7/site-packages/accelerate/utils.py
  self.deepspeed_config = {
      ...
      "steps_per_print": 1, # for debugging
      ...
  }

## manual modification for learning rate and warmup
$ vi /opt/conda/lib/python3.7/site-packages/accelerate/utils.py
  self.deepspeed_config = {
     ...
     "zero_optimization": {
         "offload_optimizer": {
             "device": self.offload_optimizer_device,
             "pin_memory": True,
         },
         "offload_param": {
             "device": self.offload_optimizer_device,
              "pin_memory": True,
         },
     },
     "optimizer": {
         "type": "AdamW",
         "params": {
             "lr": 1e-5,
             "betas": [0.8, 0.999],
             "eps": 1e-8,
             "weight_decay": 3e-7
         },
     },
     "scheduler": {
       "type": "WarmupLR",
       "params": {
         "warmup_min_lr": 1e-7,
         "warmup_max_lr": 0.001,
         "warmup_num_steps": 100
       }
     },
     ...
  }

## Error
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.
=> this is due to mixed precision learning.
   you should set `torch.autograd.set_detect_anomaly(False)` to skip 'nan values'.
   "[deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0"

** accelerate launch, deepspeed & gpt2-xl
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 1
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt2-xl --bert_output_dir=bert-checkpoint --lr=1e-6 --epoch=10 --batch_size=8

** accelerate launch, deepspeed & gpt-neo-2.7B
$ python preprocess.py --config=configs/config-gpt_neo-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt-neo-2.7B
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 8
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-gpt_neo-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt-neo-2.7B --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=5 --batch_size=4 --eval_batch_size=8 --gradient_accumulation_steps=8 --measure=accuracy
# GPU memory footprint: 29614MiB / 32510MiB foreach 4 GPUs

** accelerate launch, deepspeed & gpt-j-6B
$ python preprocess.py --config=configs/config-gptj-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt-j-6B
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 4
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-gptj-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/gpt-j-6B --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=5 --batch_size=4 --eval_batch_size=8 --gradient_accumulation_steps=4
# GPU memory footprint: 31544MiB / 32510MiB foreach 4 GPUs
  • evaluation

* enc_lass=cls

$ python evaluate.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9445,  1720/ 1821
INFO:__main__:[Elapsed Time] : 66759.70959663391ms, 36.57794352416154ms on average

** accelerate launch, deepspeed
INFO:__main__:[Accuracy] : 0.9281,  1690/ 1821
INFO:__main__:[Elapsed Time] : 77169.36945915222ms, 42.27914404083084ms on average

** accelerate launch, deepspeed & gpt2-xl
INFO:__main__:[Accuracy] : 0.9396,  1711/ 1821
INFO:__main__:[Elapsed Time] : 89949.15795326233ms, 49.22410970205789ms on average

** accelerate launch, deepspeed & gpt-neo-2.7B
$ python evaluate.py --config=configs/config-gpt_neo-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9209,  1677/ 1821
INFO:__main__:[Elapsed Time] : 119734.6019744873ms, 65.36855357033866ms on average
# GPU memory footprint: 11776MiB / 32510MiB 

*** --use_fp16
$ python evaluate.py --config=configs/config-gpt_neo-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint --use_fp16
INFO:__main__:[Accuracy] : 0.9231,  1681/ 1821
INFO:__main__:[Elapsed Time] : 66532.00912475586ms, 36.105518550663206ms on average

*** --enable_parallelformers --use_fp16 
$ pip install parallelformers
$ python evaluate.py --config=configs/config-gpt_neo-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint --use_fp16 --enable_parallelformers --num_gpus=2
# GPU memory footprint : 12180MiB / 32510MiB , 5493MiB / 32510MiB

** accelerate launch, deepspeed & gpt-j-6B
$ python evaluate.py --config=configs/config-gptj-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9336,  1700/ 1821
INFO:__main__:[Elapsed Time] : 340499.93801116943ms, 186.51898129955754ms on average
# GPU memory footprint : 23956MiB / 32510MiB

*** --use_fp16
INFO:__main__:[Accuracy] : 0.9336,  1700/ 1821
INFO:__main__:[Elapsed Time] : 144453.00483703613ms, 78.87713149353698ms on average
# GPU memory footprint : 12730MiB / 32510MiB

*** --enable_parallelformers --use_fp16
$ python evaluate.py --config=configs/config-gptj-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint --use_fp16 --enable_parallelformers --num_gpus=2  
AssertionError: GPTJ is not supported yet

emb_class=t5, enc_class=cnn | cls

  • train
* n_ctx size should be less than 512

* enc_class=cls

$ python preprocess.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-large
$ python train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=32 --gradient_accumulation_steps=2 

** accelerate launch, deepspeed
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 2
How many processes in total will you use? [1]: 2
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=32 --gradient_accumulation_steps=2
# GPU memory footprint: 9476MiB / 32510MiB foreach 2 GPUs
# for using torch.distributed.launch
# edit distributed_type in accelerate_config.yaml
  https://github.com/huggingface/accelerate/blob/120b82bfcec998c3545615579889cc7b360bc315/src/accelerate/state.py#L69
  distributed_type: MULTI_GPU
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-large --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=16 --gradient_accumulation_steps=4

** accelerate launch, deepspeed & t5-3b
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 2
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-3b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=32 --gradient_accumulation_steps=2
# GPU memory footprint: 21922MiB / 32480MiB foreach 4 GPUs

** accelerate launch, deepspeed stage 2 & t5-3b & full precision
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 2
How many gradient accumulation steps you're passing in your script? [1]: 4
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: NO
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-3b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=16 --gradient_accumulation_steps=4
# GPU memory footprint: 25826MiB / 32510MiB foreach 4 GPUs

=> unstable

** accelerate launch, deepspeed & t5-3b & full precision
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 4
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: NO
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-3b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=16 --gradient_accumulation_steps=4
# GPU memory footprint: 29182MiB / 32510MiB foreach 4 GPUs

** accelerate launch & t5-11b
# rank 0
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 0
What is the IP address of the machine that will host the main process? 10.55.12.203
What is the port you will use to communicate with the main process? 6199
Do you want to use DeepSpeed? [yes/NO]: NO
How many processes in total will you use? [1]: 12
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config_rank0.yaml
$ export NCCL_DEBUG=INFO
$ accelerate launch --config_file accelerate_config_rank0.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-11b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=16 --gradient_accumulation_steps=2 

# rank 1
$ accelerate config
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config_rank1.yaml
$ export NCCL_DEBUG=INFO
$ accelerate launch --config_file accelerate_config_rank1.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-11b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=16 --gradient_accumulation_steps=2

# rank 2
$ accelerate config
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config_rank1.yaml
$ export NCCL_DEBUG=INFO
$ accelerate launch --config_file accelerate_config_rank2.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-11b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=16 --gradient_accumulation_steps=2

=> out of memory!!

** accelerate lauch, deepspeed stage 2 & t5-11b
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 2
Where to offload optimizer states? [NONE/cpu/nvme]: NONE
How many gradient accumulation steps you're passing in your script? [1]: 8
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ export NCCL_DEBUG=INFO
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-11b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=8 --eval_batch_size=16 --gradient_accumulation_steps=8
# GPU memory footprint: 31916MiB / 32510MiB foreach 4 GPUs 

=> unstable

** accelerate lauch, deepspeed & t5-11b
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: yes
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 1
How many gradient accumulation steps you're passing in your script? [1]: 8
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
$ cp ~/.cache/huggingface/accelerate/default_config.yaml accelerate_config.yaml
$ export NCCL_DEBUG=INFO
$ accelerate launch --config_file accelerate_config.yaml train.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/t5-11b --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=4 --eval_batch_size=8 --gradient_accumulation_steps=8
# GPU memory footprint: 28666MiB / 32510MiB foreach 4 GPUs
  • evaluation

* enc_lass=cls

$ python evaluate.py --config=configs/config-t5-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint

INFO:__main__:[Accuracy] : 0.9539,  1737/ 1821
INFO:__main__:[Elapsed Time] : 53615.12064933777ms, 29.372493120340202ms on average

** accelerate launch, deepspeed
INFO:__main__:[Accuracy] : 0.9555,  1740/ 1821
INFO:__main__:[Elapsed Time] : 55336.105823516846ms, 30.323271175007243ms on average

** accelerate launch, deepspeed & t5-3b
INFO:__main__:[Accuracy] : 0.9599,  1748/ 1821
INFO:__main__:[Elapsed Time] : 63750.892639160156ms, 34.89987116593581ms on average

** accelerate launch, deepspeed & t5-3b & full precision 
INFO:__main__:[Accuracy] : 0.9643,  1756/ 1821
INFO:__main__:[Elapsed Time] : 61783.94317626953ms, 33.86116106431563ms on average

** accelerate launch, deepspeed stage 2 & t5-11b
# GPU memory footprint: 10772MiB / 32510MiB
INFO:__main__:[Accuracy] : 0.5365,   977/ 1821
INFO:__main__:[Elapsed Time] : 210478.55591773987ms, 115.39863138408451ms on average

** accelerate launch, deepspeed & t5-11b
INFO:__main__:[Accuracy] : 0.9561,  1741/ 1821
INFO:__main__:[Elapsed Time] : 207480.0992012024ms, 113.8510259953174ms on average

emb_class=megatronbert | gpt, enc_class=cnn | cls

  • train
* n_ctx size should be less than 512

* enc_class=cls

$ python preprocess.py --config=configs/config-megatronbert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/megatron-bert-cased-345m
$ python train.py --config=configs/config-megatronbert-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/megatron-bert-cased-345m --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=32

** megatron gpt-2
$ python preprocess.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/megatron-gpt2-345m
$ python train.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_model_name_or_path=./embeddings/megatron-gpt2-345m --bert_output_dir=bert-checkpoint --lr=1e-5 --epoch=10 --batch_size=32
  • evaluation
* enc_class=cls

$ python evaluate.py --config=configs/config-megatronbert-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9621,  1752/ 1821
INFO:__main__:[Elapsed Time] : 56964.438915252686ms, 31.2049177976755ms on average


** megatron gpt-2
$ python evaluate.py --config=configs/config-gpt-cls.json --data_dir=data/sst2 --bert_output_dir=bert-checkpoint
INFO:__main__:[Accuracy] : 0.9281,  1690/ 1821
INFO:__main__:[Elapsed Time] : 50888.68188858032ms, 27.869893299354302ms on average


Experiments for Korean


Optimization

  • OPTIMIZATION.md
    • Quantization
      • Dynamic
      • Quantization Aware Training
    • ONNX/ONNXRUNTIME, ONNX Quantization
    • Hyper-arameter Search

Distillation

  • DISTILLATION.md
    • Augmentation
    • Knowledge Distillation
      • ex) BERT-large, RoBERTa-large, ELECTRA-large -> augmentation/distillation -> GloVe, DenseNet-CNN, DenseNet-DSA

Fastformers

  • FASTFORMERS.md
    • Knowledge Distillation
      • Meta Pseudo Labels
    • Structured Prunning
    • ONNX Quantization

Sentence Pair Classification


Intermediate fine-tuning and self-training

  • STraTA.md
    • Few-Shot Learning
      • NSMC (Korean Sentiment Analysis)
    • Self-Training
      • NSMC

Citation

@misc{iclassifier,
  author = {dsindex},
  title = {iclassifier},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/dsindex/iclassifier}},
}