Right now this is using financial news, but this can also be repurposed to use IMDB dataset.
├── datasets/
│ ├── financial_news.csv
│ ├── financial_news_train.csv
│ └── financial_news_test.csv
├── base_model_testing.py
├── calsa.py
├── model_training.py
├── random_sampling.py
├── text_augmentation.py
└── financial_news_preprep.py
- Python 3.8+
- PyTorch
- Transformers
- pandas
- numpy
- scikit-learn
- nlpaug
- datasets
- Place your financial news dataset in
datasets/financial_news.csv
- Run the preprocessing script:
python financial_news_preprep.py
Test the performance of the pre-trained DistilBERT model:
python base_model_testing.py
Run experiments with different sample sizes (100, 300, 500):
python random_sampling.py
Run the CALSA pipeline with text augmentation:
python calsa.py
Train models using selected samples:
python model_training.py
- Base Model: DistilBERT (distilbert-base-uncased-finetuned-sst-2-english)
- Batch Size: 8
- Number of Epochs: 3
- Learning Rate: Default from Hugging Face Trainer
- Max Sequence Length: 512
The text augmentation pipeline includes:
- Synonym replacement (WordNet)
- Back-translation (French, German, Spanish)
- Random word insertion/deletion
- Sentence shuffling
Results are saved in:
fine_tuned_models_random_sampling_financial_news/
: Random sampling resultsresults_calsa/
: CALSA resultsbase_model_test_outputs/
: Base model performance
Each experiment generates:
- Trained model checkpoints
- Confusion matrices
- Classification reports