Skip to content

Official implementation for the paper LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection.

Notifications You must be signed in to change notification settings

ignasa007/LingML

Repository files navigation

LingML

This is the official code repository for the paper LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection.

Directory Structure

  • config - configuration files for different datasets and LLM models
  • data_classes - Python classes to handle different datasets, and make them suitable for training
  • datasets - raw datasets in csv format
    • aaai-constraint-covid - original dataset by Patwa et al., 2020
    • aaai-constraint-covid-appended - original dataset with appended linguistic features retrieved using LIWC-22
    • aaai-constraint-covid-cleaned - dataset constructed by eliminating records identified by Bee et al., 2023, as being neither true nor false in the context of COVID-19
    • aaai-constraint-covid-cleaned-appended - cleaned dataset with appended linguistic features
  • model_classes - Python classes to handle 11 transformer-based LLMs
    • all models need special implementation for incorporating language features
    • number of output heads needed to be changed from 3 to 2 for Twitter-RoEBRTa
  • results - results of the different runs
    • directory structure - <dataset> -> <model> -> <run-date> -> training logs and <data-split_results>
  • utils - utility functions for running the transformer experiments
  • analysis.ipynb - notebook to consolidate results
  • main.py - main file for running the transformer experiments
  • plotting.ipynb - notebook to generate plots for results of experiments in xml.ipynb
  • xml.ipynb - notebook running the experiments using simple machine learning algorithms with the language features

Setup

conda create --name <env-name> --file requirements.txt python=3.8
conda activate <env-name>

Execution

To run the transformer experiments, execute

python3 -B main.py \
    --dataset <dataset> \
    --model <model>

where dataset can be one of

  • aaai-constraint-covid
  • aaai-constraint-covid-appended
  • aaai-constraint-covid-cleaned
  • aaai-constraint-covid-cleaned-appended

and model can be one of

  • albert-base-v2 - Base version of ALBERT model with a randomly initialized sequence classification head. See HF model card.
  • bart-base - Base version of BART model. See HF model card.
  • bert-base-uncased - Base version of BERT model. See HF model card.
  • bertweet-covid-19-base-uncased - Base version of BERTweet model, a RoBERTa model pre-trained on ~850M tweets, ~5M of which were COVID-19 related. See HF model card.
  • covid-twitter-bert-v2 - CT-BERT Model, which is a large BERT model pre-trained on ~97M COVID-related tweets. See HF model card.
  • distilbert-base-uncased - Base version of DistilBERT model, a distilled version of BERT, i.e. a smaller model trained with BERT as a teacher. See HF model card.
  • longformer-base-4096 - Base version of Longformer model, which is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. See HF model card.
  • roberta-base - Base version of RoBERTa model. See HF model card.
  • twitter-roberta-base-sentiment-latest - Base version of Twitter-RoBERTa model, which is a RoBERTa-base model trained on ~124M tweets, and finetuned for sentiment analysis with the TweetEval benchmark. See HF model card.
  • xlm-mlm-en-2048 - XLM model trained with masked language modeling (MLM) objective. See HF model card.
  • xlm-roberta-base - Base version of XLM-RoBERTa model, a multilingual version of RoBERTa. See HF model card.
  • xlnet-base-uncased - Base version of XLNet model. See HF model card.

You can also override default configurations using the command line. For example,

python3 -B main.py \
    --dataset <dataset> \
    --model <model> \
    ADD_NEW_TOKENS True \
    DATASET.BATCH_SIZE 16 \
    DATASET.args.root <dataset-root> \
    MODEL.MAX_LENGTH 200

For inference, execute

python3 -B inference.py \
    --dataset <dataset> \
    --model <model> \
    --weights <path-to-weights>

For example,

python3 -B inference.py \
    --dataset aaai-constraint-covid \
    --model covid-twitter-bert-v2 \
    --weights "./results/aaai-constraint-covid/CT-BERT/2023-12-19-00-28-08/ckpt5350.pth"

Note: Make sure to set the device index to None if you do not wish to use the GPU, i.e.

python3 -B main.py \
    --dataset <dataset> \
    --model <model> \
    DEVICE_INDEX None

Citation

If you use this work, kindly cite it as

@misc{singh2024lingmllinguisticinformedmachinelearning,
      title={LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection}, 
      author={Jasraj Singh and Fang Liu and Hong Xu and Bee Chin Ng and Wei Zhang},
      year={2024},
      eprint={2405.04165},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.04165}, 
}

About

Official implementation for the paper LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published