This project provides code to pretrain a GPT based table language model to model the language of the scientific table cell contents at character level. It is based on nanoGPT codebase.
- Linux
- Python 3.9+
Create a Python virtual environment
python3 -m venv ~/table_lm_venv
Install the requirements to your newly created environment
source ~/table_lm__env/bin/activate
cd $HOME/table_lm
pip install torch numpy transformers datasets tiktoken wandb tqdm
The raw data for this model is assumed in the following JSON format (one file per table)
{
"headers": [
{"columns" : [{"content" : "<content>", "colspan" : 1},
{"content" : "<content>", "colspan" : 1}]
},
{"columns" : [{"content" : "<content>", "colspan" : 1},
{"content" : "<content>", "colspan" : 1}]
}
],
"rows" : [
{
{"content": "<cell-content>", "colspan" : 1},
{"content": "<cell-content>", "colspan" : 1}
},
{
{"content": "<cell-content>", "colspan" : 1},
{"content": "<cell-content>", "colspan" : 1}
}
]
}
The data preparation is done in two stages. First, training and validation intermediate files (one cell content per line) is generated via the script pretrain_data_prep.py
python pretrain_data_prep.py -i <JSON-table-files-directory> -o <training/validation-output-directory>
The second step maps characters to int, prepares the vocabulary metadata and creates encoded training and validation files.
First create the vocabulary
python data/table_llm_char/prepare.py -c vocab -i <input-text-corpus-file> -od <output-dir>
Then encode the training and validation text files generated in the first step. The script assumes the validation text corpus file is in the same directory as the training text corpus file.
python data/table_llm_char/prepare.py -c encode -i <train-text-corpus-file> -od <output-dir>
Make sure you have a GPU with at least 16GB RAM. The script assumes that the train.bin
, val.bin
and meta.pkl
files generated in the second step of data preparation are stored under data/table_llm_char/
directory.
nohup python train.py config/train_table_char_llm.py &> nohup_pretrain.log &
The pretraining takes about 8 hours on a RTX 4090 24GB GPU.
Data preparation for synthetic training data generation for the row merge classifier is done via the script merge_model_data_prep.py
.
python merge_model_data_prep.py -c prep -i <table-json-files-dir> -o <output-dir>
Here the tables for training are represented in the same JSON format as the Table LM pretraining.
After that, you can use the merge_model_train.py
script to train/validate the
row merge classifier via transfer learning from the pretrained Table LM.
python merge_model_train.py -p <table-lm-checkpoint-file> -v <vocabulary-meta-file> -d <data-dir> -o <model-output-dir>
Here <data-dir>
is the same as the <output-dir>
as in the previous data
preparation step.