Table Language Model

This project provides code to pretrain a GPT based table language model to model the language of the scientific table cell contents at character level. It is based on nanoGPT codebase.

Prerequisites

Linux
Python 3.9+

Python Setup

Create a Python virtual environment

python3 -m venv ~/table_lm_venv

Install the requirements to your newly created environment

source ~/table_lm__env/bin/activate
cd $HOME/table_lm
pip install torch numpy transformers datasets tiktoken wandb tqdm

Data Preparation for Pretraining

The raw data for this model is assumed in the following JSON format (one file per table)

{
    "headers": [
       {"columns" : [{"content" : "<content>", "colspan" : 1},
                 {"content" : "<content>", "colspan" : 1}]
       },
       {"columns" : [{"content" : "<content>", "colspan" : 1},
                 {"content" : "<content>", "colspan" : 1}]
       }
    ],
    "rows" : [
              { 
               {"content": "<cell-content>", "colspan" : 1},
               {"content": "<cell-content>", "colspan" : 1}
              },
              { 
               {"content": "<cell-content>", "colspan" : 1},
               {"content": "<cell-content>", "colspan" : 1}
              }
             ]
}

The data preparation is done in two stages. First, training and validation intermediate files (one cell content per line) is generated via the script pretrain_data_prep.py

python pretrain_data_prep.py -i <JSON-table-files-directory> -o <training/validation-output-directory>

The second step maps characters to int, prepares the vocabulary metadata and creates encoded training and validation files.

First create the vocabulary

python data/table_llm_char/prepare.py -c vocab -i <input-text-corpus-file> -od <output-dir>

Then encode the training and validation text files generated in the first step. The script assumes the validation text corpus file is in the same directory as the training text corpus file.

python data/table_llm_char/prepare.py -c encode -i <train-text-corpus-file> -od <output-dir>

Pretrain Table Language Model

Make sure you have a GPU with at least 16GB RAM. The script assumes that the train.bin, val.bin and meta.pkl files generated in the second step of data preparation are stored under data/table_llm_char/ directory.

nohup python train.py config/train_table_char_llm.py &> nohup_pretrain.log &

The pretraining takes about 8 hours on a RTX 4090 24GB GPU.

Row Merge Classifier Data Preparation

Data preparation for synthetic training data generation for the row merge classifier is done via the script merge_model_data_prep.py.

python merge_model_data_prep.py -c prep -i <table-json-files-dir> -o <output-dir>

Here the tables for training are represented in the same JSON format as the Table LM pretraining.

After that, you can use the merge_model_train.py script to train/validate the row merge classifier via transfer learning from the pretrained Table LM.

python merge_model_train.py -p <table-lm-checkpoint-file> -v <vocabulary-meta-file> -d <data-dir> -o <model-output-dir>

Here <data-dir> is the same as the <output-dir> as in the previous data preparation step.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
data/table_llm_char		data/table_llm_char
README.md		README.md
configurator.py		configurator.py
merge_model_data_prep.py		merge_model_data_prep.py
merge_model_train.py		merge_model_train.py
model.py		model.py
pretrain_data_prep.py		pretrain_data_prep.py
row_merger.py		row_merger.py
sample.py		sample.py
table_cell_merge_model.py		table_cell_merge_model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table Language Model

Prerequisites

Python Setup

Data Preparation for Pretraining

Pretrain Table Language Model

Row Merge Classifier Data Preparation

About

Releases

Packages

Languages

SciCrunch/table_lm

Folders and files

Latest commit

History

Repository files navigation

Table Language Model

Prerequisites

Python Setup

Data Preparation for Pretraining

Pretrain Table Language Model

Row Merge Classifier Data Preparation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages