We have scraped and complied a corpus of 3k+ Indian legal judgments and their parallel summaries.
from datasets import load_dataset
dataset = load_dataset("d0r1h/ILC")
train_set = pd.DataFrame(dataset['train'])
test_set = pd.DataFrame(dataset['test'])
git clone https://github.com/d0r1h/ILC.git
cd ILC
pip install -r requirement.txt
Summarzing using Extractive approach
!python Code/Models/extractive.py \
--output_dir dir_name \
--text_column text \
--summary_column summary \
--data_file data.csv \
--sentence_count 3
Training LED using Abstractive approach
!python Code/Models/led_summarization.py \
--model_name allenai/led-base-16384 \
--text_column Case \
--summary_column Summary \
--max_input_length 8192 \
--max_output_length 600 \
--batch_size 2 \
--num_beams 2 \
--output_dir output_dir_name
Inference on test-set using led-base-ilc model
Notebook | Colab |
---|---|
led-base-ilc |
Following results are obtained on test-set with transformer based models and extractive methods
Algorithm / model | Rouge-1 | Rouge-2 | Rouge-L |
---|---|---|---|
Extractive | |||
SumBasics | 15.69 | 6.02 | 14.48 |
LSA | 21.20 | 7.37 | 19.76 |
KLSum | 21.40 | 10.19 | 19.66 |
LexRank | 33.09 | 16.81 | 22.99 |
TextRank | 34.54 | 18.10 | 31.11 |
Abstractive | |||
LedBase | 4.31 | 1.08 | 4.11 |
Led-ilc | 42.24 | 23.18 | 39.30 |