GitHub - d0r1h/ILC: Indian Legal Corpus for Summarization

DataSet

We have scraped and complied a corpus of 3k+ Indian legal judgments and their parallel summaries.

from datasets import load_dataset

dataset = load_dataset("d0r1h/ILC")

train_set = pd.DataFrame(dataset['train'])
test_set = pd.DataFrame(dataset['test'])

Code

git clone https://github.com/d0r1h/ILC.git
cd ILC
pip install -r requirement.txt

Summarzing using Extractive approach

!python Code/Models/extractive.py \
        --output_dir dir_name \
        --text_column text \
        --summary_column summary \
        --data_file data.csv \
        --sentence_count 3

Training LED using Abstractive approach

!python Code/Models/led_summarization.py \
        --model_name  allenai/led-base-16384 \
        --text_column  Case \
        --summary_column Summary    \
        --max_input_length  8192 \
        --max_output_length  600 \
        --batch_size 2 \
        --num_beams 2 \
        --output_dir output_dir_name

Inference on test-set using led-base-ilc model

Notebook	Colab
led-base-ilc

Results:

Following results are obtained on test-set with transformer based models and extractive methods

Algorithm / model	Rouge-1	Rouge-2	Rouge-L
Extractive
SumBasics	15.69	6.02	14.48
LSA	21.20	7.37	19.76
KLSum	21.40	10.19	19.66
LexRank	33.09	16.81	22.99
TextRank	34.54	18.10	31.11
Abstractive
LedBase	4.31	1.08	4.11
Led-ilc	42.24	23.18	39.30

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Code		Code
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSet

Code

Results:

About

Languages

License

d0r1h/ILC

Folders and files

Latest commit

History

Repository files navigation

DataSet

Code

Results:

About

Resources

License

Stars

Watchers

Forks

Languages