This repository is based on "Top-down RST Parsing Utilizing Granularity Levels in Documents" .
Before running a script, you should organize files into a common SOURCE directory. This directory contains documents in CONLL-U format where the last column MISC contains information about discourse segmentation. The field 'BeginSeg=YES' is set, if a new elementary discourse unit (EDU) begins. This annotations must strictly correlate with the number of EDUs in the annotation file. For RST annotations, there are two options:
- file.dis: files corresponding to the original RST-DT tree format
- file.tree: files corresponding to the labeled attachment tree format, used in this system
python -m rstparser.cli.preprocess SOURCE DESTINATION
During preprocessing, each document file is converted into corresponding jsonl format described below:
"doc_id": "wsj_****"
"labelled_attachment_tree": "(nucleus-satellite:Elaboration (text 0) (text 1))"
"tokenized_strings": ["first sentence corresponding to text 1 .", "and this is second sentence ."]
"raw_tokenized_strings": ["first", "sentence", "corresponding", "to", "text", "1", ".", "and", "this", "is", "second", "sentence", "."]
"starts_sentence": [true, true]
"starts_paragraph": [true, false]
"parent_label": null
"granularity_type": D2E
There are sample files of our preprocessing in data/sample/
.
Train a segmentation model:
python -m rstparser.networks.segmenter --train-file rst_data/train.jsonl --valid-file rst_data/valid.jsonl --batch-size 64 --hidden 128 --bert-model roberta-base --epochs 20 --test-file rst_data/test.jsonl --serialization-dir models/seg.t3
Test segmentation model:
python -m rstparser.networks.segmenter --batch-size 64 --test-file rst_data/test.jsonl --model-paths models/seg.*/model_best_*
Train the model 5 times for D2E, D2P, D2S, P2S, P2E and S2E. If you need to select a GPU device, please use an
environment variable CUDA_VISIBLE_DEVICES
.
bash script/training.sh
Evaluate on test set for D2E, D2S2E and D2P2S2E with 5 ensemble setting.
bash script/evaluate.sh
Parameters:
- MODELPATH refers to one or more models to load.
- SOURCE contains conll files that will be segmented.
- DESTINATION is the directory where segmented conll files will be stored.
Segmentation
python -m rstparser.segment_conll --model-paths MODELPATH --input-doc SOURCE --output-dir DESTINATION
RST Parsing
python3 -m rstparser.parse_conll --model-path MODELPATH --hierarchical-type d2e --input-doc SOURCE --output-dir DESTINATION