GitHub - adienxy/DMRST_Parser: One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

Introduction

One implementation of the paper DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing and Multilingual Neural RST Discourse Parsing.
Users can apply it to parse the input text from scratch, and get the EDU segmentations and the parsed tree structure.
The model supports both sentence-level and document-level RST discourse parsing.
We trained and evaluated the model with the multilingual collection of RST discourse treebanks, and it natively supports 6 languages: English, Portuguese, Spanish, German, Dutch, Basque. Interested users can also try other languages.
This repo and the pre-trained model are only for research use. Please cite the papers if they are helpful.

Package Requirements

The model training and inference scripts were tested on following libraries and versions:

pytorch==1.7.1
transformers==4.8.2

Training: How to convert treebanks to our format for this framework

Follow the treebank pre-processing steps in the two sub-folders under Preprocess_RST_Data.
Note that the XLM-Roberta-base tokenizer is used in both treebank pre-processing and model training scripts. If you want to use other tokenizers, you should change them accordingly.
After all treebank pre-processing steps, samples will be stored in pickle files (the output path is set by user).
Since some treebanks need LDC license, for model re-training, in this repo we only provide one public dataset GUM (Zeldes, A., 2017) as an example.
The exampled pre-processed treebank GUM (Zeldes, A., 2017) (English-only) is located at the folder ./depth_mode/pkl_data_for_train/en-gum/.

Training: How to train a model with a pre-processed treebank

Run the script MUL_main_Train.py to train a model.
Before you start to train, we recommend that you read the parameter settings.
The pre-processed data in folder ./depth_mode/pkl_data_for_train/en-gum/ (English-only) will be used for training by default, as an example.
Note that the XLM-Roberta-base tokenizer is used in both treebank pre-processing and model training scripts. If you want to use other tokenizers, you should change them accordingly.

Inference: Supported Languages

Instead of re-training the model, you can use the well-trained parser for inference (model checkpoint is located at ./depth_mode/Savings/).
We trained and evaluated the model with the multilingual collection of RST discourse treebanks, and it natively supports 6 languages: English, Portuguese, Spanish, German, Dutch, Basque. Interested users can also try other languages.

Inference: Data Format

[Input] InputSentence: The input document/sentence, and the raw text will be tokenizaed and encoded by the xlm-roberta-base language backbone.
- Raw Sequence Example:
- Although the report, which has released before the stock market opened, didn't trigger the 190.58 point drop in the Dow Jones Industrial Average, analysts said it did play a role in the market's decline.*
[Output] EDU_Breaks: The indices of the EDU boundary tokens, including the last word of the sentence.
- Output Example: [5, 10, 17, 33, 37, 49]
- Segmented Sequence Example ('||' denotes the EDU boundary positions for better readability):
- Although the report, || which has released || before the stock market opened, || didn't trigger the 190.58 point drop in the Dow Jones Industrial Average, || analysts said || it did play a role in the market's decline. ||*
[Output] tree_parsing_output: The model outputs of the discourse parsing tree follow this top-down constituency parsing format.
- (1:Satellite=Contrast:4,5:Nucleus=span:6) (1:Nucleus=Same-Unit:3,4:Nucleus=Same-Unite:4) (5:Satellite=Attribution:5,6:Nucleus=span:6) (1:Satellite=span:1,2:Nucleus=Elaboration:3) (2:Nucleus=span:2,3:Satellite=Temporal:3)
- Moreover, (1:Satellite=Contrast:4,5:Nucleus=span:6) means the first parsing step (EDU1 to EDU6), where EDU4 is the splitting prediction, EDU1:4 (predicted as Satellite) and EDU5:6 (predicted as Nucleus) is one pair with discourse relation "Contrast".

Inference: How to use it for parsing

Put the text paragraph to the file ./data/text_for_inference.txt.
Pre-trained model checkpoint is located at ./depth_mode/Savings/.
Run the script MUL_main_Infer.py to obtain the RST parsing result. See the script for detailed model output.
We recommend users to run the parser on a GPU-equipped environment.

Citation

If the work is helpful, please cite our papers in your publications, reports, slides, and thesis.

@inproceedings{liu-etal-2021-dmrst,
    title = "{DMRST}: A Joint Framework for Document-Level Multilingual {RST} Discourse Segmentation and Parsing",
    author = "Liu, Zhengyuan and Shi, Ke and Chen, Nancy",
    booktitle = "Proceedings of the 2nd Workshop on Computational Approaches to Discourse",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic and Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.codi-main.15",
    pages = "154--164",
}

@inproceedings{liu2020multilingual,
  title={Multilingual Neural RST Discourse Parsing},
  author={Liu, Zhengyuan and Shi, Ke and Chen, Nancy},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6730--6738},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Preprocess_RST_Data		Preprocess_RST_Data
data		data
depth_mode		depth_mode
DataHandler.py		DataHandler.py
MUL_main_Infer.py		MUL_main_Infer.py
MUL_main_Train.py		MUL_main_Train.py
Metric.py		Metric.py
README.md		README.md
Training.py		Training.py
config.py		config.py
model_depth.py		model_depth.py
module.py		module.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Package Requirements

Training: How to convert treebanks to our format for this framework

Training: How to train a model with a pre-processed treebank

Inference: Supported Languages

Inference: Data Format

Inference: How to use it for parsing

Citation

About

Releases

Packages

Languages

adienxy/DMRST_Parser

Folders and files

Latest commit

History

Repository files navigation

Introduction

Package Requirements

Training: How to convert treebanks to our format for this framework

Training: How to train a model with a pre-processed treebank

Inference: Supported Languages

Inference: Data Format

Inference: How to use it for parsing

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages