Skip to content

BERT-LSTM-based Chinese word segmentation model on SIGHAN-2004

Notifications You must be signed in to change notification settings

AOZMH/BERT-LSTM-Chinese-Word-Segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT-LSTM-Chinese-Word-Segmentation

BERT-LSTM-based Chinese word segmentation model on SIGHAN-2004

Data Prepare

Please use data_chn_seg/ directory at PKU net disk to replace the blank data/ directory here.

Such file contains the pretrained parameters and all fine-tuned results. A PKU-net-disk account is required :).

Requirements

Requires sklearn, pytorch, transformers package.

Results

Tested on SIGHAN-2004 Chinese Word Segmentation dataset

Measurements Performance
TOTAL INSERTIONS 593
TOTAL DELETIONS 639
TOTAL SUBSTITUTIONS 1053
TOTAL NCHANGE 2285
OOV Rate 0.026
OOV Recall Rate 0.854
IV Recall Rate 0.988
TOTAL TRUE WORD COUNT 106873
TOTAL TEST WORD COUNT 106827
TOTAL TRUE WORDS RECALL 0.984
TOTAL TEST WORDS PRECISION 0.985
F MEASURE 0.984

Execution

Train model:

python main.py

Currently only support BERT-LSTM model.

As shown in the uncompleted functions in main.py, we are working on other model architectures for Chinese-seg, e.g. LSTM-CRF, BERT-LSTM-CRF, the results will be shown later.

Evluate:

python eval.py

This execution will create a file (e.g. one that named 'test_pred_bert_lstm_1.txt') in /eval directory, which contains the segmentated results on test data at /data/test.txt.

Currently we haven't incorporate argument parsers in the codes above, so please mannually change the corresponding details in the code to assign the names and routes of the files including logs, results and checkpoints.

Performance comparison

The comparison between our results and the open-source tool Pkuseg.

Model OOV Rate OOV Recall IV Recall Recall Precision F Measure
Pkuseg 0.026 0.873 0.413 0.883 0.863 0.873
Our Model 0.026 0.854 0.988 0.984 0.985 0.984

About

BERT-LSTM-based Chinese word segmentation model on SIGHAN-2004

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published