Part-of-speech tagging is the task of assigning a part-of-speech tag (from a given tag set) to every word in a given sentence.
Input:
快速 的 棕色 狐狸 跳过 了 懒惰 的 狗
Output:
[快速] VA [的] DEC [棕色] NN [狐狸] NN [跳过] VV [了] AS [懒惰] VA [的] DEC [狗] NN
F1 score calculated from word-level precision and word-level recall computed from the joint segmentation and tagging task.
- Task originally defined in Ng and Low (2004)
- Released by LDC. Requires LDC licence to acquire the datasets
- Link: https://verbs.colorado.edu/chinese/ctb.html
- Tag set has 33 POS tags
Test set | # words (dev) | # words (test) | Genre |
---|---|---|---|
CTB5 | 6,821 | 8,008 | News |
- Implementation: https://github.com/yanshao9798/tagger/blob/master/evaluation.py
System | F1 score |
---|---|
Tian el. al. (2020) | 96.92 |
Meng et. al. (2019) (Glyce + BERT) | 96.61 |
Meng et. al. (2019) (BERT) | 96.06 |
Shao et. al. 2017 | 94.38 |
Train set | # words | Genre |
---|---|---|
CTB5 | 493,935 | News |
- Available freely (GPL or equivalent licence)
- https://universaldependencies.org/
- Paper describing the dataset: Nivre et. al. (2016)
- Tagset has 15 POS tags
Test set | # words (dev) | # words (test) | Genre |
---|---|---|---|
UD Chinese | 12,663 | 12,012 | Learner essays, news, spoken language, Wiki |
- Implementation: https://github.com/yanshao9798/tagger/blob/master/evaluation.py
System | F1 score |
---|---|
Meng et. al. (2019) (Glyce + BERT) | 96.14 |
Tian el. al. (2020) | 95.69 |
Meng et. al. (2019) (BERT) | 94.79 |
Shao et. al. (2017) | 89.75 |
Train set | # words | Genre |
---|---|---|
UD Chinese | 98,608 | Learner essays, news, spoken language, Wiki |
Suggestions? Changes? Please send email to [email protected]