Releases: vistec-AI/thai2transformers
Releases · vistec-AI/thai2transformers
Assorted Thai Texts used for WangchanBERTa pre-training
This release contains cleaned datasets we used in transformer-based Thai language model pre-training (WangchanBERTa; wangchanberta-base-att-spm-uncased).
The cleaned datasets is only partially available since data from Wisesight, Pantip, and TNC is not under explicit open source licenses.
`iapp_thaiqa_xquad` dataset
Combine iapp_wiki_qa_squad
, thaiqa_squad
and xquad
training sets, using validation and test sets from iapp_wiki_qa_squad
. Remove all contexts in training sets that are similar (mUSE cosine similarity > 0.8) out of the training sets.
DatasetDict({
train: Dataset({
features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
num_rows: 10916
})
validation: Dataset({
features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
num_rows: 742
})
test: Dataset({
features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
num_rows: 739
})
})