Releases · vistec-AI/thai2transformers

18 Jan 10:41

lalital

8b42347

Assorted Thai Texts used for WangchanBERTa pre-training Latest

Latest

This release contains cleaned datasets we used in transformer-based Thai language model pre-training (WangchanBERTa; wangchanberta-base-att-spm-uncased).

The cleaned datasets is only partially available since data from Wisesight, Pantip, and TNC is not under explicit open source licenses.

Assets 3

09 Jun 08:02

cstorm125

qa-v0.2

974fb53

`iapp_thaiqa_xquad` dataset

Combine iapp_wiki_qa_squad, thaiqa_squad and xquad training sets, using validation and test sets from iapp_wiki_qa_squad. Remove all contexts in training sets that are similar (mUSE cosine similarity > 0.8) out of the training sets.

DatasetDict({
    train: Dataset({
        features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
        num_rows: 10916
    })
    validation: Dataset({
        features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
        num_rows: 742
    })
    test: Dataset({
        features: ['question_id', 'article_id', 'title', 'context', 'question', 'answers'],
        num_rows: 739
    })
})

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: vistec-AI/thai2transformers

Assorted Thai Texts used for WangchanBERTa pre-training

`iapp_thaiqa_xquad` dataset