NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC)
The pretrained word embeddings and datasets for NLP. The collection will keep updating. The purpose of these pre-trained word vectors and datasets is for learning and research purposes only.
不断收集我遇到的各种NLP预训练词向量、模型和数据集。这些预训练词向量和数据集的目的仅用来学习和研究。
The rankings are in no particular order, only in the order I added them. The data set belongs to the original author, thanks! If there is any infringement, please email me and let me know.
排名不分先后,仅按我添加的先后顺序。数据集所有权均属于原作者,感谢!若有侵权,请电邮我告知删除。
TODO
-
-
-
-
目前包含:
-
LCQMC 口语化描述的语义相似度任务 Semantic Similarity Task COLING 2018
-
XNLI 语言推断任务 Natural Language Inference EMNLP 2015
-
BQ 智能客服问句匹配 Question Matching for Customer Service EMNLP 2018 Download
-
-
大规模中文短文本摘要数据集
-
-
-
Including:
- The Corpus of Linguistic Acceptability
- The Stanford Sentiment Treebank
- Microsoft Research Paraphrase Corpus
- Semantic Textual Similarity Benchmark
- Quora Question Pairs
- MultiNLI Matched
- MultiNLI Mismatched
- Question NLI
- Recognizing Textual Entailment
- Winograd NLI
- Diagnostics Main
-
Including:
- Broadcoverage Diagnostics
- CommitmentBank
- Choice of Plausible Alternatives
- Multi-Sentence Reading Comprehension
- Recognizing Textual Entailment
- Words in Context
- The Winograd Schema Challenge
- BoolQ
- Reading Comprehension with Commonsense Reasoning
- Winogender Schema Diagnostics
-
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
-
The Stanford Question Answering Dataset