Turkish NLP Suite is a non-profit organization dedicated to Turkish NLP. We create open source corpora, pretrained models, code , tutorials and all types of linguistic resources for Turkish natural language processing. All of our code is cutting-edge, our models are easy to install and use, tutorials are great to get started .. This is state-of-art Turkish NLP after all.
That's true, we love spaCy because of the blazing fast code, great architecture, flexible pipelining, detailed documentation and awesome ecosystem. We proudly present spaCy Turkish models:
- tr_core_web_md
- tr_core_web_lg
- tr_core_web_trf
All pipelines contains a tokenizer, trainable lemmatizer, POS tagger, dependency parser, morphologizer and NER components. You can find out more about each model in the dedicated repo and download the models from HuggingFace.
spaCy Turkish models comes with comprehensive tutorials and code. Please visit the documentation section for the details.
We corporate modern techniques into all our work including transformers, GPU computing as well as using the most efficient data structures. For some examples, the brand new Turkish spaCy model tr_core_web_trf
is a transformer based pipeline; mini project "Quick FAQ Chatbot" integrates sentence-transformers
and more.
Modern NLP revolves around data, hence labelled data (even in the msall amounts) are crucial for improving quality of many NLP tasks. As a result, compiling and serving Turkish datasets lies at the core of Turkish NLP Suite project. All of our datasets are presented with a commercial licence, completely open-source and ready to use. We also use our datasets in our projects and tutorials. Here's a list of our datasets:
- Corona-mini : A mini corpus of Turkish social media reviews about Corona symptoms.
- ATIS Turkish: A multi-purpose NLU dataset for Turkish, including entities and slots.
- Turkish Wiki NER Dataset: A general purpose Turkish NER Dataset with fine labels.
- Vitamins and Supplements NER and Span Dataset: A Turkish NER dataset that lies in intersection of medical NLP and product reviews.
- Vitamins and Supplements Reviews: A sentiment analysis dataset for Turkish, that lies in intersection of medical NLP and customer sentiment.
- Beyazperde Top 300 Movies Dataset: Turkish sentiment analysis dataset of Top 300 Movies reviews.
- Beyazperde Top All Movies Dataset: Turkish sentiment analysis dataset from movie reviews.
For the details and data please visit the dedicated repos of the datasets. We also provide guidance and documentations for the ones who would like to compile their own datasets. If you're looking for creating your own datasets, please visit the documentation section.
Surely we like to mine some good Turkish datasets 😉 If you'd like to do some data mining together, you can have a look at our video series Quick recipes with spaCy Turkish and Quick FAQ Bot; or even better read the Medium blog post about how sentiment turned into political heat after earthquake disaster.
If you like doing some pair programming, please visit our Turkish NLP Youtube channel. Here's a list of playlists:
- Veriseti formatları
- Veriseti nasıl derlenir
- Baştan sona Türkçe Linguistik
- spaCy ve Semantic Search'le hızlı FAQ Botu
- Hızlı spaCy Türkçe tarifleri
- spaCy modeli nasıl yapılır?
- Semantic Web
There are several paths to get started indeed. If you're already working with text and speech data, you can dive into Turkish only parts safely. This path includes information about Turkish linguistics, then application code. One can watch
- All about Turkish linguistics
- Quick recipes with spaCy Turkish
- Quick FAQ Bot
- How to train spaCy models
- Semantic Web
If you're a junior/student or didn't work on NLP problems before, we suggest starting from the beginning. This path includes the foundational series "NLP dataset formats" and "How to compile NLP datasets". After warming up to NLP tasks and data conventions, you can dive inot the most advanced parts above 😉
- Turkish NLP, a Gentle Introduction
- Turkish Phonetics: A Quick Intro
- Neden yasaklandı? Depremle ilgili Ekşi Sözlük yorumlarına NLP gözüyle bakış
- A collection of brand new datasets for Turkish NLP
Google ML Developer Programs team supported this work by providing Google Cloud Credit. Many thanks to Google Developer Experts for their generous contributions!