Arabic Word-Embedding (Word2vec) model training from Wikipedia articles
Steps to start training:-
1- Got to Wikipedia Arabic articles data dump at this URL:-
https://dumps.wikimedia.org/arwiki/latest/
2- Download just Articles only, looks like this:-
arwiki-latest-pages-articles-multistream.xml.bz2
about 1 GB approximately
3- Use WikiExtractor to extract articles to json files
https://github.com/attardi/wikiextractor
4- Run arabic_word2vec.py to get your Model.
Enjoy Arabic Word-Embedding (Word2vec) ;-)
5- Use my repository https://github.com/rozester/Arabic-Word-Embeddings-Word2vec to visualize it in action.
Thanks to Abed Khooli for his function (ArTokenizer) was very helpful in Arabic Text Cleansing
Watch it on action