The aim of this task was training word vectors using different algorithms like skipgram, continuous bag of word(cbow) and GloVe and using them for spelling error detection. I have described the code for each algorithm one by one:
The code for this is contained in skipgram.ipynb
notebook.
- We first import all the required libraries.
- Then we load the data which we first processed to contain only space separated words. Data is loaded using
LineSentence
fromgensim
library. - Then we train the model for the loaded data with following parameters:
- Window size: 5
- min_count: 5
- threads: 8
- sg: 1 # for skipgram training
- Then we save the model using model.save() so that we can use it later for visualization plot. We comment this line to avoid saving it repeatedly.
- As the model is saved so we load it using
Word2Vec.load()
and then for an idea of the correctness of our vector embeddings we find the some most similar vectors to king. - In next step we try to get a complete space TSNE plot for our word vectors. For this we manipulate the dimensions of vectors and and get a corresponding dataframe for our transformed word vectors.
- For most similar words plot we select a few words and then get their most similar embedding from the model. Then these are plotted to get most words visualization.
Nearly all of the steps for this were same except for sg=0
while initializing model for training using Word2Vec
. So for sake of brevity I have written same steps as for skipgram again here.
The code for this is contained in glove.ipynb
notebook.
- We first clone https://github.com/stanfordnlp/GloVe.
- Then inside cloned folder we run
make
from terminal. - Then inside this folder it contains a file
demo.sh
which we modify by changingCORPUS=train.txt
andMAX_ITER=50
so as our corpus istrain.txt
then from terminal we run this script usingbash glove.sh
. - Running above script generates a vector file
vectors.txt
. These are desired glove word embeddings. - Now for visualizing in
glove.ipynb
we load thisvectors.txt
and modify it to convert in a form so that it can be represented as a model generated usinggensim
. - Above step is done using
KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
. This is function provided bygensim
library. - Then we save the model using model.save() so that we can use it later for visualization plot. We comment this line to avoid saving it repeatedly.
- After this steps of visualization using TSNE and PCA plots are almost same as for skipgram and cbow because we have converted it to a model which can be represented as one which is generated using gensim.
- As the model is saved so we load it using
Word2Vec.load()
and then for an idea of the correctness of our vector embeddings we find the some most similar vectors to king. - In next step we try to get a complete space TSNE plot for our word vectors. For this we manipulate the dimensions of vectors and and get a corresponding dataframe for our transformed word vectors.
- For most similar words plot we select a few words and then get their most similar embedding from the model. Then these are plotted to get most words visualization.
Visualizations of the plots for word embeddings are:
Visualizations for this are contained in skipgram.ipynb
file as plots created using matplotlib
.
These visualizations include:
- complete space visualization(TSNE plot)
- complete space visualization(PCA plot)
- most similar word representation(TSNE plot)
- most similar word representation(PCA plot)
Visualizations for this are contained in skipgram.ipynb
file as plots created using matplotlib
.
These visualizations include:
- complete space visualization(TSNE plot)
- complete space visualization(PCA plot)
- most similar word representation(TSNE plot)
- most similar word representation(PCA plot)
Visualizations for this are contained in skipgram.ipynb
file as plots created using matplotlib
.
These visualizations include:
- complete space visualization(TSNE plot)
- complete space visualization(PCA plot)
- most similar word representation(TSNE plot)
- most similar word representation(PCA plot)