Skip to content

Latest commit

 

History

History
40 lines (34 loc) · 2.82 KB

README.md

File metadata and controls

40 lines (34 loc) · 2.82 KB

N-gram Language Models

This is the first assignment for NLP course. The task is to train and evaluate language models on an English corpus. We used the English part of Greek-English parallel corpus.  We download corpus, split it in a train - test set and so did our train sentences are in file “europarl-v7.el-en.en.train”  and test sentences in “europarl-v7.el-en.en.test”.

Special care is given when user is not running the code for first time so to give user the option to reprocess dataset and/or retrain the language models. In case you need access to the already trained language models please follow this link.

Results

We estimate the language Cross-Entropy and Perplexity of our models on part of the padded test set (100 sentences), treating it as a single sequence. Function perplexity() computes entropy and perplexity for two cases:

  1. Including probabilities of the form P(start|...) (or P(start1|...) or P(start2|...)) and P(end|...) in the computation of perplexity.
  2. Not including probabilities of the form P(start|...) (or P(start1|...) or P(start2|...)) in the computation of perplexity, but including probabilities of the form P(end|...).

In simple Linear Interpolation, we combine different order n-grams by linearly interpolating all the models. Here, we combine unigram, bigram and trigram maximum-likelihood estimations using linear interpolation and check if the combined model performs better. Best l1, l2, l3 parameters in perplexity_interpolated() were found after some trials on a validation set of 100 sentences. (l1 = 2/10, l2 = 8/10, l3 = 1/10)

Comparing with Table 1 results we observe that interpolated models have much better perplexity.

Acknowledgement

Natural Language Processing course is part of the MSc in Computer Science of the Department of Informatics, Athens University of Economics and Business. The course covers algorithms, models and systems that allow computers to process natural language texts and/or speech.