A repository for NLP course at uni.
Language: Python
PCFG (Probabilistic Context Free Grammar)
Writing a grammar that generates legal sentences in English. Exploring some (but not all) aspects of the English language, and their implementation in a grammar.
Distributional Semantics
Finding similar words by meaning using a combination of algorithms, and making a detailed report comparing them.
Data: wikipedia
Algorithms:
- word contexts: (a) sentence (b) window (c) dependency tree (parent\son, direction of arc, jump over preposition)
- similarity: (a) cosine distance (b) PMI
- order: (a) 1st order similarity (b) 2nd order similarity
Relation Extraction
Given a small amount of data (news articles), extract Named Entities, from each sentence, and the relation between them.
i.e Yosi (work for) CBS
Algorithm:
- For each sentence, extract Named Entities, dependency tree and POS tagging using spacy library.
- Generate a sequence from the path between the two entities on the dependency tree.
- Run LSTM on the path, concat output with other feature vectors, such as: Named Entity type, Named Entity POS tag.
- Pass through MLP with softmax activation.
Challenges:
- small dataset
- missing labels (entities\relations that should have been included in the gold file)
- mismatches between gold file Named Entities, and spacy output Named Entities.
Architecture choice:
pure ML approach, instead of hybrid ML and rule based. A hybrid could be made after error analysis, for example: the model sometimes confuses relation (work for) with (kill), because both relations contain PERSONs.