semsi is a toolbox and a web-based service for semantic similarity analysis. It provides facilities for retrieving a list of similar documents and for suggesting relevant topic words.
Currently it only supports the Finnish language. We also provide a service for transforming Finnish words into their basic forms (lemmatisation). We use the sukija package inside Voikko for the vocabulary and morphology rules.
We use Flask as our web framework.
It's easiest to run semsi in a virtualenv. The package virtualenvwrapper
provides a nice set of scripts to manage virtualenvs.
mkvirtualenv semsi
pip install -r requirements.txt
To install the Finnish vocabulary and morphological rules:
wget http://www.kansanmuisti.fi/storage/sukija-v1.tar.bz2
tar -C lexicon -xvjf sukija-v1.tar.bz2
You might want to run semsi with gunicorn:
pip install gunicorn
gunicorn semsi:app
Et voilà! You may now run ./stem-client.py
to test your brand new
Finnish stemming service.