An emtsv module to connect a preverb to its verb or verb-derivative token to which it belongs. To be used for Hungarian.
See also: https://github.com/ril-lexknowrep/hungarian-preverb-corpus
This module is a rule-based tool that essentially uses a hand-crafted decision tree to connect Hungarian preverbs to their verbs from which they are separated in certain syntactic contexts. It uses information from emtsv's tok, morph and pos modules to decide whether a separated preverb should be connected to a particular verb. It connects preverbs only based on morphological and part of speech tags and surface word order cues, and thus it does not need either a lexicon which lists legitimate preverb-verb combinations, nor the output of a syntactic parser to work. Separated preverbs are not only connected to finite verb forms, but also to infinitives, adjectival and adverbial participles, and nomina actionis that are derived from verbs that have preverbs. A Hungarian preverb may be separated from the verb root in all of these words in certain syntactic contexts.
This module uses the following tsv output fields. The philosophy that underlies these annotations is explained in our emPreverb paper.
-
The
prev
field: Verbs (by which we mean finite as well as non-finite verb forms and potentially separable verb derivatives) and preverbs are annotated as follows:pfx
marks a verb token that contains a prefixed, i.e. non-separated preverb.sep
marks a verb token from which a preverb was separated, i.e. a verb token to which a preverb token in the sentence belongs.conn
marks a separated preverb token which has been identified as belonging to a verb token in the sentence.
The
sep
andconn
annotations thus mark connected verb-preverb pairs. Verbs for which no corresponding preverb was found and preverbs to which no verb could be assigned by emPreverb are not annotated in any way, i.e. this field remains empty for them. -
The
previd
field: This field contains an unambiguous numerical identifier that indicates whichsep
verb a specificconn
preverb belongs to. The preverb has the sameprevid
value as the corresponding verb.emPreverb
is only designed to handle one-to-one correspondences between separated preverbs and verbs, and thus coordinative structures in which arguably more than one preverb should be connected to a single verb, or vice versa, are not annotated as such (e.g. meg kell és meg is lehet oldani; az öregje addig senkinek cipőt, csizmát nem szab a lábára, amíg meg nem nézette, szagoltatta, tapintatta velük a bőröket, hogy melyik lenne igazán a kedvükre való). -
The
prevpos
field: This indicates the direction and distance of the separated preverb relative to its verb. This information only appears on verbs with separated preverbs, not on the preverbs, nor on verbs with a non-separated preverb. The value ofprevpos
consists of a number, which specifies the distance in tokens, and a sign, which specifies the direction. Minus means 'to the left' and plus means 'to the right'. For example, a value of+1
would mean that the separated preverb is located immediately to the right of its verb, and-2
indicates that the preverb is two tokens to the left of the verb, i.e. with one other token in between. -
The
xpostag
field: Although this is one of the required source fields of emPreverb, its value is also modified by it. For verbs with either a separated (prev = "sep"
) or a non-separated preverb (prev = "pfx"
), the label[/Prev]
is prepended to the original value ofxpostag
. For example: szétvetve becomes[/Prev][/V][_AdvPtcp/Adv]
instead of the original[/V][_AdvPtcp/Adv]
that is assigned to thexpostag
field by emtsv's PurePos tagger module. Thexpostag
of the separated verb nyelje in a föld nyelje el becomes[/Prev][/V][Sbjv.Def.3Sg]
. -
The
lemma
field: This is also a required source field of emPreverb which is modified by it. For verbs with a non-separated preverb the lemma is left unchanged. Forsep
verbs, the separated preverb is prepended to the verb's lemma. The lemma ofconn
preverbs is set to an empty string. In the previous example, PurePos originally assigns the valuesnyel
andel
to thelemma
field of nyelje and el respectively. EmPreverb changes these toelnyel
and the empty string (i.e. no lemma at all) respectively. -
The
compound
field: This field is the target field of our emCompound module. It is not required by emPreverb, but if emPreverb's input does contain thecompound
field, then emPreverb modifies it, adding the compound structure preverb + # + verb lemma as the value of this field forsep
verbs. This means that the value of thecompound
field of the token nyelje in the above example, which is originally empty (as this form is not itself a compound), becomesel#nyel
. Verb tokens with non-separated preverbs, like szétvetve, are already analysed as compounds, i.e.szét#vet
by emCompound, so these are not changed.
Important note on the ordering of modules: Since emPreverb modifies the values of the lemma
and xpostag
fields that are assigned by emtsv pos, it we do not recommend running emPreverb before any other emtsv modules that use these two fields as their source fields. Thus emPreverb should ideally toward the end of the pipeline. If it is being used, then emCompound should be run before emPreverb. EmFilter and emToReadable can be safely run after emPreverb. EmToReadable can in fact convert emPreverb's output annotation into a human-readable format.
V prev | V previd | V prevpos | V xpostag | V lemma | V compound | P prev | P previd | P lemma | |
---|---|---|---|---|---|---|---|---|---|
átúsztam | pfx | [/Prev][/V][Pst.NDef.1Sg] |
átúszik | át#úszik | |||||
végig kell vinni | sep | 1 | -2 | [/Prev][/V][Inf] |
végigvisz | végig#visz | conn | 1 | "" |
nem gondolom -e meg | sep | 2 | +2 | [/Prev][/V][Prs.Def.1Sg] |
meggondol | meg#gondol | conn | 2 | "" |
90 napot meg nem haladó | sep | 3 | -2 | [/Prev][/Adj][Nom] |
meghaladó | meg#haladó | conn | 3 | "" |
Depending on the current configuration of your system, you might have to add the path to the emPreverb module on your machine (i.e. the path to your clone of the emPreverb repository) to the PYTHONPATH
environmental variable like this before executing the commands below, otherwise you might get a 'module not found' error from the Python interpreter:
export PYTHONPATH="${PYTHONPATH}:/path/to/emPreverb/"
(Replace the part "/path/to/emPreverb/
" by the actual absolute path to emPreverb on your machine.) In addition, if you are also using emCompound, you might have to do the same for the emCompound directory as well.
EmPreverb can be executed as an individual Python module. The file 'input.txt' in this example is a raw text file:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos > pos_output.tsv
cat pos_output.tsv | python3 -m emPreverb > prev_output.tsv
Optionally, if emCompound is executed before emPreverb in the processing pipeline, then emPreverb adjusts the content of the compound
field as described above:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos > pos_output.tsv
cat pos_output.tsv | python3 -m emCompound | python3 emPreverb > prev_output.tsv
Alternatively, emPreverb can be run within emtsv as part of a processing pipeline:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos,preverb > prev_output.tsv
Or together with emCompound:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos,compound,preverb > prev_output.tsv
pip install -r requirements.txt
make connect_preverbs
: if compound
is present
make connect_preverbs_withcompound
: if compound
field is not present
Uses code in emPreverb
directory directly.
Just type make
to run all the following.
- A virtual environment is created in
venv
. emPreverb
Python package is created indist/emPreverb-*-py3-none-any.whl
.- The package is installed in
venv
. - The package is unit tested on
tests/inputs/*.in
and outputs are compared withtests/outputs/*.out
.
The above steps can be performed by make venv
, make build
, make install
and make test
respectively.
The Python package can be installed anywhere by direct path:
pip install ./dist/emPreverb-*-py3-none-any.whl
- Check
emPreverb/version.py
. make release-major
ormake release-minor
ormake release-patch
.
This will update the version number appropriately make agit commit
with a newgit
TAG.make
to recreate the package with the new tag indist/emPreverb-TAG-py3-none-any.whl
.- Go to
https://github.com/THISUSER/emPreverb
and "Create release from tag". - Add wheel file from
dist/emPreverb-TAG-py3-none-any.whl
manually to the release.
- Install
emtsv
: 1st and 2nd point +cython
only. - Go to the
emtsv
directory (cd emtsv
). - Add
emPreverb
by adding this line torequirements.txt
:
https://github.com/THISUSER/emPreverb/releases/download/vTAG/emPreverb-TAG-py3-none-any.whl
- Complete
config.py
by addingem_preverb
andtools
fromemPreverb/__main__.py
appropriately. - Complete
emtsv
installation bymake venv
. echo "A kutya ment volna el sétálni." | venv/bin/python3 ./main.py tok,morph,pos > old
echo "A kutya ment volna el sétálni." | venv/bin/python3 ./main.py tok,morph,pos,preverb > new
- See results by
diff old new
. - If everything is in order, create a PR for
emtsv
.
That's it! :)
Based on postprocess-emtsv/scripts/connect_prev.py
and emDummy
.
TODO command line argument -v
.