The following documentation describes the Estonian word sense disambiguation tool semyhe. It is based on Estonian WordNet (EstWN) hyponym/hypernym hierarchy and meant to disambiguate all nouns and verbs. The current version is 0.30.
All the examples in the following text have grey background.
semyhe can be obtained from http://math.ut.ee/~kaarel/NLP/semyhe/semyhe_0.30.tar.gz.
semyhe_0.30.tar.gz
semyhesemyhe.cnf*.pmconvert_char.cnvä → ä)convert_n.cnvhange_triip+d → hangetriip)convert_v.cnvvalenda+sid → valendama)gui_semyhe.tclevaluate/*dbtools/*frequency/*Note: semyhe has currently been tested only on Unix platform (Solaris, Linux), but since it has been written in Perl it should work on other platforms as well.
In addition to the Perl interpreter the following is needed:
BerkeleyDB that communicates
with Berkeley DB; this can be obtained from
CPAN
The following sequence of commands creates a directory
called semyhe-0.30 into the active directory:
> gunzip semyhe_0.30.tar.gz > tar xvf semyhe_0.30.tar
Only one environment variable (THESAURUS_DB_PATH) is needed,
it should point to the location of EstWN thesaurus (in Berkeley DB
format).
For example when using tcsh, setting an environment
variable looks like this:
setenv THESAURUS_DB_PATH /home/kaarel/EstWN/thesaurus/senses38.db
Also make sure that Perl can find all the required
modules, if some of them are located in some other place
than /usr/lib or /usr/local/lib then
the environment variable PERL5LIB must be set, e.g.
setenv PERL5LIB /home/kaarel/lib/perl5/site_perl/5.005/sun4-solaris
This should be all that is required to complete the installation.
semyhe?
semyhe takes two obligatory command-line arguments,
namely file that specifies the input file
and pos
that specifies the part-of-speech of the words being disambiguated
(either noun (n) or verb (v)).
For example:
> semyhe --file tkt0031.txt --pos n > tkt0031.semyhe
The result of the analysis goes to standard output (STDOUT). The result is basically the same as the input text, the only difference is that each line is supplied with the semantic information.
Also, semyhe accepts several other command-line parameters in addition to the obligatory ones:
window defines the window size to be useddebug defines the verbosity levelthreshold defines the thresholdversion prints the version informationhelp prints a list of available command-line arguments
The configuration file semyhe.cnf
(see example)
lets you define a number of
variables (some of which can be redefined on the command-line)
to modify the flow of the program.
The input text for the system must be morphologically analyzed, it means that each word has to be provided with information about its lemma and part-of-speech. Taking those two into account makes it possible for semyhe to localize the senses in EstWN hyponym/hypernym tree that correspond to the analyzed word. Current version of the system expects to find a unique morphological reading for every word. Since the Estonian morphological analyzer/disambiguator often leaves words morphologically ambiguous, this constraint will be removed in the nearest future.
An example of a text understandable to semyhe:
Ka
ka+0 //_D_ //
sõidukiirus
sõidu_kiirus+0 //_S_ com sg nom // 0
oli
ole+i //_V_ main indic impf ps3 sg ps af // 8
veel
veel+0 //_D_ //
hea
hea+0 //_A_ pos sg nom //
,
, //_Z_ Com //
nii
nii+0 //_D_ //
paarkümmend
paar_kümmend+0 //_N_ card sg nom l //
kilomeetrit
kilo_meeter+t //_S_ com sg part // 1
tunnis
tund+s //_S_ com sg in // 1
The above is the output of the Estonian morphological analyzer
ESTMORF. Each word of the text is presented on a separate line
followed by its morphological reading which is split up by
two slashes (//): first comes the lemma (e.g.
ka+0), then the reading itself (e.g. _D_).
The above example is also gone through manual sense disambiguation,
this is denoted by the numbers after the last slashes.
Similarly to the morphological analysis, the system does not try to provide each word with exactly one sense as its semantic reading. In case two (or more) senses have equal evaluation results then both of those prevail in the output.
Ka
ka+0 //_D_ //
sõidukiirus
sõidu_kiirus+0 //_S_ com sg nom // @ sõidukiirus:0:0
oli
ole+i //_V_ main indic impf ps3 sg ps af // @ olema:853#754#753#80:9
veel
veel+0 //_D_ //
hea
hea+0 //_A_ pos sg nom //
,
, //_Z_ Com //
nii
nii+0 //_D_ //
paarkümmend
paar_kümmend+0 //_N_ card sg nom l //
kilomeetrit
kilo_meeter+t //_S_ com sg part // @ kilomeeter:2493:1
tunnis
tund+s //_S_ com sg in // @ tund:1442:2
semyhe adds its contribution to the end of each line
that already contains the morphological reading, preceding it with the
@-sign. The word sense information consists of three parts
separated by colons:
#-signs
Only nouns (morphological tag _S_) and verbs
(_V_) are currently being disambiguated. In case the
word does not exist in the thesaurus then its sense is denoted
by 0.