Automatic word sense disambiguation with semyhe

Introduction

The following documentation describes the Estonian word sense disambiguation tool semyhe. It is based on Estonian WordNet (EstWN) hyponym/hypernym hierarchy and meant to disambiguate all nouns and verbs. The current version is 0.30.

All the examples in the following text have grey background.

Downloading

semyhe can be obtained from http://math.ut.ee/~kaarel/NLP/semyhe/semyhe_0.30.tar.gz.

Contents of semyhe_0.30.tar.gz

semyhe
main program
semyhe.cnf
configuration file
*.pm
several modules used by different scripts
convert_char.cnv
converts characters (e.g. ää)
convert_n.cnv
converts nouns in ESTMORF-form into lemma (e.g. hange_triip+dhangetriip)
convert_v.cnv
converts verbs in ESTMORF-form into lemma (e.g. valenda+sidvalendama)
gui_semyhe.tcl
graphical user interface to semyhe, written in Tcl/Tk, at the moment very unstable (view screenshot)
evaluate/*
several scripts for evaluating the results of automatic disambiguation; it makes sense to apply these scripts only to files that have also been disambiguated by hand (read more)
dbtools/*
several scripts for converting the thesaurus into database form
frequency/*
several scripts for calculating the sense frequencies

Note: semyhe has currently been tested only on Unix platform (Solaris, Linux), but since it has been written in Perl it should work on other platforms as well.

Installation

Prerequisites

In addition to the Perl interpreter the following is needed:

Unpacking the tar-archive

The following sequence of commands creates a directory called semyhe-0.30 into the active directory:

> gunzip semyhe_0.30.tar.gz
> tar xvf semyhe_0.30.tar

Setting the required environment variables

Only one environment variable (THESAURUS_DB_PATH) is needed, it should point to the location of EstWN thesaurus (in Berkeley DB format). For example when using tcsh, setting an environment variable looks like this:

setenv THESAURUS_DB_PATH /home/kaarel/EstWN/thesaurus/senses38.db

Also make sure that Perl can find all the required modules, if some of them are located in some other place than /usr/lib or /usr/local/lib then the environment variable PERL5LIB must be set, e.g.

setenv PERL5LIB /home/kaarel/lib/perl5/site_perl/5.005/sun4-solaris

This should be all that is required to complete the installation.

How to use semyhe?

semyhe takes two obligatory command-line arguments, namely file that specifies the input file (see input format) and pos that specifies the part-of-speech of the words being disambiguated (either noun (n) or verb (v)). For example:

> semyhe --file tkt0031.txt --pos n > tkt0031.semyhe

The result of the analysis goes to standard output (STDOUT). The result is basically the same as the input text, the only difference is that each line is supplied with the semantic information. (see output format)

Also, semyhe accepts several other command-line parameters in addition to the obligatory ones:

The configuration file semyhe.cnf (see example) lets you define a number of variables (some of which can be redefined on the command-line) to modify the flow of the program.

Input text and output text

Input

The input text for the system must be morphologically analyzed, it means that each word has to be provided with information about its lemma and part-of-speech. Taking those two into account makes it possible for semyhe to localize the senses in EstWN hyponym/hypernym tree that correspond to the analyzed word. Current version of the system expects to find a unique morphological reading for every word. Since the Estonian morphological analyzer/disambiguator often leaves words morphologically ambiguous, this constraint will be removed in the nearest future.

An example of a text understandable to semyhe:

Ka
     ka+0 //_D_ //
sõidukiirus
     sõidu_kiirus+0 //_S_ com sg nom // 0
oli
     ole+i //_V_ main indic impf ps3 sg ps af // 8
veel
     veel+0 //_D_ //
hea
     hea+0 //_A_ pos sg nom //
,
     , //_Z_ Com //
nii
     nii+0 //_D_ //
paarkümmend
     paar_kümmend+0 //_N_ card sg nom l //
kilomeetrit
     kilo_meeter+t //_S_ com sg part // 1
tunnis
     tund+s //_S_ com sg in // 1

The above is the output of the Estonian morphological analyzer ESTMORF. Each word of the text is presented on a separate line followed by its morphological reading which is split up by two slashes (//): first comes the lemma (e.g. ka+0), then the reading itself (e.g. _D_). The above example is also gone through manual sense disambiguation, this is denoted by the numbers after the last slashes.

Output

Similarly to the morphological analysis, the system does not try to provide each word with exactly one sense as its semantic reading. In case two (or more) senses have equal evaluation results then both of those prevail in the output.

Ka
     ka+0 //_D_ //
sõidukiirus
     sõidu_kiirus+0 //_S_ com sg nom // @ sõidukiirus:0:0
oli
     ole+i //_V_ main indic impf ps3 sg ps af //  @ olema:853#754#753#80:9
veel
     veel+0 //_D_ //
hea
     hea+0 //_A_ pos sg nom //
,
     , //_Z_ Com //
nii
     nii+0 //_D_ //
paarkümmend
     paar_kümmend+0 //_N_ card sg nom l //
kilomeetrit
     kilo_meeter+t //_S_ com sg part // @ kilomeeter:2493:1
tunnis
     tund+s //_S_ com sg in // @ tund:1442:2

semyhe adds its contribution to the end of each line that already contains the morphological reading, preceding it with the @-sign. The word sense information consists of three parts separated by colons:

  1. lemma
  2. synset numbers, if more than one are given, then they are separated by #-signs
  3. the number of different senses the word has

Only nouns (morphological tag _S_) and verbs (_V_) are currently being disambiguated. In case the word does not exist in the thesaurus then its sense is denoted by 0.

Other stuff


Kaarel Kaljurand, Thu Oct 18 15:21:26 EET 2001

validate this page: HTML, CSS