Workshop on Annotation and Exploitation of Parallel Corpora (AEPC)

Find information on the workshop in short on the AEPC flyer.

In recent years parallel corpora have become ever more useful for data-driven Machine Translation, Word Sense Disambiguation, or Cross-language Information Retrieval. Most of the time parallel corpora were used as raw texts (i.e. without any linguistic annotation) or with independent linguistic annotation (i.e. linguistic annotation that was applied to either language side without resort to the other). We believe that the full potential of parallel corpora will be reached when parallel corpora are aligned and annotated concurrently. Many research strands like the automatic creation of parallel treebanks and parallel parsing point in this direction. In particular the popularity of syntax-enhanced approaches to statistical machine translation and the rise of multilingual corpus linguistics indicate the relevance of this workshop at this point in time.

Various projects have been initiated to build aligned parallel treebanks [Cmejrek et al., 2005, Gustafson-Capková et al., 2007, Ahrenberg, 2007] and most of them are based on tedious manual labor [Lundborg et al., 2007, Samuelsson and Volk, 2007]. Recently, several attempts have been made to automate this process mainly focused on creating syntaxoriented translation models [Wang et al., 2002, Gildea, 2003, Zhechev and Way, 2008, Lavie et al., 2008]. The main strategies are based on alignment through parsing and chunking [Spreyer et al., 2008], language pair-dependent alignment rules [Groves et al., 2004] and the use of previous word alignment to induce phrase correspondences [Zhechev and Way, 2008]. Discriminative approaches using supervised learning have been successfully applied as well [Tiedemann and Kotzé, 2009]. Using these techniques to scale up the size of available aligned treebanks opens up a wide range of new possibilities for the exploration of cross-lingual data with syntactic and semantic information.

The work on automatic tree alignment is closely related to synchronous parsing based on transduction grammars (as in [Melamed, 2003]) or based on bootstrapping from a small set of manually labeled seeds (as in [Kuhn and Jellinghaus, 2006]). The advertised advantage is that the parallel text helps in syntactic disambiguation as well as in fast and robust annotation. Multiparallel corpora are considered to be of higher value than bilingual corpora.

Automatic syntactic annotation depends on the availability of language technology modules (e.g. PoS taggers and parsers) in the respective language. Resource-poor languages might not have this technology infrastructure. Moreover manual annotation is time-consuming. Therefore [Hwa et al., 2005] and [Smith and Eisner, 2006] have proposed ways to transfer syntactic information in parallel corpora, termed annotation projection, from one language to another.

As a follow up to the work on projecting syntactic information across parallel corpora, the projection of semantic annotation was pioneered in recent work by [Padó and Lapata, 2009]. They have worked on the transfer of frame-semantic annotation across parallel corpora. We believe that improved functional and semantic projection is a necessary step to speed up the tedious process of semantic annotation. This is confirmed in recent work by [Dorr et al., 2010].

There are few tools for corpus linguistics over parallel corpora, there are even fewer for visualizing and searching annotated parallel corpora (an example is [Germann, 2007]). With the increasing interest in and availability of annotated parallel corpora we see a growing demand for such tools.

With this workshop we try to bring together researchers that work on annotating parallel corpora for various languages and purposes and researchers that explore such resources for various applications. The following research areas will be addressed:

Invited speaker

AEPC Workshop Schedule

AEPC Workshop Organizers

AEPC Submission Format

We ask for papers (min. length 6 pages - max. length 10 pages conforming to the TLT guidelines) describing research in the range of topics specified above. We welcome work in progress reports if they contain at least preliminary results.

The language of the AEPC Workshop is English, and all papers should be submitted in well-checked English. Papers should be submitted in PDF. Submissions should be made via the EasyChair AEPC Web page.

Program Committee

The following researchers have agreed to serve on the program committee:


For information on this workshop please contact volk at cl uzh ch.