Site menu:

Latest news:

Dec 2, 2010:
Online proceedings available

Dec 4, 2010:
Photos

Main conference:

TLT9 - The Ninth International Workshop on Treebanks and Linguistic Theories

Supported by:

Workshop on Annotation and Exploitation of Parallel Corpora (AEPC)

Find information on the workshop in short on the AEPC flyer.

In recent years parallel corpora have become ever more useful for data-driven Machine Translation, Word Sense Disambiguation, or Cross-language Information Retrieval. Most of the time parallel corpora were used as raw texts (i.e. without any linguistic annotation) or with independent linguistic annotation (i.e. linguistic annotation that was applied to either language side without resort to the other). We believe that the full potential of parallel corpora will be reached when parallel corpora are aligned and annotated concurrently. Many research strands like the automatic creation of parallel treebanks and parallel parsing point in this direction. In particular the popularity of syntax-enhanced approaches to statistical machine translation and the rise of multilingual corpus linguistics indicate the relevance of this workshop at this point in time.

Various projects have been initiated to build aligned parallel treebanks [Cmejrek et al., 2005, Gustafson-Capková et al., 2007, Ahrenberg, 2007] and most of them are based on tedious manual labor [Lundborg et al., 2007, Samuelsson and Volk, 2007]. Recently, several attempts have been made to automate this process mainly focused on creating syntaxoriented translation models [Wang et al., 2002, Gildea, 2003, Zhechev and Way, 2008, Lavie et al., 2008]. The main strategies are based on alignment through parsing and chunking [Spreyer et al., 2008], language pair-dependent alignment rules [Groves et al., 2004] and the use of previous word alignment to induce phrase correspondences [Zhechev and Way, 2008]. Discriminative approaches using supervised learning have been successfully applied as well [Tiedemann and Kotzé, 2009]. Using these techniques to scale up the size of available aligned treebanks opens up a wide range of new possibilities for the exploration of cross-lingual data with syntactic and semantic information.

The work on automatic tree alignment is closely related to synchronous parsing based on transduction grammars (as in [Melamed, 2003]) or based on bootstrapping from a small set of manually labeled seeds (as in [Kuhn and Jellinghaus, 2006]). The advertised advantage is that the parallel text helps in syntactic disambiguation as well as in fast and robust annotation. Multiparallel corpora are considered to be of higher value than bilingual corpora.

Automatic syntactic annotation depends on the availability of language technology modules (e.g. PoS taggers and parsers) in the respective language. Resource-poor languages might not have this technology infrastructure. Moreover manual annotation is time-consuming. Therefore [Hwa et al., 2005] and [Smith and Eisner, 2006] have proposed ways to transfer syntactic information in parallel corpora, termed annotation projection, from one language to another.

As a follow up to the work on projecting syntactic information across parallel corpora, the projection of semantic annotation was pioneered in recent work by [Padó and Lapata, 2009]. They have worked on the transfer of frame-semantic annotation across parallel corpora. We believe that improved functional and semantic projection is a necessary step to speed up the tedious process of semantic annotation. This is confirmed in recent work by [Dorr et al., 2010].

There are few tools for corpus linguistics over parallel corpora, there are even fewer for visualizing and searching annotated parallel corpora (an example is [Germann, 2007]). With the increasing interest in and availability of annotated parallel corpora we see a growing demand for such tools.

With this workshop we try to bring together researchers that work on annotating parallel corpora for various languages and purposes and researchers that explore such resources for various applications. The following research areas will be addressed:

Invited speaker

AEPC Workshop Schedule

AEPC Workshop Organizers

AEPC Submission Format

We ask for papers (min. length 6 pages - max. length 10 pages conforming to the TLT guidelines) describing research in the range of topics specified above. We welcome work in progress reports if they contain at least preliminary results.

The language of the AEPC Workshop is English, and all papers should be submitted in well-checked English. Papers should be submitted in PDF. Submissions should be made via the EasyChair AEPC Web page.

Program Committee

The following researchers have agreed to serve on the program committee:

References

[Ahrenberg, 2007] Ahrenberg, L. (2007). LinES: An English-Swedish parallel treebank. In Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA, 2007).

[Cmejrek et al., 2005] Cmejrek, M., Curín, J., and Havelka, J. (2005). Prague Czech- English dependency treebank. Resource for structure-based MT. In Proceedings of EAMT 10th Annual Conference, Budapest.

[Dorr et al., 2010] Dorr, B. J., Passonneau, R. J., Farwell, D., Green, R., Habash, N., Helmreich, S., Hovy, E., Levin, L., Miller, K. J., Mitamura, T., Rambow, O., and Siddharthan, A. (2010). Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation. Natural Language Engineering, 16(03):197-243.

[Germann, 2007] Germann, U. (2007). Two tools for creating and visualizing subsentential alignments of parallel text. In Proc. of The Linguistic Annotation Workshop at ACL 2007, pages 121-124, Prague.

[Gildea, 2003] Gildea, D. (2003). Loosely tree-based alignment for machine translation. In Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL-03), pages 80-87, Sapporo, Japan.

[Groves et al., 2004] Groves, D., Hearne, M., and Way, A. (2004). Robust sub-sentential alignment of phrase-structure trees. In Proceedings of the 20th International Conference on Computational Linguistics (CoLing 2004), pages 1072-1078, Geneva, Switzerland.

[Gustafson-Capková et al., 2007] Gustafson-Capková, S., Samuelsson, Y., and Volk, M. (2007). SMULTRON (version 1.0) - The Stockholm MULtilingual parallel TReebank. http://www.ling.su.se/dali/research/smultron/index.htm. An English-German-Swedish parallel Treebank with sub-sentential alignments.

[Hwa et al., 2005] Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., and Kolak, O. (2005). Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3).

[Kuhn and Jellinghaus, 2006] Kuhn, J. and Jellinghaus, M. (2006). Multilingual parallel treebanking: A lean and exible approach. In Proc. of LREC, Genua.

[Lavie et al., 2008] Lavie, A., Parlikar, A., and Ambati, V. (2008). Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2), pages 87-95, Columbus, Ohio. Association for Computational Linguistics.

[Lundborg et al., 2007] Lundborg, J., Marek, T., Mettler, M., and Volk, M. (2007). Using the Stockholm TreeAligner. In Proceedings of the 6th Workshop on Treebanks and Linguistic Theories, pages 73-78, Bergen, Norway.

[Melamed, 2003] Melamed, I. D. (2003). Multitext grammars and synchronous parsers. In NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 79-86, Morristown, NJ, USA. Association for Computational Linguistics.

[Padó and Lapata, 2009] Padó, S. and Lapata, M. (2009). Cross-lingual annotation projection of semantic roles. Journal of Artificial Intelligence Research.

[Samuelsson and Volk, 2007] Samuelsson, Y. and Volk, M. (2007). Alignment tools for parallel treebanks. In Proceedings of GLDV Frühjahrstagung 2007.

[Smith and Eisner, 2006] Smith, D. A. and Eisner, J. (2006). Quasi-synchronous grammars: Alignment by soft projection of syntactic dependencies. In Proceedings of the Workshop on Statistical Machine Translation, pages 23{30, New York.

[Spreyer et al., 2008] Spreyer, K., Kuhn, J., and Schrader, B. (2008). Identification of comparable argument-head relations in parallel corpora. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), Marrakesh.

[Tiedemann and Kotzé, 2009] Tiedemann, J. and Kotzé, G. (2009). A discriminative approach to tree alignment. In Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography and Language Learning (in Connection with RANLP2009), pages 33-39.

[Wang et al., 2002] Wang, W., Huang, J.-X., Zhou, M., and Huang, C.-N. (2002). Structure alignment using bilingual chunking. In Proceedings of the 19th Conference on Computational Linguistics, pages 1-7, Taipei, Taiwan.

[Zhechev and Way, 2008] Zhechev, V. and Way, A. (2008). Automatic generation of parallel treebanks. In Proceedings of the 22nd International Conference on Computational Linguistics (CoLing), pages 1105-1112.

For information on this workshop please contact volk at cl uzh ch.