Workshop on Annotation and Exploitation of Parallel Corpora (AEPC)
Find information on the workshop in short on the AEPC flyer.
In recent years parallel corpora have become ever more useful for data-driven Machine Translation, Word Sense Disambiguation, or Cross-language Information Retrieval. Most of the time parallel corpora were used as raw texts (i.e. without any linguistic annotation) or with independent linguistic annotation (i.e. linguistic annotation that was applied to either language side without resort to the other). We believe that the full potential of parallel corpora will be reached when parallel corpora are aligned and annotated concurrently. Many research strands like the automatic creation of parallel treebanks and parallel parsing point in this direction. In particular the popularity of syntax-enhanced approaches to statistical machine translation and the rise of multilingual corpus linguistics indicate the relevance of this workshop at this point in time.
Various projects have been initiated to build aligned parallel treebanks [Cmejrek et al., 2005, Gustafson-Capková et al., 2007, Ahrenberg, 2007] and most of them are based on tedious manual labor [Lundborg et al., 2007, Samuelsson and Volk, 2007]. Recently, several attempts have been made to automate this process mainly focused on creating syntaxoriented translation models [Wang et al., 2002, Gildea, 2003, Zhechev and Way, 2008, Lavie et al., 2008]. The main strategies are based on alignment through parsing and chunking [Spreyer et al., 2008], language pair-dependent alignment rules [Groves et al., 2004] and the use of previous word alignment to induce phrase correspondences [Zhechev and Way, 2008]. Discriminative approaches using supervised learning have been successfully applied as well [Tiedemann and Kotzé, 2009]. Using these techniques to scale up the size of available aligned treebanks opens up a wide range of new possibilities for the exploration of cross-lingual data with syntactic and semantic information.
The work on automatic tree alignment is closely related to synchronous parsing based on transduction grammars (as in [Melamed, 2003]) or based on bootstrapping from a small set of manually labeled seeds (as in [Kuhn and Jellinghaus, 2006]). The advertised advantage is that the parallel text helps in syntactic disambiguation as well as in fast and robust annotation. Multiparallel corpora are considered to be of higher value than bilingual corpora.
Automatic syntactic annotation depends on the availability of language technology modules (e.g. PoS taggers and parsers) in the respective language. Resource-poor languages might not have this technology infrastructure. Moreover manual annotation is time-consuming. Therefore [Hwa et al., 2005] and [Smith and Eisner, 2006] have proposed ways to transfer syntactic information in parallel corpora, termed annotation projection, from one language to another.
As a follow up to the work on projecting syntactic information across parallel corpora, the projection of semantic annotation was pioneered in recent work by [Padó and Lapata, 2009]. They have worked on the transfer of frame-semantic annotation across parallel corpora. We believe that improved functional and semantic projection is a necessary step to speed up the tedious process of semantic annotation. This is confirmed in recent work by [Dorr et al., 2010].
There are few tools for corpus linguistics over parallel corpora, there are even fewer for visualizing and searching annotated parallel corpora (an example is [Germann, 2007]). With the increasing interest in and availability of annotated parallel corpora we see a growing demand for such tools.
With this workshop we try to bring together researchers that work on annotating parallel corpora for various languages and purposes and researchers that explore such resources for various applications. The following research areas will be addressed:
- Parallel Treebanks (manual or automatic creation)
- Cross-language Word Alignment and Phrase-Structure Alignment
- Parallel Grammars, Parallel Parsing
- Grammar Induction
- Parallel Semantic Annotation
- Parallel Referent Resolution and Anaphora
- Annotation Projection
- Multi-parallel Corpora
- Tools for Multilingual Corpus Linguistics
- Exploitation of Parallel Corpora for Evaluation
- Annotated Parallel Corpora for Machine Translation
- Novel Applications of Annotated Parallel Corpora
Invited speaker
- Matthias Buch-Kromann, Copenhagen Business School
AEPC Workshop Schedule
- Deadline for paper submission: 3 October 2010
- Notification of acceptance: 24 October 2010
- Final version of paper for workshop proceedings: 15 November 2010
- Workshop: 2 December 2010
AEPC Workshop Organizers
- Lars Ahrenberg (Linköping University)
- Jörg Tiedemann (Uppsala University)
- Martin Volk (University of Zurich)
AEPC Submission Format
We ask for papers (min. length 6 pages - max. length 10 pages conforming to the TLT guidelines) describing research in the range of topics specified above. We welcome work in progress reports if they contain at least preliminary results.
The language of the AEPC Workshop is English, and all papers should be submitted in well-checked English. Papers should be submitted in PDF. Submissions should be made via the EasyChair AEPC Web page.
Program Committee
The following researchers have agreed to serve on the program committee:
- Paul Buitelaar (DERI, Galway)
- Anne Göhring (University of Zurich)
- Silvia Hansen (University of Mainz)
- Joakim Nivre (Uppsala University)
- Lonneke van der Plas (University of Geneva)
- Yvonne Samuelsson (Stockholm University)
- John Tinsley (Dublin City University)
- Mats Wirén (Stockholm University)
- Dekai Wu (Hong Kong University of Science & Technology)
- Ventsislav Zhechev (Dublin City University)
References
[Ahrenberg, 2007] Ahrenberg, L. (2007). LinES: An English-Swedish parallel treebank. In Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA, 2007).
[Cmejrek et al., 2005] Cmejrek, M., Curín, J., and Havelka, J. (2005). Prague Czech- English dependency treebank. Resource for structure-based MT. In Proceedings of EAMT 10th Annual Conference, Budapest.
[Dorr et al., 2010] Dorr, B. J., Passonneau, R. J., Farwell, D., Green, R., Habash, N., Helmreich, S., Hovy, E., Levin, L., Miller, K. J., Mitamura, T., Rambow, O., and Siddharthan, A. (2010). Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation. Natural Language Engineering, 16(03):197-243.
[Germann, 2007] Germann, U. (2007). Two tools for creating and visualizing subsentential alignments of parallel text. In Proc. of The Linguistic Annotation Workshop at ACL 2007, pages 121-124, Prague.
[Gildea, 2003] Gildea, D. (2003). Loosely tree-based alignment for machine translation. In Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL-03), pages 80-87, Sapporo, Japan.
[Groves et al., 2004] Groves, D., Hearne, M., and Way, A. (2004). Robust sub-sentential alignment of phrase-structure trees. In Proceedings of the 20th International Conference on Computational Linguistics (CoLing 2004), pages 1072-1078, Geneva, Switzerland.
[Gustafson-Capková et al., 2007] Gustafson-Capková, S., Samuelsson, Y., and Volk, M. (2007). SMULTRON (version 1.0) - The Stockholm MULtilingual parallel TReebank. http://www.ling.su.se/dali/research/smultron/index.htm. An English-German-Swedish parallel Treebank with sub-sentential alignments.
[Hwa et al., 2005] Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., and Kolak, O. (2005). Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3).
[Kuhn and Jellinghaus, 2006] Kuhn, J. and Jellinghaus, M. (2006). Multilingual parallel treebanking: A lean and exible approach. In Proc. of LREC, Genua.
[Lavie et al., 2008] Lavie, A., Parlikar, A., and Ambati, V. (2008). Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2), pages 87-95, Columbus, Ohio. Association for Computational Linguistics.
[Lundborg et al., 2007] Lundborg, J., Marek, T., Mettler, M., and Volk, M. (2007). Using the Stockholm TreeAligner. In Proceedings of the 6th Workshop on Treebanks and Linguistic Theories, pages 73-78, Bergen, Norway.
[Melamed, 2003] Melamed, I. D. (2003). Multitext grammars and synchronous parsers. In NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 79-86, Morristown, NJ, USA. Association for Computational Linguistics.
[Padó and Lapata, 2009] Padó, S. and Lapata, M. (2009). Cross-lingual annotation projection of semantic roles. Journal of Artificial Intelligence Research.
[Samuelsson and Volk, 2007] Samuelsson, Y. and Volk, M. (2007). Alignment tools for parallel treebanks. In Proceedings of GLDV Frühjahrstagung 2007.
[Smith and Eisner, 2006] Smith, D. A. and Eisner, J. (2006). Quasi-synchronous grammars: Alignment by soft projection of syntactic dependencies. In Proceedings of the Workshop on Statistical Machine Translation, pages 23{30, New York.
[Spreyer et al., 2008] Spreyer, K., Kuhn, J., and Schrader, B. (2008). Identification of comparable argument-head relations in parallel corpora. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), Marrakesh.
[Tiedemann and Kotzé, 2009] Tiedemann, J. and Kotzé, G. (2009). A discriminative approach to tree alignment. In Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography and Language Learning (in Connection with RANLP2009), pages 33-39.
[Wang et al., 2002] Wang, W., Huang, J.-X., Zhou, M., and Huang, C.-N. (2002). Structure alignment using bilingual chunking. In Proceedings of the 19th Conference on Computational Linguistics, pages 1-7, Taipei, Taiwan.
[Zhechev and Way, 2008] Zhechev, V. and Way, A. (2008). Automatic generation of parallel treebanks. In Proceedings of the 22nd International Conference on Computational Linguistics (CoLing), pages 1105-1112.
For information on this workshop please contact volk at cl uzh ch.