To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

Modelling large parallel … - University of Gothenburg, Sweden Till startsida
To content Read more about how we use cookies on

Modelling large parallel corpora: The Zurich Parallel Corpus Collection

Conference paper
Authors Johannes Graën
Tannon Kew
Anastassia Shaitarova
Martin Volk
Published in Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019 / Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, Caroline Iliadi (eds.)
Publisher Leibniz-Institut für Deutsche Sprache
Place of publication Mannheim
Publication year 2019
Published at Department of Swedish
Language en
Keywords parallel corpora corpus encoding corpus annotation corpus standardisation document alignment sentence alignment word alignment
Subject categories Computational linguistics


Text corpora come in many different shapes and sizes and carry heterogeneous annotations, depending on their purpose and design. The true benefit of corpora is rooted in their annotation and the method by which this data is encoded is an important factor in their interoperability. We have accumulated a large collection of multilingual and parallel corpora and encoded it in a unified format which is compatible with a broad range of NLP tools and corpus linguistic applications. In this paper, we present our corpus collection and describe a data model and the extensions to the popular CoNLL-U format that enable us to encode it.

Page Manager: Webmaster|Last update: 9/11/2012

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?