To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

Construction and Annotati… - University of Gothenburg, Sweden Till startsida
To content Read more about how we use cookies on

Construction and Annotation of a Corpus of Contemporary Nepali

Journal article
Authors Y.P. Yadava
A. Hardie
R.R. Lohani
B.N. Regmi
S. Gurung
A. McEnery
Jens Allwood
P Hall
A Gurung
Published in Corpora
Volume 3
Pages 213-225
ISSN 1749-5032
Publication year 2008
Published at Department of Linguistics
Centre of Interdisciplinary Research/Cognition/Information
Pages 213-225
Language en
Subject categories Humanities, Specific Languages


In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.

Page Manager: Webmaster|Last update: 9/11/2012

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?