To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

The Swedish Culturomics G… - University of Gothenburg, Sweden Till startsida
Sitemap
To content Read more about how we use cookies on gu.se

The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP

Conference paper
Authors Stian Rødven Eide
Nina Tahmasebi
Lars Borin
Published in Linköping Electronic Conference Proceedings. Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, July 11, 2016, Krakow, Poland
ISBN 978-91-7685-733-5
ISSN 1650-3686
Publisher Linköping University Electronic Press
Place of publication Linköping
Publication year 2016
Published at Department of Swedish
Language en
Links https://spraakbanken.gu.se/resurs/g...
www.ep.liu.se/ecp/article.asp?issue...
Keywords nlp, corpus, culturomics
Subject categories Language Technology (Computational Linguistics)

Abstract

In this paper we present a dataset of contemporary Swedish containing one billion words. The dataset consists of a wide range of sources, all annotated using a state-of-the-art corpus annotation pipeline, and is intended to be a static and clearly versioned dataset. This will facilitate reproducibility of experiments across institutions and make it easier to compare NLP algorithms on contemporary Swedish. The dataset contains sentences from 1950 to 2015 and has been carefully designed to feature a good mix of genres balanced over each included decade. The sources include literary, journalistic, academic and legal texts, as well as blogs and web forum entries.

Page Manager: Webmaster|Last update: 9/11/2012
Share:

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?