To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

Poor man's OCR post-corre… - University of Gothenburg, Sweden Till startsida
Sitemap
To content Read more about how we use cookies on gu.se

Poor man's OCR post-correction: Unsupervised recognition of variant spelling applied to a multilingual document collection

Conference paper
Authors Harald Hammarström
Shafqat Virk
Markus Forsberg
Published in DATeCH2017, Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage, Göttingen, Germany — June 01 - 02, 2017
ISBN 978-1-4503-5265-9
Publisher Association for Computing Machinery (ACM)
Place of publication New York
Publication year 2017
Published at Department of Swedish
Language en
Links dx.doi.org/10.1145/3078081.3078107
Keywords Multilingual, OCR, Unsupervised
Subject categories Languages and Literature

Abstract

© 2017 Copyright held by the owner/author(s). The accuracy of Optical Character Recognition (OCR) is sets the limit for the success of subsequent applications used in text analyzing pipeline. Recent models of OCR postprocessing significantly improve the quality of OCR-generated text but require engineering work or resources such as humanlabeled data or a dictionary to perform with such accuracy on novel datasets. In the present paper we introduce a technique for OCR post-processing that runs off-the-shelf with no resources or parameter tuning required. In essence, words which are similar in form that are also distributionally more similar than expected at random are deemed OCR-variants. As such it can be applied to any language or genre (as long as the orthography segments the language at the word-level). The algorithm is illustrated and evaluated using a multilingual document collection and a benchmark English dataset.

Page Manager: Webmaster|Last update: 9/11/2012
Share:

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?