To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

Quantifying the impact of… - University of Gothenburg, Sweden Till startsida
To content Read more about how we use cookies on

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Journal article
Authors Mark J. Hill
Simon Hengchen
Published in Digital Scholarship in the Humanities
Volume 34
Pages 825-843
ISSN 20557671
Publication year 2019
Published at Department of Swedish
Pages 825-843
Language en
Subject categories Computer and Information Science, Other Humanities


This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

Page Manager: Webmaster|Last update: 9/11/2012

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?