To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

Normalising Non-standardi… - University of Gothenburg, Sweden Till startsida
To content Read more about how we use cookies on

Normalising Non-standardised Orthography in Algerian Code-switched User-generated Data

Conference paper
Authors Wafia Adouane
Jean-Philippe Bernardy
Simon Dobnik
Published in The 5th Workshop on Noisy User-generated Text (W-NUT), November 4, 2019, Hong Kong / Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi (Editors)
ISBN 978-1-950737-84-0
Publisher Association for Computational Linguistics
Place of publication Stroudsburg, PA
Publication year 2019
Published at Department of Philosophy, Linguistics and Theory of Science
Language en
Subject categories Language Technology (Computational Linguistics)


We work with Algerian, an under-resourced non-standardised Arabic variety, for which we compile a new parallel corpus consist- ing of user-generated textual data matched with normalised and corrected human annota- tions following data-driven and our linguisti- cally motivated standard. We use an end-to- end deep neural model designed to deal with context-dependent spelling correction and nor- malisation. Results indicate that a model with two CNN sub-network encoders and an LSTM decoder performs the best, and that word context matters. Additionally, pre- processing data token-by-token with an edit- distance based aligner significantly improves the performance. We get promising results for the spelling correction and normalisation, as a pre-processing step for downstream tasks, on detecting binary Semantic Textual Similarity.

Page Manager: Webmaster|Last update: 9/11/2012

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?