To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

Normalising Non-standardi… - University of Gothenburg, Sweden Till startsida
Sitemap
To content Read more about how we use cookies on gu.se

Normalising Non-standardised Orthography in Algerian Code-switched User-generated Data

Conference contribution
Authors Wafia Adouane
Jean-Philippe Bernardy
Simon Dobnik
Published in The 5th Workshop on Noisy User-generated Text (W-NUT), November 4, 2019, Hong Kong
Publication year 2019
Published at Department of Philosophy, Linguistics and Theory of Science
Language en
Links https://www.aclweb.org/anthology/D1...
Subject categories Language Technology (Computational Linguistics)

Abstract

We work with Algerian, an under-resourced non-standardised Arabic variety, for which we compile a new parallel corpus consist- ing of user-generated textual data matched with normalised and corrected human annota- tions following data-driven and our linguisti- cally motivated standard. We use an end-to- end deep neural model designed to deal with context-dependent spelling correction and nor- malisation. Results indicate that a model with two CNN sub-network encoders and an LSTM decoder performs the best, and that word context matters. Additionally, pre- processing data token-by-token with an edit- distance based aligner significantly improves the performance. We get promising results for the spelling correction and normalisation, as a pre-processing step for downstream tasks, on detecting binary Semantic Textual Similarity.

Page Manager: Webmaster|Last update: 9/11/2012
Share:

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?