Till sidans topp

Sidansvarig: Webbredaktion
Sidan uppdaterades: 2012-09-11 15:12

Tipsa en vän

Shami: A Corpus of Levant… - Göteborgs universitet Till startsida
Till innehåll Läs mer om hur kakor används på gu.se

Shami: A Corpus of Levantine Arabic Dialects

Paper i proceeding
Författare Chatrine (kathrein) Qwaider (abu kwaik)
Motaz Saad
Stergios Chatzikyriakidis
Simon Dobnik
Publicerad i Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 7-12, 2018, Miyazaki, Japan / editors: Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga
ISBN 979-10-95546-00-9
Förlag European Language Resources Association (ELRA)
Publiceringsår 2018
Publicerad vid Institutionen för filosofi, lingvistik och vetenskapsteori
Språk en
Länkar www.lrec-conf.org/proceedings/lrec2...
Ämnesord dialectal Arabic, Levantine dialect corpus, dialect identification
Ämneskategorier Datorlingvistik


Modern Standard Arabic (MSA) is the official language used in education and media across the Arab world both in writing and formal speech. However, in daily communication several dialects depending on the country, region as well as other social factors, are used. With the emergence of social media, the dialectal amount of data on the Internet have increased and the NLP tools that support MSA are not well-suited to process this data due to the difference between the dialects and MSA. In this paper, we construct the Shami corpus, the first Levantine Dialect Corpus (SDC) covering data from the four dialects spoken in Palestine, Jordan, Lebanon and Syria. We also describe rules for pre-processing without affecting the meaning so that it is processable by NLP tools. We choose Dialect Identification as the task to evaluate SDC and compare it with two other corpora. In this respect, experiments are conducted using different parameters based on n-gram models and Naive Bayes classifiers. SDC is larger than the existing corpora in terms of size, words and vocabularies. In addition, we use the performance on the Language Identification task to exemplify the similarities and differences in the individual dialects.

Sidansvarig: Webbredaktion|Sidan uppdaterades: 2012-09-11

På Göteborgs universitet använder vi kakor (cookies) för att webbplatsen ska fungera på ett bra sätt för dig. Genom att surfa vidare godkänner du att vi använder kakor.  Vad är kakor?