Till sidans topp

Sidansvarig: Webbredaktion
Sidan uppdaterades: 2012-09-11 15:12

Tipsa en vän
Utskriftsversion

Computational Linguistics… - Göteborgs universitet Till startsida
Webbkarta
Till innehåll Läs mer om hur kakor används på gu.se

Computational Linguistics Resources for Indo-Iranian Languages

Doktorsavhandling
Författare Shafqat Virk
Datum för examination 2013-06-03
Opponent at public defense Dr. Pushpak Bhattacharyya, Department of Computer Science and Engineering Indian Institute of Technology Mumbai- 400 076 India.
ISBN 978-91-628-8706-3
Förlag University of Gothenburg
Förlagsort Göteborg
Publiceringsår 2013
Publicerad vid Institutionen för data- och informationsteknik (GU)
Språk en
Länkar hdl.handle.net/2077/36665
Ämnesord Grammatical FrameWork, Indo-Iranian Languages, Resource Grammars
Ämneskategorier Datorlingvistik

Sammanfattning

Can computers process human languages? During the last fifty years, two main approaches have been used to find an answer to this question: data- driven (i.e. statistics based) and knowledge-driven (i.e. grammar based). The former relies on the availability of a vast amount of electronic linguistic data and the processing capabilities of modern-age computers, while the latter builds on grammatical rules and classical linguistic theories of language. In this thesis, we use mainly the second approach and elucidate the development of computational (”resource”) grammars for six Indo-Iranian languages: Urdu, Hindi, Punjabi, Persian, Sindhi, and Nepali. We explore different lexical and syntactical aspects of these languages and build their resource grammars using the Grammatical Framework (GF) – a type theo- retical grammar formalism tool. We also provide computational evidence of the similarities/differences between Hindi and Urdu, and report a mechanical development of a Hindi resource grammar starting from an Urdu resource grammar. We use a functor style implementation that makes it possible to share the commonalities between the two languages. Our analysis shows that this sharing is possible upto 94% at the syntax level, whereas at the lexical level Hindi and Urdu differed in 18% of the basic words, in 31% of tourist phrases, and in 92% of school mathematics terms. Next, we describe the development of wide-coverage morphological lexicons for some of the Indo-Iranian languages. We use existing linguistic data from different resources (i.e. dictionaries and WordNets) to build uni-sense and multi-sense lexicons. Finally, we demonstrate how we used the reported grammatical and lexical resources to add support for Indo-Iranian languages in a few existing GF application grammars. These include the Phrasebook, the mathematics grammar library, and the Attempto controlled English grammar. Further, we give the experimental results of developing a wide-coverage grammar based arbitrary text translator using these resources. These applications show the importance of such linguistic resources, and open new doors for future re- search on these languages.

Sidansvarig: Webbredaktion|Sidan uppdaterades: 2012-09-11
Dela:

På Göteborgs universitet använder vi kakor (cookies) för att webbplatsen ska fungera på ett bra sätt för dig. Genom att surfa vidare godkänner du att vi använder kakor.  Vad är kakor?