Majdi Sawalha @ Leeds |
![]() |
| Home> Research Overview | |
Automatic Part-of-Speech Tagging of Arabic Text Arabic is the language spoken and written by more than 300 million people as first language, and more than 1 billion Muslim worldwide interested in learning Arabic. Arabic is also one of the official languages in the United Nations. In the UK particularly, there are more than 1.6 million Muslim working and living representing (2.8%) of the UK population; who are interested in Arabic language. Moreover, Universities and learning institutions have special departments for Islamic and Middle Eastern studies; where Arabic is the main teaching language. The UK has many leading international organizations that develop language resources, such as; dictionaries, lexicons, thesauruses, WordNets, machine translators and learning materials for English and a verity of foreign languages including Arabic. Part-of-Speech taggers and morphological analyzers have many applications. They work as pre-processor for text; where many natural language applications need them to enhance their accuracy. Tagged corpus can be used to extract grammatical and linguistic information from the corpus and used to train machine learning algorithms. Furthermore, part of speech information is useful for information technology application such as text indexing, information retrieval and speech processing. |
|
![]() |
|
Corpora are multifunctional resources, which have been used by linguistics, NLP researchers and researchers from other disciplines, such as literacy stylists. Few Arabic corpora are freely available, while annotated Arabic corpora are not freely available for research. Designing large Arabic corpus including vowelized and non-vowelized text, and making it available for researchers of different disciplines will make impact advance to Arabic computational linguistics and natural language processing. Annotated corpus has key advantages of ease of exploitation, reusability, multi-functionality, and explicit analyses. Enriching the Arabic corpus with annotation can be done automatically, semi-automatically or manually. Automatic annotation using high accuracy corpus software of morphological analyzers and part-of-speech taggers will reduce the annotation time, efforts and will produce consistent annotated Arabic corpus. Part-of-speech tagging and morphological analyzers are essential for detecting anomalous text. They are key technology for text analytics and they are essential language technology tools needed to support the upgrade of the actual web to semantic web (SW) by providing automatic analysis of the linguistic structure of textual documents. Moreover, Search engine technology, Information extraction and Lexical and semantic analysis for Arabic text require part-of-speech tagger and morphological analysers to identify the correct part-of-speech of the word in the first stage of the text analysis. The project of Automatic Part-of-Speech tagging of Arabic text is an extension to the projects of the language research group in the School of Computing at University of Leeds. We have an ongoing interest in corpus-based research on Arabic. A new free Arabic corpus, the Corpus of Contemporary Arabic has been developed. And a new open-source concordance tool for analysis of Arabic corpus texts, aConCorde. More projects, including working towards integration of Arabic into the Python Natural Language Tool Kit (NLTK), including software for morphological analysis and Part-of-Speech tagging of classical and contemporary Arabic texts, as represented by the Quran and our Corpus of Contemporary Arabic respectively, and developing more sophisticated software tools for question-answering and query-by-concept for studying the Quran. |
|
Home Page | POS Tag Set | Blog | Arabic @ Leeds | University of Leeds |
|