Entrance of School of Computing School of Computing where I work

Latifa Al-Sulaiti's Homepage




Arabic Corpora

Arabic Web Concordancing


Arabic CALL

Arabic Corpus-Based Studies

Researchers in Arabic Computational Linguistics



The main purpose of my research is to develop a prototype Corpus of Contemporary Arabic (CCA). The target users of this corpus will be language teachers, language engineers, foreign learners of Arabic and material writers. The first step in designing my corpus is deciding on the text type to include in this corpus. For this reason I have developed a questionnaire to help me identify the suitable texts.

I mainly derived my texts from websites. I have identified several useful sites which you can find a list of them in the section 'links'. From these sites I obtained my written texts. I have also included some spoken files which are obtained from radio Qatar. But the number of these files is very small because inputting such types of files is time-consuming and the time allocated for me to finish the project was not enough to do a good sample of spoken texts.

Before including any text in my corpus I have to obtain legal permission from the sources. I have prepared a letter based on the letter used by the BNC and I have sent it to the legal advisor in the university to assess its legality and to get some advice. After amending the letter, I made an Arabic version to be sent to some sources which prefer corresponding in Arabic. I have been granted permission to use a good number website magazines and newspapers.

Based on the results I obtained from my questionnaire and the number and type of websites granted me permission, I began collecting my corpus. This corpus includes (up to the present date) 842684 words and 415 texts in some of the categories identified by the language teachers and language engineers:

You can download the entire corpus stripped of XML markup as a raw UTF-8 textfile: CCA_raw_utf8.txt. Alternatively, below are the XML marked-up files of the corpus, in a separate zipped file for each category:

Below are the files of the corpus:

Autobiography Autobiography.zip
Short Stories Short_Stories.zip
Children's Stories Children's Stories.zip
Economics Economics.zip
Education Education.zip
Health and Medicine Health and Medicine.zip
Interviews Interviews.zip
Politics Politics.zip
Recipes Recipes.zip
Religion Religion.zip
Sociology Sociology.zip
Science Science.zip
Sports Sports.zip
Tourist and Travel Tourist and Travel.zip
(sports, entertainment, education) A small group of files

New addition to the corpus:

This table contains some new texts which are not annotated with XML markup. The texts belong to the science category and they contain 101, 214 words derived from the online magazine Science and Technology (Kuwait).

Science ScienceB.zip


Extension of the present project and call for participation:

As a result of this project there is a plan to develop a prototype sampler corpus in the School of Computing at Leeds University. It is referred to as the 'International Corpus of Arabic (ICA)'. It will be designed on the principles of the ICE (The International Corpus of English). This project is still an idea, but one of the important things is to find collaborators in the Arab world (Gulf States + Saudi Arabia + Yemen, Levantine countries (Jordan, Syria, Lebanon, Palestine), North Africa + Egypt and Sudan)- who can participate in this project. If you are interested, please contact Eric Atwell or Latifa Al-Sulaiti.


Last modified: March 11, 2009 9:00 AM