Entrance of School of Computing School of Computing where I work

Latifa Al-Sulaiti's Homepage

Home

Research

Survey

Arabic Corpora

Arabic Web Concordancing

Links

Arabic CALL

Arabic Corpus-Based Studies

Researchers in Arabic Computational Linguistics

Arabic Corpora

  • Non-free
Name of Corpus
 
Source
Medium
Size
Purpose
Material
Buckwalter Arabic Corpus 1986-2003 Tim Buckwalter Written 2.5 to 3 billion words Lexicography Public resources on the Web
Leuven Corpus (1990-2004) Catholic University Leuven, Belgium Written and spoken

3M words

(spoken: 700,000)

Arabic-Dutch /Dutch-Arabic learner’s dictionary Internet sources, radio & TV, primary school books
Arabic Newswire Corpus(1994) University of Pennsylvania LDC Written 80M words Education and the development of technology Agence France Presse, Xinhua News Agency, and Umma Press

CALLFRIEND Corpus (1995)

 
University of Pennsylvania LDC Conversational 60 telephone conversations Development of language identification technology Egyptian native speakers

NijmegenCorpus
(1996)

 
Nijmegen University Written Over 2M words Arabic-Dutch / Dutch-Arabic dictionary Magazines and fiction
CALLHOME Corpus (1997) University of Pennsylvania LDC Conversational 120 telephone conversations Speech recognition produced from telephone lines Egyptian native speakers
CLARA (1997) Charles University, Prague Written 50M words Lexicographic purposes Periodicals, books, internet sources from 1975-present
Egypt (1999) John Hopkins University Written Unknown MT A parallel corpus of the Qur’an in English and Arabic
Broadcast News Speech (2000) University of Pennsylvania LDC Spoken More than 110 broadcasts Speech recognition News broadcast from the radio of voice of America.

DINAR Corpus
(2000)

 
Nijmegen Univ.,SOTETEL-IT, co-ordination of Lyon2 Univ Written 10M words Lexicography, general research, NLP Unknown

An-Nahar Corpus (2001)

 
ELRA Written 140M words General research An-Nahar newspaper (Lebanon)
Al-Hayat Corpus (2002) ELRA Written 18.6M words Language Engineering and Information Retrieval Al-Hayat newspaper (Lebanon)

Arabic Gigaword
(2003)

 
University of Pennsylvania LDC Written Around 400M Natual language processing, information retrieval, language modelling Agence France Presse, Al-Hayat news agency, An-Nahar news agency, Xinhua news agency
E-A Parallel Corpus (2003) University of  Kuwait Written 3M words Teaching translation & lexicography Publications from Kuwait National Council
General Scientific Arabic Corpus (2004) UMIST, UK Written 1.6M words Investigating Arabic compounds http://www.kisr.edu.kw/science/
Classical Arabic Corpus (CAC) (2004) UMIST, UK Written 5M words Lexical analysis research www.muhaddith.org and www.alwaraq.com
Multilingual Corpus 2004 UMIST, UK Written 10.7M  words (Arabic 1M) Translation IT-specialized websites
SOTETEL Corpus SOTETEL-IT, Tunisia Written 8M words Lexicography Literature, academic and journalistic material

 

  • Free

 

Corpus of Contemporary Arabic (CCA) 2004 University of Leeds, UK Written and spoken Around 1M words TAFL Websites and online magazines
Available online
Arabic Blogs (2009)and a corpus builder application Shereen Khoja, Pacific University, Oregon, USA Written 131,836 words Investigating the use of colloquial Arabic and gender issues 37 blogs around the death of a Saudi female journalist and blogger, Hadeel Alhodaif
The corpus and the corpus builder application can be obtained by contacting the owner.
Essex Arabic Summaries Corpus (EASC 1.0) Mahmoud El-Haj, University of Essex, UK Written - - 153 Arabic articles and 765 human generated extractive summaries of the article.
You can request a copy of the corpus by clicking here

  • Under development

 

International Corpus of Arabic (ICA) 2008-2009

Bibliotheca Alexandrina (BA)

Written

100M words

General linguistic research

A wide range of sources from the Internet representing different Arabic regions

Related Articles:

Al-Ansary, S. et al. (2008). Building an International Corpus of Arabic (ICA): Progress of Compilation Stage. Bibliotheca Alexandrina.

Al-Ansary, S. et al. (2008). Towards analyzing the International Corpus of Arabic: Progress of Morphological Stage. Bibliotheca Alexandrina.

More information about written and spoken corpora can be also obtained from Nemlar (Network for Euro-Mediterranean Language Resources). See the report produced by Mahtab Nikkhou and Khalid Choukri Report on Survey on Arabic Language Resources and Tools in the Mediterranean Countries.

 

Last Modified: February 20, 2010 11:00 AM