| Name of Corpus
|
Source |
Medium |
Size |
Purpose |
Material |
| Buckwalter Arabic Corpus
1986-2003 |
Tim Buckwalter |
Written |
2.5 to 3 billion words |
Lexicography |
Public resources on the
Web |
| Leuven Corpus (1990-2004) |
Catholic University Leuven,
Belgium |
Written and spoken |
|
Arabic-Dutch /Dutch-Arabic
learner’s dictionary |
Internet sources, radio
& TV, primary school books |
| Arabic
Newswire Corpus(1994) |
University of Pennsylvania
LDC |
Written |
80M words |
Education and the development
of technology |
Agence France Presse,
Xinhua News Agency, and Umma Press |
| CALLFRIEND
Corpus (1995)
|
University of Pennsylvania
LDC |
Conversational |
60 telephone conversations
|
Development of language
identification technology |
Egyptian native speakers |
|
|
Nijmegen University |
Written |
Over 2M words |
Arabic-Dutch / Dutch-Arabic
dictionary |
Magazines and fiction |
| CALLHOME
Corpus (1997) |
University of Pennsylvania
LDC |
Conversational |
120 telephone conversations |
Speech recognition produced
from telephone lines |
Egyptian native speakers |
| CLARA (1997) |
Charles University, Prague |
Written |
50M words |
Lexicographic purposes |
Periodicals, books, internet
sources from 1975-present |
| Egypt (1999) |
John Hopkins University |
Written |
Unknown |
MT |
A parallel corpus of the
Qur’an in English and Arabic |
| Broadcast News Speech (2000) |
University of Pennsylvania
LDC |
Spoken |
More than 110 broadcasts |
Speech recognition |
News broadcast from the
radio of voice of America. |
| DINAR Corpus
(2000)
|
Nijmegen Univ.,SOTETEL-IT,
co-ordination of Lyon2 Univ |
Written |
10M words |
Lexicography, general
research, NLP |
Unknown |
| An-Nahar Corpus (2001)
|
ELRA |
Written |
140M words |
General research |
An-Nahar newspaper (Lebanon) |
| Al-Hayat Corpus
(2002) |
ELRA |
Written |
18.6M words |
Language Engineering and
Information Retrieval |
Al-Hayat newspaper (Lebanon) |
| Arabic
Gigaword
(2003)
|
University of Pennsylvania
LDC |
Written |
Around 400M |
Natual language processing,
information retrieval, language modelling |
Agence France Presse,
Al-Hayat news agency, An-Nahar news agency, Xinhua news agency |
| E-A Parallel Corpus (2003) |
University of Kuwait |
Written |
3M words |
Teaching translation &
lexicography |
Publications from Kuwait
National Council |
| General Scientific Arabic Corpus (2004) |
UMIST, UK |
Written |
1.6M words |
Investigating Arabic compounds |
http://www.kisr.edu.kw/science/ |
| Classical Arabic Corpus (CAC) (2004) |
UMIST, UK |
Written |
5M words |
Lexical analysis research |
www.muhaddith.org
and www.alwaraq.com |
| Multilingual Corpus 2004 |
UMIST, UK |
Written |
10.7M words (Arabic
1M) |
Translation |
IT-specialized websites |
| SOTETEL Corpus |
SOTETEL-IT, Tunisia |
Written |
8M words |
Lexicography |
Literature, academic and
journalistic material |