Entrance of School of Computing School of Computing where I work

Latifa Al-Sulaiti's Homepage

Home

Research

Survey

Arabic Corpora

Arabic Web Concoardancing

Links

Arabic CALL

Arabic Corpus-Based Studies

Researchers in Arabic Computational Linguistics

  Arabic Web Concordancing

  • Corpora with a Web interface

 

-Leeds Internet Corpora

These large untagged corpora have been developed by Serge Sharoff at the Centre for Translation Studies, University of Leeds.  They are developed from the Web using automated search engine queries. The corpora are available with a Web interface, and they can be accessed [here]. 

Arabic Internet Corpus 2006

University of Leeds

Written

About 150M words

General research

Public resources on the Web

Arabic Wikipedia Corpus 2006

University of Leeds

Written

About 150M words

General research

Wikipedia, the free online encyclopedia

Arabic Legal Corpus 2006

University of Leeds

Written

12M

General research

Internet legal sources

Arabic Computer Science Corpus 2009

University of Leeds

Written

5M

General research

Internet

 

Some tips on the use of Leeds Internet Corpora:

1. For basic queries type the word in the space and press Submit Query. For example, كتاب gives all the occurrences of the word. If you type كتاب.* you get examples such as: كتابي , كتابك , كتاباتي ...etc. If you type .*كتاب.*, you get examples such as الكتاب , الكتابه ,للكتاب ...etc.

2. If the word you are searching for has different spellings such as 'Google', use the following syntax: (جوجل)|(قوقل)|(غوغل).

3. Concordances lines are selected randomly with the first word is the most frequent, but there is no statistics to show its frequency.

4. Frequency of words only appear in the collocations.

5. There is sometimes inaccuracy of frequency counts because some pages of the Internet are duplicated. Users need to check manually.

6. The Web interface gives 'keyword in context', with an option of retrieving the source document by clicking on the side arrow or Invert all.

7. There is no option of saving query results. Users have to save results manually by copy and paste.

8. Only single words can be queried.

9. The Web interface provides statistical information such as T-score, Mutual Information (MI) score, and Log-Likelihood score. Very useful information on the use of statistical measures of lexical associations: Mutual Information score, and T-score can be found in Biber, D. et al (1998) p. 265-8.

 

 

-ArabiCorpus

This untagged corpus was developed by Dilworth Parkinson. It is large and it can be accessed on this site: Words can be search in Arabic  or Latin script. The website provides detailed instructions on the search.

Users need to register before using it, but not necessarily have to pay.

ArabiCorpus -2009

Brigham Young University

Written

About 70M words

General research

Quran,Medieval Science, modern literature, novels, 1001Night, Penn Treebank, Egyptian colloquial, some newspapers.

 

In order to use corpora for research, students and professional researchers have to follow the steps of corpus-based approach:

  •  Identify your research questions.
  •  Select the statistical techniques that fit your research questions. You might not need to use all the statistical measures.
  •  Observe patterns and interpret results.

 

 

 Selected Books and Online-Articles on Corpora and Language

These resources introduce you to some of the most common statistical techniques used in corpus-based studies and how to report the results. Different sample studies which cover lexicography, grammar, discourse, and register variation are presented.

Books

Biber, D. et al. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press.

Hunston, S. (2002).  Corpora in Applied Linguistics (Cambridge Applied Linguistics).  Cambridge University Press.

Ghadessy, M. et al. (eds). (2001) Small Corpus Studies and ELT: Theory and Pracrice John Benjamins Publishing Company.

Kennedy, G. (1998). An Introduction to Corpus Linguistics (Studies in Language and Linguistics). Longman.

McEnry, T. et al. (2005). Corpus-Based Language Studies. Routledge.

Oakes, M. (1998). Statistics for Corpus Linguistics. Columbia University Press.

Stubbs, M. (1996)  Text and Corpus Analysis: Computer assisted studies of language and institutions (language in society). WileyBlackwell.

Tognini-Bonelli, E. (2001). Corpus Linguistics at Work.  Amsterdam: John Benjamins Publishing Company.

 

Online Articles

Cobb, T. Is there any measurable learning from hands-on concordancing? System 25 (3), 301-315.

Hadley, G. forthcoming, 'Sensing the Winds of Change: An Introduction to Data-driven Learning'. To appear in Insights 2 .(seen online March 14, 2009).

Kennedy, C. & Miceli, T. (2001). An evaluation of intermediate students' Approaches to corpus investigation. Language Learning  & TechnologyVol. 5, No. 3, September 2001, pp. 77-90.

Pinna, A. (2002).  Corpus techniques at work in the ELT classromm.  Annali della Facoltà di Lingue e Letterature Straniere, Vol. 2 , pp. 35-59.

Stevens, V. (1995), 'Concordancing with Language Learners: Why? When? What?', CAELL Journal 6/2, pp. 2-10.  

Thompson, P. & Tribble, C. (2001). Looking at citations: using corpora in English for academic purposesLanguage Learning  & TechnologyVol. 5, No. 3, September 2001, pp. 91-105.

Tribble, C. (1997). 'Improvising Corpora for ELT: Quick and Dirty Ways of Developing Corpora for Language Teaching'. In B. Lewandowska-Tomaszczyk and J. Melia (eds) Proceedings of the First International conference on Practical Applications in Language Corpora.

 


Useful Websites

 

  • Corpora and Concordancing, University of Warwick
  • ESL Teaching and  Learning Resources: Concordancing  & Collocations
  • Corpora for Language Learning and Teaching
  • Devoted to Corpora
  • Classroom Concordancing/Data-driven learning Bibliography
  • Concoranding in ELT
  • Text Corpora and Corpus Linguistics
  • Corpus Linguistics
  • Bibliography of Concordance, Collocation, Corpus and Vocabulary related books
  • Tim John's Page
  •  




     

    Last Modified: March 9, 2009 9:00 AM