NATURAL LANGUAGE PROCESSING / COMPUTATIONAL LINGUISTICS

Lecturer: Eric Atwell
website: http://www.comp.leeds.ac.uk/ai32

Adverts: Christmas fun; CompSoc

WEEK 1 Language and Text as Data
  • 01.ppt 01.pdf The Language Machine: ambiguity, applications
  • 02.ppt 02.pdf Corpus: text as data, tags, word tokens and types
  • Wikipedia: Computational linguistics, Natural Language Processing, Text Analytics, Corpus Linguistics, Corpus
  • Jurafsky, D; Martin, J. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Second Edition) Pearson International.
  • Atwell E. 1999. The Language Machine The British Council, London.

    WEEK 2 Lexical and Morphological Analysis
  • 03.ppt 03.pdf Words: tokenization and morphology
  • 04.ppt 04.pdf Word-counts and N-grams
  • Wikipedia: Word, Tokenization, Morphology (linguistics), n-gram
  • Atwell, E, Roberts, A. 2005. Combinatory Hybrid Elementary Analysis of Text Proc MorphoChallenge'2005, Venice, Italy.
  • Sawalha, M, Atwell, E. 2010. Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. Proc LREC'2010 Language Resources and Evaluation Conference, Valetta, Malta.

    WEEK 3 Part-of-Speech Tags and Taggers
  • 05.ppt 05.pdf Word bi-grams and PoS-tags
  • 06.ppt 06.pdf Machine Learning PoS-taggers
  • 07.ppt 07.pdf PoS-Tagging theory and terminology
  • Wikipedia: Part of speech, Part-of-speech tagging
  • Leech, G; Garside, R; Atwell, E. 1983. The Automatic Grammatical Tagging of the LOB Corpus ICAME Journal, vol. 7, pp.13-33.
  • Atwell, E. 2008. Development of tag sets for part-of-speech tagging in: Ludeling A, Kyto M (eds.) Corpus Linguistics: An International Handbook, Vol. 1, pp.501-526. Mouton de Gruyter.

    WEEK 4 Classifying Text by Machine Learning
  • 08.ppt 08.pdf World Wide English: intro to cw
  • 09.ppt 09.pdf Data Mining methodology: CRISP-DM
  • 10.ppt 10.pdf Machine Learning in practice with WEKA
  • Leeds University: World Wide English Corpus
  • Log-likelihood calculator and spreadsheet incorporating the log-likelihood calculation: LL.xls
  • Wikipedia: CRISP-DM Cross Industry Standard Process for Data Mining, Weka (machine learning), Feature selection
  • Atwell, E; Al-Sulaiti, L; Sharoff, S. 2009. Arabic and Arab English in the Arab World, Proc CL'2009 International Conference on Corpus Linguistics, Liverpool, UK.
  • Atwell, E; Arshad, J; Lai, C; Nim, L; Rezapour Ashregi, N; Wang, J; Washtell, J. 2007. Which English dominates the World Wide Web, British or American? Proc CL'2007 International Conference on Corpus Linguistics, Birmingham, UK.
  • Abu Shawar, B; Atwell, E. 2005. A chatbot system as a tool to animate a corpus. ICAME Journal, vol. 29, pp.5-24.

    WEEK 5 Information Retrieval
  • 11.ppt 11.pdf Information Retrieval: set v vector models
  • 12.ppt 12.pdf Query broadening to improve IR
  • Wikipedia: Information Retrieval, Vector space model, Google
  • Manning, C; Raghavan, P; Schutze, H. 2008. Itroduction to Information Retrieval Cambridge University Press.
  • Abu Shawar, B; Atwell, E; Roberts, A. 2005. FAQchat as an Information Retrieval system Proc 2nd Language and Technology Conference, Poznan, Poland. pp.274-278.

    WEEK 6 (revision/catching up?!)

    WEEK 7 Grammar and Parsing: Syntactic Structures
  • 13.ppt 13.pdf Formal English grammar
  • 14.ppt 14.pdf Parsing: finding grammatical structure
  • Wikipedia: Grammar, English grammar, Dependency grammar, Parsing, Parse tree, Syntactic Structures
  • Atwell, E; Demetriou, G; Hughes, J; Schriffin, A; Souter, C; Wilcock, S. 2000. A comparative evaluation of modern English corpus grammatical annotation schemes ICAME Journal, vol.24, pp.7-23.
  • Atwell, E. 1996. Comparative evaluation of grammatical annotation models in: Sutcliffe, R; Koch, H; McElligott, A (eds.) Industrial Parsing of Software Manuals, pp.25-46. Rodopi, Amsterdam.

    WEEK 8 Computational (Lexical) Semantics
  • 15.ppt 15.pdf Lexical semantics and word sense disambiguation
  • 16.ppt 16.pdf Chunking - shallow parsing
  • Wikipedia: Lexical semantics, Word-sense disambiguation, Shallow parsing
  • Banko, M; Brill, E. 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation Proc ACL/EACL Association for Computational Linguistics conference, Toulouse, France.
  • Demetriou, G; Atwell, E. 2001. A domain-independent semantic tagger for the study of meaning associations in English text Proc IWCS-4 Fourth International Workshop on Computational Semantics, pp.67-80. Tilburg, Netherlands.
  • Brierley, C; Atwell, E. 2007. An approach for detecting prosodic phrase boundaries in spoken English ACM Crossroads Journal, vol. 14.

    WEEK 9 Useful NLP/CL: Information Extraction, Machine Translation
  • 17.ppt 17.pdf Information Extraction, Named Entity Recognition
  • 18.ppt 18.pdf NLP for other languages, Machine Translation
  • Wikipedia: Information Extraction, Named Entity Recognition, General Architecture for Text Engineering, Machine Translation, Statistical Machine Translation, Google Translate
  • Hina, S; Atwell, E; Johnson, O. 2010. Semantic Tagging of Medical Narratives with Top Level Concepts from SNOMED CT Healthcare Data Standard International Journal of Intelligent Computing Research (IJICR), Vol.1.3, pp118-123.

    WEEK 10 NLP/CL research: Detecting Terrorist Activities, Understanding the Quran
  • 19.ppt 19.pdf Text Analytics for Detecting Terrorist Activities
  • 20.ppt 20.pdf Corpus Linguistics for Understanding the Quran
  • Wikipedia: Centre for Protection of National Infrastructure, Surveillance, Quran, Quranic Arabic Corpus, Quran translations

    WEEK 11 Review and Exam Preparation
  • Example past exam papers from 2009, 2010, 2011
  • 21.ppt 21.pdf Review: summary of the course

    WEEK 12 Online resources for NLP/CL: Google, Youtube, videolectures.net, ...
  • 22.ppt 22.pdf Google Tech-Talk: Theorizing from Data

    Exercise 1: Example Youtube video Deadline: Week 4, Friday 21/10/11
  • Choose a small subtopic in NLP/CL, eg from Wikipedia
  • Make a SHORT (1-2 minutes) video on this topic. An easy way to do this is: make a short PowerPoint presentation; Save as JPG (save each slide as a separate file); in MS Movie Maker, load the slides as a sequence of still images; add audio narration; Save movie to your desktop. If you prefer, you can use other methods to make the movie.
  • Register a Youtube account; upload the movie to Youtube; then
  • EMAIL e.s.atwell@leeds.ac.uk with URL of your movie, AND state whether youn agree to let me add this URL to the course website, to let other students see it.
    EXAMPLES: NLP overview, Artificial Intelligence in Fiction, Machine Learning in NLP, Parsing, Part-of-Speech Tagging, Speech Recognition, Summarization, Words.

    Exercise 2: coursework.doc (see also: 08.ppt 08.pdf ) Deadline: Week 8, Friday 18/11/11 ... In brief:
  • if possible work with a partner of your choice - to help each other.
  • select a country, and 2 "nearby" ones for comparison.
  • select some features - words - which appear frequently in British English but not American English, or vice versa. The World Wide English Corpus website has an example ukus.arff whcih includes centre, center, colour, color - you can just use these.
  • design a "decision procedure" or test which will decide if a sample is UK or US, based on the data from the UK and US samples. If you use my ukus.arff, then this could simply be something like a decision tree "is frequency("color")>3?" yes --> US, no --> UK
  • work out how many of the UK and US samples are correctly classified by this - it may be less than 100% (all 20)
  • work out how your 3 countries are classified by your classifier e.g. decision tree: for each country, is it predicted to prefer UK or US English?
  • note down in outline what you have done - Intro, Methods, Results, Conclusions - as a set of powerpoint slides
  • also grab some screenshots, eg of WWE corpus, and your mehod and results
  • save as .jpg images, then copy these into Movie Maker to make a video
  • add narration: talk us through the slides, like a short lecture
  • save video, upload to youtube, then email me the URL
    The detailed instructions contain some more steps (mainly to give keen students some added challenge!) but the above is a basic summary of what you need to do.

    Ideas for student projects applying NLP/CL