NATURAL LANGUAGE PROCESSING / COMPUTATIONAL LINGUISTICS
Lecturer: Eric Atwell
website:
http://www.comp.leeds.ac.uk/ai32
Adverts:
Christmas fun;
CompSoc
WEEK 1 Language and Text as Data
01.ppt 01.pdf
The Language Machine: ambiguity, applications
02.ppt 02.pdf
Corpus: text as data, tags, word tokens and types
Wikipedia:
Computational linguistics,
Natural Language Processing,
Text Analytics,
Corpus Linguistics,
Corpus
Jurafsky, D; Martin, J. 2008.
Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition (Second Edition)
Pearson International.
Atwell E. 1999.
The Language Machine The British Council, London.
WEEK 2 Lexical and Morphological Analysis
03.ppt 03.pdf
Words: tokenization and morphology
04.ppt 04.pdf
Word-counts and N-grams
Wikipedia:
Word,
Tokenization,
Morphology (linguistics),
n-gram
Atwell, E, Roberts, A. 2005.
Combinatory Hybrid Elementary Analysis of Text Proc MorphoChallenge'2005,
Venice, Italy.
Sawalha, M, Atwell, E. 2010.
Fine-Grain
Morphological Analyzer and Part-of-Speech Tagger for Arabic Text.
Proc LREC'2010 Language Resources and Evaluation Conference, Valetta, Malta.
WEEK 3 Part-of-Speech Tags and Taggers
05.ppt 05.pdf
Word bi-grams and PoS-tags
06.ppt 06.pdf
Machine Learning PoS-taggers
07.ppt 07.pdf
PoS-Tagging theory and terminology
Wikipedia:
Part of speech,
Part-of-speech tagging
Leech, G; Garside, R; Atwell, E. 1983.
The Automatic Grammatical Tagging of the LOB Corpus
ICAME Journal, vol. 7, pp.13-33.
Atwell, E. 2008.
Development of tag sets for part-of-speech tagging
in: Ludeling A, Kyto M (eds.) Corpus Linguistics: An International Handbook,
Vol. 1, pp.501-526. Mouton de Gruyter.
WEEK 4 Classifying Text by Machine Learning
08.ppt 08.pdf
World Wide English: intro to cw
09.ppt 09.pdf
Data Mining methodology: CRISP-DM
10.ppt 10.pdf
Machine Learning in practice with WEKA
Leeds University: World Wide English Corpus
Log-likelihood calculator and
spreadsheet incorporating the log-likelihood calculation: LL.xls
Wikipedia:
CRISP-DM Cross Industry Standard Process for Data Mining,
Weka (machine learning),
Feature selection
Atwell, E; Al-Sulaiti, L; Sharoff, S. 2009.
Arabic and Arab English in the Arab World,
Proc CL'2009 International Conference on Corpus Linguistics, Liverpool, UK.
Atwell, E; Arshad, J; Lai, C; Nim, L; Rezapour Ashregi, N; Wang, J;
Washtell, J. 2007.
Which English dominates the World Wide Web, British or American?
Proc CL'2007 International Conference on Corpus Linguistics, Birmingham, UK.
Abu Shawar, B; Atwell, E. 2005.
A chatbot system as a tool to animate a corpus.
ICAME Journal, vol. 29, pp.5-24.
WEEK 5 Information Retrieval
11.ppt 11.pdf
Information Retrieval: set v vector models
12.ppt 12.pdf
Query broadening to improve IR
Wikipedia:
Information Retrieval,
Vector space model,
Google
Manning, C; Raghavan, P; Schutze, H. 2008.
Itroduction to Information Retrieval Cambridge University Press.
Abu Shawar, B; Atwell, E; Roberts, A. 2005.
FAQchat as an Information Retrieval system
Proc 2nd Language and Technology Conference, Poznan, Poland.
pp.274-278.
WEEK 6 (revision/catching up?!)
WEEK 7 Grammar and Parsing: Syntactic Structures
13.ppt 13.pdf
Formal English grammar
14.ppt 14.pdf
Parsing: finding grammatical structure
Wikipedia:
Grammar,
English grammar,
Dependency grammar,
Parsing,
Parse tree,
Syntactic Structures
Atwell, E; Demetriou, G; Hughes, J; Schriffin, A; Souter, C; Wilcock, S.
2000. A comparative evaluation
of modern English corpus grammatical annotation schemes
ICAME Journal, vol.24, pp.7-23.
Atwell, E. 1996.
Comparative evaluation of grammatical annotation models
in: Sutcliffe, R; Koch, H; McElligott, A (eds.)
Industrial Parsing of Software Manuals, pp.25-46. Rodopi, Amsterdam.
WEEK 8 Computational (Lexical) Semantics
15.ppt 15.pdf
Lexical semantics and word sense disambiguation
16.ppt 16.pdf
Chunking - shallow parsing
Wikipedia:
Lexical semantics,
Word-sense disambiguation,
Shallow parsing
Banko, M; Brill, E. 2001.
Scaling to Very Very Large Corpora for Natural Language Disambiguation
Proc ACL/EACL Association for Computational Linguistics conference,
Toulouse, France.
Demetriou, G; Atwell, E. 2001.
A domain-independent semantic tagger for the study of meaning associations
in English text
Proc IWCS-4 Fourth International Workshop on Computational Semantics,
pp.67-80. Tilburg,
Netherlands.
Brierley, C; Atwell, E. 2007.
An approach for detecting prosodic phrase boundaries in spoken English
ACM Crossroads Journal, vol. 14.
WEEK 9 Useful NLP/CL: Information Extraction, Machine Translation
17.ppt 17.pdf
Information Extraction, Named Entity Recognition
18.ppt 18.pdf
NLP for other languages, Machine Translation
Wikipedia:
Information Extraction,
Named Entity Recognition,
General Architecture for Text Engineering,
Machine Translation,
Statistical Machine Translation,
Google Translate
Hina, S; Atwell, E; Johnson, O. 2010.
Semantic Tagging of Medical Narratives with Top Level Concepts from
SNOMED CT Healthcare Data Standard
International Journal of Intelligent Computing Research (IJICR), Vol.1.3,
pp118-123.
WEEK 10 NLP/CL research: Detecting Terrorist Activities, Understanding the Quran
19.ppt 19.pdf
Text Analytics for Detecting Terrorist Activities
20.ppt 20.pdf
Corpus Linguistics for Understanding the Quran
Wikipedia:
Centre for Protection of National Infrastructure,
Surveillance,
Quran,
Quranic Arabic Corpus,
Quran translations
WEEK 11 Review and Exam Preparation
Example past exam papers from 2009,
2010, 2011
21.ppt 21.pdf
Review: summary of the course
WEEK 12 Online resources for NLP/CL: Google, Youtube, videolectures.net, ...
22.ppt 22.pdf
Google Tech-Talk: Theorizing from Data
Exercise 1:
Example Youtube video
Deadline: Week 4, Friday 21/10/11
Choose a small subtopic in NLP/CL, eg from Wikipedia
Make a SHORT (1-2 minutes) video on this topic. An easy way to do this is:
make a short PowerPoint presentation; Save as JPG (save each slide as a separate file);
in MS Movie Maker, load the slides as a sequence of still images; add audio narration;
Save movie to your desktop. If you prefer, you can use other methods to make the movie.
Register a Youtube account; upload the movie to Youtube; then
EMAIL e.s.atwell@leeds.ac.uk with URL of your movie, AND state whether youn agree
to let me add this URL to the course website, to let other students see it.
EXAMPLES:
NLP overview,
Artificial Intelligence in Fiction,
Machine Learning in NLP,
Parsing,
Part-of-Speech Tagging,
Speech Recognition,
Summarization,
Words.
Exercise 2: coursework.doc
(see also: 08.ppt 08.pdf )
Deadline: Week 8, Friday 18/11/11 ... In brief:
if possible work with a partner of your choice - to help each other.
select a country, and 2 "nearby" ones for comparison.
select some features - words - which appear frequently in British English but
not American English, or vice versa. The
World Wide English Corpus website has an example
ukus.arff whcih includes centre, center, colour, color - you can just
use these.
design a "decision procedure" or test which will decide if a sample
is UK or US, based on the data from the UK and US samples. If you use
my ukus.arff, then this could simply be something like a decision tree
"is frequency("color")>3?" yes --> US, no --> UK
work out how many of the UK and US samples are correctly classified by
this - it may be less than 100% (all 20)
work out how your 3 countries are classified by your classifier e.g.
decision tree: for each country, is it predicted to prefer UK or US English?
note down in outline what you have done - Intro, Methods,
Results, Conclusions - as a set of powerpoint slides
also grab some screenshots, eg of WWE corpus, and your mehod and results
save as .jpg images, then copy these into Movie Maker to make a video
add narration: talk us through the slides, like a short lecture
save video, upload to youtube, then email me the URL
The detailed instructions contain some more steps (mainly to give keen
students some added challenge!) but the above is a basic summary of what you
need to do.
Ideas for student projects applying NLP/CL