Of course, you can come up with your own ideas, and discuss these with potential supervisors. However, many applicants have a general interest in computing and language, but do not have a well-defined project; if this is you, then to give you some ideas, below are some ideas for projects suggested by Eric Atwell. If you like one of these project suggestions, then please send us a PhD application form to join our Language research group in the School of Computing at Leeds University.
Note that a good project proposal should include some references to other related research, at Leeds and elsewhere, to show how the proposal links to developments in the wider research community. The outlines below illustrate the wide range of sources of research papers accessible via the World Wide Web.
Chen H et al (eds). 2008. Terrorism Informatics: Knowledge Management and Data Mining for Homeland Security. Springer. springerlink
Wang F et al (eds). 2006. Intelligence and Security Informatics: Proc WISI 2006. Springer. springerlink
Schmid A. 2006. Forum on Crime And Society. United Nations Publications. Google Books link
Lowd D, Meek C. 2005. Adversarial learning. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM Portal link
Feinberg S. 2005. Homeland Insecurity: Datamining, Terrorism Detection, and Confidentiality. Bull. Internat. Stat. Inst. NISS link
J Allanach J et al. 2004. Detecting, tracking, and counteracting terrorist networks via hidden Markov models. Proc IEEE Aerospace Conference ieeexplore link
Elovici Y et al. 2004. Using Data Mining Techniques for Detecting Terror-Related Activities on the Web. Journal of Information Warfare. pdf
The topic of Detecting Terrorist Activities is new to our research group, BUT several areas of current and past research are relevant: research in Corpus Linguistics and Text Analytics, bootstrapping linguistic knowledge, resources and concepts from English and Arabic text; for example, Arabic text analytics; detecting hidden meanings in text; social and cultural text mining; detecting non-standard language variation; detecting hidden errors in text; plagiarism detection:
Arabic text analytics: Sawalha, M; Atwell, E. Comparative evaluation of Arabic language morphological analysers and stemmers. Proc COLING 2008;
Atwell E et al. Mapping Middle Eastern and North African diasporas: Arabic corpus linguistics research at the University of Leeds. Proc BRISMES Conference 2008;
Roberts, A; Al-Sulaiti, L; Atwell, E. aConCorde: Towards an open-source, extendable concordancer for Arabic. Corpora Journal 2006
Detecting hidden meanings in text: Abbas, N; Atwell, E. Qurany: A Tool to Search for Concepts in the Quran. Proc Corpus Linguistics 2009;
Elliott, J; Atwell, E. Is anybody out there?: the detection of intelligent and generic language-like features. Journal of the British Interplanetary Society 2000
Social and cultural text mining: Atwell, E; Brierley, C. Combining teaching & research in text-mining from social and cultural data. Proc Int Conf e-Social Science 2008
Detecting non-standard language variation: Atwell E et al. Which English dominates the World Wide Web, British or American? Proc Corpus Linguistics 2007;
Atwell, E et al. User-guided system development in ISLE: Interactive Spoken Language Education. Natural Language Engineering Journal 2000
Detecting hidden errors in text: Brierley, C; Atwell, E. An approach for detecting prosodic phrase boundaries in spoken English. ACM Crossroads Journal 2007;
Elliott, D; Atwell, E; Hartley, T. Using corpora to automatically detect untranslated and outrageous words in machine translation output. Proc Corpus Linguistics 2005;
Atwell,E. How to detect grammatical errors in a text without parsing it. Proc EACL 1987
Plagiarism detection: Atwell, E et al. Detecting student copying in a corpus of science laboratory reports. Proc Corpus Linguistics 2003
English is an international lingua franca, and in some specialised domains a restricted English is used to ease communication between non-native speakers; for example 'Aviation English' is used by Air Traffic Controllers and pilots world-wide (Churcher et al 1997),(Campbell-Laird 2004). Another advantage of Simplified English is that it is easier to translate into other languages, both for humans and Machine Translation systems (Hartley and Paris 2001), (Nyberg et al 2003). Simplified English may also assist readers who have difficulty with complex English, eg due to Dyslexia (Pedler 2007). Linguists have set up an online discussion group on Simplified English (linguistlist 2008)
The aim of this project is to develop a tool to 'simplify' input English texts. The first task is to identify what constitutes Simplified English or Controlled English, by reviewing the literature, including manuals and instructions for authors of Simplified English texts, and reports on past attempts to develop 'simplification' programs, eg the Simplified English Checker/Corrector (SECC) project. Also relevant is research on detecting errors in Machine Translation outputs, caused by complex and/or ambiguous input language (Elliott et al 2005),(Babych et al 2008). Then, you will identify specific simplification rules which can be encoded and applied in your own computer program. Your Machine Assisted Translation system will be evaluated in terms of its usefulness for one or more applications of Controlled Languages. Two experts on Simplified English have agreed to be potentials "client" or "user" for the student project, to advise and/or evaluate the student work: Dr Bogdan Babych of the Centre for Translation Studies here at Leeds University, and Dr Mike Day of Rolls-Royce.
Allen, Jeffrey. 1999. Different kinds of Controlled Languages. In TC-Forum magazine, volume 1-99, pp. 4-5. http://www.tc-forum.org/topiccl/cl15diff.htm
ASD Simplified Technical English: FAQs. 2005. http://www.asd-ste100.org/pagina5.htm
ASD/ATA/AIA S1000D http://www.s1000d.org
Babych, Bogdan and Hartley, Anthony. 2008. Automated MT Evaluation for Error Analysis: Automatic Discovery of Potential Translation Errors for Multiword Expressions. Proceedings of the ELRA Workshop on Evaluation. In conjunction with LREC 2008. Marrakech, Morocco.
Campbell-Laird, Kitty. 2004. Aviation English: a review of the language of International Civil Aviation. Proceedings of IPCC 2004 International Professional Communication Conference, pp.253-261.
Churcher, Gavin; Eric Atwell, and Clive Souter. 1997. The semantic/pragmatic annotation of an air traffic control corpus for use in speech recognition. In: Ljung, Magnus (ed) Corpus-based Studies in English: Papers from Seventeenth International Conference on English Language Research on Computerized Corpora (ICAME 17), pp. 353-374. Rodopi.
Elliott, Debbie; Atwell, Eric; Hartley, Tony. 2005. Using corpora to automatically detect untranslated and outrageous words in machine translation output. In: Proceedings of Corpus Linguistics 2005.
Hartley, Anthony and Paris, Cecile. 2001. Translation, controlled languages, generation. In: Erich Steiner and Colin Yallop (eds), Exploring translation and multilingual text production: beyond content. pp.307-325. Berlin, Mouton de Gruyter
Hassen, Karen. 1998. Training People to Write in a Controlled Language: AECMA Simplified English. Paper presented at the Second International Workshop on Controlled Language Applications (CLAW98), Pittsburgh, Pennsylvania.
Kamprath, Christine, Eric Adolphson, Teruko Mitamura and Eric Nyberg. 1998. Controlled Language Multilingual Document Production: Experience with Caterpillar Technical English. In Proceedings of the Second International Workshop on Controlled Language Applications (CLAW98), pp.51-61. Pittsburgh, Pennsylvania: Language Technologies Institute, Carnegie Mellon University.
Linguist-list discussion group on Simplified English: http://listserv.linguistlist.org/archives/simplified-english.html
MacDonald, Maria Luisa. 2008. Simplified Technical English for all: A Customer-friendly Specification. Powerpoint presentation, Agustawestland. http://www.x-pubs.com/resources/2008conf/downloads/4X-Pubs2008_Maria_McDonald_Simplified_Technical_English_For_All.pdf
Nyberg, Eric, Maitamura, Teruko, and Huijsen, Willem-Olaf. 2003. Controlled language for authoring and translation. In: Harold Somers (ed), Computers and Translation: A translator's guide. 245-281. Amsterdam: John Benjamins.
Jennifer Pedler, 2007. Computer Correction of Real-word Spelling Errors in Dyslexic Text. PhD thesis, Birkbeck College, University of London. http://www.dcs.bbk.ac.uk/research/recentphds/pedler.pdf
SECC project: http://www.ccl.kuleuven.be/about/SECC_zonderdemo.html
Researchers at www.worldmapper.org have developed a novel tool to visualize various types of global data. For example, above is a map of global population density: countries are expanded or shrunk to show their population rather than land-area. We see India, China, Japan have large populations for their lands mass, so are "blown up"; whereas Canada, Australia have shrunk to show their population density is low.
The WorldMapper team have produced maps for many other measurable global distributions, see http://www.worldmapper.org/textindex/text_index.html The next on their list is "language": they want to map distributions of different languages around the world. They plan to use standard Linguistics textbooks as their source of numbers. Your Project challenge is to find alternative data on WWW usage: numbers of web-pages for each language, for each country (Top Level Domain), comparing several different search engines (eg Google and Yahoo), to provide data for maps of language distribution on the WWW according to Google v Yahoo v ano. Overall statistics on total number of pages in English, German and a few other major languages are available but unreliable and out-of-date; your aim is to carry out a comprehensive, up-to-date survey, and get reliable statistics country-by-country on current WWW usage. This can be done by using the Advanced Search features of Google/Yahoo, and/or a more sophisticated generic approach would be a python/java program linked ot Google/Yahoo API. If necessary, we may have access to advice from VP Engineering at Yahoo, a Leeds graduate (see http://www.comp.leeds.ac.uk/nlp )
For the visualization, you have the option of collecting the data is CSV or spreadsheet format, and then emailing this to the WorldMapper project, for them to display on their website; you should also experiment with world-mapping software yourself, to evaluate how easy it is to use and how well this works. Ideally you could integrate web-data-maning and map-visualisation into a tool for future re-use.
Evaluation would consult potential users of the language maps at Leeds University and elsewhere, eg researchers in Diaspora - how languages and cultures spread across a region or the world. For example, Leeds University researchers recently hosted the BriSMES international conference on Mapping Middle Eastern and North African Diasporas ; and the Arts and Humanities Research Council runs a research programme on Diasporas, Migration and Identities, coordinated by Prof Kim Knott of Leeds University.
The main "deliverable" of this project is not only software, but a also Web-mined dataset suitable for the WorldMapper visualization tool - so this project will focus on design, structure, collation, cleaning, formatting of a large data warehouse, and analysis and evaluation of the resulting data and visualizations. The results will be published not just in Corpus Linguistics and Visualization journals, but also to the language and diaspora research community.
Worldmapper project: http://www.worldmapper.org/
List of topics mapped: http://www.worldmapper.org/textindex/text_index.html
BriSMES Conference on Mapping Middle Eastern and North African Diasporas http://www.dur.ac.uk/brismes/events_2008.htm
However, arguably the words of English do not always fit into a neat heirarchy. An alternative view is that each word should be linked to other related words, with a weight on the link showing the closeness in meaning; but instead of a heirarchy structure, these links can form an arbitrary graph. This is the way pages are "organised" on the WWW: there is no heirarchy of websites, just links. One way to represent this is by a matrix, with a row and column for every word in English, and a weight in each (x,y) cell denoting the strength of meaning-similarity between words x and y. Your task is to create such a matrix.
The weights can be estimated by a method which uses the online Longman Dictionary of Contemporary English. The relatedness between 2 words x and y can be seen by looking at the dictionary definitions of x and y: just count the number of words appearing in both definitions. This simple measure has been shown to give a useful metric of semantic relatedness of two words (Demetriou and Atwell 2001) and the semantic association between two web-pages (Duan and Atwell 2002). These papers calculated each word-pair weight dynamically; it would be more efficient to compute the entire matrix (for every possible English word-pair) in advance, and then save this matrix, as an open-source reusable resource: WordWeb will be you main project deliverable.
You can go on to demonstrate some applications of WordWeb, to measure semantic distance between web-pages, and/or other documents.
The main "deliverable" of this project is not just software, but a data-structure - so this project will focus on design, structure and generation of a www-inspired word-web data-resource. Cognitive Science undergraduate student Danielle Hurley developed a basic prototype WordWeb; so your background research can start from this prototype and develop into a system sufficently sophisticated and robust to integrate into an Open Source library such as NLTK, the python Natural Language ToolKit.
Background References:
Danielle Hurley, 2008. WordWeb: A Lexical Semantic Web Resource. Undergraduate Project report, School of Computing, Leeds University. http://www.comp.leeds.ac.uk/cgi-bin/fyproj/reports/0708/Hurley.pdf.gz
LDOCE dictionary database: http://www.comp.leeds.ac.uk/db32/ldoce
Demetriou G and Atwell E. 2001. A domain-independent semantic tagger for the study of meaning associations in English text. In Harry Bunt, Ielka van der Sluis and Elias Thijsse (editors), Proceedings of the Fourth International Workshop on Computational Semantics (IWCS-4) pp.67-80. Tilburg, Netherlands. ISBN: 90-74029-16-7. http://www.comp.leeds.ac.uk/eric/demetriou01iwcs.pdf
Duan, Xiao Yuan; Atwell, Eric. 2002. Semantic association between web pages - a lexical knowledge based method. In: Proceedings of the 5th Annual CLUK Colloquium: Computational Linguistics in the United Kingdom, (CLUK 2002, University of Leeds, UK), pp. 66-76.
A first step is to develo a system which can machine-learn word-internal
structure or morphology from a list of words extracted from a corpus.
The MorphoChapplege contest has been set up to encourage researchers
worldwide to tavkle this problem. The objective of the MorphoChallenge is to
design a statistical machine learning algorithm that discovers which
morphemes (smallest individually meaningful units of language) words
consist of. Ideally, these are basic vocabulary units suitable for
different NLP tasks, such as text understanding, machine translation,
information retrieval, and statistical language modeling.
For instance, a subset of the test English word list looks like this:
...
1 barefoot's
2 barefooted
6699 feet
653 flies
2939 flying
1782 foot
64 footprints
...
The participants' task is to return a list containing exactly the same words
as in the input, with morpheme analyses provided for each word. The list
returned shall not contain the word frequency information. A submission
for the dataset of English words may look like this:
...
barefoot's BARE FOOT +GEN
barefooted BARE FOOT +PAST
feet FOOT +PL
flies FLY_N +PL, FLY_V +3SG
flying FLY_V +PCP1
foot FOOT
footprints FOOT PRINT +PL
...
Morpho Challenge 2008 was a follow-up to our previous Morpho Challenge 2005 (Unsupervised Segmentation of Words into Morphemes) and Morpho Challenge 2007 (Unsupervised Morpheme Analysis). The task of Morpho Challenge 2009 will be similar to the Morpho Challenge 2008, where the aim was to find the morpheme analysis of the word forms in the data.
Developing an entry for an international Machine Learning contest may sound a i
bit daunting, but the task is actually well laid out and fits a PhD
project well. The student would start by reviewing the entrants to previous MorphoChallenge contests, and examining the new task for 2009. We will design and implement a Leeds entry to the contest. Evaluation will use the evaluation "gold standards" and metrics provided by the organisers. If the entry works well,
we may be invited to present the system at a forthcoming MorphoChallenge
workshop - date and location to be decided.
This project aims to develop formalised computer representations or models of the concepts and abstractions involved in musicological research into the history and culture of concerts, enriching the source texts with analyses, interpretations and insights from researchers, to further knowledge and understanding.
The Concert Programmes project at the British Library aims to catalogue and archive music concert programmes held in the British Library and a number of other repositories in Britain and Ireland. Concert programmes are a primary source of information for historical and musicological research, but they were not documented at national or regional level in the UK and only rarely by holding institutions. Concert programmes represent one of the last major categories of material relevant to music research that has not been subject to systematic treatment. They do not fall within the scope of other current resource discovery projects in music: RILM, RISM, RIPM and RIdIM.
The Concert Life in Nineteenth-Century London Database project, based at Leeds University School of Music, differs from the Concert Programmes project in that it takes a detailed view of a few carefully selected years of performance activity within the period 1800-1914, confined to London, and drawing principally on newspaper and periodical notices and reviews as well as concert programmes and bills. The aim is to build a detailed description of concerts in the sample periods, supplementing the information from concert advertisements, programmes and reviews with other source materials where available, and enriching each concert database entry with expert interpretations of the data. The result is a richer resource for historical research, at a higher level of knowledge abstraction than the raw text of the source concert programmes: researchers can directly search for features of concerts such as date, venue, composer(s), performer(s), instrumentation, etc.
The concert is at the heart of both projects. In theory, it would be rewarding to combine aspects of the two projects: the rich analysis of concert data in the Concert Life database could be extended to further data in the Concert Programmes archive. However, the process is very labour-intensive, as researchers have to read each document and painstakingly extract the information from the text to type into the database. Some fields like date and venue may be relatively straightforward to identify visually; others like the specific items offered in the course of the performance itself require musicological expert analysis of the concert programme.
We want to investigate the extent to which techniques from Data Mining and Corpus Linguistics could be applied to partially automate the Data Warehouse capture process. The first step is to transcribe a selection of concert programmes to form an initial Concert Programme Corpus, a collection of texts with XML mark-up to tag salient features of the text. These should be concert programmes for which Database entries already exist. We can extract the database information in a common interchange format such as CSV, so that we can apply Data Mining and Information Extraction tools such as WEKA and GATE to capture the mapping from text and mark-up to database fields. This will also identify the fields which cannot be learnt from the programme text alone, to be explicitly marked as "optional extras" for future work. This will enable us to develop an intelligent computer-assisted concert programme capture tool, to speed up the process of extracting key information from concert programmes into a broader but "lighter" database: covering a wider selection from the Concert Programmes project, but with less initial detail than the full Concert Life database project. This still leaves open the possibility of enriching the records with full expert interpretations, but this is beyond the scope of a PhD project.
We also propose to develop a corpus-based knowledge representation formalism or Concert Ontology to encode knowledge about concerts. This will allow automated reasoning about interrelated data-fields, enabling researchers to ask challenging questions about concerts in both the Concert Programmes archive and the Concert Life Database. Research in Artificial Intelligence has led to sophisticated models, techniques and tools for language analysis, corpus linguistics and knowledge representation and reasoning; we propose to investigate the use of these techniques to further enrich the datasets.
The Concert Life Corpus and Ontology will form an open-source case-study resource for computer-based research and teaching in music and cultural history, as well as a challenging, rich, accessible data warehouse for Natural Language Processing, Artificial Intelligence and Data Mining research.
The Koran (also spelt Qur'an) is the holy book of Islam: muslims believe it holds instructions from God, passed on by an angel, Gabriel, to a man, Mohammed, to be memorised by him and passed on to others. Eventually this set of instructions was written down, and every muslim should try to learn the words of the Koran and follow these teachings from God. However, although the lessons of the Koran should be applied to all aspects of life, it may not be straightforward to find an answer to any given question. The aim of this project is to develop one or more systems whcih deliver answers from the Koran to any questions input to the system. Ideally two (or more) different systems will be implemented, based on different algorithms, to allow comnparative evaluation of performance. Ideally the system(s) should have both English and Arabic versions, allowing questions and answers either in the original Arabic of the Koran, or an English translation more accessible to English speakers.
Leeds PhD student Bayan Abu Shawar has developed tools to automate retraining of the ALICE chatbot directly from training texts, such as transcripts of human dialogue, FAQ websites, and even a chatbot trained with the Koran to reply with quotations from the Muslim holy book. One approach to this project is to build on this system, to extend the ALICE simple pattern-matching chatbot. An alternative approach is a deeper "knowledge-based" system, requiring encoding of Koran verses in a logic-based knowledge representation formalism, so that questions are mapped onto logical queries over this Koran knowledge management formalism. A third apporahc might be to adapt an existing Dialogue Management system including linguistic analysis (parsing, semantic and pragmatic analysis) of Koran, and of questions.
The resulting systems will be evalauted comparatively, by empirically measuring user satisfaction in users such as Mosque classes, religious education classes in schools, or an alernative to online FAQ websites for muslims seeking guidance. This project could potentially be widely-used by muslims and in religious education.
REFERENCES
Abu Shawar, Bayan; Atwell, Eric. An Arabic chatbot giving answers from the Qur'an. In: Bel, B and Marlien, I (editors) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles, Volume 2, pp. 197-202 ATALA. 2004.
Abu Shawar, Bayan; Atwell, Eric. Accessing an information system by chatting. In: Meziane, F and Metais,E (editors) Natural Language Processing and Information Systems, pp. 407-412 Springer-Verlag. 2004
AbuShawar B and Atwell E. (2005) Modelling turn-taking in a corpus-trained chatbot. To appear in Bernhard Schroeder et al (eds) Sprache, Sprechen und Computer/Computer Studies in Language and Speech, Peter Lang Verlag. http://www.comp.leeds.ac.uk/eric/todo/bayanSSC.doc
Eric Atwell (2004) website: http://www.comp.leeds.ac.uk/eric also includes links to several corpus-trained chatbots
Knowledge Management and Data Mining research e.g. Sentiment Analysis can be applied to Happiness research. Currently, the standard way to analyse levels of happiness is either (i) put people through brain scans - to measure activity in the "happiness" part of the brain; but this isnt feasible for mass studies ... or (ii) ask experimental subjects to keep a written diary or log, and then analyse "by hand" variation in their well-being or sentiment over time, e.g. see (Christensen 2003).
Computational analysis of text could provide a novel alternative tool; for example, Leeds PhD student Fangzhong Su is researching sentiment analysis tools to measure "sentiment" of words in text. A 2008 Final Year Project by Torre Williams developed a web-interface to extract web-pages and provide a basic measure of "happiness" of pages. A related Final Year Project by Andrew McKinlay last year looked at "opinion mining", extracting and classifying positive and negative opinions from natural language text sources. Several of the software toolkits described on the toolkit survey website http://www.alias-i.com/lingpipe/web/competition.html include tools for Sentiment Analysis of text: computing whether the text is "positive" or "negative", eg for consumer reviews.
A number of existing tools use Google API (or similar, eg Yahoo) to harvest web-pages in a given nation, to provide a corpus which can be used to discover domain-specific "sentiment features" indicating national level of happiness. These can build up a picture for a region on levels of happiness as evidenced by national WWW domains. Alternatively, the logs gathered by traditional researchers can be put through Sentiment Analysis software to assess variation in happiness levels automatically.
Your task is to survey the possibilities, select appropriate data-source and sentiment analysis tool (or build your own in Python), design and run some experiments to measure "happiness" in your samples; and then apply this tool to current research issues in Happiness Studies, collaborating with economists, social scientists and/or psychologists to see if computational measurement of happiness can provide answers to some of their research questions. For example, they may want you to evaluate your software results against expert human judgements of happiness levels in research cases.
Background references:
Journal of Happiness Studies (published by Springer) http://www.springer.com/10902
Torre Williams. 2008. Finding the Level of Happiness of a Polarised Text. http://www.comp.leeds.ac.uk/cgi-bin/fyproj/reports/0708/Williams.pdf.gz
Andrew McKinlay. 2008. A system for predicting sports results from natural language. Final Year Project report, School of Computing, Leeds University. http://www.comp.leeds.ac.uk/cgi-bin/fyproj/reports/0708/McKinlay.pdf.gz
LingPipe list of "competition" systems for text analysis: http://www.alias-i.com/lingpipe/web/competition.html
Christensen et al "A practical guide to experience-sampling procedures" Journal of Happiness Studies V.4:1 pp. 53-78, 2003
This is a broad area, and students can focus on a specific project according to personal interest. This research area is interdisciplinary, and you will benefit from input from research groups in other departments at Leeds University, eg Education, Modern Languages, Translation Studies, Linguistics, English, Psychology, Disability Studies. However, your research will be in the School of Computing so of course must have a clear Computing content and focus. Possible questions to investigate include:
Al-Sulaiti, Latifa; Roberts, Andrew; Atwell, Eric. The use of corpora and concordance in the teaching of contemporary Arabic in: Proceedings of EuroCALL 2005. 2005.
Elliott, Debbie; Atwell, Eric; Hartley, Tony. Using corpora to automatically detect untranslated and outrageous words in machine translation output in: Proceedings of Corpus Linguistics 2005. 2005
Atwell, Eric; Howarth, Peter; Souter, Clive. The ISLE corpus: Italian and German spoken learner's English. ICAME Journal, vol. 27, pp. 5-18. 2003
Atwell, Eric; Roberts, Andrew. Combinatory hybrid elementary analysis of text . Kurimo, M, Creutz, M and Lagus, K (editors) Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes. 2006.
Abu Shawar, Bayan; Atwell, Eric. A chatbot system as a tool to animate a corpus. ICAME Journal, vol. 29, pp. 5-24. 2005.
Abu Shawar, Bayan; Atwell, Eric. Using corpora in machine-learning chatbot systems. International Journal of Corpus Linguistics, vol. 10, pp. 489-516. 2005.
For The World Wide Web is a huge resource, containing a very large number of documents on a wide range of topics; document workflow within a large company intranet also covers many different topics. This project is about employing modern computational methods to find regularity in large databanks containing texts. More specifically, knowledge discovery in this project will be about:
In doing your project you will access several large text collections used by Leeds University researchers, including: the British National Corpus (BNC), the Russian Reference Corpus, the German Internet Corpus (GIC) and the Chinese Gigaword Corpus. You will also assess suitability of standard Data-Mining tools (eg WEKA, R) to clustering/classifying text.
For more details of corpus resources at Leeds, and this project topic, see http://www.comp.leeds.ac.uk/ssharoff/ or http://corpus.leeds.ac.uk/ and/or email eric@comp.leeds.ac.uk or ssharoff@comp.leeds.ac.uk
Arabic is a major international modern language, and is the official language of Islam: every Muslim should learn Arabic to read the original Koran. However, there are relatively few software resources for Arabic language computing. This project aims to develop tools for Arabic NLP, specifically Part-of-Speech tagger and associated dictionary and morphological analyser. The student will start from the generic Python NLTK PoS-tagger for English, and adapt this to Arabic.
Python is the new "cool" language for object-oriented software engineering, backed by a blossoming open-source community developing Python software, eg NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing. NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit.
Ideally the student should have some basic knowledge of Arabic; but linguistic advice and end-user evaluation will be available through the Arabic language computing research group here at Leeds (Bayan Abu Shawar, Latifa Al-Sulaiti, Andy Roberts).
See Websites: http://www.python.org/ http://nltk.sourceforge.net/
and References:
S Bird and E Loper, "NLTK: The Natural Language Toolkit" in Proceedings of ACL'04, Barcelona, 2004 - http://www.ldc.upenn.edu/sb/home/papers/nltk.pdf
Arabic NLP research at Leeds University: http://www.comp.leeds.ac.uk/arabic
Atwell, Eric; Al-Sulaiti, Latifa; Al-Osaimi, Saleh; Abu Shawar, Bayan. "A review of Arabic corpus analysis tools" in: Bel, B and Marlien, I (editors) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles, Volume 2, pp. 229-234 ATALA. 2004 http://www.comp.leeds.ac.uk/eric/taln04/aaaaF2.pdf
Abu Shawar, Bayan; Atwell, Eric. "An Arabic chatbot giving answers from the Qur'an" in: Bel, B and Marlien, I (editors) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles, Volume 2, pp. 197-202 ATALA. 2004.
Andy Roberts, aConCorde multilingual concordance tool http://www.comp.leeds.ac.uk/andyr/software/aConCorde/index.html
Machine Learning and Data Mining have been applied to language datasets to discover patterns: for example, grammar inference, grammatical word-classifications for English and other languages. However, most research has been done by individuals using their own bespoke software; this project would aim to re-engineer a range of known algorithms into a generic toolkit for Linguistic knowledge discovery, and/or to adapt existing Machine Learning tools to learn from language corpus datasets. The student can start from the generic Python Natural Language Toolkit, and extend this by adding modules for Grammar Inference; and/or start from the WEKA Java data-mining toolkit, and extend this by adding modules for language processing.
Python and Java are languages for object-oriented software engineering, backed by a blossoming open-source community developing free, re-usable Python and Java software. For example, NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing. NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit. Another example is WEKA: "... Weka is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. The algorithms can either be applied directly to a dataset or called from your own Java code..."
Ideally the student should have some basic knowledge of Machine Learning and NLP; but expert advice and end-user evaluation will be available through the language computing research group here at Leeds. See Websites: http://www.python.org/ http://nltk.sourceforge.net/
and References:
S Bird and E Loper, "NLTK: The Natural Language Toolkit" in Proceedings of ACL'04, Barcelona, 2004 - http://www.ldc.upenn.edu/sb/home/papers/nltk.pdf
Atwell, Eric; Roberts, Andrew. Combinatory hybrid elementary analysis of text in: Kurimo, M, Creutz, M & Lagus, K (editors) Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes. 2006.
Atwell, Eric; Abu Shawar, Bayan; Babych, Bogdan; Elliott, Debbie; Elliott, John; Gent, Paul; Hartley, Anthony; Hu, Xunlei Rose; Medori, Julia; Oba, Toshifumi; Roberts, Andy; Scharoff, Serge; Souter, Clive. "Corpus Linguistics, Machine Learning and Evaluation: Views from Leeds" Research Report no. 2003.02, School of Computing, University of Leeds. 2003. - http://www.comp.leeds.ac.uk/research/pubs/reports/2003/2003_02.pdf
Atwell, Eric. "Clustering of word types and unification of word tokens into grammatical word-classes" in: Bel, B and Marlien, I (editors) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles, Volume 1, pp. 27-32 ATALA. 2004.
van Zaanen, Menno; Roberts, Andrew; Atwell, Eric. "A multilingual parallel parsed corpus as gold standard for grammatical inference evaluation" in: Proceedings of LREC'04 Workshop on The Amazing Utility of Parallel and Comparable Corpora, pp. 58-61 European Language Resources Association. 2004.
A Frequently Asked Questions website such as the School of Computing FAQ http://www.comp.leeds.ac.uk/faq/ holds a wealth of knowledge in partly-structured natural language text. However, it is not always straightforward to find what you are looking for. the current FAQ offers two alternative interfaces:
Bayan Abu Shawar, computing PhD student, has developed two alternative interfaces at: http://www.comp.leeds.ac.uk/cgi-bin/bshawar/faq.cgi - you can type a question in natural English, and this is then passed to (i) a chatbot interface, using her machine-learnt chatbot engine; and (ii) a Google interface, using the standard Google search engine.
The aim of this project is to comparatively evaluate alternative ways of accessing FAQs, through methodical user-based evaluation. The SoC FAQ can be used as a concrete test knowledge base; ideally the evaluation should also be extended to other FAQs, and/or other alternative access methods. If desired, the student may implement their own alternative software. The aim is to come up with clear recommendations for a preferred knowledge management system, possibly a hybrid of more than one approach.
The project need NOT involve a great deal of software development: several systems already exist. The focus is on methodical evaluation of existing knowledge management systems. However, if the student prefers, there is scope to develop a new or hybrid FAQ-access system of your own.
References
Abu Shawar, Bayan; Atwell, Eric. "Evaluation of chatbot systems" in: Proceedings of Eighth Maghrebian Conference on Software Engineering and Artificial Intelligence. 2004.
Abu Shawar, Bayan; Atwell, Eric. "Accessing an information system by chatting" in: Meziane, F and Metais,E (editors) Natural Language Processing and Information Systems, pp. 407-412, Lecture Notes in Computer Science, Springer-Verlag. 2004.
Abu Shawar, Bayan; Atwell, Eric. "An Arabic chatbot giving answers from the Qur'an" in: Bel, B and Marlien, I (editors) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles, Volume 2, pp. 197-202 ATALA. 2004.
MAT or Machine-Assisted Translation is software to ASSIST a human translator of eg French text to English, by taking over the simple repetititve translations, leaving the difficult bits to the human expert. MAT does NOT attempt perfect or full translation, as this is beyond current NLP/AI. By analogy, it may be possible to identify simple commonly-repeated patterns in English-language software specifications or documentation, which map straightforwardly onto UML formal specifications and hence to code.
This project will take the corpus-based approach: first collate a Corpus or collection of example specification documents with corresponding UML and/or other formal representation, and/or code - pref in Python or Java. This corpus could include examples from open-source software providers, and student coursework exercises. Then, seek to identify recurrent patterns mapping from spec to UML and/or code, by 2 methods: machine-learning; and knowledge-eliciting inspection or "hand-crafting". The result should be a "translation memory" of stock phrases in English specs which correspond to known fragments of UML and/or code. These could be used to speed up (but not fully automate) the production of code from specifications.
The system also requires a user interface displaying both English and OO code drafts side-by-side, linking equivalences. Where practical, equivalences and inconsistencies should be highlighted automatically; your work on this problem can build on a previous Undergraduate Project:
INDURUWANA, Chanaka Viraj. 2007. Parsing text to generate concept classes. Final Year Project report, School of Computing, Leeds University. http://www.comp.leeds.ac.uk/cgi-bin/fyproj/reports/0607/Induruwana.pdf.gz
Cryonics involves freezing a human body at/near death, in the hope that medical science will advance to the point that the body can be re-animated at some future time; eg see http://www.cryonics-europe.org/
Leeds PhD student Bayan Abu Shawar has been developing Machine Learning techniques to build AI "personality models" from training samples of the person's conversation. The resulting chatbot system is thus adapted to chat in the style of the human subject. Bill Faloon of the Reanimation Foundation has said this could be useful to cryonics:
"The Reanimation Foundation funds ... research aimed at the uploading of the deanimated individual's identity and memory in a computer, electronic of other type artificial system that would enable the individual to be fully or partially restored to some level of consciousness ... There are other groups who would like to see this technology developed for the purpose of cloning identities BEFORE one is placed in cryo-preservation..."
This project will investigate the application of Bayan Abu Shawar's PhD system to capturing ("freezing") an identity. You can also investigate the potential market for this technology, and develop a business plan for a research and development programme to develop and apply the techniques.
Morie information: http://www.comp.leeds.ac.uk/bshawar
Also, see Google searches on: cryonics, reanimation society
Leeds has a track record for research on computer analysis of English language texts, also known as English Corpus Linguistics. For example, we have developed a Part-of-Speech analysis system, and collaborate with other research projects such as the ICE: International Corpus of English, which includes research teams in 15 countries where English is the main language - see http://www.ucl.ac.uk/english-usage/ice/ In many of these English-speaking countries, the national ICE subcorpus is a recognsied national resource used in research and teaching.
Leeds researchers are trying to set up an analogous research effort for Arabic, the International Corpus of Arabic. This student project will investigate feasibility and set up project managment and infrastructure. Some Arab countries are too poor to fund research; and have minimal computing/WWW infrastructure compared to UK. However, there is a flourishing Arabic internet/WWW user community, including newspapers and radio; so it may be feasible to collect at least some samples of ICA Arabic remotely, via WWW; and to help academics in Arab universities with infrastructure planning and design issues.
This project will investigate what can be done to instigate and start data-collection for an Arabic ICA corpus, including seeking Arabic sources accessible remotely, eg on the internet. The project will also draw up proposals for infrastructure and data collection methods, including possible use of the Leeds Virtual Knowledge Park (VKP); and survey potential funding sources for an inernational project, such as EPSRC, EU, FCO, British Council, UNESCO.
Unfortunately, the School of Computing cannot fund project student field-trip visits to Arabia ... BUT if you do get the chance to visit an Arab country, you may get to see how you have made a real contribution to help developing international research collaboration.
References:
Arabic NLP research at Leeds University: http://www.comp.leeds.ac.uk/arabic
The International Corpus of English project has models to adapt, see http://www.ucl.ac.uk/english-usage/ice/
Leeds has a track record for research on computer analysis of English language texts, also known as English Corpus Linguistics. For example, we have developed a Part-of-Speech analysis system, and worked with other research projects such as the ICE: International Corpus of English, which includes research teams in 15 countries where English is the main language - see http://www.ucl.ac.uk/english-usage/ice/ In many of these English-speaking countries, the national ICE subcorpus is a recognsied national resource used in research and teaching.
However, arguably English is not "owned" by Britain and other countries where it is a mother tongue - increasingly, English is an International language, learnt and used worldwide for international communiation. For example, nearly all science and engineering research and development worldwide, including Computing and Language Engineering research, is reported or documented in English first, in local languages later (and sometimes ONLY in English). In Arab countries, many scientific and business discussions are in Arabic, but in practice much "official documentation" and publications are in English. This suggests English could be a de-facto working language in the Arab world; the case could be strengthened by collecting a corpus of Arab English documents to evidence the widespread use and acceptance of English in the Arab world.
Leeds researchers are thinking of setting up an Arab "variant" of ICE, for English used as a foreign language in Arab countries. This project will investigate feasibility, propose data warehouse infrastructure, and collect a prototype dataset from internet and other public sources of English documents written/spoken by non-naive speakers in Arab countries.
Other corpus collection projects eg ICE-GB have gone on to enrich the text with PoS-tagging and parsing; a possible extension to this project is to PoS-tag and parse the Corpus of Arab English.
This project will investigate what can be done to organise and steamline data-collection for a Data Warehouse of Arab English, including English sources accessible remotely, eg on the internet. The project will also draw up proposals for infrastructure and data collection methods, including possible use of the J-BootCat web-crawler tool; and survey potential funding sources for an inernational project, such as EPSRC, EU, FCO, British Council, UNESCO, Arab League.
References:
The International Corpus of English project has models to adapt, see http://www.ucl.ac.uk/english-usage/ice/
Leeds researchers are collecting corpus data for several languags, see e.g. http://corpus.leeds.ac.uk/
AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models, was a research project funded by the UK Engineering and Physical Sciences Research Council, EPSRC (Atwell et al 2000). A number of English language PoS-tagged corpora are available to the research community, where each word is annotated with a detailed Part-of-Speech grammatical tag. However, the detailed PoS-tag sets applied in these corpora (Brown, LOB, SEC, ICE, UPenn, PoW, BNC etc) are all subtly different. There is implicit agreement on what consititutes a noun, verb, adverb etc; but the tag labels used are different, and many detailed subcategories are different: in effect, inconsistent word-class ontologies have been applied. Ideally researchers would like to be able to use PoS-tagged corpora interchangeably, and/or combine them into one large data-set for linguistic modelling eg via Machine Learning.
A PhD project could build on the results of the original AMALGAM project, to develop techniques for integating the corpus datasets, by identifying and removing inconsistencies in PoS-tagging, and developing a single unified tagset. Current PhD student Owen Nancarrow is undertaking a comparative study of the tagging of adverbs in these English corpora, and this could be extended to other major word-classes. An analogous challenge lies ahead for corpora in other languages: for example, there are several PoS-tagged Arabic corpora, with different PoS-tagsets, so a Phd project could develop a unified PoS-tagset and tagger for Arabic to enable integration of Arabic PoS-tagged corpus resources.
Another extension to AMALGAM would be a software engineering research project to upgrade the software. The original AMALGAM project developed resources and tools for English corpus analysis, including Part-of-Speech tagsets and taggers. Resources were archived on a website (AMALGAM 2000) but access to these has been curtailed due to compatability problems after systems upgrades: the website needs to be brought up to date to fit current (and future) web technologies. In particular, users could access the English PoS-tagger software via email: send your text to amalgam-tagger@comp.leeds.ac.uk, it was processed (tagged) and returned via email. This service could be replaced with a web-page similar to the Lancaster University CLAWS demo page (CLAWS 2004). The underlying software could be upgraded to emerging standard NLTK. Linguistic extensions could include parsing, semantic tagging, and other linguistic analysis of the English text; and/or extension to other language(s), to provide a world-standard multilingual text-tagging server. This calls for an initial survey of users needs and preferences, followed by upgrade developments, then evaluation through user satisfaction trials.
References:
AMALGAM (2000) project website: http://www.comp.leeds.ac.uk/amalgam/amalgam/amalghome.htm
Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. ICAME Journal, volume 24, pages 7-23 http://www.comp.leeds.ac.uk/eric/icamejournal2000.pdf
CLAWS (2004) CLAWS POS-tagger trial service: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html
If you want to do a PhD on any of the above topics, please contact Eric Atwell at the School of Computing, Leeds University.