Tools for working with parallel corpora and studying contrastive semantics

Click here to access the new set of tools which is based on the Corpus Workbench and allows access to large comparable corpora in a variety of languages (including 200 million word corpora for Chinese, Portuguese and Russian among others).

Theory

The linguistic theory I rest on is Halliday's systemic-functional linguistics. Among other assumptions it makes about language, the important distinction is made between the description of the system of language, i.e. its potential for exchanging meanings, and the instantiation of the potential in the text. This is directly related to the topic of corpus linguistics, i.e. the empirical study of instances of language production. For more information on my views on the relationship between language and meaning, see the research interests page.

I'm interested in the contrastive study of how the same story is told in different languages (primarily English and Russian with casual examples frm German) by looking into translations of literary and technical texts. The corpora used for concordancing include two technical texts:

    excerpts from the AutoCAD v.13 User's Manual (Chapter 2: Drawing objects) and its translation into Russian;
and three literary texts:
    Vladimir Nabokov's 'The Vane Sisters' and two of its Russian translations made by Ilyin and Barabtarlo;
    Vladimir Nabokov's 'Lolita' and its Russian translation made by Nabokov himself.
    Vladimir Pelevin's 'Omon Ra' in Russian and its translation into English.
Some aligned files are not available on-line because of copyright restrictions. The multilingual corpus for Alice (with lemmatization and morphosyntactic annotations) is freely available. The web-interface to the Alice corpus is available from here. In addition, the members of the NLP group in Leeds can access the Japanese-English corpus of Yomiuri.

There are tools for computing most significant collocations for words in the corpus. See the methodology and some lists of collocates on a separate webpage.

Software

There are many tools for working with concordances and parallel corpora, cf. Michael Barlow's page, but very few of them support language-independent annotations, especially in the XML format. Thus, the software presented here covers the gap. The software is written for Active State Perl, but it should run under any Perl 5 implementation.

Because of its relationship to my studies in systemic linguistics, the software is also genetically related to two pieces of software for working with systemic networks: KPML, a tool for developing multilingual grammars aimed at text generation, and The Systemic Coder, a tool for presenting systemic networks and annotating texts in terms of their features.

Perl scripts to download

Comments/suggestions about their content/functionality are welcome.

The complete package is available as a zip file.

makeconc.pl

To create a corpus:

The script creates a corpus represented in the XML format from a plain text file. The procedure breaks the text in sentences, sentences into words, gives every sentence and word a unique identifier. Two files are produced: an XML file and an html file with anchors for sentences. It is advisable to pass the corpus also through a POS tagger in order to fill the attributes for lemma, POS and morphfeatures properly. A sense disambiguation procedure can help in filling the lexfeatures attribute. Consult.pl produces an html file with concordancing results, in which sentences are linked to the anchors in the html file.

addpara.pl

To add another aligned file to a corpus

adds/updates alignemnt in a parallel multilingual corpus created by makeconc.pl by adding another translation or replacing the older alignment from a text file with alignment. Typically, the alignment is on the sentence level using the Gale-Church algorithm. The default alignment format of translation pairs is original$translation\n\n (as used by Mark Alister). The original text should be an EXACT replica (including the word order) of what is stored in the corpus between <w> tags. (including the order of constituents). Consult.pl produces an html file with search results, which sentences are linked to the anchors in the html file.
consult.pl

To query a corpus:

The consultation function produces an HTML file with a KWIC (KeyWords In Context) list, which items are sentences that conform to the set of search criteria. Search criteria are applied to one sentence and may include all the information stored in the XML format about this sentence, e.g. coocurrence, specific morphological or lexical properties of words as well as their synonyms or translation equivalents. The search criteria may restrict the length of the context, which is extracted from found sentences analysis presented to the user.

The parameters (Perl variables) to be specified in the search condition are:

$searchstring - the search string, e.g. 'lemma="leave" POS="verb"';

$searchtitle - the title of the HTML output, e.g. 'leave/VERB';

$contextsize - the maximal length of the left and right context for results, e.g. 5. In the HTML output, each sentence in the output list can be explored with respect to its wider context, since it is hyperlinked to its position in the original text.

$sortoption "left", "right", "original" - the order, in which occurrences appear in the output list. Actually, the order is defined by the alphabetical order of items in the array @contexts;

$showtranslation - 1 or 2 or 0; whether to show in the output list the respective parallel sentence (1) or only the hyperlink (2) or suppress the search for translations (0).

$translations - a list of translation equivalents or a list of files with translation equivalents (typically language by language, e.g. a file for German size adjectives and a file for Russian size adjectives). When an equivalent matches a word from its start, the whole word is highlighted in the output. Filenames are separated with commas, semicolons or spaces (MS Windows-style filenames with spaces are not allowed).

Search parameters or entire search functions are defined by the user, as a useful example, cf. search for a class of words. The class is defined as a sequence of lexical items stored in a file. It may include multiword entries (expressed using underscores, e.g. a_lot_of). The file may also include the specification of search variables and comments (in the Perl syntax), cf. the example of verbs of motion. A call

perl consult.pl q/wish.q &all-en.lst wish.html

outputs all expressions referring to wishes found in all English corpora. A call

perl consult.pl list.pl ctt/alice.ctt motion-alice.html q/motion.q 

outputs all verbs of motion found in the English text of Alice in Wonderland (the list of lexical items for list.pl is the last argument).

search.pl

To search lexical items in plain texts:

The script is a modification of consult.pl for searching sets of lexical items in plain texts. The set of words to search is given in a text file like verbs of motion. If the input file is specified as &inputfile it contains a list of files to search through.

The need to search several lexical items or their forms (in a language with complex morphology, like Russian) in arbitrary texts occurs often enough, but there is no a tool like awk in Windows. A call

perl search.pl motion.txt alice-en.txt alice-m.txt

outputs all occurrences of verbs of motion in alice-en.txt into alice-m.txt

annotate.pl

To add lexfeatures annotations:

The script looks for a given search condition in a given corpus (in the same way as consult.pl) and prompts for alteration of lexical features for found items. The response is stored back in the concordance.

A simple scenario


The multilingual corpus for Alice is loaded and unzipped to a current directory. The file anger.q lists most frequent words that are used to designate one's anger in English. The variable $translation lists most frequent translations of anger words into German and Russian

Then, a call:

perl consult.pl q/anger.q alice.ctt anger.html

outputs all occurrences of anger-related words to anger.html. Keywords and translations (if they belong to the most frequent set) are highlighted in the output.

 
The software may be freely distributed and modified in terms of the GNU General Public License as long as the copyright notice at the top of each file is retained. The software is provided in the hope that it can be useful. ABSOLUTELY NO warranty is given, in particular, with respect to its suitability for your specific purposes. See the GNU General Public License for more details.

 

<< Back to the home page

Last modified on 20/05/03 by Serge Sharoff, s.sharoffleeds.ac.uk