Open Source MT evaluation toolkit######################################################### purpose: scores Machine Translation output; the scores correlate with human evaluations of adequacy and fluency usage: perl wnm-01-1.pl <evaluated-text> <human-reference-text> <corpusFrequencyFile> example: perl wnm-01-1.pl te01sysA-base.txt te00humanRef.txt wnm-frqEnglish-darpa94e.txt >> wnm-results.txt requires: a corpus statistics file in the following format: word;FrequencyInCorpus;NumberOfTextsWhereFound the header of the corpus statistics file should be: <CorpStat>NumberOfTokensInCorpus;NumberOfTextsInCorpus 2 corpus statistics files for English are included (the 2 files are created on 2 different human reference translations) sample output: MT-TEXT:tw01sysA-base.txt;wnm-RECALL-ADEQUACY:0.2221;wnm-FSCORE-FLUENCY:0.2788 ;DETAILS: ;tw01sysA-base.txt;bP:0.2896;bR:0.3442;bF:0.3146 ;tw01sysA-base.txt;wP:0.3745;wR:0.2221;wF:0.2788 ######################################################### # NOTE: v01-1 AT THE MOMENT EACH FILE IS TREATED AS A SINGLE TEXT (THEREFORE NO TEXT/SEGMENT MARKUP IS REQUIRED) # IF YOU EVALUATE A LARGE COLLECTION OF TEXTS, PUT EACH TEXT INTO A DIFFERENT FILE AND COMPUTE AVERAGE SCORES ######################################################### ######################################################### authors: Bogdan Babych <bogdan <at> comp.leeds.ac.uk> Tony Hartley <a.hartley <at> leeds.ac.uk> Centre for Translation Studies, University of Leeds, England, UK
The tool implements a method of MT evaluation that combines BLEU (Papineni et al., 2002) with weights of statistical salience from vector space model, such as S-scores (Babych, Hartley, Atwell, 2003), which are similar to TF.IDF scores (Salton, Lesk, 1968)
The method is described in (Babych, 2004), (Babych, Hartley, 2004a), (Babych, Hartley, 2004b). The paper (Babych, Hartley, 2004a) deals with the relation of the frequency salience weights and legitimate tranlation variation (LTV)
The method has been tested for correlation with human scores on DARPA 94 MT evaluation corpus (White et al, 1994) and a corpus of e-mails / EU White Paper document (Babych, Hartley, Atwell, 2004)
Babych B, Hartley A, Atwell E. 2003. Statistical Modelling of MT output corpora for Information Extraction. In: Proceedings of the Corpus Linguistics 2003 conference, edited by Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery. Lancaster University (UK), 28 - 31 March 2003. Pp. 62-70. PDF, DOC
Babych B, Hartley A. 2004a. Modelling legitimate translation variation for automatic evaluation of MT quality, LREC 2004 (forthcoming). PDF, DOC
Babych B, Hartley A. 2004b. Extending BLEU MT Evaluation Method with Frequency Weighting, ACL 2004 (forthcoming). PDF, DOC
Babych B. 2004. Weighted N-gram model for evaluating Machine Translation output. CLUK `04. Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics. Unviersity of Birmingham 6-7 January, 2004. pp. 15-22. PDF, DOC
Papineni K, Roukos S, Ward T, Zhu W-J. 2002 BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for the Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.
Salton, G. and M.E. Lesk. 1968. Computer evaluation of indexing and text processing. Journal of the ACM, 15(1) , 8-36.
White, J., T. OConnell and F. OMara. 1994. The ARPA MT evaluation methodologies: evolution, lessons and future approaches. Proceedings of the 1st Conference of the Association for Machine Translation in the Americas. Columbia, MD, October 1994. pp. 193-205.