Data for the WMT13 shared task on Quality Estimation

Training data

  • Task 1.1: concatenation of the WMT12 QE English-Spanish training and test sentences (in this order). Evaluation script.
  • Task 1.2: English-Spanish and German-English. Big thanks to Eleftherios Avramidis for preparing this data.
  • Task 1.3: English-Spanish sentences with post-editing time at sentence-level.
  • Task 2: Same as Task 1.3 with edit operations at word-level.
  • Test data

  • Task 1.1: News sentences English and their Spanish translations. Human references, human post-edited MT output, gold HTER labels and ranking and Evaluation script
  • Task 1.2: English-Spanish and German-English. DE-EN labels, EN-ES labels, and Evaluation script
  • Task 1.3: English-Spanish sentences. Labels and Evaluation script
  • Task 2: Same as 1.3, word-level. Labels and Evaluation script
  • Software and resources

    You can use QuEst as is for all sentence-level subtasks: Task 1.1, Task 1.2 (note that you'll have to run feature extraction repeatedly for multiple alternative translations of a given source sentence), and Task 1.3. The default configuration of the software will extract the 17 baseline features as used in WMT12.

    For Task 2 -- word-level -- there is no baseline code for word-level features.

    Resources for WMT13 shared task

    WMT13 QE Task 1.1

    The resources for baseline (17 official black-box features) and GB (glass-box) features are the same as those provided last year for WMT12 English-Spanish QE shared task. Some additional resources for more advanced features are also provided. We note that resources that refer to specific instances (e.g. n-best lists) are provided separately for training (1832) and test (422) instances, so that participants can reproduce last year's results if they want. For the WMT13 QE shared task, we'll be using the entire WMT12 QE shared task dataset for training, so please concatenate files where appropriate (in the following order: 1832 training instances + 422 test instances).

    Baseline features: English -- should go under /lang_resources/english/

  • English language model
  • English source training corpus
  • English n-gram counts
  • English truecase model
  • Baseline features: Spanish -- should go under /lang_resources/spanish/

  • Spanish language model
  • Spanish language model of POS tags
  • Spanish truecase model
  • Baseline features: Giza file -- should go under /lang_resources/giza/

  • English-Spanish Lexical translation table src-tgt
  • GB features for training data -- should go under /lang_resources/gb_resources/

  • Moses-like n-best list for WMT12 training and test sets with standard (baleline) model component values, final model score, and word-alignment
  • Moses-like phrase alignment for top translation obtained with option trace in WMT12 training and test sets
  • Moses decoder word graph for WMT12 training and test sets
  • Moses decoder dump obtained with option verbose 2 for WMT12 training and test sets
  • GB features for WMT13 test data -- should go under /lang_resources/gb_resources/

  • Moses-like n-best list for WMT13 test set with standard (baleline) model component values, final model score, and word alignment
  • File with moses-like phrase alignment for top translation obtained with option trace, word graph, and decoder dump obtained with option verbose 2 for WMT13 test set
  • Advanced features -- should go under /lang_resources/advanced_resources/

  • Pre-generated models for probabilistic features extracted from with PCFG parsing (English and Spanish). The tar contains a readme file with information on how these models are generated and the PCFG-based features that can be extracted
  • Pre-generated topic models (English and Spanish). The tar contains a readme file with information on how these models are generated and the features that can be extracted
  • Pre-generated global lexicon models. The tar contains a readme file with information on how these models are generated and the features that can be extracted
  • Pre-generated mutual information trigger models. The tar contains a readme file with information on how these models are generated and the features that can be extracted
  • WMT13 QE Task 1.2

    For this task, the training/test datasets contain translations from many MT systems, produced over many years, and we do not have access to such systems. We refer the participants to corpora used by the various editions of WMT to build resources such as language models for the 2009-2012 datasets to build specific resources such as language models. For the baseline features, we will use the resources provided for Task 1.1 for English-Spanish, and resources built from Europarl and News data released as part of WMT11 (thanks to Eleftherios Avramidis) for German-English dataset:

    Baseline features: German -- should go under /lang_resources/german/

  • German language model
  • German source training corpus
  • German n-gram counts
  • German truecase model
  • German compound split model
  • Baseline features: English -- should go under /lang_resources/english/

  • English language model
  • English truecase model
  • Baseline features: Giza file -- should go under /lang_resources/giza/

  • German-English Lexical translation table src-tgt
  • WMT13 QE Task 1.3 and Task 2

  • Output of the Moses decoder that produced the English-Spanish translations, including 1000-best lists and search graphs, for the training set and for the test set
  • Parallel data used to build the SMT system (using Moses)
  • Code to re-run the entire Moses system (without installing additional software): execute run.sh


  • Lucia Specia - University of Sheffield