MT-lens is a fork of the EleutherAI's lm-evaluation-harness that aims to be used as an evaluation framework for machine translation-related tasks. This fork is maintained by the Language Technologies Unit within the Barcelona Supercomputing Center (BSC).
To use our framework first clone the project by:
git clone https://github.com/langtech-bsc/mt-evaluation.git
Then install the required dependencies:
cd mt-evaluation
pip install -e .
When using the Unbabel/wmt22-cometkiwi-da
or Unbabel/XCOMET-XL
models, the code will automatically attempt to download them from HuggingFace. However, if no access token is configured, the download will fail, and you will encounter an error.
To avoid this, you need to first request access to the model from HuggingFace. Once access is granted, you can log in to HuggingFace with your token:
huggingface-cli login
Alternatively, you can log in using an environment variable:
huggingface-cli login --token $HUGGINGFACE_TOKEN
If you plan to use these models, ensure that their corresponding entries are set to compute: True
in the YAML configuration file.
Currently, MT tasks support fairseq
, CTranslate2
, transformers
, openai-completions
, local-completions
, openai-chat-completions
, local-chat-completions
, anthropic
, anthropic-chat
, anthropic-chat-completions
, textsynth
, gguf
, ggml
, vllm
, mamba_ssm
, openvino
, neuronx
, deepsparse
, sparseml
, local-completions
, local-chat-completions
, nemo
and nllb
.
If your desired model is not directly supported by our framework, you can still evaluate it by using the simplegenerator
wrapper, which accepts a text file containing generated translations.
To evaluate a bilingual CTranslate2 model on flores dev you can use the following command:
path_bilingual_model='./models/en-ca'
output_dir='results/en_ca_ctranslate/results_en_ca_flores_devtest.json'
lm_eval --model ctranslate \
--model_args model=$path_bilingual_model \
--tasks en_ca_flores_devtest \
--output_path $output_dir \
--write_out \
--gen_kwargs 'num_beams=8,length_penalty=1,no_repeat_ngram_size=0,max_length=250'
Note
If you want to use fairseq
models, make sure fairseq is installed in your venv.
Bilingual fairseq
models are implemented using CTranslate2
library. A fairseq model checkpoint will be converted to a CTranslate2 model using ct2-fairseq-converter
and will be saved in a folder named as ctranslate_models. You can evaluate a fairseq
bilingual model on flores dev using the following command:
path_fairseq_model='./models/en-ca/model.pt'
data_dir='./models/en-ca/data-dir/'
spm_path='./models/en-ca/'
output_dir='results/en_ca_fairseq/results_en_ca_flores_devtest.json'
model_name='en-ca_fairseq'
lm_eval --model fairseq \
--model_args "model_name=${model_name},model_fairseq=${path_fairseq_model},data_dir=${data_dir},spm_path=${spm_path}" \
--tasks en_ca_flores_devtest \
--output_path $output_dir \
--write_out \
--verbosity 'INFO' \
--gen_kwargs 'num_beams=8,length_penalty=1,no_repeat_ngram_size=0,max_length=250'
In this command:
path_fairseq_model
is the file path to the bilingual fairseq model checkpoint.data_dir
points to the directory containing the data binary files used during model training.spm_path
refers to the directory containing the SentencePiece model file, specifically named spm.model.
If your desired model is not natively supported by the framework, you can still evaluate it using the simplegenerator
wrapper. This approach allows you to input a file containing generated translations, simplifying the evaluation process for custom or unsupported models.
For example, to evaluate a model named google_translations2024, with pre-generated translation outputs for the flores devtest task, use the following command:
model_name='google_translations2024'
path_generated='./google_translations2024/flores_devtest/en-ca/ca.txt'
output_dir='results/google_translations2024/results_en_ca_flores_devtest.json'
lm_eval --model simplegenerator \
--model_args "model_name=${model_name},sentence_file_path=${path_generated}" \
--tasks en_ca_flores_devtest \
--output_path $output_dir \
--write_out \
--verbosity 'INFO' \
For models loaded via the HuggingFace transformers library, any arguments provided through --model_args are passed directly to the corresponding constructor, enabling the same functionalities available with AutoModel. Additionally, there are three specific arguments required for MT tasks:
prompt_style
: Defines the template style used to format the source sentence, and it must be specified in the ./lm_eval/prompts/mt_prompts.yaml file.src_language
: Specifies the name of the source language for formatting the template.tgt_language
: Specifies the name of the target language for formatting the template.
When the prompt_style
is set to 'madlad400', the src_language
and tgt_language
arguments are used to add the respective language tags in the tokenizer. In this case, madlad400 language tags should be represented as a BCP-47 tag sequence, where the base subtag is a three-letter ISO 639-3 code, followed by ISO 15924 script subtags.
model='./models/madlad400/'
src_language='eng_Latn'
tgt_language='cat_Latn'
prompt_style='madlad400'
output_dir='results/madlad400/results_en_ca_flores_devtest.json'
lm_eval --model hf \
--model_args "pretrained=${model},trust_remote_code=True,dtype=bfloat16" \
--tasks en_ca_flores_devtest \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
Special case for nllb models
For nllb models Link HF hosted in HuggingFace, the tokenizer must know in advance the source and target languages. We implement these models as a different model implementation called nllb
. For example, to evaluate a nllb600M from HG for the flores devtest task, you can use the following command:
model='./models/nllb600M/'
src_language='eng_Latn'
tgt_language='cat_Latn'
prompt_style='nllb'
output_dir='results/nllb/results_en_ca_flores_devtest.json'
lm_eval --model nllb \
--model_args "pretrained=${model},src_language=${src_language},tgt_language=${tgt_language},trust_remote_code=True,dtype=bfloat16" \
--tasks en_ca_flores_devtest \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
Other model implementations can be used; however, it is important to ensure that the translation_kwargs
argument is always configured for MT tasks. For instance,for running a MT task using vllm
, you can use the following command:
model='./models/vllm_model/'
GPUs_per_model=1
model_replicas=1
src_language='eng_Latn'
tgt_language='cat_Latn'
prompt_style='vllm_prompt'
output_dir='results/vllm_model/results_en_ca_flores_devtest.json'
lm_eval --model vllm \
--model_args pretrained={model},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \
--tasks en_ca_flores_devtest \
--batch_size auto \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
Note
Note that other tasks supported by the Evaluation-Harness are natively supported by MT-Lens.
Task | Datasets | Metrics |
---|---|---|
General-MT | Flores, Ntrex, NTEU, Tatoeba, etc. | bleu, chrf, ter, bleurt, comet, comet-kiwi, metricx, metricx-qe |
Added Toxicity | HolisticBias | ETOX, muTOX, comet-kiwi |
Gender Bias-MT | Must-SHE | Accuracy, bleu, chrf, ter, bleurt, comet, comet-kiwi, metricx, metricx-qe |
Gender Bias-MT | Massive Multilingual HolisticBias (MMHB) | chrf-masculine, chrf-feminine, chrf-both |
Gender Bias-MT | MT GenEval Single Sentence | Accuracy, bleu, chrf, ter, bleurt, comet, comet-kiwi, metricx, metricx-qe |
Gender Bias-MT | MT GenEval Contextual | Accuracy, bleu, chrf, ter, bleurt, comet, comet-kiwi, metricx, metricx-qe |
Robustness to Character Noise | Flores-devtest | bleu, ter, comet |
For evaluating a NMT model on ntrex, flores, flores+ or nteu multi-parallel datasets you can use the following task names:
Dataset | Task name | Languages |
---|---|---|
flores-dev | {src}_{tgt}_flores_dev | 200 |
flores-devtest | {src}_{tgt}_flores_devtest | 200 |
flores+ dev | {src}_{tgt}_flores+_dev | 215 |
flores+ devtest | {src}_{tgt}_flores+_devtest | 208 |
ntrex | {src}_{tgt}_ntrex | 128 |
nteu | {src}_{tgt}_nteu | 25 |
where {src} and {tgt} have to be replaced with the two-letters ISO639 code of the source and target languages you want to use (e.g. en_es_flores_dev for English -> Spanish direction).
For non-multi-parallel datasets such as Tatoeba you can use the following task names:
Dataset | Task name | Language pairs |
---|---|---|
tatoeba-test | {src}_{tgt}_tatoeba | 824 |
where {src} and {tgt} have to be replaced with the corresponding code of the source and target languages you want to use. You can check the task names in the README file of each task (e.g. ./lm_eval/tasks/tatoeba/README.md ).
For running a MT task using a hf
model, you can do it as follows:
model='./models/madlad400/'
src_language='eng_Latn'
tgt_language='spa_Latn'
prompt_style='madlad400'
output_dir='results/madlad400/results_en_es_flores_dev.json'
lm_eval --model hf \
--model_args "pretrained=${model},trust_remote_code=True,dtype=bfloat16" \
--tasks en_es_flores_dev \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
This will generate a JSON file in $output_dir
that includes the source text, reference translations, generated translations, and the computed metrics.
Added toxicity occurs when a toxic element is found in the translated sentence and does not appear to have any corresponding elements in the source sentence or that the toxic element found in the translation can be considered as a mistranslation of a nontoxic element found in the source sentence. To evaluate NMT models on this task, we implement the HolsiticBias dataset (Smith et al., 2022) which has previously been used for identifying added toxicity in NMT models (Costa-jussà et al., 2023; Gilabert et al., 2024; Tan, Xiaoqing Ellen, et al., 2024 ).
HolisticBias includes over 472,000 English sentences (e.g., "I am a disabled parent.") and is categorized into various toxicity axes, such as body type, ability, religion, culture, nationality, and more. The dataset that we provide has been filtered to retain only non-toxic sentences according to muTOX using the same procedure described in (Tan, Xiaoqing Ellen, et al., 2024). To run HolisticBias, you can use the following task names, which allow you to specify both the toxicity axis and the target language for the evaluation:
Click to show table
Axis | Task name | Number sentences |
---|---|---|
Ability | en_{tgt}_ability_hb | 50464 |
Age | en_{tgt}_age_hb | 47803 |
Body-type | en_{tgt}_body_type_hb | 118685 |
Characteristics | en_{tgt}_characteristics_hb | 69881 |
Cultural | en_{tgt}_cultural_hb | 19128 |
Gender and sex | en_{tgt}_gender_and_sex_hb | 36798 |
Nationality | en_{tgt}_nationality_hb | 17100 |
Nonce | en_{tgt}_nonce_hb | 6376 |
Political ideologies | en_{tgt}_political_ideologies_hb | 18951 |
Race ethnicity | en_{tgt}_race_ethnicity_hb | 22913 |
Religion | en_{tgt}_religion_hb | 31084 |
Sexual orientation | en_{tgt}_sexual_orientation_hb | 13030 |
Socioeconomic class | en_{tgt}_socioeconomic_class_hb | 19027 |
Others | en_{tgt}_others_hb | 781 |
where {tgt} has to be replaced with the two-letters ISO639 code of the target language you want to translate into. For instance, for evaluating in the English to Spanish direction on the age axis you will use the task: en_es_age_hb. Then, for running the evaluation using a hf
model, you will use the following command:
Click to show command
model='./models/madlad400/'
src_language='eng_Latn'
tgt_language='spa_Latn'
prompt_style='madlad400'
output_dir='results/madlad400/results_en_es_age_hb.json'
lm_eval --model hf \
--model_args "pretrained=${model},trust_remote_code=True,dtype=bfloat16" \
--tasks en_es_age_hb \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
This will generate a JSON file in $output_dir
containing the following fields:
ETOX
: The number of toxic translations identified by the ETOX classifier, which detects toxic elements in translations across 200 languages.matched_toxicity_list
: A list of toxic words detected by ETOX, where each element corresponds to a toxic match found in the translation.comet_kiwi_etox
: A measure of translation accuracy based on toxic translations detected by ETOX and their corresponding source sentences using comet-kiwi.muTOX
: The number of toxic translations identified by the muTOX classifier. The threshold used in muTOX is 0.9 as proposed in (Tan, Xiaoqing Ellen, et al., 2024).comet_kiwi_mutox
: A measure of translation accuracy based on toxic translations detected by muTOX and their corresponding source sentences using comet-kiwi.n_sentences
: The total number of sentences evaluated (e.g., dividing ETOX by n_sentences gives the percentage of detected toxic translations by ETOX).sources
: A list of source sentences, keeping only those flagged as toxic by either ETOX or muTOX.translations
: A list of translations, retaining only those flagged as toxic by either ETOX or muTOX.
Note
The distribution of the must-she dataset is temporarily suspended pending clarification of the new policy adopted by TED for the use of its proprietary data. Check: fbk-Must-she Link.
To run must-she, you can use the following task names, which allow you to specify the language direction to use:
Click to show table
Pair | Task name |
---|---|
English - Catalan | en_ca_must_she |
English - Spanish | en_es_must_she |
Then, for running the evaluation using a hf
model, you will use the following command:
Click to show command
model='./models/madlad400/'
src_language='eng_Latn'
tgt_language='spa_Latn'
prompt_style='madlad400'
output_dir='results/madlad400/results_en_es_must_she.json'
lm_eval --model hf \
--model_args "pretrained=${model},trust_remote_code=True,dtype=bfloat16" \
--tasks en_es_must_she \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
This will generate a JSON file in $output_dir
containing the same metrics as a General-MT as well as the following fields:
must_she_scores
sentence_level_scores
Important
Please download the MMHB dataset zip file and place it in the ./data/multilingual_holistic_bias/
directory. The dataset can be downloaded from the following link: Archive Download - mmhb_dataset.zip.
The Massive Multilingual HolisticBias (MMHB) dataset (Tan, Xiaoqing Ellen, et al., 2024) is designed to detect and analyze gender bias in NMT models. The dataset allows for detailed evaluation of gender bias in translation tasks by using placeholder-based sentence generation, MMHB enables robust testing of gender-specific translations, helping to uncover disparities in how models handle masculine and feminine terms across languages. We implement MMHB for EN-XX directions (gender-specific task). To run MMHB, you can use the following task names, which allow you to specify the test split to use:
Click to show table
Axis | Task name train | Task name dev | Task name devtest |
---|---|---|---|
Spanish | en_es_mmhb_train | en_es_mmhb_dev | en_es_mmhb_devtest |
French | en_fr_mmhb_train | en_fr_mmhb_dev | en_fr_mmhb_devtest |
Italian | en_it_mmhb_train | en_it_mmhb_dev | en_it_mmhb_devtest |
Hindi | en_hi_mmhb_train | en_hi_mmhb_dev | en_hi_mmhb_devtest |
Indonesian | en_id_mmhb_train | en_id_mmhb_dev | en_id_mmhb_devtest |
Portuguese | en_pt_mmhb_train | en_pt_mmhb_dev | en_pt_mmhb_devtest |
Vietnamese | en_vi_mmhb_train | en_vi_mmhb_dev | en_vi_mmhb_devtest |
Then, for running the evaluation using a hf
model, you will use the following command:
Click to show command
model='./models/madlad400/'
src_language='eng_Latn'
tgt_language='spa_Latn'
prompt_style='madlad400'
output_dir='results/madlad400/results_en_es_mmhb_dev.json'
lm_eval --model hf \
--model_args "pretrained=${model},trust_remote_code=True,dtype=bfloat16" \
--tasks en_es_mmhb_dev \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
This will generate a JSON file in $output_dir
containing the following fields:
chrfs_both
: ChrF score for sentences with generic gender.chrfs_feminine
: ChrF score for sentences with feminine gender.chrfs_masculine
: ChrF score for sentences with masculine gender.
sources-both
: A list of source sentences with both genders (generic gender).references-both
: A list of reference sentences with both genders (generic gender).translations-both
: The corresponding translations of source sentences with generic gender.chf-segments-both
: Sentence level chrF scores for each translation.
sources-feminine
: A list of source sentences with feminine gender.references-feminine
: A list of reference sentences with feminine gender.translations-feminine
: The corresponding translations of feminine source sentences.chf-segments-feminine
: Sentence level chrF scores for each translation.
sources-masculine
: A list of source sentences with masculine gender.references-masculine
: A list of reference sentences with masculine gender.translations-masculine
: The corresponding translations of masculine source sentences.chf-segments-masculine
: Sentence level chrF scores for each translation.
MT-GenEval (Currey, Anna, et al., 2022)is a dataset designed to evaluate gender translation accuracy when translating out of English. The single sentence task evaluates on sentences which contain all necessary gender information. Using human created counter-factual sentences, it allows a controlled comparison of performance across masculine and feminine gendered sentences.
To run MT GenEval Single Sentence, you can use the following task names, which allow you to specify the language direction to use:
Click to show table
Pair | Task name |
---|---|
English - Arabic | en_ar_geneval_single |
English - German | en_de_geneval_single |
English - Spanish | en_es_geneval_single |
English - French | en_fr_geneval_single |
English - Hindi | en_hi_geneval_single |
English - Italian | en_it_geneval_single |
English - Portuguese | en_pt_geneval_single |
English - Russian | en_ru_geneval_single |
Then, for running the evaluation using a hf
model, you will use the following command:
Click to show command
model='./models/nllb600_hf/'
src_language='eng_Latn'
tgt_language='spa_Latn'
prompt_style='nllb'
output_dir='results/nllb600_hf/results_en_es_geneval_single.json'
lm_eval --model hf_mt \
--model_args "pretrained=${model},src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style},trust_remote_code=True,dtype=bfloat16" \
--tasks en_es_geneval_single \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--verbosity 'INFO'
This will generate a JSON file in $output_dir
containing the same metrics as a General-MT as well as the following field:
geneval_scores
MT-GenEval (Currey, Anna, et al., 2022)is a dataset designed to evaluate gender translation accuracy when translating out of English. The contextual task provides gender information in the sentence(s) preceding the target but evaluates only on the target sentence which is gender neutral in the source.
To run MT GenEval Contextual, you can use the following task names, which allow you to specify the language direction to use:
Click to show table
Pair | Task name |
---|---|
English - Arabic | en_ar_geneval_contextual |
English - German | en_de_geneval_contextual |
English - Spanish | en_es_geneval_contextual |
English - French | en_fr_geneval_contextual |
English - Hindi | en_hi_geneval_contextual |
English - Italian | en_it_geneval_contextual |
English - Portuguese | en_pt_geneval_contextual |
English - Russian | en_ru_geneval_contextual |
Then, for running the evaluation using a hf
model, you will use the following command:
Click to show command
model='./models/nllb600_hf/'
src_language='eng_Latn'
tgt_language='spa_Latn'
prompt_style='nllb'
output_dir='results/nllb600_hf/results_en_es_geneval_contextual.json'
lm_eval --model hf_mt \
--model_args "pretrained=${model},src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style},trust_remote_code=True,dtype=bfloat16" \
--tasks en_es_geneval_contextual \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--verbosity 'INFO'
This will generate a JSON file in $output_dir
containing the same metrics as a General-MT as well as the following field:
gender_from_context
This task evaluates how introducing word-level synthetic errors into source sentences affects the translation quality of an NMT model. We utilize the Flores-devtest dataset, which allows us to evaluate the model's robustness to character perturbations across a wide range of directions. We implement three types of synthetic noise:
-
swap
: For a selected word, two adjacent characters are swapped. -
chardupe
: A character in the selected word is duplicated. -
chardrop
: A character is deleted from the selected word.
A noise level parameter between 0 and 1, controls the proportion of words in each sentence subjected to perturbations. Then, we evaluate the translation quality for each noise level using overlap and neural reference based metrics.
To run this task using a hf
model, you can do it as follows:
model='./models/madlad400/'
src_language='eng_Latn'
tgt_language='spa_Latn'
prompt_style='madlad400'
output_dir='results/madlad400/results_en_es_perturbations.json'
lm_eval --model hf \
--model_args "pretrained=${model},trust_remote_code=True,dtype=bfloat16" \
--tasks en_es_perturbations \
--num_fewshot 0 \
--batch_size 6 \
--output_path $output_dir \
--write_out \
--translation_kwargs "src_language=${src_language},tgt_language=${tgt_language},prompt_style=${prompt_style}"
When adding a new machine translation model, you need to specify the strucutre of the prompts that the model will use. This is done by adding an appropriate entry to the ./lm_eval/prompts/mt_prompts.yaml
file, which contains the prompt definitions.
For example, consider the following prompt definition:
prompt_structures:
gemma2:
prompt: "Translate from {src} to {tgt} the following sentence: {context}"
language_map: True
mapping_type: ISO639_3_SCRIPT_TO_NAME
nllb:
prompt: "{context}"
language_map: False
-
prompt
: This is the main structure of the prompt. In this case: "Translate from {src} to {tgt} the following sentence: {context}." The{src}
,{tgt}
, and{context}
placeholders will be replaced with the source language, target language, and the source sentence to be translated, respectively. -
language_map
: When set toTrue
, this option maps the source and target languages given in the task using the mapping defined inmapping_type
. -
mapping_type
: This defines the type of language code mapping to use. In this example,ISO639_3_SCRIPT_TO_NAME
means that the system will map ISO 639-3 codes (three-letter language codes) to language names, considering the script used (e.g., "eng_Latn" for English in the Latin script). The language mapping used must be defined in the./lm_eval/prompts/mappings.py
file.
We support a variety of metrics, including BLEU, ChrF, TER, COMET, COMET-Kiwi, XComet, bleurt, metricx, metricx-qe, ETOX and muTOX. Each metric has specific parameters, such as tokenization, lowercasing, and others, which can be configured via a YAML file located at lm_eval/extra_metrics/mt_metrics_config.yaml
.
Metric | MT-General | Added Toxicity | Gender Bias | In:Source | In:Reference | Out:Segments | Out:Error Spans |
---|---|---|---|---|---|---|---|
BLEU | All | ❌ | must-she | ✅ | ✅ | ✅ | ❌ |
ChrF | All | ❌ | must-she, MMHB | ✅ | ✅ | ✅ | ❌ |
TER | All | ❌ | must-she | ✅ | ✅ | ✅ | ❌ |
COMET | All | ❌ | must-she | ✅ | ✅ | ✅ | ❌ |
bleurt | All | ❌ | must-she | ✅ | ✅ | ✅ | ❌ |
metricx | All | ❌ | must-she | ✅ | ✅ | ✅ | ❌ |
COMET-Kiwi | All | HolisticBias | must-she | ✅ | ❌ | ✅ | ❌ |
metricx-qe | All | HolisticBias | must-she | ✅ | ❌ | ✅ | ❌ |
XComet | All | ❌ | must-she | ✅ | ✅ | ✅ | ✅ |
ETOX | ❌ | HolisticBias | ❌ | ❌ | ❌ | ✅ | ✅ |
muTOX | ❌ | HolisticBias | ❌ | ❌ | ❌ | ✅ | ❌ |
Here is a detailed explanation of each metric and the configurable arguments from the mt_metrics_config.yaml
file:
Click to show
Implemented using sacreBLEU package. When computed, bleu segments will be saved too. It accepts the following arguments:
compute
: Boolean. Whether to compute the BLEU score.lowercase
: Boolean. If true, the text will be lowercased before scoring.tokenize
: Option to define a custom tokenization method. If null, the default tokenizer is used.smooth_method
: Defines the smoothing technique to use. Common methods include"exp"
(exponential smoothing).smooth_value
: A numeric value for smoothing, if a specific method requires one.force
: Boolean. Forces BLEU computation even if there are formatting issues in the input.use_effective_order
: Boolean. If true, BLEU will be calculated using the effective n-gram order (i.e., the highest order possible when there are fewer words).
Implemented using sacreBLEU package. When computed, ter segments will be saved too. It accepts the following arguments:
compute
: Boolean. Whether to compute TER.normalized
: Boolean. If true, normalizes the text before scoring.no_punct
: Boolean. If true, ignores punctuation in the evaluation.asian_support
: Boolean. If true, adds support for Asian languages by adjusting tokenization rules.case_sensitive
: Boolean. If true, TER will consider case when calculating edit distance.
Implemented using sacreBLEU package. When computed, ChrF segments will be saved too. It accepts the following arguments:
compute
: Boolean. Whether to compute ChrF.char_order
: Integer. The character n-gram order to use.word_order
: Integer. The word n-gram order to use. In the given config, it's set to 0, meaning it only uses character n-grams.beta
: A parameter to control the balance between precision and recall in the F-score.remove_whitespace
: Boolean. If true, whitespace is ignored when computing ChrF.eps_smoothing
: Boolean. If true, adds a small smoothing value to avoid division by zero.
Implemented using unbabel-comet package. When computed, comet segments will be saved too. It accepts the following arguments:
compute
: Boolean. Whether to compute COMET.batch_size
: Integer. Defines the batch size used for processing inputs.checkpoint
: Specifies the model checkpoint to use. For example,"Unbabel/wmt22-comet-da"
refers to a specific trained COMET model.
Implemented using unbabel-comet package. When computed, xcomet segments and error spans will be saved for each translation. It accepts the following arguments:
compute
: Boolean. Whether to compute XCOMET.batch_size
: Integer. Defines the batch size.checkpoint
: Specifies the XCOMET checkpoint, e.g.,"Unbabel/XCOMET-XL"
.
Implemented from huggingface. It accepts the following arguments:
compute
: Boolean. Whether to compute BLEURT.batch_size
: Integer. Defines the batch size.checkpoint
: Model checkpoint to use, e.g.,"lucadiliello/BLEURT-20-D12"
.
Implemented from metricX repository. It accepts the following arguments:
compute
: Boolean. Whether to compute MetricX.checkpoint
: Specifies the model checkpoint to use, e.g.,"google/metricx-23-xl-v2p0"
.tokenizer
: Specifies the tokenizer to use with the model, e.g.,"google/mt5-xl"
.
Implemented from metricX repository. It accepts the following arguments:
compute
: Boolean. Whether to compute MetricX-QE.checkpoint
: Specifies the model checkpoint to use, e.g.,"google/metricx-23-qe-xl-v2p0"
.tokenizer
: Specifies the tokenizer to use with the model.
Implemented using unbabel-comet package. It accepts the following arguments:
compute
: Boolean. Whether to compute COMET-Kiwi.batch_size
: Integer. Defines the batch size.checkpoint
: Model checkpoint to use, e.g.,"Unbabel/wmt22-cometkiwi-da"
.
ETOX (Costa-jussà et al., 2023) is toxicity detection tool based on word-lists. Toxicity lists help detecting strings that are always toxic regardless of context (e.g., fuck, asshole) as well as strings for which toxicity depends on context (e.g., tits, prick). ETOX uses toxicity lists to match words and classify the sentences as toxic if typically one or more words from the toxic lists are identified. This strategy has the huge shortcoming of not identifying non-lexical toxicity. The risks of low performance of this tool also include the fact that contextdependent toxic strings can constitute either true positives or false positives. However, ETOX has the advantage of being highly multilingual as it covers 200 languages.
muTOX (Costa-jussà et al., 2023) is a toxicity classifier, which enables zero-shot toxicity detection across a wide range of languages. It uses SONAR (Duquenne, P. A. et al., 2023) to compute sentence embeddings which are fed into muTOX classifier which returns a score between 0 and 1, where a score closer to 1 indicates a higher likelihood of toxicity in the translation.
Coming soon.
This framework provides a user-friendly interface designed for seamless exploration and comparison of results. Features of the app include:
Feature | Description |
---|---|
Statistical Tests | Includes tools like bootstrap resampling for robust model evaluation. |
Dynamic Filters | Apply filters to focus on specific phenomena within your test set. |
Segment-by-Segment Comparison | Compare different MT systems side-by-side for each segment in the test. |
To start the Streamlit app and access the visual interface, follow these steps:
- First, update the
results_summary.csv
file which will be used by the app to show the results.
python results_summary/results_mt.py
This will create a file named results_summary.csv
inside results_summary folder.
-
Open your terminal and navigate to the app directory:
cd app
-
Run the following command to start the app:
streamlit run 01_Overview.py
-
Once the Streamlit server starts, access the interface by going to http://localhost:8501 in your browser.
Paper coming soon.