The SQuAD2.0 metric (Rajpurkar et al., 2018) is the state-of-the-art metric to evaluate extractive question answering. To compute the F1 and EM (exact match) scores, it utilizes whitespace tokenization, which works well for languages that rely on whitespaces to separate words but does not work for languages that do not require whitespaces, such as Chinese or Thai. Nevertheless, the default SQuAD metric has been used to evaluate multilingual extractive question answering, most prominently in the XQuAD (Artetxe et al., 2020) benchmark.
To better evaluate languages that do not rely on whitespaces, we propose to use a word tokenizer instead.
The original SQuAD 2.0 evaluation metric uses the following tokenization function:
def get_tokens(s):
if not s:
return []
return normalize_answer(s).split()
This whitespace tokenization only produces the desired results for languages that use whitespace to separate words. XQuAD, for example, also contains examples in Chinese and Thai, which don't rely on whitespaces and may be used only to separate sentences or to emphasize words.
We propose using a word tokenizer for languages that do not rely on whitespaces. Our get_tokens
function looks like this:
def get_tokens(s, language, use_word_tokenizer):
if not s:
return []
if use_word_tokenizer:
if language == "chinese":
return list(jieba.cut(normalize_answer(s)))
else:
return tokenize.word_tokenize(normalize_answer(s), language=language)
else:
return normalize_answer(s).split()
To be able to use our metric with the widely-used Hugging Face evaluate library, you can use the M2QAMetric
class from m2qa_metric.py:
from M2QA_Metric.m2qa_metric import M2QAMetric
# 1. Load our adapted multilingual SQuAD 2.0 metric
m2qa_metric = M2QAMetric()
#2. call the metric you would normally do, but add the "language" parameter
full_results = m2qa_metric.compute(
predictions=predictions,
references=references,
no_answer_threshold=0.95,
language=language,
)
This metric also works with the Hugging Face evaluator pipeline for extractive question answering:
from M2QA_Metric.m2qa_metric import M2QAMetric
import evaluate
# 1. Initialize your QA evaluator (this is unchanged)
qa_evaluator = evaluate.evaluator("question-answering")
# 2. Load our adapted multilingual SQuAD 2.0 metric
metric = M2QAMetric()
# 3. Before computing the results, you have to set the language that the data is in
qa_evaluator.METRIC_KWARGS["language"] = language
# 4. Then call the evaluator as always (this is unchanged)
results = qa_evaluator.compute(
tokenizer=tokenizer,
model_or_pipeline=model,
data=squad_v2_dataset["validation"],
metric=metric,
squad_v2_format=True,
)
We provide an adapted version of the official SQuAD v2.0 evaluation script here: m2qa_metric_official_squad_v2_eval_script.py
Chinese results using the adapted SQuAD 2.0 metric with word tokenization instead of whitespace tokenization, affecting F1 scores on answerable questions. Relative changes to Table 2 are shown in parentheses: