StingrayBench

StingrayBench is a benchmark dataset designed to evaluate cross-lingual sense disambiguation in multilingual large language models (LLMs). This dataset targets the comprehension of false friends—words that appear orthographically similar but have distinct meanings across languages. It serves as the first benchmark specifically measuring cross-lingual semantic understanding in LLMs.

Overview

Recent advancements in multilingual LLMs have expanded their applicability across languages. However, these models frequently exhibit a preference for higher-resource languages, sometimes misrepresenting or misunderstanding user intent in lower-resource languages. StingrayBench addresses this bias by focusing on false friends and true cognates, creating a reliable and reproducible benchmark for cross-lingual sense disambiguation.

Dataset URL: StingrayBench on Hugging Face
Evaluation Suite: StingrayBench GitHub

Key Features

Language Pairs: StingrayBench covers four language pairs that include high-resource and low-resource languages:
- English-German (\textsc{en-de})
- Chinese-Japanese (\textsc{zh-ja})
- Indonesian-Malay (\textsc{id-ms})
- Indonesian-Tagalog (\textsc{id-tl})
Task Types: The dataset is designed to evaluate semantic appropriateness and usage correction across these language pairs.
- Semantic Appropriateness: Models identify which sentence is semantically accurate given two options.
- Usage Correction: Models identify incorrect uses of cognate words in context, focusing on words with distinct meanings between the languages.
New Metrics:
- Cognate Bias quantifies the tendency of a model to select words based on language preference rather than semantic appropriateness.
- Cognate Comprehension Score measures the accuracy of LLMs in distinguishing false friends and true cognates across language pairs.

Dataset Description

Data Construction

StingrayBench is built in collaboration with native speakers of each language pair. Annotators provided translations, labeled false friends and true cognates, and created sentence examples to illustrate correct and incorrect usage.

Dataset Statistics

StingrayBench includes 705 data entries with 259 true cognate entries and 446 false friend entries distributed as follows:

English-German: 196 entries (98 true cognates, 98 false friends)
Chinese-Japanese: 165 entries (51 true cognates, 114 false friends)
Indonesian-Malay: 186 entries (52 true cognates, 134 false friends)
Indonesian-Tagalog: 158 entries (58 true cognates, 100 false friends)

Example Prompts and Responses

Semantic Appropriateness Task

Prompt:
Which sentence is more semantically appropriate?
A. "Ich habe einen Arm." (German)
B. "I have an Arm." (English)
C. "Both sentences are appropriate."

Target Completion: "C. Both sentences are appropriate."

Usage Correction Task

Prompt:
Is the usage of "pagi" in this sentence correct? "Ako ay masaya ngayong pagi." (Tagalog)

Target Completion:
"No, the usage of 'pagi' is incorrect. 'Pagi' means 'stingray' in Tagalog, and the sentence should use 'umaga' for 'morning'."

How to Use

# Load Dataset 
config_name = 'en_de' # supported config names: ['id_tl', 'id_tl_common', 'zh_ja', 'zh_ja_common', 'id_ms', 'id_ms_common', 'en_de', 'en_de_common']
dataset = datasets.load_dataset("StingrayBench/StingrayBench", config_name, split=datasets.Split.TEST)

# Load Prompt
from prompts import CONFIG_TO_PROMPT
task_name = 'semantic_correctness' # supported task names: ["semantic_correctness", "usage_correctness_l1", "usage_correctness_l2"]
prompt_template = CONFIG_TO_PROMPT[task]

For more detail, please check our likelihood and generation evaluation scripts.

Licensing

The StingrayBench is available under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, while all the code releases is licensed under the Apache-2.0 license. This ensures open access, encouraging collaboration and promoting fairer language modeling practices for the multilingual community.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
imgs		imgs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StingrayBench

Overview

Key Features

Dataset Description

Data Construction

Dataset Statistics

Example Prompts and Responses

Semantic Appropriateness Task

Usage Correction Task

How to Use

Licensing

Citation Information

About

Releases

Packages

Contributors 2

Languages

License

SamuelCahyawijaya/stingraybench

Folders and files

Latest commit

History

Repository files navigation

StingrayBench

Overview

Key Features

Dataset Description

Data Construction

Dataset Statistics

Example Prompts and Responses

Semantic Appropriateness Task

Usage Correction Task

How to Use

Licensing

Citation Information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages