Skip to content

Repository for paper "Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense"

License

Notifications You must be signed in to change notification settings

SamuelCahyawijaya/stingraybench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StingrayBench

StingrayBench is a benchmark dataset designed to evaluate cross-lingual sense disambiguation in multilingual large language models (LLMs). This dataset targets the comprehension of false friends—words that appear orthographically similar but have distinct meanings across languages. It serves as the first benchmark specifically measuring cross-lingual semantic understanding in LLMs.

Overview

Recent advancements in multilingual LLMs have expanded their applicability across languages. However, these models frequently exhibit a preference for higher-resource languages, sometimes misrepresenting or misunderstanding user intent in lower-resource languages. StingrayBench addresses this bias by focusing on false friends and true cognates, creating a reliable and reproducible benchmark for cross-lingual sense disambiguation.

Key Features

  1. Language Pairs: StingrayBench covers four language pairs that include high-resource and low-resource languages:

    • English-German (\textsc{en-de})
    • Chinese-Japanese (\textsc{zh-ja})
    • Indonesian-Malay (\textsc{id-ms})
    • Indonesian-Tagalog (\textsc{id-tl})
  2. Task Types: The dataset is designed to evaluate semantic appropriateness and usage correction across these language pairs.

    • Semantic Appropriateness: Models identify which sentence is semantically accurate given two options.
    • Usage Correction: Models identify incorrect uses of cognate words in context, focusing on words with distinct meanings between the languages.
  3. New Metrics:

    • Cognate Bias quantifies the tendency of a model to select words based on language preference rather than semantic appropriateness.
    • Cognate Comprehension Score measures the accuracy of LLMs in distinguishing false friends and true cognates across language pairs.

Dataset Description

Data Construction

StingrayBench is built in collaboration with native speakers of each language pair. Annotators provided translations, labeled false friends and true cognates, and created sentence examples to illustrate correct and incorrect usage.

Dataset Statistics

StingrayBench includes 705 data entries with 259 true cognate entries and 446 false friend entries distributed as follows:

  • English-German: 196 entries (98 true cognates, 98 false friends)
  • Chinese-Japanese: 165 entries (51 true cognates, 114 false friends)
  • Indonesian-Malay: 186 entries (52 true cognates, 134 false friends)
  • Indonesian-Tagalog: 158 entries (58 true cognates, 100 false friends)

Example Prompts and Responses

Semantic Appropriateness Task

Prompt:
Which sentence is more semantically appropriate?
A. "Ich habe einen Arm." (German)
B. "I have an Arm." (English)
C. "Both sentences are appropriate."

Target Completion: "C. Both sentences are appropriate."

Usage Correction Task

Prompt:
Is the usage of "pagi" in this sentence correct? "Ako ay masaya ngayong pagi." (Tagalog)

Target Completion:
"No, the usage of 'pagi' is incorrect. 'Pagi' means 'stingray' in Tagalog, and the sentence should use 'umaga' for 'morning'."

How to Use

# Load Dataset 
config_name = 'en_de' # supported config names: ['id_tl', 'id_tl_common', 'zh_ja', 'zh_ja_common', 'id_ms', 'id_ms_common', 'en_de', 'en_de_common']
dataset = datasets.load_dataset("StingrayBench/StingrayBench", config_name, split=datasets.Split.TEST)

# Load Prompt
from prompts import CONFIG_TO_PROMPT
task_name = 'semantic_correctness' # supported task names: ["semantic_correctness", "usage_correctness_l1", "usage_correctness_l2"]
prompt_template = CONFIG_TO_PROMPT[task]

For more detail, please check our likelihood and generation evaluation scripts.

Licensing

The StingrayBench is available under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) license, while all the code releases is licensed under the Apache-2.0 license. This ensures open access, encouraging collaboration and promoting fairer language modeling practices for the multilingual community.

Citation Information

About

Repository for paper "Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages