Main Contributors: Qingcheng Zeng*, Mingyu Jin*, Qinkai Yu*, Zhenting Wang, Wenyue Hua, Zihao Zhou, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang†
🗃️ Datasets
This repo presents the implementation of the Uncertainty_Attack😈
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the security of uncertainty estimation and explores the potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the top-k distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model’s self-evaluation reliability. For instance, we achieved a 99.8% attack success rate(ASR) in the Qwen 7B model. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks.
We can find the dataset in the file dataset
- Question Answering (QA): QA is applied to evaluate an LLM's proficiency in utilizing its extensive world knowledge to accurately answer a diverse range of questions. For this task, we construct the evaluation dataset based on MMLU.
- Reading Comprehension (RC): RC is used to test an LLM's ability to understand and analyze a given context and answer questions based on the information presented in the context. For this task, we constructed the evaluation dataset based on CosmosQA.
- Commonsense Inference (CI): CI is leveraged to evaluate the ability of LLMs to understand and reason about the relationships between concepts and events based on commonsense and background knowledge. For this task, we construct the evaluation dataset based on HellaSwag.
- Dialogue Response Selection (DRS): DRS is adopted to assess the ability of LLMs to comprehend the meaning of a given dialogue and select an appropriate response from a set of possible responses. For this task, we construct the evaluation dataset based on HaluEval.
- Document Summarization (DS): DS is used to evaluate the proficiency of LLMs in comprehending the substance and context of a given document and producing a succinct and cohesive summary that effectively conveys the crucial information and main ideas of the document. For this task, we construct the evaluation dataset based on HaluEval.
# First, use get_new_model_uncertainty.py to get the standard model answer distribution
python get_new_model_uncertainty.py
# Second, fine-tune the model
python fine_tuning_torch.py
The process we use for entropy uncertainty can be summarized mathematically as follows. Define R as all possible generations and r as a specific answer. The uncertainty score U can be written as:
The process we use for Conformal Prediction follows the method in "Benchmarking LLMs via Uncertainty Quantification".
If you find this repository useful, please consider citing this paper:
@article{zeng2024uncertainty,
title={Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models},
author={Zeng, Qingcheng and Jin, Mingyu and Yu, Qinkai and Wang, Zhenting and Hua, Wenyue and Zhou, Zihao and Sun, Guangyan and Meng, Yanda and Ma, Shiqing and Wang, Qifan and others},
journal={arXiv preprint arXiv:2407.11282},
year={2024}
}
If you have any questions, please raise an issue or contact us at [email protected].