Detecting Edit Failures in LLMs: An Improved Specificity Benchmark (website)
This repository contains the code for the paper Detecting Edit Failures in LLMs: An Improved Specificity Benchmark (ACL Findings 2023).
It extends previous work on model editing by Meng et al. [1] by introducing a new benchmark, called CounterFact+, for measuring the specificity of model edits.
The repository is a fork of MEMIT, which implements the model editing algorithms MEMIT (Mass Editing Memory in a Transformer) and ROME (Rank-One Model Editing). Our fork extends this code by additional evaluation scripts implementing the CounterFact+ benchmark. For installation instructions see the original repository.
We recommend conda
for managing Python, CUDA, and PyTorch; pip
is for everything else. To get started, simply install conda
and run:
CONDA_HOME=$CONDA_HOME ./scripts/setup_conda.sh
$CONDA_HOME
should be the path to your conda
installation, e.g., ~/miniconda3
.
See INSTRUCTIONS.md for instructions on how to run the experiments and evaluations.
If you find our paper useful, please consider citing as:
@inproceedings{jason2023detecting,
title = {Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark},
author = {Hoelscher-Obermaier, Jason and Persson, Julia and Kran, Esben and Konstas, Ionnis and Barez, Fazl},
booktitle = {Findings of ACL},
year = {2023},
organization = {Association for Computational Linguistics}
}