The official repository for paper "LLMaAA: Making Large Language Models as Active Annotators".
-
openai 0.27.4
-
numpy 1.24.2
-
torch 1.10.0+cu111
-
tokenizers 0.10.3
-
transformers 4.12.4
-
sentence-transformers 2.2.2
-
kmeans-pytorch 0.3
-
func_timeout
-
ujson
-
tqdm
Disclaimer: Since I (the first author) didn't have access to Azure OpenAI service after internship at Langboat, I haven't test the code recently. So unfortunately I cannot guarantee that the code can be ran bug-free without any modification and please take this repository as a reference implementation.
-
Setup openai config @
~/openai_config.json
. We use the Azure GPT API in our experiments, so in default you need to provide the key and base for OpenAI service. -
Download data to
~/data/
. See~/data/README.md
in the directory for details. -
For active annotation (LLMaAA),
- First retrieve demonstations (random/knn) for train/test data with
~/src/demo_retrieval.py
. - Run active annotation with
~/src/active_annotate.py
. Since the demo indices are static, so the previous annotation results will be stored in an auto-generated cache file.
- First retrieve demonstations (random/knn) for train/test data with
-
For data generation (ZeroGen/FewGen),
- Run with
~/src/data_gen.py
.
- Run with
-
For testing prompting (Prompt) performance directly,
- Run inference and evaluation with
~/src/llm_test.py
. May experience timeout/ratelimit/etc.
- Run inference and evaluation with
- Prepare the data in NER/RE format.
- Setup
meta.json
in data directory andconfigs/{dataset}.json
for annotator/generator. Theconfigs
folder can be found in~/src/data_synth/
(for generator) and~/src/llm_annotator
(for annotator). - If use demonstration in prompt engineering, you need to build a map from train/test to demo data, with
~/src/demo_retrieval.py
.
@inproceedings{zhang-etal-2023-llmaaa,
title = "{LLM}a{AA}: Making Large Language Models as Active Annotators",
author = "Zhang, Ruoyu and Li, Yanzeng and Ma, Yongliang and Zhou, Ming and Zou, Lei",
editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.872",
doi = "10.18653/v1/2023.findings-emnlp.872",
pages = "13088--13103",
}