LLMaAA

The official repository for paper "LLMaAA: Making Large Language Models as Active Annotators".

Key Dependencies

openai 0.27.4
numpy 1.24.2
torch 1.10.0+cu111
tokenizers 0.10.3
transformers 4.12.4
sentence-transformers 2.2.2
kmeans-pytorch 0.3
func_timeout
ujson
tqdm

Brief Introduction to usage

Disclaimer: Since I (the first author) didn't have access to Azure OpenAI service after internship at Langboat, I haven't test the code recently. So unfortunately I cannot guarantee that the code can be ran bug-free without any modification and please take this repository as a reference implementation.

Setup openai config @ ~/openai_config.json. We use the Azure GPT API in our experiments, so in default you need to provide the key and base for OpenAI service.
Download data to ~/data/. See ~/data/README.md in the directory for details.
For active annotation (LLMaAA),
- First retrieve demonstations (random/knn) for train/test data with ~/src/demo_retrieval.py.
- Run active annotation with ~/src/active_annotate.py. Since the demo indices are static, so the previous annotation results will be stored in an auto-generated cache file.
For data generation (ZeroGen/FewGen),
- Run with ~/src/data_gen.py.
For testing prompting (Prompt) performance directly,
- Run inference and evaluation with ~/src/llm_test.py. May experience timeout/ratelimit/etc.

Adapt to new dataset

Prepare the data in NER/RE format.
Setup meta.json in data directory and configs/{dataset}.json for annotator/generator. The configs folder can be found in ~/src/data_synth/ (for generator) and ~/src/llm_annotator (for annotator).
If use demonstration in prompt engineering, you need to build a map from train/test to demo data, with ~/src/demo_retrieval.py.

Citation

@inproceedings{zhang-etal-2023-llmaaa,
    title = "{LLM}a{AA}: Making Large Language Models as Active Annotators",
    author = "Zhang, Ruoyu  and Li, Yanzeng  and Ma, Yongliang  and Zhou, Ming  and Zou, Lei",
    editor = "Bouamor, Houda  and Pino, Juan  and Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.872",
    doi = "10.18653/v1/2023.findings-emnlp.872",
    pages = "13088--13103",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMaAA

Key Dependencies

Brief Introduction to usage

Adapt to new dataset

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
src		src
README.md		README.md
openai_config.json		openai_config.json

ridiculouz/LLMaAA

Folders and files

Latest commit

History

Repository files navigation

LLMaAA

Key Dependencies

Brief Introduction to usage

Adapt to new dataset

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages