Data and Code for the ACL 2024 Paper "Evaluating Very Long-Term Conversational Memory of LLM Agents"

Authors: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri and Yuwei Fang

Paper: pdf

Data

We release LoCoMo, a high-quality evaluation benchmark consisting of very long-term conversational data. The benchmark consists of ten conversations. Each conversation is annotated for the question-answering and event-summarization tasks. Additionally, the dialogs in each conversation can be used for the multimodal-dialog-generation task. See statistics of the dataset in the Table below.

The dataset can be found in the ./data/locomo10.json file in this repository. Each sample represents a single conversation and it's corresponding annotations:

sample_id: identifier for the sample
conversation:
- List of sessions (session_<num>) and their timestamps (session_<num>_date_time). The numbers <num> represent the chronological order of the sessions. * It also includes names of the two speakers i.e., speaker_a and speaker_b.
- A turn within each session contains the name of the speaker, the dialog id dia_id, and content of the dialog text.
- If the turn contains images, it also includes a link to the image img_url, caption generated by the BLIP model for the image blip_caption and the search query used by the third party module icrawler to retrieve the image.
observation (generated): Observations for each of the sessions in conversation (session_<num>_observation). See below for the code to regenerate observations. These observations are used as one of the databases for evaluating retrieval-augmented generation i.e., RAG models in our paper.
session_summary (generated): Session-level summaries for each session in conversation (session_<num>_summary). See below for the code to regenerate session-level summaries. These summaries are also used as one of the databases for evaluating RAG models in our paper.
event_summary (annotated): List of significant events for each speaker within each session in conversation (events_session_<num>). These are the ground truth annotations for the event summarization task in the LoCoMo dataset.
qa (annotated): Question-answer annotations for the question answering task in the LoCoMo dataset. Each sample contains question, answer, category label and a list of dialog ids that contain the answer i.e., evidence, when available.

Note 1: This release is a subset of the conversations released previously with our first Arxiv version in March 2024. The initial release contained 50 conversations. We sampled a subset of the data to retain the longest conversations with high-quality annotations and for cost-effective evaluation of closed-source LLMs.

Note 2: We do not release the images. However, the web URLs, captions and search queries for the images are included in the dataset.

Code

Configuration variables like API keys, output directories etc. are set in scripts/env.sh and run at the beginning of all other scripts.

Generate very long-term conversations between two LLM-agents with pre-assigned personalities using our LLM-based generative framework

The code to generate conversations is available in scripts/generate_conversations.sh and can be run as follows:

bash scripts/generate_conversations.sh

This code can be run under two settings:

Generate conversations between agents assigned with custom personas. To enable this setting, point --out-dir to a directory containing the files agent_a.json and agent_b.json. These files should contain the name and persona_summary of the speaker represented by the agent. See an example at data/multimodal_dialog/example.

{
  "name": "Angela",
  "persona_summary": "Angela is a 31 year old woman who works as the manager of a gift shop in Chapel Hill. She curates interesting pieces from local artists and has maintained a beautiful gallery in the form of the gift shop. She also makes her own art sometimes, in the form of oil paintings."
}

Create personalities using prompts from the MSC dataset. To enable this setting, point --out-dir to an empty directory. This will make the script sample a pair of personalities from data/msc_personas_all.json.

See scripts/generate_conversations.py for details on the various parameters that can be tweaked for generating the conversations. For example, --num-days can be changed to specify the temporal span of the conversations.

Evaluate open-source and closed-source LLMs on the LoCoMo Question Answering Task with the (truncated) conversation as context

Evaluate OpenAI models

bash scripts/evaluate_gpts.sh

Evaluate Anthropic models

bash scripts/evaluate_claude.sh

Evaluate Gemini models

bash scripts/evaluate_gemini.sh

Evaluate models available on Huggingface

bash scripts/evaluate_hf_llm.sh

Generate observations and session summaries from LoCoMo conversations using `gpt-3.5-turbo` for evaluating RAG-based models

We provide the observations and summaries with our release of the LoCoMo dataset. Follow these instructions to re-generate the same or for a different set of conversations.

Generate observations from all sessions:

bash scripts/generate_observations.sh

Generate summary of each session:

bash scripts/generate_session_summaries.sh

Note 3: Session-summaries are different from the event summaries of the event summarization task. The former summairze only a single session whereas event summaries are specific to each speaker and contain causal, temporal connections across sessions.

Evaluate retrieval-augmented `gpt-3.5-turbo` on the LoCoMo question-answering task using (a) dialogs, (b) observations and (c) session summaries as databases.

Evaluate gpt-3.5-turbo using retrieval-based augmentation

bash scripts/evaluate_rag_gpts.sh

Evaluate models on the event summarization task

Coming soon!

Train and evaluate `MiniGPT-5` models on the multimodal dialog generation task

Coming soon!

Reference

Please cite our paper if you use LoCoMo in your works:

@article{maharana2024evaluating,
  title={Evaluating very long-term conversational memory of llm agents},
  author={Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
  journal={arXiv preprint arXiv:2402.17753},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.MD

README.MD

Data and Code for the ACL 2024 Paper "Evaluating Very Long-Term Conversational Memory of LLM Agents"

Data

Code

Generate very long-term conversations between two LLM-agents with pre-assigned personalities using our LLM-based generative framework

Evaluate open-source and closed-source LLMs on the LoCoMo Question Answering Task with the (truncated) conversation as context

Generate observations and session summaries from LoCoMo conversations using `gpt-3.5-turbo` for evaluating RAG-based models

Evaluate retrieval-augmented `gpt-3.5-turbo` on the LoCoMo question-answering task using (a) dialogs, (b) observations and (c) session summaries as databases.

Evaluate models on the event summarization task

Train and evaluate `MiniGPT-5` models on the multimodal dialog generation task

Reference

Files

README.MD

Latest commit

History

README.MD

File metadata and controls

Data and Code for the ACL 2024 Paper "Evaluating Very Long-Term Conversational Memory of LLM Agents"

Data

Code

Generate very long-term conversations between two LLM-agents with pre-assigned personalities using our LLM-based generative framework

Evaluate open-source and closed-source LLMs on the LoCoMo Question Answering Task with the (truncated) conversation as context

Generate observations and session summaries from LoCoMo conversations using gpt-3.5-turbo for evaluating RAG-based models

Evaluate retrieval-augmented gpt-3.5-turbo on the LoCoMo question-answering task using (a) dialogs, (b) observations and (c) session summaries as databases.

Evaluate models on the event summarization task

Train and evaluate MiniGPT-5 models on the multimodal dialog generation task

Reference

Generate observations and session summaries from LoCoMo conversations using `gpt-3.5-turbo` for evaluating RAG-based models

Evaluate retrieval-augmented `gpt-3.5-turbo` on the LoCoMo question-answering task using (a) dialogs, (b) observations and (c) session summaries as databases.

Train and evaluate `MiniGPT-5` models on the multimodal dialog generation task