-
Language Agents: Foundations, Prospects, and Risks
- Yu Su, Diyi Yang, Shunyu Yao, Tao Yu
- 🏛️ Institutions: OSU, Stanford, Princeton, HKU
- 📅 Date: November 2024
- 📑 Publisher: EMNLP 2024
- 💻 Env: [Misc]
- 🔑 Key: [survey], [tutorial], [reasoning], [planning], [memory], [multi-agent systems], [safty]
- 📖 TLDR: This tutorial provides a comprehensive exploration of language agents—autonomous systems powered by large language models capable of executing complex tasks through language instructions. It delves into their theoretical foundations, potential applications, associated risks, and future directions, covering topics such as reasoning, memory, planning, tool augmentation, grounding, multi-agent systems, and safety considerations.
-
Large Language Models Empowered Personalized Web Agents
- Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua
- 🏛️ Institutions: HK PolyU, NTU Singapore
- 📅 Date: Oct 22, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [benchmark], [personalized web agent], [user behavior alignment], [memory-enhanced alignment]
- 📖 TLDR: This paper proposes a novel framework, Personalized User Memory-enhanced Alignment (PUMA), enabling large language models to serve as personalized web agents by incorporating user-specific data and historical web interactions. The authors also introduce a benchmark, PersonalWAB, to evaluate these agents on various personalized web tasks. Results show that PUMA improves web agent performance by optimizing action execution based on user-specific preferences.
-
- Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig
- 🏛️ Institutions: CMU, MIT
- 📅 Date: September 11, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [memory], [AWM]
- 📖 TLDR: The paper proposes Agent Workflow Memory (AWM), a method enabling language model-based agents to induce and utilize reusable workflows from past experiences to guide future actions in web navigation tasks. AWM operates in both offline and online settings, significantly improving performance on benchmarks like Mind2Web and WebArena, and demonstrating robust generalization across tasks, websites, and domains.
-
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
- Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
- 🏛️ Institutions: CMU, Google DeepMind
- 📅 Date: June 20, 2024
- 📑 Publisher: NeurIPS 2024
- 💻 Env: [GUI]
- 🔑 Key: [framework], [memory], [in-context learning], [ICAL]
- 📖 TLDR: This paper introduces In-Context Abstraction Learning (ICAL), a method enabling Vision-Language Models (VLMs) to generate their own examples from sub-optimal demonstrations and human feedback. By abstracting trajectories into generalized programs of thought, ICAL enhances decision-making in retrieval-augmented LLM and VLM agents, reducing reliance on manual prompt engineering and improving performance across various tasks.
-
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
- Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, Zongqing Lu
- 🏛️ Institutions: NTU, BAAI, PKU
- 📅 Date: March 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement]
- 📖 TLDR: This paper introduces Cradle, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources.
-
On the Multi-turn Instruction Following for Conversational Web Agents
- Yang Deng, Xuan Zhang, Wenxuan Zhang, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua
- 🏛️ Institutions: NUS, DAMO Academy, University of Copenhagen
- 📅 Date: February 23, 2024
- 📑 Publisher: ACL 2024
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [multi-turn dialogue], [memory utilization], [self-reflective planning]
- 📖 TLDR: This paper explores multi-turn conversational web navigation, introducing the MT-Mind2Web dataset to support instruction-following tasks for web agents. The proposed Self-MAP (Self-Reflective Memory-Augmented Planning) framework enhances agent performance by integrating memory with self-reflection for sequential decision-making in complex interactions. Extensive evaluations using MT-Mind2Web demonstrate Self-MAP's efficacy in addressing the limitations of current models in multi-turn interactions, providing a novel dataset and framework for evaluating and training agents on detailed, multi-step web-based tasks.
-
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
- Longtao Zheng, Rundong Wang, Xinrun Wang, Bo An
- 🏛️ Institutions: NTU
- 📅 Date: June 13, 2023
- 📑 Publisher: ICLR 2024
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [benchmark], [trajectory prompting], [state abstraction], [memory retrieval]
- 📖 TLDR: Synapse introduces a novel framework for computer control tasks, leveraging trajectory-as-exemplar prompting and memory to enhance LLM performance in complex, multi-step computer tasks. The system combines state abstraction, trajectory-based prompts, and memory retrieval, overcoming LLM limitations by filtering task-irrelevant data, storing exemplar trajectories, and retrieving relevant instances for improved decision-making. Synapse achieves significant performance gains on benchmarks such as MiniWoB++ and Mind2Web, demonstrating enhanced task success rates and generalization across diverse web-based tasks.