Skip to content

Latest commit

 

History

History
73 lines (65 loc) · 8.75 KB

paper_Tao_Yu.md

File metadata and controls

73 lines (65 loc) · 8.75 KB

Tao Yu's Papers

  • Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    • Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
    • 🏛️ Institutions: HKU, NTU, Salesforce
    • 📅 Date: Dec 5, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI]
    • 🔑 Key: [model], [dataset], [planning], [reasoning], [Aguvis], [visual grounding]
    • 📖 TLDR: This paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. It leverages image-based observations and grounds natural language instructions to visual elements, employing a consistent action space to ensure cross-platform generalization. The approach integrates explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. A large-scale dataset of GUI agent trajectories is constructed, incorporating multimodal reasoning and grounding. Comprehensive experiments demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. All datasets, models, and training recipes are open-sourced to facilitate future research.
  • Attacking Vision-Language Computer Agents via Pop-ups

    • Yanzhe Zhang, Tao Yu, Diyi Yang
    • 🏛️ Institutions: Georgia Tech, HKU, Stanford
    • 📅 Date: Nov 4, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI]
    • 🔑 Key: [attack], [adversarial pop-ups], [VLM agents], [safety]
    • 📖 TLDR: This paper demonstrates that vision-language model (VLM) agents can be easily deceived by carefully designed adversarial pop-ups, leading them to perform unintended actions such as clicking on these pop-ups instead of completing their assigned tasks. Integrating these pop-ups into environments like OSWorld and VisualWebArena resulted in an average attack success rate of 86% and a 47% decrease in task success rate. Basic defense strategies, such as instructing the agent to ignore pop-ups or adding advertisement notices, were found to be ineffective against these attacks.
  • Language Agents: Foundations, Prospects, and Risks

    • Yu Su, Diyi Yang, Shunyu Yao, Tao Yu
    • 🏛️ Institutions: OSU, Stanford, Princeton, HKU
    • 📅 Date: November 2024
    • 📑 Publisher: EMNLP 2024
    • 💻 Env: [Misc]
    • 🔑 Key: [survey], [tutorial], [reasoning], [planning], [memory], [multi-agent systems], [safty]
    • 📖 TLDR: This tutorial provides a comprehensive exploration of language agents—autonomous systems powered by large language models capable of executing complex tasks through language instructions. It delves into their theoretical foundations, potential applications, associated risks, and future directions, covering topics such as reasoning, memory, planning, tool augmentation, grounding, multi-agent systems, and safety considerations.
  • Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

    • Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongsheng Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
    • 🏛️ Institutions: HKU, SJTU, Google Cloud AI Research, Google DeepMind, Salesforce Research, Yale University, Sea AI Lab, University of Waterloo
    • 📅 Date: July 15, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [benchmark], [dataset], [data science], [engineering workflows], [Spider2-V]
    • 📖 TLDR: This paper introduces Spider2-V, a multimodal agent benchmark designed to evaluate the capability of agents in automating professional data science and engineering workflows. It comprises 494 real-world tasks across 20 enterprise-level applications, assessing agents' proficiency in code generation and GUI operations within authentic computer environments.
  • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    • Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu
    • 🏛️ Institutions: HKU, CMU, Salesforce, University of Waterloo
    • 📅 Date: April 11, 2024
    • 📑 Publisher: NeurIPS 2024
    • 💻 Env: [GUI]
    • 🔑 Key: [benchmark], [real computer tasks], [online environment], [online benchmark]
    • 📖 TLDR: OSWorld introduces a groundbreaking benchmark for multimodal agents to perform open-ended tasks within real computer environments across platforms like Ubuntu, Windows, and macOS. It includes 369 real-world tasks involving web and desktop apps, file management, and multi-app workflows, with custom evaluation scripts for reproducibility. The results reveal current agents’ limitations in GUI interaction and operational knowledge, as they achieve just 12.24% task success compared to humans' 72.36%, highlighting critical gaps for future model improvement.
  • OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

    • Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong
    • 🏛️ Institutions: Shanghai AI Lab, East China Normal University, Princeton, HKU
    • 📅 Date: February 12, 2024
    • 📑 Publisher: ICLR 2024 Workshop LLMAgents
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [self-directed learning], [GAIA], [FRIDAY], [OS-Copilot]
    • 📖 TLDR: The OS-Copilot framework supports building generalist agents capable of performing diverse tasks across an operating system (OS). This work introduces FRIDAY, an embodied agent using OS-Copilot to self-improve by learning from task outcomes. It operates with a memory-based architecture to tackle OS-level tasks across applications like terminals, web browsers, and third-party tools. Tested on the GAIA benchmark, FRIDAY achieved 35% higher performance than prior methods, proving effective in adapting to unfamiliar applications and refining its capabilities with minimal guidance.
  • OpenAgents: An Open Platform for Language Agents in the Wild

    • Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu
    • 🏛️ Institutions: HKU, XLang Lab, Sea AI Lab, Salesforce Research
    • 📅 Date: October 16, 2023
    • 📑 Publisher: arXiv
    • 💻 Env: [Web]
    • 🔑 Key: [framework], [Data Agent], [Plugins Agent], [Web Agent]
    • 📖 TLDR: This paper introduces OpenAgents, an open-source platform designed to facilitate the use and hosting of language agents in real-world scenarios. It features three agents: Data Agent for data analysis using Python and SQL, Plugins Agent with access to over 200 daily API tools, and Web Agent for autonomous web browsing. OpenAgents aims to provide a user-friendly web interface for general users and a seamless deployment experience for developers and researchers, promoting the development and evaluation of innovative language agents in practical applications.
  • AutoDroid: LLM-powered Task Automation in Android

    • Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu
    • 🏛️ Institutions: Tsinghua University, Shanghai AI Lab, University of Notre Dame, MSR
    • 📅 Date: August 29, 2023
    • 📑 Publisher: MobiCom 2024
    • 💻 Env: [Mobile]
    • 🔑 Key: [framework], [dataset], [benchmark], [Android task automation], [LLM-powered agent]
    • 📖 TLDR: This paper introduces AutoDroid, a novel mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The framework combines the commonsense knowledge of LLMs with domain-specific knowledge of apps through automated dynamic analysis. AutoDroid features a functionality-aware UI representation method, exploration-based memory injection techniques, and a multi-granularity query optimization module. Evaluated on a new benchmark with 158 common tasks, AutoDroid achieves a 90.9% action generation accuracy and a 71.3% task completion rate, significantly outperforming GPT-4-powered baselines.