From 2e73d4d0b8f56f4191c4eebf3190c81dfa911209 Mon Sep 17 00:00:00 2001 From: Boyu Gou Date: Fri, 13 Dec 2024 17:53:11 -0500 Subject: [PATCH] update --- .github/workflows/main.yml | 2 +- update_template_or_data/update_paper_list.md | 140 +++++++++---------- 2 files changed, 71 insertions(+), 71 deletions(-) diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 5a5ef41..2504ffd 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -53,7 +53,7 @@ jobs: run: | git config --global user.name "github-actions" git config --global user.email "github-actions@github.com" - git add README.md update_template_or_data/update_paper_list.md + git add README.md update_template_or_data/update_paper_list.md grouped_papers/env_desktop.md grouped_papers/env_general.md grouped_papers/env_gui.md grouped_papers/env_mobile.md grouped_papers/env_web.md git commit -m "Update README with sorted content from update_template_or_data/update_paper_list.md" git push env: diff --git a/update_template_or_data/update_paper_list.md b/update_template_or_data/update_paper_list.md index 4e0270a..901b79b 100644 --- a/update_template_or_data/update_paper_list.md +++ b/update_template_or_data/update_paper_list.md @@ -97,15 +97,6 @@ - πŸ”‘ Key: [framework], [reinforcement learning], [RL], [self-evolving curriculum], [WebRL], [outcome-supervised reward model] - πŸ“– TLDR: This paper introduces *WebRL*, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents. -- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) - - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong - - πŸ›οΈ Institutions: Tsinghua University, Peking University - - πŸ“… Date: October 31, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] - - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. - - [From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents](https://arxiv.org/abs/2409.13701) - Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr - πŸ›οΈ Institutions: UIUC @@ -115,6 +106,15 @@ - πŸ”‘ Key: [framework], [context management], [generalization], [multi-turn navigation], [CWA] - πŸ“– TLDR: This study examines how different contextual elements affect the performance and generalization of Conversational Web Agents (CWAs) in multi-turn web navigation tasks. By optimizing context managementβ€”specifically interaction history and web page representationβ€”the research demonstrates enhanced agent performance across various out-of-distribution scenarios, including unseen websites, categories, and geographic locations. +- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) + - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong + - πŸ›οΈ Institutions: Tsinghua University, Peking University + - πŸ“… Date: October 31, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] + - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. + - [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu - πŸ›οΈ Institutions: UCLA, Salesforce AI Research @@ -169,15 +169,6 @@ - πŸ”‘ Key: [dataset], [framework], [synthetic data] - πŸ“– TLDR: The *EDGE* framework proposes an innovative approach to improve GUI understanding and interaction capabilities in vision-language models through large-scale, multi-granularity synthetic data generation. By leveraging webpage data, EDGE minimizes the need for manual annotations and enhances the adaptability of models across desktop and mobile GUI environments. Evaluations show its effectiveness in diverse GUI-related tasks, contributing significantly to autonomous agent development in GUI navigation and interaction. -- [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) - - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig - - πŸ›οΈ Institutions: CMU - - πŸ“… Date: October 24, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. - - [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft @@ -196,14 +187,14 @@ - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. -- [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) - - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee - - πŸ›οΈ Institutions: KAIST, UT at Austin - - πŸ“… Date: October 23, 2024 +- [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) + - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig + - πŸ›οΈ Institutions: CMU + - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] - - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. + - πŸ’» Env: [Web] + - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] + - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. - [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao @@ -214,6 +205,15 @@ - πŸ”‘ Key: [framework], [vision-language model], [Action Transformer], [app agent], [Android control], [multi-modal] - πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks. +- [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) + - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee + - πŸ›οΈ Institutions: KAIST, UT at Austin + - πŸ“… Date: October 23, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] + - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. + - [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236) - Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua - πŸ›οΈ Institutions: HK PolyU, NTU Singapore @@ -268,6 +268,15 @@ - πŸ”‘ Key: [framework], [autonomous GUI interaction], [experience-augmented hierarchical planning] - πŸ“– TLDR: This paper introduces Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI). The system addresses key challenges in automating computer tasks through experience-augmented hierarchical planning and an Agent-Computer Interface (ACI). Agent S demonstrates significant improvements over baselines on the OSWorld benchmark, achieving a 20.58% success rate (83.6% relative improvement). The framework shows generalizability across different operating systems and provides insights for developing more effective GUI agents. +- [TinyClick: Single-Turn Agent for Empowering GUI Automation](https://arxiv.org/abs/2410.11871) + - Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, Jakub Hoscilowicz + - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology + - πŸ“… Date: October 9, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [Vision-Language Model], [Screenspot], [OmniAct] + - πŸ“– TLDR: TinyClick is a compact, single-turn agent designed to automate GUI tasks by precisely locating screen elements via the Vision-Language Model Florence-2-Base. Trained with multi-task strategies and MLLM-based data augmentation, TinyClick achieves high accuracy on Screenspot and OmniAct, outperforming specialized GUI interaction models and general MLLMs like GPT-4V. The model's lightweight design (0.27B parameters) ensures fast processing and minimal latency, making it efficient for real-world applications on multiple platforms. + - [ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents](https://arxiv.org/abs/2410.11872) - Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoschuk, Artur Janicki - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology @@ -286,15 +295,6 @@ - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. -- [TinyClick: Single-Turn Agent for Empowering GUI Automation](https://arxiv.org/abs/2410.11871) - - Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, Jakub Hoscilowicz - - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology - - πŸ“… Date: October 9, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [Vision-Language Model], [Screenspot], [OmniAct] - - πŸ“– TLDR: TinyClick is a compact, single-turn agent designed to automate GUI tasks by precisely locating screen elements via the Vision-Language Model Florence-2-Base. Trained with multi-task strategies and MLLM-based data augmentation, TinyClick achieves high accuracy on Screenspot and OmniAct, outperforming specialized GUI interaction models and general MLLMs like GPT-4V. The model's lightweight design (0.27B parameters) ensures fast processing and minimal latency, making it efficient for real-world applications on multiple platforms. - - [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://osu-nlp-group.github.io/UGround/) - Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su - πŸ›οΈ Institutions: OSU, Orby AI @@ -448,6 +448,24 @@ - πŸ”‘ Key: [framework], [AppAgent v2] - πŸ“– TLDR: This work presents *AppAgent v2*, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately. +- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) + - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah + - πŸ›οΈ Institutions: Microsoft Research, Microsoft Gen AI + - πŸ“… Date: August 1, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [dataset], [OmniParser] + - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. + +- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) + - Xinbei Ma, Zhuosheng Zhang, Hai Zhao + - πŸ›οΈ Institutions: SJTU + - πŸ“… Date: August 2024 + - πŸ“‘ Publisher: ACL 2024 + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [model], [framework], [benchmark] + - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. + - [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199) - [Author information not available] - πŸ›οΈ Institutions: MultiOn, Stanford @@ -466,24 +484,6 @@ - πŸ”‘ Key: [multimodal agents], [environmental distractions], [robustness] - πŸ“– TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments. -- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) - - Xinbei Ma, Zhuosheng Zhang, Hai Zhao - - πŸ›οΈ Institutions: SJTU - - πŸ“… Date: August 2024 - - πŸ“‘ Publisher: ACL 2024 - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [model], [framework], [benchmark] - - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. - -- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) - - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah - - πŸ›οΈ Institutions: Microsoft Research, Microsoft Gen AI - - πŸ“… Date: August 1, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [dataset], [OmniParser] - - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. - - [OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation](https://arxiv.org/abs/2407.19056) - Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang - πŸ›οΈ Institutions: UCSD, UCLA, AI2 @@ -547,6 +547,15 @@ - πŸ”‘ Key: [framework], [tool formulation], [multi-agent collaboration], [MobileExperts] - πŸ“– TLDR: This paper introduces *MobileExperts*, a framework that enhances autonomous operations on mobile devices by dynamically assembling agent teams based on user requirements. Each agent independently explores and formulates tools to evolve into an expert, improving efficiency and reducing reasoning costs. +- [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) + - Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li + - πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua, SUSTech, Oxford + - πŸ“… Date: July 3, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [benchmark], [framework], [evaluation], [CRAB] + - πŸ“– TLDR: The authors present *CRAB*, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks. + - [Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model](https://arxiv.org/abs/2407.03037) - Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang - πŸ›οΈ Institutions: Institute of Software, Chinese Academy of Sciences; Monash University; Beijing Institute of Technology; University of Chinese Academy of Sciences @@ -565,15 +574,6 @@ - πŸ”‘ Key: [dataset], [benchmark], [AMEX] - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. -- [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) - - Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li - - πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua, SUSTech, Oxford - - πŸ“… Date: July 3, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [benchmark], [framework], [evaluation], [CRAB] - - πŸ“– TLDR: The authors present *CRAB*, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks. - - [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://screen-point-and-read.github.io/) - Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang - πŸ›οΈ Institutions: UCSC, Microsoft Research @@ -718,15 +718,6 @@ - πŸ”‘ Key: [framework], [multi-agent], [planning], [decision-making], [reflection] - πŸ“– TLDR: The paper presents **Mobile-Agent-v2**, a multi-agent architecture designed to assist with mobile device operations. It comprises three agents: a planning agent that generates task progress, a decision agent that navigates tasks using a memory unit, and a reflection agent that corrects erroneous operations. This collaborative approach addresses challenges in navigation and long-context input scenarios, achieving over a 30% improvement in task completion compared to single-agent architectures. -- [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) - - Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Z. Shou - - πŸ›οΈ Institutions: Unknown - - πŸ“… Date: June 2024 - - πŸ“‘ Publisher: NeurIPS 2024 - - πŸ’» Env: [Desktop, Web] - - πŸ”‘ Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction] - - πŸ“– TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation. - - [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva - πŸ›οΈ Institutions: CMU, UCSB @@ -745,6 +736,15 @@ - πŸ”‘ Key: [benchmark], [framework], [web agents], [failure analysis], [analysis], [task disaggregation] - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. +- [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) + - Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Z. Shou + - πŸ›οΈ Institutions: Unknown + - πŸ“… Date: June 2024 + - πŸ“‘ Publisher: NeurIPS 2024 + - πŸ’» Env: [Desktop, Web] + - πŸ”‘ Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction] + - πŸ“– TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation. + - [Large Language Models Can Self-Improve At Web Agent Tasks](https://arxiv.org/abs/2405.20309) - Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter - πŸ›οΈ Institutions: University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, NXAI