diff --git a/paper_by_env/paper_desktop.md b/paper_by_env/paper_desktop.md index 34e37b7..3eddde3 100644 --- a/paper_by_env/paper_desktop.md +++ b/paper_by_env/paper_desktop.md @@ -70,15 +70,6 @@ - πŸ”‘ Key: [framework], [dataset], [general virtual agents], [open-ended learning], [tool creation] - πŸ“– TLDR: AgentStudio is a robust toolkit for developing virtual agents with versatile actions, such as GUI automation and code execution. It unifies real-world human-computer interactions across OS platforms and includes diverse observation and action spaces, facilitating comprehensive training and benchmarking in complex settings. The toolkit's flexibility promotes agent generalization across varied tasks, supporting tool creation and a multimodal interaction interface to advance agent adaptability and learning. -- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) - - Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Zongqing Lu - - πŸ›οΈ Institutions: NTU, BAAI, PKU - - πŸ“… Date: March 5, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Desktop] - - πŸ”‘ Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement] - - πŸ“– TLDR: This paper introduces *Cradle*, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources. - - [Cradle: Empowering Foundation Agents Towards General Computer Control](https://arxiv.org/abs/2403.03186) - Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu - πŸ›οΈ Institutions: Skywork AI, BAAI, NTU, PKU, Institute of Software - Chinese Academy of Sciences, HKU, CUHK @@ -88,6 +79,15 @@ - πŸ”‘ Key: [framework], [model], [general computer control], [skill curation], [self-improvement] - πŸ“– TLDR: This paper introduces the Cradle framework, designed to enable general computer control (GCC) through multimodal input (e.g., screen images and optional audio) and outputs (keyboard and mouse). Cradle’s six core modules, including self-reflection, skill curation, and memory, allow for generalized task handling in complex environments like AAA games. Demonstrated in *Red Dead Redemption II*, the framework exhibits adaptability by performing real missions and following the storyline with minimal prior knowledge, showcasing its potential as a generalist agent for diverse computer tasks. +- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) + - Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Zongqing Lu + - πŸ›οΈ Institutions: NTU, BAAI, PKU + - πŸ“… Date: March 5, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Desktop] + - πŸ”‘ Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement] + - πŸ“– TLDR: This paper introduces *Cradle*, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources. + - [UFO: A UI-Focused Agent for Windows OS Interaction](https://arxiv.org/abs/2402.07939) - Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang - πŸ›οΈ Institutions: Microsoft diff --git a/paper_by_env/paper_web.md b/paper_by_env/paper_web.md index 6330ba5..442d027 100644 --- a/paper_by_env/paper_web.md +++ b/paper_by_env/paper_web.md @@ -88,15 +88,6 @@ - πŸ”‘ Key: [framework], [learning], [imitation learning], [exploration], [AI feedback] - πŸ“– TLDR: The paper presents **OpenWebVoyager**, an open-source framework for training web agents that explore real-world online environments autonomously. The framework employs a cycle of exploration, feedback, and optimization, enhancing agent capabilities through multimodal perception and iterative learning. Initial skills are acquired through imitation learning, followed by real-world exploration, where the agent’s performance is evaluated and refined through feedback loops. -- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft - - πŸ“… Date: October 24, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - - [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig - πŸ›οΈ Institutions: CMU @@ -106,6 +97,15 @@ - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. +- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) + - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida + - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft + - πŸ“… Date: October 24, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] + - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. + - [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236) - Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua - πŸ›οΈ Institutions: HK PolyU, NTU Singapore @@ -241,15 +241,6 @@ - πŸ”‘ Key: [benchmark], [planning], [reasoning], [WorkArena++] - πŸ“– TLDR: This paper introduces **WorkArena++**, a benchmark comprising 682 tasks that simulate realistic workflows performed by knowledge workers. It evaluates web agents' capabilities in planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding. The study reveals challenges faced by current large language models and vision-language models in serving as effective workplace assistants, providing a resource to advance autonomous agent development. [oai_citation_attribution:0‑arXiv](https://arxiv.org/abs/2407.05291?utm_source=chatgpt.com) -- [Adversarial Attacks on Multimodal Agents](https://chenwu.io/attack-agent/) - - Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan - - πŸ›οΈ Institutions: CMU - - πŸ“… Date: Jun 18, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [safety], [VisualWebArena-Adv] - - πŸ“– TLDR: This paper investigates the safety risks posed by multimodal agents built on vision-enabled language models (VLMs). The authors introduce two adversarial attack methods: a captioner attack targeting white-box captioners and a CLIP attack that transfers to proprietary VLMs. To evaluate these attacks, they curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena. The study demonstrates that within a limited perturbation norm, the captioner attack can achieve a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals. The paper also discusses the robustness of agents based on other VLMs and provides insights into factors contributing to attack success and potential defenses. [oai_citation_attribution:0‑ArXiv](https://arxiv.org/abs/2406.12814?utm_source=chatgpt.com) - - [WebCanvas: Benchmarking Web Agents in Online Environments](https://arxiv.org/abs/2406.12373) - Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu - πŸ›οΈ Institutions: Zhejiang University, iMean AI, University of Washington @@ -259,14 +250,14 @@ - πŸ”‘ Key: [framework], [dataset], [benchmark], [Mind2Web-Live], [key-node evaluation] - πŸ“– TLDR: This paper presents WebCanvas, an online evaluation framework for web agents designed to address the dynamic nature of web interactions. It introduces a key-node-based evaluation metric to capture critical actions or states necessary for task completion while disregarding noise from insignificant events or changed web elements. The framework includes the Mind2Web-Live dataset, a refined version of the original Mind2Web static dataset, containing 542 tasks with 2,439 intermediate evaluation states. Despite advancements, the best-performing model achieves a task success rate of 23.1%, highlighting substantial room for improvement. -- [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) - - Eric Li, Jim Waldo - - πŸ›οΈ Institutions: Harvard - - πŸ“… Date: June 1, 2024 +- [Adversarial Attacks on Multimodal Agents](https://chenwu.io/attack-agent/) + - Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan + - πŸ›οΈ Institutions: CMU + - πŸ“… Date: Jun 18, 2024 - πŸ“‘ Publisher: arXiv - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation] - - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. + - πŸ”‘ Key: [benchmark], [safety], [VisualWebArena-Adv] + - πŸ“– TLDR: This paper investigates the safety risks posed by multimodal agents built on vision-enabled language models (VLMs). The authors introduce two adversarial attack methods: a captioner attack targeting white-box captioners and a CLIP attack that transfers to proprietary VLMs. To evaluate these attacks, they curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena. The study demonstrates that within a limited perturbation norm, the captioner attack can achieve a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals. The paper also discusses the robustness of agents based on other VLMs and provides insights into factors contributing to attack success and potential defenses. [oai_citation_attribution:0‑ArXiv](https://arxiv.org/abs/2406.12814?utm_source=chatgpt.com) - [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) - Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou @@ -277,6 +268,15 @@ - πŸ”‘ Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction] - πŸ“– TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation. +- [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) + - Eric Li, Jim Waldo + - πŸ›οΈ Institutions: Harvard + - πŸ“… Date: June 1, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation] + - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. + - [Large Language Models Can Self-Improve At Web Agent Tasks](https://arxiv.org/abs/2405.20309) - Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter - πŸ›οΈ Institutions: University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, NXAI diff --git a/paper_by_key/paper_benchmark.md b/paper_by_key/paper_benchmark.md index 7e4f32e..6bbedb3 100644 --- a/paper_by_key/paper_benchmark.md +++ b/paper_by_key/paper_benchmark.md @@ -36,14 +36,14 @@ - πŸ”‘ Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting] - πŸ“– TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages. -- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft +- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) + - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu + - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] + - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. - [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig @@ -54,14 +54,14 @@ - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. -- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) - - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu - - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU +- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) + - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida + - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] - - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] + - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee @@ -216,15 +216,6 @@ - πŸ”‘ Key: [dataset], [benchmark], [E-ANT] - πŸ“– TLDR: This paper introduces **E-ANT**, the first large-scale Chinese GUI navigation dataset comprising over 40,000 real human interaction traces across more than 5,000 tiny apps. The dataset includes high-quality screenshots with annotations, facilitating the evaluation and development of GUI navigation and decision-making capabilities in multimodal large language models (MLLMs). The authors also assess various MLLMs on E-ANT, providing insights into their performance and potential improvements. -- [Adversarial Attacks on Multimodal Agents](https://chenwu.io/attack-agent/) - - Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan - - πŸ›οΈ Institutions: CMU - - πŸ“… Date: Jun 18, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [safety], [VisualWebArena-Adv] - - πŸ“– TLDR: This paper investigates the safety risks posed by multimodal agents built on vision-enabled language models (VLMs). The authors introduce two adversarial attack methods: a captioner attack targeting white-box captioners and a CLIP attack that transfers to proprietary VLMs. To evaluate these attacks, they curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena. The study demonstrates that within a limited perturbation norm, the captioner attack can achieve a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals. The paper also discusses the robustness of agents based on other VLMs and provides insights into factors contributing to attack success and potential defenses. [oai_citation_attribution:0‑ArXiv](https://arxiv.org/abs/2406.12814?utm_source=chatgpt.com) - - [WebCanvas: Benchmarking Web Agents in Online Environments](https://arxiv.org/abs/2406.12373) - Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu - πŸ›οΈ Institutions: Zhejiang University, iMean AI, University of Washington @@ -234,6 +225,15 @@ - πŸ”‘ Key: [framework], [dataset], [benchmark], [Mind2Web-Live], [key-node evaluation] - πŸ“– TLDR: This paper presents WebCanvas, an online evaluation framework for web agents designed to address the dynamic nature of web interactions. It introduces a key-node-based evaluation metric to capture critical actions or states necessary for task completion while disregarding noise from insignificant events or changed web elements. The framework includes the Mind2Web-Live dataset, a refined version of the original Mind2Web static dataset, containing 542 tasks with 2,439 intermediate evaluation states. Despite advancements, the best-performing model achieves a task success rate of 23.1%, highlighting substantial room for improvement. +- [Adversarial Attacks on Multimodal Agents](https://chenwu.io/attack-agent/) + - Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan + - πŸ›οΈ Institutions: CMU + - πŸ“… Date: Jun 18, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [safety], [VisualWebArena-Adv] + - πŸ“– TLDR: This paper investigates the safety risks posed by multimodal agents built on vision-enabled language models (VLMs). The authors introduce two adversarial attack methods: a captioner attack targeting white-box captioners and a CLIP attack that transfers to proprietary VLMs. To evaluate these attacks, they curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena. The study demonstrates that within a limited perturbation norm, the captioner attack can achieve a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals. The paper also discusses the robustness of agents based on other VLMs and provides insights into factors contributing to attack success and potential defenses. [oai_citation_attribution:0‑ArXiv](https://arxiv.org/abs/2406.12814?utm_source=chatgpt.com) + - [GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents](https://arxiv.org/abs/2406.10819) - Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun - πŸ›οΈ Institutions: Huazhong University of Science and Technology (HUST), MSR, University of Illinois at Chicago (UIC) @@ -252,15 +252,6 @@ - πŸ”‘ Key: [benchmark], [MobileAgentBench] - πŸ“– TLDR: This paper introduces *MobileAgentBench*, a benchmark designed to evaluate the performance of large language model-based mobile agents. It defines 100 tasks across 10 open-source apps, categorized by difficulty levels, and assesses existing agents like AppAgent and MobileAgent to facilitate systematic comparisons. -- [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) - - Eric Li, Jim Waldo - - πŸ›οΈ Institutions: Harvard - - πŸ“… Date: June 1, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation] - - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. - - [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) - Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou - πŸ›οΈ Institutions: NUS, Microsoft Gen AI @@ -270,6 +261,15 @@ - πŸ”‘ Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction] - πŸ“– TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation. +- [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) + - Eric Li, Jim Waldo + - πŸ›οΈ Institutions: Harvard + - πŸ“… Date: June 1, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation] + - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. + - [AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573) - Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva - πŸ›οΈ Institutions: Google DeepMind, Google @@ -378,15 +378,6 @@ - πŸ”‘ Key: [benchmark], [dataset], [multi-turn dialogue], [memory utilization], [self-reflective planning] - πŸ“– TLDR: This paper explores multi-turn conversational web navigation, introducing the MT-Mind2Web dataset to support instruction-following tasks for web agents. The proposed Self-MAP (Self-Reflective Memory-Augmented Planning) framework enhances agent performance by integrating memory with self-reflection for sequential decision-making in complex interactions. Extensive evaluations using MT-Mind2Web demonstrate Self-MAP's efficacy in addressing the limitations of current models in multi-turn interactions, providing a novel dataset and framework for evaluating and training agents on detailed, multi-step web-based tasks. -- [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue](https://arxiv.org/abs/2402.05930) - - Xing Han Lu, ZdenΔ›k Kasner, Siva Reddy - - πŸ›οΈ Institutions: Mila, McGill University - - πŸ“… Date: February 2024 - - πŸ“‘ Publisher: ICML 2024 - - πŸ’» Env: [Web] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [multi-turn dialogue], [real-world navigation], [WebLINX] - - πŸ“– TLDR: WebLINX addresses the complexity of real-world website navigation for conversational agents, with a benchmark featuring over 2,300 demonstrations across 150+ websites. The benchmark allows agents to handle multi-turn instructions and interact dynamically across diverse domains, including geographic and thematic categories. The study proposes a retrieval-inspired model that selectively extracts key HTML elements and browser actions, achieving efficient task-specific representations. Experiments reveal that smaller finetuned decoders outperform larger zero-shot multimodal models, though generalization to new environments remains challenging. - - [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) - Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov - πŸ›οΈ Institutions: CMU @@ -396,6 +387,15 @@ - πŸ”‘ Key: [dataset], [benchmark] - πŸ“– TLDR: OmniACT introduces a dataset and benchmark to train and evaluate multimodal agents capable of autonomously performing diverse tasks across desktop and web environments. Using annotated UI elements across applications, it combines visual grounding with natural language instructions, providing 9,802 data points for developing agents that integrate high-level reasoning with UI interactions. The study highlights the limited proficiency of current models, with baselines like GPT-4 only achieving 15% of human performance on executable scripts, emphasizing OmniACT's potential as a testbed for advancing multimodal AI. +- [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue](https://arxiv.org/abs/2402.05930) + - Xing Han Lu, ZdenΔ›k Kasner, Siva Reddy + - πŸ›οΈ Institutions: Mila, McGill University + - πŸ“… Date: February 2024 + - πŸ“‘ Publisher: ICML 2024 + - πŸ’» Env: [Web] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [multi-turn dialogue], [real-world navigation], [WebLINX] + - πŸ“– TLDR: WebLINX addresses the complexity of real-world website navigation for conversational agents, with a benchmark featuring over 2,300 demonstrations across 150+ websites. The benchmark allows agents to handle multi-turn instructions and interact dynamically across diverse domains, including geographic and thematic categories. The study proposes a retrieval-inspired model that selectively extracts key HTML elements and browser actions, achieving efficient task-specific representations. Experiments reveal that smaller finetuned decoders outperform larger zero-shot multimodal models, though generalization to new environments remains challenging. + - [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](https://arxiv.org/abs/2401.16158) - Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang - πŸ›οΈ Institutions: Beijing Jiaotong University, Alibaba diff --git a/paper_by_key/paper_dataset.md b/paper_by_key/paper_dataset.md index 7653dc0..9bb72ca 100644 --- a/paper_by_key/paper_dataset.md +++ b/paper_by_key/paper_dataset.md @@ -333,15 +333,6 @@ - πŸ”‘ Key: [model], [dataset], [UI understanding], [infographics understanding], [vision language model] - πŸ“– TLDR: This paper introduces ScreenAI, a vision-language model specializing in UI and infographics understanding. The model combines the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. ScreenAI achieves state-of-the-art results on several UI and infographics-based tasks, outperforming larger models. The authors also release three new datasets for screen annotation and question answering tasks. -- [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue](https://arxiv.org/abs/2402.05930) - - Xing Han Lu, ZdenΔ›k Kasner, Siva Reddy - - πŸ›οΈ Institutions: Mila, McGill University - - πŸ“… Date: February 2024 - - πŸ“‘ Publisher: ICML 2024 - - πŸ’» Env: [Web] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [multi-turn dialogue], [real-world navigation], [WebLINX] - - πŸ“– TLDR: WebLINX addresses the complexity of real-world website navigation for conversational agents, with a benchmark featuring over 2,300 demonstrations across 150+ websites. The benchmark allows agents to handle multi-turn instructions and interact dynamically across diverse domains, including geographic and thematic categories. The study proposes a retrieval-inspired model that selectively extracts key HTML elements and browser actions, achieving efficient task-specific representations. Experiments reveal that smaller finetuned decoders outperform larger zero-shot multimodal models, though generalization to new environments remains challenging. - - [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) - Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov - πŸ›οΈ Institutions: CMU @@ -351,6 +342,15 @@ - πŸ”‘ Key: [dataset], [benchmark] - πŸ“– TLDR: OmniACT introduces a dataset and benchmark to train and evaluate multimodal agents capable of autonomously performing diverse tasks across desktop and web environments. Using annotated UI elements across applications, it combines visual grounding with natural language instructions, providing 9,802 data points for developing agents that integrate high-level reasoning with UI interactions. The study highlights the limited proficiency of current models, with baselines like GPT-4 only achieving 15% of human performance on executable scripts, emphasizing OmniACT's potential as a testbed for advancing multimodal AI. +- [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue](https://arxiv.org/abs/2402.05930) + - Xing Han Lu, ZdenΔ›k Kasner, Siva Reddy + - πŸ›οΈ Institutions: Mila, McGill University + - πŸ“… Date: February 2024 + - πŸ“‘ Publisher: ICML 2024 + - πŸ’» Env: [Web] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [multi-turn dialogue], [real-world navigation], [WebLINX] + - πŸ“– TLDR: WebLINX addresses the complexity of real-world website navigation for conversational agents, with a benchmark featuring over 2,300 demonstrations across 150+ websites. The benchmark allows agents to handle multi-turn instructions and interact dynamically across diverse domains, including geographic and thematic categories. The study proposes a retrieval-inspired model that selectively extracts key HTML elements and browser actions, achieving efficient task-specific representations. Experiments reveal that smaller finetuned decoders outperform larger zero-shot multimodal models, though generalization to new environments remains challenging. + - [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) - Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried - πŸ›οΈ Institutions: CMU diff --git a/paper_by_key/paper_framework.md b/paper_by_key/paper_framework.md index 972348d..67f0b42 100644 --- a/paper_by_key/paper_framework.md +++ b/paper_by_key/paper_framework.md @@ -342,15 +342,6 @@ - πŸ”‘ Key: [framework], [tool formulation], [multi-agent collaboration], [MobileExperts] - πŸ“– TLDR: This paper introduces *MobileExperts*, a framework that enhances autonomous operations on mobile devices by dynamically assembling agent teams based on user requirements. Each agent independently explores and formulates tools to evolve into an expert, improving efficiency and reducing reasoning costs. -- [Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model](https://arxiv.org/abs/2407.03037) - - Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang - - πŸ›οΈ Institutions: Institute of Software, Chinese Academy of Sciences; Monash University; Beijing Institute of Technology; University of Chinese Academy of Sciences - - πŸ“… Date: July 3, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [VisionDroid] - - πŸ“– TLDR: The paper presents **VisionDroid**, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods. - - [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) - Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li - πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua University, SUSTech, Oxford @@ -360,6 +351,15 @@ - πŸ”‘ Key: [benchmark], [framework], [evaluation], [CRAB] - πŸ“– TLDR: The authors present *CRAB*, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks. +- [Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model](https://arxiv.org/abs/2407.03037) + - Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang + - πŸ›οΈ Institutions: Institute of Software, Chinese Academy of Sciences; Monash University; Beijing Institute of Technology; University of Chinese Academy of Sciences + - πŸ“… Date: July 3, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [VisionDroid] + - πŸ“– TLDR: The paper presents **VisionDroid**, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods. + - [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://screen-point-and-read.github.io/) - Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang - πŸ›οΈ Institutions: UCSC, MSR @@ -450,15 +450,6 @@ - πŸ”‘ Key: [framework], [multi-agent], [planning], [decision-making], [reflection] - πŸ“– TLDR: The paper presents **Mobile-Agent-v2**, a multi-agent architecture designed to assist with mobile device operations. It comprises three agents: a planning agent that generates task progress, a decision agent that navigates tasks using a memory unit, and a reflection agent that corrects erroneous operations. This collaborative approach addresses challenges in navigation and long-context input scenarios, achieving over a 30% improvement in task completion compared to single-agent architectures. -- [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) - - Eric Li, Jim Waldo - - πŸ›οΈ Institutions: Harvard - - πŸ“… Date: June 1, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation] - - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. - - [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva - πŸ›οΈ Institutions: CMU, UCSB @@ -468,6 +459,15 @@ - πŸ”‘ Key: [framework], [visual grounding], [UI element localization], [LVG] - πŸ“– TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose **LVG**, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata. +- [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) + - Eric Li, Jim Waldo + - πŸ›οΈ Institutions: Harvard + - πŸ“… Date: June 1, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation] + - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. + - [Unveiling Disparities in Web Task Handling Between Human and Web Agent](https://arxiv.org/abs/2405.04497) - Kihoon Son, Jinhyeon Kwon, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim - πŸ›οΈ Institutions: KAIST, Seoul National University @@ -558,14 +558,14 @@ - πŸ”‘ Key: [framework], [dataset], [web-based VLN], [HTML content integration], [multimodal navigation] - πŸ“– TLDR: This paper introduces the *WebVLN* task, where agents navigate websites by following natural language instructions that include questions and descriptions. Aimed at emulating real-world browsing behavior, the task allows the agent to interact with elements not directly visible in the rendered content by integrating HTML-specific information. A new *WebVLN-Net* model, based on the VLN BERT framework, is introduced alongside the *WebVLN-v1* dataset, supporting question-answer navigation across web pages. This framework demonstrated significant improvement over existing web-based navigation methods, marking a new direction in vision-and-language navigation research. -- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) - - Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Zongqing Lu - - πŸ›οΈ Institutions: NTU, BAAI, PKU +- [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) + - Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang + - πŸ›οΈ Institutions: Fudan University, Huawei - πŸ“… Date: March 5, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Desktop] - - πŸ”‘ Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement] - - πŸ“– TLDR: This paper introduces *Cradle*, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources. + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [dataset], [Android GUI], [Chain-of-Action-Thought], [autonomous GUI agents] + - πŸ“– TLDR: This paper introduces *Chain-of-Action-Thought* (CoAT), a novel paradigm to improve GUI agent task completion by enabling agents to interpret previous actions, current screen content, and action rationale for next steps. The authors present the *Android-In-The-Zoo* (AitZ) dataset, which includes 18,643 screen-action pairs with detailed annotations, supporting CoAT's development and evaluation. The study demonstrates that fine-tuning with the AitZ dataset improves performance of a baseline large language model in predicting correct action sequences in Android tasks. - [Cradle: Empowering Foundation Agents Towards General Computer Control](https://arxiv.org/abs/2403.03186) - Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu @@ -576,14 +576,14 @@ - πŸ”‘ Key: [framework], [model], [general computer control], [skill curation], [self-improvement] - πŸ“– TLDR: This paper introduces the Cradle framework, designed to enable general computer control (GCC) through multimodal input (e.g., screen images and optional audio) and outputs (keyboard and mouse). Cradle’s six core modules, including self-reflection, skill curation, and memory, allow for generalized task handling in complex environments like AAA games. Demonstrated in *Red Dead Redemption II*, the framework exhibits adaptability by performing real missions and following the storyline with minimal prior knowledge, showcasing its potential as a generalist agent for diverse computer tasks. -- [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) - - Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang - - πŸ›οΈ Institutions: Fudan University, Huawei +- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) + - Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Zongqing Lu + - πŸ›οΈ Institutions: NTU, BAAI, PKU - πŸ“… Date: March 5, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [dataset], [Android GUI], [Chain-of-Action-Thought], [autonomous GUI agents] - - πŸ“– TLDR: This paper introduces *Chain-of-Action-Thought* (CoAT), a novel paradigm to improve GUI agent task completion by enabling agents to interpret previous actions, current screen content, and action rationale for next steps. The authors present the *Android-In-The-Zoo* (AitZ) dataset, which includes 18,643 screen-action pairs with detailed annotations, supporting CoAT's development and evaluation. The study demonstrates that fine-tuning with the AitZ dataset improves performance of a baseline large language model in predicting correct action sequences in Android tasks. + - πŸ’» Env: [Desktop] + - πŸ”‘ Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement] + - πŸ“– TLDR: This paper introduces *Cradle*, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources. - [Improving Language Understanding from Screenshots](https://arxiv.org/abs/2402.14073) - Tianyu Gao, Zirui Wang, Adithya Bhaskar, Danqi Chen diff --git a/paper_by_key/paper_self-improvement.md b/paper_by_key/paper_self-improvement.md index 880bbc2..1aa38e6 100644 --- a/paper_by_key/paper_self-improvement.md +++ b/paper_by_key/paper_self-improvement.md @@ -9,15 +9,6 @@ - πŸ”‘ Key: [self-improvement], [self-improve] - πŸ“– TLDR: This paper investigates the ability of large language models (LLMs) to enhance their performance as web agents through self-improvement. Utilizing the WebArena benchmark, the authors fine-tune LLMs on synthetic training data, achieving a 31% improvement in task completion rates. They also introduce novel evaluation metrics to assess the performance, robustness, and quality of the fine-tuned agents' trajectories. -- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) - - Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Zongqing Lu - - πŸ›οΈ Institutions: NTU, BAAI, PKU - - πŸ“… Date: March 5, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Desktop] - - πŸ”‘ Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement] - - πŸ“– TLDR: This paper introduces *Cradle*, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources. - - [Cradle: Empowering Foundation Agents Towards General Computer Control](https://arxiv.org/abs/2403.03186) - Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu - πŸ›οΈ Institutions: Skywork AI, BAAI, NTU, PKU, Institute of Software - Chinese Academy of Sciences, HKU, CUHK @@ -26,3 +17,12 @@ - πŸ’» Env: [Desktop] - πŸ”‘ Key: [framework], [model], [general computer control], [skill curation], [self-improvement] - πŸ“– TLDR: This paper introduces the Cradle framework, designed to enable general computer control (GCC) through multimodal input (e.g., screen images and optional audio) and outputs (keyboard and mouse). Cradle’s six core modules, including self-reflection, skill curation, and memory, allow for generalized task handling in complex environments like AAA games. Demonstrated in *Red Dead Redemption II*, the framework exhibits adaptability by performing real missions and following the storyline with minimal prior knowledge, showcasing its potential as a generalist agent for diverse computer tasks. + +- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) + - Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Zongqing Lu + - πŸ›οΈ Institutions: NTU, BAAI, PKU + - πŸ“… Date: March 5, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Desktop] + - πŸ”‘ Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement] + - πŸ“– TLDR: This paper introduces *Cradle*, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources. diff --git a/update_template_or_data/logs/error.log b/update_template_or_data/logs/error.log new file mode 100644 index 0000000..e69de29 diff --git a/update_template_or_data/statistics/keyword_wordcloud.png b/update_template_or_data/statistics/keyword_wordcloud.png index bd1bc6a..fd766c6 100644 Binary files a/update_template_or_data/statistics/keyword_wordcloud.png and b/update_template_or_data/statistics/keyword_wordcloud.png differ diff --git a/update_template_or_data/statistics/keyword_wordcloud_long.png b/update_template_or_data/statistics/keyword_wordcloud_long.png index 2e923de..cae2be8 100644 Binary files a/update_template_or_data/statistics/keyword_wordcloud_long.png and b/update_template_or_data/statistics/keyword_wordcloud_long.png differ diff --git a/update_template_or_data/statistics/top_authors.png b/update_template_or_data/statistics/top_authors.png index 377d5cf..b7e7621 100644 Binary files a/update_template_or_data/statistics/top_authors.png and b/update_template_or_data/statistics/top_authors.png differ diff --git a/update_template_or_data/update_readme_template.md b/update_template_or_data/update_readme_template.md index 9abd36c..97026a2 100644 --- a/update_template_or_data/update_readme_template.md +++ b/update_template_or_data/update_readme_template.md @@ -1,6 +1,6 @@ # Awesome GUI Agent Paper List -This paper list covers a variety of papers related to GUI Agents, such as: +This repo covers a variety of papers related to GUI Agents, such as: - GUI Understanding - Datasets - Benchmarks