Practical usage #2936

rezzie-rich · 2024-07-15T09:07:32Z

rezzie-rich
Jul 15, 2024

Hey guys, i really appreciate all the efforts you guys are putting in. I also apologize for not being able to contribute with actual code implementation. However, conceder me from a client/user perspective.

I have been closely following OD, so i know all the hard work that has been put into the project and codeactagent came a long way with the current version 1.8. However, there seems to be a common issue with almost all the ai developer agents(not only OD). they all seem to be super focused on swe-bench eval rather than practical usage. It's the same problem that happened with finetuneing llm. Everyone finetuneing llm became obsessed with getting the highest score on eval that despite having sota leaderboard scores, in a practical usage, they all broke and turned out to be completely useless. It got so bad that huggingface had to update their leaderboard with brand new benchmarks, resulting in wiping out all those "sota" broken model creating a clean slate. I know getting a high score proves the capability of an agent. However, when that becomes the main goal, it results in a very similar situation when a student studies only to pass the exam rather than to learn. Just like having 4.0 gpa doesn't mean that the student will be a useful employee same as having the highest score on eval doesn't mean it's a practical real-world usable agent. At the end of the day, what really matters is how well it can work in a real-world practical scenario rather than the degree/eval score.

The majority, if not all (excluding some paid) agents, are focused on the following:

swe-bench eval
github PR review
github PR commit

A practical ai developer agent must be able to do the following:

analyze any project (local/github repo) for context awareness
identify errors/flaws/vonerability within the project
propose suggestions to improve the project
Write complete code for new or existing projects as per user request
pull/push to github repo
pull and analyze multiple repos or local projects to create a new or improve an existing project with new features
create PR/commit changes
test/deploy software
perform additional tasks of a software development life cycle
Finally, perform swe-bench or other eval for performance assessment

Correct me if I'm wrong, but the current development seems to be messy due to a lack of project management. It feels like there is an existing project, and different contributors are bringing in different pieces and trying out what fits and what doesn't. Sometimes, it works, and sometimes it breaks the whole thing, which eventually results in a lot of work with no results. I have great respect for all of you for devoting your valuable efforts, but without a proper plan and direction, it's just hard work down the drain. On the plus side, there's experience gained, but on the downside, there's a loss of expectation, time, and effort.

Before any further development, what requires is a reliable team of contributors. It shows there are 162 contributors, but most of them have only 1 commit 3 months ago, and some even only contributed 5 lines of code. Through everyones efforts are appreciated even if it's only a line (a line can sometimes make a huge difference), but what requires is a stable team. Thus, projects like ACR have only 8 contributors, yet it's one of the top performing stable projects with no bugs and PR (only problem it lacks features 😅). Once there's a stable team, then there should be a team research for the latest developments in ai developer agents (mentat, ACR, agentless, other non agent but relevant topic) and based on the new findings develop a comprehensive new structure for opendevin. Before creating new PR and going through trial and error, there should be a clear goal and structure for the project as a whole, which accounts for each and every aspect/features, including the current bugs. For an example, if im correct, micro agents are based on autodev and thats the least utilized and developed agents where gptswarm is a comprehensive agent structure that should be able to replace microagents and provide better overall performance and since gptswarm works like a hive mind, it should also be able to replace delegator agent and take on that role as well or even replace all other agents except codeactagent which then will generate the code and gptswarm will orchestrate the rest; and codeactagent can adopt the effective part from the other high performing agents based on its own structure and usability. However, the point is that before just jumping on to write code to try something, there should be a team research and discussion to choose the right approach, then start writing code with all hands on deck. It's kind of like going back to the drawing board as a lot has changed since the OD and codeactagent projects started. This way, the development will be faster as well following a corporate workflow with management structure rather than open market. Like the biggining, there should be a clear list for new contributors on what to contribute chosen by the team.

I'm sorry because I'm sure i have said things that may not be true or applicable since I'm lacking a lot of info. I know there is a slack channel for discussion, so some of the things i have said are based on assumptions and may not be true, so im sorry for my ignorance. please correct me if im wrong. I have said many things regarding various aspects of the project. Some may not be true when it comes to management and development but the usability wise i believe I'm correct as it's not only specific to OD but also other agents and llm tuner doing the same, and its the same pattern that resulted in over 34k llm variants with "sota" benchmark scores yet deems broken or useless in practical usage.

There's a lot of very qualified people behind OD, so there should be no reason why OD isn't at the top of the ai developer agent list.

@xingyaoww @enyst @tobitege @SmartManoj @neubig @li-boxuan @PierrunoYT @yufansong @mczhuge @Jiayi-Pan @rest

SmartManoj · 2024-07-15T09:19:55Z

SmartManoj
Jul 15, 2024

I would like to link the RoadMap.

2 replies

rezzie-rich Jul 15, 2024
Author

I saw the road map, but i have mentioned some practical use cases as well. My main point was that a lot of new developments happened in the agent field as well as having an agent that gets a high score on eval but can't perform in a practical scenario is pointless. Usability should be the forefront and eval in the back.

Maybe my later comment will explain it more as im talking about a complete revision rather than practical upgrades.

rezzie-rich Jul 15, 2024
Author

Also, I'm sorry, the roadmap kinda slipped outta my mind while writing it.

rezzie-rich · 2024-07-15T09:27:07Z

rezzie-rich
Jul 15, 2024
Author

I was talking mostly about complete revision(doesn't mean scraping all that already exist) rather than partial revision and addition(which seems to be the current approach). Similar to openai's approach towards gpt-4o. At first, it was believed that the bigger the model, the better the performance, so they kept pushing it all the way to 1.7t parameters model, which requires a super computer to run. But later, seeing smaller open-source models matching its performance, they went back to the drawing board, combining what they had already learned/achieved with the lastest development from the community and created a masterpiece. Though openai didn't reveal gpt-4o parameters but i guarantee it's a less than 100b model, and the proof is the speed and pricing compared to gpt-4. Gpt-4 is the most expensive model despite being the weakest. Also because Google's gemma-2 outperforms gpt-4 dispite being a 27b model.

0 replies

neubig · 2024-07-15T18:15:54Z

neubig
Jul 15, 2024
Maintainer

Hey @rezzie-rich , FWIW I do think your comments here are very valid! I haven't had time to write a comprehensive response yet, but just wanted to 👍

1 reply

rezzie-rich Jul 15, 2024
Author

Thanks @neubig , not sure should i reopen the discussion or make a shorter version of it but for now maybe just leave it closed as long as everyone tagged can see and get the point I'm trying to make.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Practical usage #2936

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Practical usage #2936

rezzie-rich Jul 15, 2024

Replies: 3 comments · 3 replies

SmartManoj Jul 15, 2024

rezzie-rich Jul 15, 2024 Author

rezzie-rich Jul 15, 2024 Author

rezzie-rich Jul 15, 2024 Author

neubig Jul 15, 2024 Maintainer

rezzie-rich Jul 15, 2024 Author

rezzie-rich
Jul 15, 2024

Replies: 3 comments 3 replies

SmartManoj
Jul 15, 2024

rezzie-rich Jul 15, 2024
Author

rezzie-rich Jul 15, 2024
Author

rezzie-rich
Jul 15, 2024
Author

neubig
Jul 15, 2024
Maintainer

rezzie-rich Jul 15, 2024
Author