Practical usage #2936
Replies: 3 comments 3 replies
-
I would like to link the RoadMap. |
Beta Was this translation helpful? Give feedback.
-
I was talking mostly about complete revision(doesn't mean scraping all that already exist) rather than partial revision and addition(which seems to be the current approach). Similar to openai's approach towards gpt-4o. At first, it was believed that the bigger the model, the better the performance, so they kept pushing it all the way to 1.7t parameters model, which requires a super computer to run. But later, seeing smaller open-source models matching its performance, they went back to the drawing board, combining what they had already learned/achieved with the lastest development from the community and created a masterpiece. Though openai didn't reveal gpt-4o parameters but i guarantee it's a less than 100b model, and the proof is the speed and pricing compared to gpt-4. Gpt-4 is the most expensive model despite being the weakest. Also because Google's gemma-2 outperforms gpt-4 dispite being a 27b model. |
Beta Was this translation helpful? Give feedback.
-
Hey @rezzie-rich , FWIW I do think your comments here are very valid! I haven't had time to write a comprehensive response yet, but just wanted to 👍 |
Beta Was this translation helpful? Give feedback.
-
Hey guys, i really appreciate all the efforts you guys are putting in. I also apologize for not being able to contribute with actual code implementation. However, conceder me from a client/user perspective.
I have been closely following OD, so i know all the hard work that has been put into the project and codeactagent came a long way with the current version 1.8. However, there seems to be a common issue with almost all the ai developer agents(not only OD). they all seem to be super focused on swe-bench eval rather than practical usage. It's the same problem that happened with finetuneing llm. Everyone finetuneing llm became obsessed with getting the highest score on eval that despite having sota leaderboard scores, in a practical usage, they all broke and turned out to be completely useless. It got so bad that huggingface had to update their leaderboard with brand new benchmarks, resulting in wiping out all those "sota" broken model creating a clean slate. I know getting a high score proves the capability of an agent. However, when that becomes the main goal, it results in a very similar situation when a student studies only to pass the exam rather than to learn. Just like having 4.0 gpa doesn't mean that the student will be a useful employee same as having the highest score on eval doesn't mean it's a practical real-world usable agent. At the end of the day, what really matters is how well it can work in a real-world practical scenario rather than the degree/eval score.
The majority, if not all (excluding some paid) agents, are focused on the following:
A practical ai developer agent must be able to do the following:
Correct me if I'm wrong, but the current development seems to be messy due to a lack of project management. It feels like there is an existing project, and different contributors are bringing in different pieces and trying out what fits and what doesn't. Sometimes, it works, and sometimes it breaks the whole thing, which eventually results in a lot of work with no results. I have great respect for all of you for devoting your valuable efforts, but without a proper plan and direction, it's just hard work down the drain. On the plus side, there's experience gained, but on the downside, there's a loss of expectation, time, and effort.
Before any further development, what requires is a reliable team of contributors. It shows there are 162 contributors, but most of them have only 1 commit 3 months ago, and some even only contributed 5 lines of code. Through everyones efforts are appreciated even if it's only a line (a line can sometimes make a huge difference), but what requires is a stable team. Thus, projects like ACR have only 8 contributors, yet it's one of the top performing stable projects with no bugs and PR (only problem it lacks features 😅). Once there's a stable team, then there should be a team research for the latest developments in ai developer agents (mentat, ACR, agentless, other non agent but relevant topic) and based on the new findings develop a comprehensive new structure for opendevin. Before creating new PR and going through trial and error, there should be a clear goal and structure for the project as a whole, which accounts for each and every aspect/features, including the current bugs. For an example, if im correct, micro agents are based on autodev and thats the least utilized and developed agents where gptswarm is a comprehensive agent structure that should be able to replace microagents and provide better overall performance and since gptswarm works like a hive mind, it should also be able to replace delegator agent and take on that role as well or even replace all other agents except codeactagent which then will generate the code and gptswarm will orchestrate the rest; and codeactagent can adopt the effective part from the other high performing agents based on its own structure and usability. However, the point is that before just jumping on to write code to try something, there should be a team research and discussion to choose the right approach, then start writing code with all hands on deck. It's kind of like going back to the drawing board as a lot has changed since the OD and codeactagent projects started. This way, the development will be faster as well following a corporate workflow with management structure rather than open market. Like the biggining, there should be a clear list for new contributors on what to contribute chosen by the team.
I'm sorry because I'm sure i have said things that may not be true or applicable since I'm lacking a lot of info. I know there is a slack channel for discussion, so some of the things i have said are based on assumptions and may not be true, so im sorry for my ignorance. please correct me if im wrong. I have said many things regarding various aspects of the project. Some may not be true when it comes to management and development but the usability wise i believe I'm correct as it's not only specific to OD but also other agents and llm tuner doing the same, and its the same pattern that resulted in over 34k llm variants with "sota" benchmark scores yet deems broken or useless in practical usage.
There's a lot of very qualified people behind OD, so there should be no reason why OD isn't at the top of the ai developer agent list.
@xingyaoww @enyst @tobitege @SmartManoj @neubig @li-boxuan @PierrunoYT @yufansong @mczhuge @Jiayi-Pan @rest
Beta Was this translation helpful? Give feedback.
All reactions