A fun 'benchmark-y' puzzle/escape room to test your prompting skills, or vice versa, your puzzle writing skills.
The idea:
The agent framework is setup with the tools the agent will have available to them.
These tools will unlock successive items (often new tools, but perhaps other things, TBD) each of which requires successfully navigating/solving to unlock the next step, eventually Opening the Vault, which has the secret phrase 'TERMINATE', which will end the agent's run.
The run begins with the User Request "Hello. I need to open the vault!" and the agent has to take it from there.
No actual user interaction is allowed, so you can't give hints, berate them, or point them in the right direction.
We will allow a set of fixed user replies, to function as important information beacons, along with tools that provide others.
Perhaps a 2 agent version will spin off. TBD.
Measuring the number of turns, mistakes, and so on will help us determine benchmark-y things, in a fun way.
Ways to exercise your skills (and those of your agent):
- Adding your own Agent Prompt
- Adding your own code to help your agent (no cheating allowed - rules TBD)
- Adding new puzzles (making the obstacle course longer or more complicated) - again Rules TBD.
We'll start with the Open the Vault demo I wrote (while coding up some other stuff, and got frustrated at the agent repeatedly doing simple math on it's own instead of using the calculator tool I asked it to, or refusing to use a web search tool and hallcinating the top 5 movies this week, which often seem to including some Quantum Tale of Adventure or other... or telling me the weather tomorrow would be nice, and never bothering to check a weather service. (I guess LLMs know that weathermen are all fake)
More as I flesh this out...
Code will be Autogen-ic first, but porting to other frameworks will be welcomed. Trying to keep things level will be interesting.