Discussion about this post

User's avatar
David Sola's avatar

I believe the issue with the third task, at least with some of its steps, is that we expect the agent to use the same tools and interfaces as a human (the website, clicking buttons, etc.). However, some of those intermediate steps could be performed much more efficiently by the agent using alternative interfaces, such as CLI commands or API requests to Substacks.

Jasmine Sun's avatar

thanks for writing this! I do a similar thing of having the leading models attempt a few typical job tasks every few months, and have found it super helpful both for “feeling AI’s jagged edges” — seeing progress where it happens & all the places where it doesn’t

6 more comments...

No posts

Ready for more?