Discussion about this post

User's avatar
JP's avatar

Coming to this a bit late but it's a useful corrective. The point about 15% of tasks being solvable via terminal alone is the kind of detail that doesn't make it into the headlines.

I've been writing about Claude's 72.5% OSWorld score https://reading.sh/inside-anthropics-quiet-bet-on-computer-vision-ba0cf1000756 in the context of Anthropic's Vercept acquisition, and your critique here actually makes the acquisition more interesting, not less. If a chunk of OSWorld tasks don't require genuine GUI interaction, then the real-world performance gap for actual screen-based computer use could be wider than the benchmark suggests. Which means Anthropic bringing in Girshick's team isn't just about pushing a number higher. It's about solving the harder perception problems that the benchmark doesn't fully capture.

Did you find any correlation between task complexity and model failure rate? Curious whether the models fail more on genuinely GUI-dependent tasks vs the ones solvable through scripting.

1 more comment...

No posts

Ready for more?