OSWorld — AI computer use capabilities
Tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time.
We looked at OSWorld, a popular evaluation of AI computer use capabilities.
OSWorld consists of 361 tasks sourced from forums and tutorials. Models get an Ubuntu VM and task instructions, and write code to interact with the mouse and keyboard.
Here is one task’s starting state. Instructions: “Make a duplicate of the last two slides for me, please.”
Most tasks are realistic but relatively simple, requiring fewer than ten steps (clicks, text inputs, etc.). These tasks take humans only a few minutes to complete.
Contrary to standard practice, OSWorld is updated continuously. A major update in July changed most tasks, and 10% of task instructions have been updated since then. Furthermore, 10% of tasks rely on live data from the web. This makes it hard to compare results over time.
OSWorld is about computer use, but many tasks require little use of graphical user interfaces. About 15% can be solved with only the terminal and a further 30% can rely heavily on Python scripts. We even found cases of models downloading packages to manipulate spreadsheets.
We also noticed ambiguity in several task instructions. While this *is* realistic, it makes interpreting progress on OSWorld harder. If a model improves, is that because it got better at computer use, or because it got better at guessing the intent of the tasks?
Finally, we found about 10% of tasks to have serious errors (though this isn’t an unusual rate). Some tasks have incorrect answer keys, some evaluation functions are too strict or too lax, and some instructions are fatally ambiguous.
Read the full report on our website.




