What does SWE-bench Verified actually measure?

It is one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories

Florian Brand

and

JS Denain

Jun 13, 2025

SWE-bench Verified is a set of 500 issues and patches from real python repositories. Each benchmark sample contains:

The repository before the PR was merged
The issue description(s)

For evaluation it also contains:

The original, human-written solution
A set of tests

The model is put in the position of a maintainer: given the issue description, it has to patch the repository to resolve the issue and make tests pass. This is much more realistic than assessing models on LeetCode-style tasks.

We think a relatively low fraction (~5-10%) of SWE-bench Verified’s issues are incorrect and unsolvable. The issues are all from live codebases Human annotators filtered the issues from a larger subset, SWE-bench. We manually analyzed 40 issues

People sometimes equate SWE-bench Verified scores with general software engineering ability. Instead, SWE-bench Verified predicts whether an AI can fix simple issues (taking at most a couple hours for an SWE to solve) in a codebase.

Furthermore, the low diversity of codebases limits external validity. Django comprises nearly half of all issues and five repositories account for over 80% of the benchmark.

Contamination is a major concern for SWE-bench Verified: since all the issues are sourced from famous repositories, and date back from 2023 and earlier, it’s likely that current models have been trained on these repositories.

Recently, many benchmarks have been developed that build on SWE-bench Verified’s foundation, such as SWE-bench Multimodal, SWE-bench Multilingual, Multi-SWE-bench, and Terminal-Bench. We may review them in future issues.

Read the full report on our website.

A guest post by

Florian Brand

PhD Student, Trier University & DFKI | https://florianbrand.de/ | Opinions my own.

Epoch AI

Discussion about this post