Multiple frontier models seem to be getting exactly 83% on GPQA diamond - but is this a sign of model limitations or benchmark saturation?
Share this post
GPQA Diamond: What’s Left?
Share this post
Multiple frontier models seem to be getting exactly 83% on GPQA diamond - but is this a sign of model limitations or benchmark saturation?