Multiple frontier models seem to be getting exactly 83% on GPQA diamond - but is this a sign of model limitations or benchmark saturation?
GPQA Diamond: what’s left?
Multiple frontier models seem to be getting exactly 83% on GPQA diamond - but is this a sign of model limitations or benchmark saturation?