Alt-text generated by gpt5, demonstrating it cannot see the problem with it, which is that the bars bear no correlation to the numbers:
Bar chart titled “SWE-bench Verified – Software engineering” showing accuracy percentages (pass@1) for three AI models, with two shades representing “without thinking” and “with thinking.” GPT-5 bar is tallest: ~52.8% without thinking, rising to ~74.9% with thinking. OpenAI o3 scores 69.1% (only one bar), GPT-4o scores 30.8% (only one bar). Problems: inconsistent use of “with thinking” data (missing for o3 and GPT-4o), unclear legend placement, and awkward color overlap in GPT-5’s stacked bar.
so we're trusting #altmanAI with the world's safety, when they can't even be trusted to produce a graph. swell.