GPT-5.4 Ships With Native Computer Control, 1M Context, and State-of-the-Art Agent Benchmarks
OpenAI's new flagship model collapses reasoning, coding, and computer use into a single system — and it's already live in Cursor, the API, and ChatGPT. The benchmark numbers suggest a genuine step function in agentic capability.
OpenAI began rolling out GPT-5.4 on Thursday, unifying its advances in reasoning, code generation, and agentic workflows into a single frontier model available across ChatGPT, the API, and Codex. As @OpenAI announced, the release includes both GPT-5.4 Thinking and a GPT-5.4 Pro tier, positioning the model as OpenAI's answer to the question that has dominated the past year of AI development: can one model actually do the work?
The benchmark results suggest it can, or at least that it's closer than anything before it. According to @OpenAIDevs, GPT-5.4 sets new state-of-the-art results on professional work and computer use tasks. The headline numbers: 83% on GDPval, 75% on OSWorld-Verified, and 57.7% on SWE-Bench Pro. That last figure is particularly notable — SWE-Bench Pro tests a model's ability to resolve real software engineering issues from open-source repositories, and a nearly 58% solve rate represents a meaningful jump from where frontier models sat even six months ago. These aren't toy problems. They're the kind of multi-file, multi-step reasoning tasks that trip up junior developers.
Get our free daily newsletter
Get this article free — plus the lead story every day — delivered to your inbox.
Want every article and the full archive? Upgrade anytime.
No spam. Unsubscribe anytime.