OpenAI releases o3, a model that tops 99% of human competitors on IOI 2024 and Codeforces benchmarks

[Adobe Stock]

OpenAI has just announced the gradual release of o3 and o4‑mini, two new reasoning models focused on combining larger test‑time compute with full tool access. As OpenAI President Greg Brockman stated during the launch event, some models “feel like a qualitative step into the future,” and OpenAI believes “today is going to be another step forward” with these releases. He emphasized their potential, noting:

“For the first time, top scientists tell us these models generate genuinely useful novel ideas.”

— Greg Brockman

He elaborated that these are more than just models; they are truly “AI systems” fundamentally designed for complex interactions and problem-solving using external utilities. “We trained them to use tools inside their chain of thought,” Brockman explained, citing an example where “o3 issued 600 tool calls in a row to reach a solution” for one particularly hard task. Model o3 posts new best‑in‑class scores across coding, math, science, and vision benchmarks, while o4‑mini offers a leaner profile optimized for speed and cost.

Popular AI commentator AI Explained acknowledged the advancement in a recent analysis video: “Don’t get me wrong, they [o3 and o4-mini] are much better than previous models like o1… It’s a great model, a big improvement.” The video, published today on YouTube, offers a more measured assessment while still recognizing significant progress.

Both models integrate images directly into their reasoning loop and call external tools — Python, web search, custom functions — during inference. Brockman also highlighted a capability “close to my heart”: software engineering. He finds that the “O‑series models are now better at finding their way around our own OpenAI code than I am—and that’s a big productivity boost.”

Tool use in practice

Mark Chen, OpenAI’s head of research, used the launch stream to spell out why tool access matters so much. “Pairing reasoning models with tools makes them smarter,” he argued, drawing parallels: “just as a calculator helps with arithmetic or Maps helps with navigation.” Chen stated that “combining the o‑series with our full tool‑suite pushes state‑of‑the‑art on hard benchmarks.”

AI Explained seems to agree with this approach, noting in their analysis: “Both models are trained from the ground up to use tools, and I think that’s an epic way to improve models, and they will get even more useful very fast.” This alignment on the importance of tool integration suggests a growing consensus around this development path.

Chen added that “tools also unlock new functions: models can now think with images,” calling Python code “to crop or transform an image service‑side.” This means users “can upload blurry, upside‑down, or complex pictures, and the model handles them.”

During the livestream, OpenAI member of the technical staff Brandon McKinzie demonstrated this by revisiting a 2015 physics‑poster project. He explained how o3 performed the complex analysis: “o3 zooms, finds the relevant plot, extrapolates the slope to the physical quark mass, normalizes the number, then queries recent papers for updated estimates. It saves days of manual work.” Researcher Wenda Li added that to achieve this, the model actively browses the code, using “common terminal tools… lists the files, opens the file with `sed`, tries to find what it’s looking for.” It even identified inheritance issues in the code, checking the Method Resolution Order (MRO) before applying a patch and running unit tests “just like any good engineer would.” Researcher Ananya Kumar noted that this sophisticated, multi-step problem-solving wasn’t explicitly programmed: the model “just organically learns to do these things,” including simplifying its own initial brute-force code and double-checking results.

On SWE‑bench, o3 repaired open‑source bugs with an average of 37 containerised shell interactions; some fixes required more than 100, showcasing persistence across long tool‑use chains.

Benchmark performance

The models demonstrate impressive benchmark results across various domains. AI Explained highlighted one particular achievement: “On the MMMU… o3 gets 82.9%. That is genuinely better than Gemini 2.5 Pro’s 81.7%. So well done to OpenAI on that.” This represents a significant milestone as OpenAI surpasses a key competitor on an important multimodal benchmark.

AI Explained also noted success on their own testing framework: “O3 was the first model to get 6 out of 10 on the first 10 public questions of my own benchmark, SimpleBench… I was genuinely impressed with some of its answers.” The analysis particularly praised o3’s ability to analyze the benchmark website itself: “I was particularly impressed with the way that it analyzed my benchmark website, created an image… and did a deep dive analysis which was pretty much accurate… and honestly some pretty nuanced advice about the benchmark itself and its limitations.”

Developer ecosystem

The company open‑sourced codex‑cli, which presenter Fouad Torky described as “a lightweight interface to connect our models to our users and their computers.” He positioned it as “a reference implementation for how to safely deploy code-executing agents where you need them,” built on public APIs like the Responses API and integrating new features such as chain-of-thought summaries. Presenter Michael Kim explained that Codex CLI runs in “suggest mode” by default, where “you get to approve each [command or edit],” but for faster workflows (like the live demo), it can operate in “full-auto mode.”

An accompanying $1 million open‑source credit fund is earmarked for projects that pair codex‑cli with the latest models to “accelerate the frontier of open source,” according to Torky.

Roll‑out schedule

Mark Chen detailed the availability in the livestream:

Today: “If you are a Pro, Plus, Team subscriber, you’re going to… start rolling out access to o3, o4‑mini, and o4‑mini‑high.” These will replace the previous o1 and o3‑mini models.
+1 week: “You’re going to have to wait a week if you are Enterprise or EDU.”
O1 Pro Users: “If you use o1 Pro today and you love it, we are going to roll out o3 Pro, but it will take us some time…”
API: The o3 and o4-mini models are live now via the API. Chen noted that the API-based “tool‑use endpoints [are] coming in the next few weeks.”

Brockman confirmed the model replacement strategy: “we’re going to be replacing the o1 models with the new ones.” This aligns with AI Explained’s observation that “these [o3 and o4-mini] are now the best models within ChatGPT,” confirming their elevated status within OpenAI’s ecosystem.

With new reasoning depth, lower cost‑per‑token, and stronger multimodal skills, the o‑series aims to move advanced AI from lab curiosity to everyday scientific and engineering utility. Brockman concluded that these models represent “a major step forward in our mission of bringing AGI to benefit all of humanity.”

AI Explained offered a fitting summation in their analysis, acknowledging that “o3 represents genuine progress. Well done to OpenAI.” This measured praise from a critical voice in the AI community adds weight to OpenAI’s claims of advancement, even as the full impact of these new models remains to be seen.

Benchmark comparison bar chart for AIME, GPQA, SWE, showing o4‑mini and o3 vs peers — Benchmark comparison chart shown during the presentation.

AI Explained quotes sourced from YouTube video “Critical analysis of the two most powerful new models behind ChatGPT, o3 and o4-mini” published April 16, 2025.

Tool use in practice

Benchmark performance

Developer ecosystem

Roll‑out schedule

Related Articles Read More >

Elsevier expands LeapSpace with writing coach and Claim Radar, says 97% of users report time savings from the platform

Anthropic says Claude can run science experiments now rather than just plan them

OpenAI’s GPT-5.6 Sol sets a coding record. Its own system card says it cheats sometimes.

Noetik’s TARIO-2: A ‘world model’ that reads a tumor from a single slide

Search R&D World