
[Adobe Stock]
Raw power vs. precision
The o1 models, particularly o1-preview, showcase mathematical prowess—scoring an 83% success rate on the American Invitational Mathematics Examination compared to its predecessor’s 13%. If you feed o1-preview an hour-long transcript with filler words and ask it to remove them, it can do so with high precision in a minute or two — a task that would easily take 15 to 30 minutes to do manually. In scientific domains, it shows similar muscle, with significant improvements in understanding biological protocols and laboratory procedures, according to OpenAI’s research and some glowing early reviews from scientists. When given tasks like analyzing research papers or planning experiments, it can accurately process information at a high speed.
But according to OpenAI’s system card evaluations, while o1-preview excels at tasks like protocol planning and scientific analysis, it sometimes provides incomplete or misleading safety information. OpenAI noted, too, it classifies the o1 models as a “medium risk” for biological threat creation. That is, the models can assist experts with operational planning but don’t enable non-experts to create biological threats. On the other hand, the system card highlights the models’ strong performance on tasks like LAB-Bench evaluations, where o1-preview scored 81% on ProtocolQA. In its testing, OpenAI reported that biology experts found the models especially useful for speeding up research processes and explaining technical protocols. (Note: A more powerful version of the models is coming— to date, only the o1-preview and o1-mini models are publicly available.)
The tug of war
In my own unscientific experiments, I found significant quality control issues when pushing them to their limits with JavaScript and Python. Sometimes the models get facts (or code) wrong that you just provided to them minutes earlier. And in coding, they sometimes do things that change variables and imports around without a good reason. In essence, working with these models often feels like a tug of war — especially in bigger coding tasks.
Yes, they can produce impressively long blocks of code without the dreaded placeholder syndrome that plagues other models. But reliability? It’s hit or miss. Maybe for very simple projects they can bang it out of the park on the first try. It’s perhaps a coin toss for more complex queries.
According to recent report from The Verge, the models occasionally exhibit what researchers call “reward hacking”—essentially optimizing for user satisfaction in ways that might prioritize completing a task over maintaining accuracy.
They also can be longwinded. Ask the wrong question or the right one in the wrong way, and you could get pages and pages of analysis that doesn’t really help anything. At first, the models’ ability to handle large amounts of context seemed like a revelation. Share an entire codebase with them? Sure thing. Go ahead and copy and paste it in (if you can fit it in the context window). But it tends to unravel eventually. I’ve spent more than a few nights up until 2 or 3 a.m. debugging their “creative interpretations” of code. Sometimes, the models would arbitrarily change variable casing, as if testing whether you’re paying attention. Some of the problems seemed to worsen when using the models via the API with especially large amounts of input — memory is definitely a rate-limiting factor.
Sometimes a simpler model is, well, just more satisfying than one that thinks long and hard about your problem and ends up eventually in the woods sniffing trees.
Circular reasoning
o1 models have a “chain of thought” feature that makes its reasoning process more transparent—though this transparency sometimes reveals concerning behaviors. In my experience, this manifests as a tendency to work in circles: suggesting you reinstall dependencies you just installed, or circling back to files you worked with an hour ago while ignoring more recent versions. In some cases, the models will refer back to an earlier request — that you already resolved a half hour ago. But let’s do that again just for fun…
There is more at play than just spinning one’s wheels. In a study of 100,000 test conversations, about 0.8% of o1-preview’s responses were flagged as “deceptive.” Breaking this down further, 0.56% were hallucinations (incorrect answers), with a twist: roughly two-thirds of these (0.38%) appeared to be “intentional,” meaning the model’s chain of thought revealed it knew the information was incorrect. The remaining hallucinations (0.18%) were unintentional. Most of the intentional hallucinations occurred in a specific scenario: When a user asked the model to provide references to articles, websites, or books, the model.
The power-efficiency trade-off
The relationship between o1-preview and o1-mini mirrors another interesting dynamic. While o1-mini is billed as offering 90% of o1-preview’s capabilities at 20% of the cost, the reality is more nuanced. In my testing, they offer similar performance profiles, with the larger model sometimes showing an edge in general tasks while o1-mini occasionally surprises with superior coding performance.
Similar to how a Rottweiler can both protect a home or accidentally knock over a vase, the key to working with these models is understanding their dual nature. They’re powerful research assistants that can crunch through math problems and blaze through scientific papers, but they are not infallible. They require constant vigilance and a healthy dose of skepticism — especially when they wag their metaphorical tails and confidently present you with a citation that doesn’t exist. Or in my case, when I tried adding a new map to a website, I got dozens of type errors in a single go that took hours to fix. Maybe I should just ignore those suggestions? But in terms of time savings, it’s probably a wash at this point.