Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

Why o1-preview and o1-mini are like a STEM Rottweiler

By Brian Buntz | October 31, 2024

Rottweiler

[Adobe Stock]

In the past month, I’ve spent considerable time with OpenAI’s o1 models, and I’ve come to think of them as the Rottweilers of the genAI world: powerful, occasionally stubborn, and requiring special handling. When they work well, they can be capable underlings—but they require a firm hand and careful attention to detail. And they can sometimes pull you off the metaphorical sidewalk.

Raw power vs. precision

The o1 models, particularly o1-preview, showcase mathematical prowess—scoring an 83% success rate on the American Invitational Mathematics Examination compared to its predecessor’s 13%. If you feed o1-preview an hour-long transcript with filler words and ask it to remove them, it can do so with high precision in a minute or two — a task that would easily take 15 to 30 minutes to do manually. In scientific domains, it shows similar muscle, with significant improvements in understanding biological protocols and laboratory procedures, according to OpenAI’s research and some glowing early reviews from scientists. When given tasks like analyzing research papers or planning experiments, it can accurately process information at a high speed.

But according to OpenAI’s system card evaluations, while o1-preview excels at tasks like protocol planning and scientific analysis, it sometimes provides incomplete or misleading safety information. OpenAI noted, too, it classifies the o1 models as a “medium risk” for biological threat creation. That is, the models can assist experts with operational planning but don’t enable non-experts to create biological threats. On the other hand, the system card highlights the models’ strong performance on tasks like LAB-Bench evaluations, where o1-preview scored 81% on ProtocolQA. In its testing, OpenAI reported that biology experts found the models especially useful for speeding up research processes and explaining technical protocols. (Note: A more powerful version of the models is coming— to date, only the o1-preview and o1-mini models are publicly available.)

The tug of war

In my own unscientific experiments, I found significant quality control issues when pushing them to their limits with JavaScript and Python. Sometimes the models get facts (or code) wrong that you just provided to them minutes earlier. And in coding, they sometimes do things that change variables and imports around without a good reason. In essence, working with these models often feels like a tug of war — especially in bigger coding tasks.

Yes, they can produce impressively long blocks of code without the dreaded placeholder syndrome that plagues other models. But reliability? It’s hit or miss. Maybe for very simple projects they can bang it out of the park on the first try. It’s perhaps a coin toss for more complex queries.

According to recent report from The Verge, the models occasionally exhibit what researchers call “reward hacking”—essentially optimizing for user satisfaction in ways that might prioritize completing a task over maintaining accuracy.

They also can be longwinded. Ask the wrong question or the right one in the wrong way, and you could get pages and pages of analysis that doesn’t really help anything. At first, the models’ ability to handle large amounts of context seemed like a revelation. Share an entire codebase with them? Sure thing. Go ahead and copy and paste it in (if you can fit it in the context window). But it tends to unravel eventually. I’ve spent more than a few nights up until 2 or 3 a.m. debugging their “creative interpretations” of code. Sometimes, the models would arbitrarily change variable casing, as if testing whether you’re paying attention. Some of the problems seemed to worsen when using the models via the API with especially large amounts of input — memory is definitely a rate-limiting factor.

Sometimes a simpler model is, well, just more satisfying than one that thinks long and hard about your problem and ends up eventually in the woods sniffing trees.

Circular reasoning

o1 models have a “chain of thought” feature that makes its reasoning process more transparent—though this transparency sometimes reveals concerning behaviors. In my experience, this manifests as a tendency to work in circles: suggesting you reinstall dependencies you just installed, or circling back to files you worked with an hour ago while ignoring more recent versions. In some cases, the models will refer back to an earlier request — that you already resolved a half hour ago. But let’s do that again just for fun…

There is more at play than just spinning one’s wheels. In a study of 100,000 test conversations, about 0.8% of o1-preview’s responses were flagged as “deceptive.” Breaking this down further, 0.56% were hallucinations (incorrect answers), with a twist: roughly two-thirds of these (0.38%) appeared to be “intentional,” meaning the model’s chain of thought revealed it knew the information was incorrect. The remaining hallucinations (0.18%) were unintentional. Most of the intentional hallucinations occurred in a specific scenario: When a user asked the model to provide references to articles, websites, or books, the model.

The power-efficiency trade-off

The relationship between o1-preview and o1-mini mirrors another interesting dynamic. While o1-mini is billed as offering 90% of o1-preview’s capabilities at 20% of the cost, the reality is more nuanced. In my testing, they offer similar performance profiles, with the larger model sometimes showing an edge in general tasks while o1-mini occasionally surprises with superior coding performance.

Similar to how a Rottweiler can both protect a home or accidentally knock over a vase, the key to working with these models is understanding their dual nature. They’re powerful research assistants that can crunch through math problems and blaze through scientific papers, but they are not infallible. They require constant vigilance and a healthy dose of skepticism — especially when they wag their metaphorical tails and confidently present you with a citation that doesn’t exist. Or in my case, when I tried adding a new map to a website, I got dozens of type errors in a single go that took hours to fix. Maybe I should just ignore those suggestions? But in terms of time savings, it’s probably a wash at this point.

Related Articles Read More >

AI Agents in the Lab
How AI agents are reshaping R&D 
U.S. reportedly will rework GPU export controls amid industry pushback
Musk tests AI-powered government layoffs under Trump’s DOGE agenda
Berkeley debuts $5,000 open-source humanoid built with desktop 3D printers
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.
RD 25 Power Index

R&D World Digital Issues

Fall 2024 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

Research & Development World
  • Subscribe to R&D World Magazine
  • Enews Sign Up
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE