OpenAI's 'Strawberry' AI: What we know so far...

[Firefly]

OpenAI is gearing up to launch its potentially reasoning-based and math-capable AI model codenamed “Strawberry,” according to a report from The Information.

The release, which could come within the next two weeks, marks a significant step in OpenAI’s quest to develop more advanced AI systems. The initial release may be limited to a small group of testers. Strawberry, also previously known as Q* or Q-Star, is designed to bolster reasoning capabilities.

Strawberry’s “System 2” thinking

While OpenAI has yet to confirm details around the release, it reportedly employs a “System 2” style of thinking, a concept the psychologist Daniel Kahneman popularized in his book Thinking, Fast and Slow. The process involves a slow, deliberate, and analytical mode of thinking for conscious reasoning. Conversely, System 1 is fast, intuitive, and emotional.

In terms of Strawberry, the system reportedly spends significantly longer to “think” than its current model GPT-4o. The Information reports that it will spend 10-20 seconds processing its input and potential responses before sharing a final answer to reduce errors.

In addition, Strawberry (formerly Q*) will likely:

Focus on advanced reasoning and problem-solving capabilities.
Demonstrate proficiency in solving mathematical problems.
May be integrated into ChatGPT — potentially a model known as Orion or GPT-5.

The Information had previously reported that OpenAI was also developing a model known as Orion that uses synthetic data from a Strawberry mode. Orion is a separate project, likely to be OpenAI’s next flagship language model, according to The Information.

The cost of AI training

While OpenAI has not released the full details regarding training GPT-4, OpenAI CEO Sam Altman estimated the cost to train GPT-4 was “more than” $100 million. According to some estimates, the model has 1.76 trillion parameters.

Some pundits have speculated that future models could cost hundreds of millions or even billions of dollars to train, prompting questions from the likes of Goldman Sachs about the ROI from the industry.

In 2023, Altman commented, “I think we’re at the end of the era where it’s gonna be these giant models, and we’ll make them better in other ways.”

Connection to STaR (Self-Taught Reasoner)

Some reports from Reuters and others have pointed to a possible connection between reports of Q*/Strawberry, and STaR (Self-Taught Reasoner). The similarities between reports surrounding Strawberry and the STaR research paper published in 2022 are notable:

STaR starts with a small set of examples demonstrating step-by-step reasoning (called “rationales”). It then prompts a large language model (LLM) to generate rationales for a larger dataset of questions that don’t have rationales. This is analogous to providing the LLM with a few worked-out examples and then asking it to solve similar problems on its own. This is also known as “bootstrapping” in this case.

Looping closer to the truth

The process uses a language model’s existing reasoning abilities and iteratively improves them through a self-learning loop. The process is as follows:

Rationale Generation: STaR starts with a small set of examples demonstrating step-by-step reasoning (called “rationales”). It then prompts a large language model (LLM) to generate rationales for a larger dataset of questions that don’t have rationales.
Filtering: It checks if the generated rationales lead to the correct answer. Only the rationales that result in correct answers are kept.
Fine-tuning: The LLM is fine-tuned on this filtered dataset of questions and their corresponding, successfully generated rationales. This strengthens the model’s ability to generate good rationales.
Iteration: The process (steps 1-3) is repeated. The improved LLM from the previous step is used to generate rationales for the same larger dataset again. This iterative process continues, with the model learning from its own generated reasoning and improving its performance over time.
Rationalization (optional): To address the limitation of only learning from initially successful rationales, STaR introduces “rationalization”. For questions the model answered incorrectly, it provides the correct answer as a hint and asks the model to generate a rationale that justifies it. This helps the model learn from its mistakes and improve its reasoning on more challenging problems.

STaR logic [From the paper]

“STaR lets a model improve itself by learning from its own generated reasoning,” the paper concluded. The authors also noted: “”We believe using examples without reasoning to bootstrap reasoning is a very general approach, and that STaR can serve as the basis of more sophisticated techniques across many domains.”

Chain-of-thought rationale generation

Chain-of-thought reasoning involves breaking down a complex problem into a series of intermediate steps, each forming a logical chain to the next. Humans reason similarly, making the reasoning process more transparent and easier to understand than traditional deep learning alone, which can uncover hidden connections between variables but not in an explainable manner.

Both STaR and Strawberry are reportedly successful at tackling mathematical problems. The STaR paper shared examples of how STaR generates step-by-step solutions for math problems, sometimes finding more efficient solutions than those in the ground truth data.

Comments

Paul Bevilaqus says

September 11, 2024 at 5:44 pm

Seems incredible
Jason Wang says

September 11, 2024 at 6:38 pm

Step 2, Filtering: It checks if the generated rationales lead to the correct answer. Only the rationales that result in correct answers are kept.
Who sets the criteria to judge the answer being correct or not if you do not know the answer before hand?
- Brian Buntz says
  
  September 11, 2024 at 7:08 pm
  
  My understanding, would be that you would still need to have ground-truth answers (labeled data) to verify what is true or not. The STaR paper mentions pairs of problems (x) and their corresponding ground truth answers (y). I added a flowchart image from the paper (https://arxiv.org/pdf/2203.14465v2) that might be helpful.
  
  One reality of deep learning and reinforcement learning is that systems tend to do better with greater amounts of data, assuming it is relatively clean. One way to help boost the accuracy is to create synthetic data that can still be verified in terms of its accuracy. For many math problems, for instance, you could use math software or a programming language like Python to verify the calculations are correct, and add the verified answers to the training data.

Strawberry’s “System 2” thinking

The cost of AI training

Connection to STaR (Self-Taught Reasoner)

Looping closer to the truth

Chain-of-thought rationale generation

Comments

Related Articles Read More >

OpenAI and Molecule.one report a near-autonomous AI chemist that improved a stubborn coupling reaction

MilliporeSigma’s CTO on AI retrosynthesis, the Merck KGaA–Siemens deal and the chemistry that runs the autonomous lab

How Cypris evolved from selling patent reports to agentic R&D intelligence

Medable’s Digital Data Flow Agent focuses on protocol translation as the agentic race accelerates

Search R&D World