The release, which could come within the next two weeks, marks a significant step in OpenAI’s quest to develop more advanced AI systems. The initial release may be limited to a small group of testers. Strawberry, also previously known as Q* or Q-Star, is designed to bolster reasoning capabilities.
Strawberry’s “System 2” thinking
While OpenAI has yet to confirm details around the release, it reportedly employs a “System 2” style of thinking, a concept the psychologist Daniel Kahneman popularized in his book Thinking, Fast and Slow. The process involves a slow, deliberate, and analytical mode of thinking for conscious reasoning. Conversely, System 1 is fast, intuitive, and emotional.
In terms of Strawberry, the system reportedly spends significantly longer to “think” than its current model GPT-4o. The Information reports that it will spend 10-20 seconds processing its input and potential responses before sharing a final answer to reduce errors.
In addition, Strawberry (formerly Q*) will likely:
- Focus on advanced reasoning and problem-solving capabilities.
- Demonstrate proficiency in solving mathematical problems.
- May be integrated into ChatGPT — potentially a model known as Orion or GPT-5.
The Information had previously reported that OpenAI was also developing a model known as Orion that uses synthetic data from a Strawberry mode. Orion is a separate project, likely to be OpenAI’s next flagship language model, according to The Information.
The cost of AI training
While OpenAI has not released the full details regarding training GPT-4, OpenAI CEO Sam Altman estimated the cost to train GPT-4 was “more than” $100 million. According to some estimates, the model has 1.76 trillion parameters.
Some pundits have speculated that future models could cost hundreds of millions or even billions of dollars to train, prompting questions from the likes of Goldman Sachs about the ROI from the industry.
In 2023, Altman commented, “I think we’re at the end of the era where it’s gonna be these giant models, and we’ll make them better in other ways.”
Connection to STaR (Self-Taught Reasoner)
Some reports from Reuters and others have pointed to a possible connection between reports of Q*/Strawberry, and STaR (Self-Taught Reasoner). The similarities between reports surrounding Strawberry and the STaR research paper published in 2022 are notable:
STaR starts with a small set of examples demonstrating step-by-step reasoning (called “rationales”). It then prompts a large language model (LLM) to generate rationales for a larger dataset of questions that don’t have rationales. This is analogous to providing the LLM with a few worked-out examples and then asking it to solve similar problems on its own. This is also known as “bootstrapping” in this case.
Looping closer to the truth
The process uses a language model’s existing reasoning abilities and iteratively improves them through a self-learning loop. The process is as follows:
- Rationale Generation: STaR starts with a small set of examples demonstrating step-by-step reasoning (called “rationales”). It then prompts a large language model (LLM) to generate rationales for a larger dataset of questions that don’t have rationales.
- Filtering: It checks if the generated rationales lead to the correct answer. Only the rationales that result in correct answers are kept.
- Fine-tuning: The LLM is fine-tuned on this filtered dataset of questions and their corresponding, successfully generated rationales. This strengthens the model’s ability to generate good rationales.
- Iteration: The process (steps 1-3) is repeated. The improved LLM from the previous step is used to generate rationales for the same larger dataset again. This iterative process continues, with the model learning from its own generated reasoning and improving its performance over time.
- Rationalization (optional): To address the limitation of only learning from initially successful rationales, STaR introduces “rationalization”. For questions the model answered incorrectly, it provides the correct answer as a hint and asks the model to generate a rationale that justifies it. This helps the model learn from its mistakes and improve its reasoning on more challenging problems.
Chain-of-thought rationale generation
Chain-of-thought reasoning involves breaking down a complex problem into a series of intermediate steps, each forming a logical chain to the next. Humans reason similarly, making the reasoning process more transparent and easier to understand than traditional deep learning alone, which can uncover hidden connections between variables but not in an explainable manner.
Both STaR and Strawberry are reportedly successful at tackling mathematical problems. The STaR paper shared examples of how STaR generates step-by-step solutions for math problems, sometimes finding more efficient solutions than those in the ground truth data.
Paul Bevilaqus says
Seems incredible
Jason Wang says
Step 2, Filtering: It checks if the generated rationales lead to the correct answer. Only the rationales that result in correct answers are kept.
Who sets the criteria to judge the answer being correct or not if you do not know the answer before hand?
Brian Buntz says
My understanding, would be that you would still need to have ground-truth answers (labeled data) to verify what is true or not. The STaR paper mentions pairs of problems (x) and their corresponding ground truth answers (y). I added a flowchart image from the paper (https://arxiv.org/pdf/2203.14465v2) that might be helpful.
One reality of deep learning and reinforcement learning is that systems tend to do better with greater amounts of data, assuming it is relatively clean. One way to help boost the accuracy is to create synthetic data that can still be verified in terms of its accuracy. For many math problems, for instance, you could use math software or a programming language like Python to verify the calculations are correct, and add the verified answers to the training data.