
An overview of STORM’s pre-writing approach — from the original paper: https://arxiv.org/pdf/2402.14207
Crafting comprehensive articles from scratch can be a tedious process. But what if AI could lend a hand — at least in writing summary articles of well-defined research topics?
Enter STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking), an open-source AI system that promises to generate Wikipedia-style articles on pretty much any topic using large language models and web search. Developed by researchers at Stanford University, STORM is featured on Papers with Code, a platform that provides an overview of the latest developments in machine learning research. You can find the official GitHub repository here.
Intrigued by STORM, I wondered whether it could make good on its promise to create drafts of comprehensive, factual articles with proper citations in just minutes.
Spoiler alert: It did.
A draft article in minutes for $0.005
To be fair, it took about ten minutes to get the project up and running, which I did in Google Colab. After getting it running, I had it research a neural network phenomenon known as double descent. Within minutes, I had an article written on the subject. No, I don’t plan on posting it anywhere else (the article is below if you are curious), but it is a well-sourced article and a pretty good first draft. Total cost for this experiment? A fraction of a cent — $0.0051282800000000016 to be precise.

STORM screen capture
If you’re interested in trying out STORM, you’ll need to set up a couple of API keys: I used one from OpenAI and another from Tavily, but several options are available. While the project is optimized for OpenAI’s GPT models, you can use other language models such as Claude, Ollama, Gemini or Mistral. For retrieval, Tavily is the default. It offers a free tier supporting 1,000 queries a month. On top of that, there are other API options for the retrieval functionality, including DuckDuckGo, Google, Bing and others.
Getting it set up
You don’t have to be able to code to get this up and running, but if you are familiar with Python and Jupyter notebooks, it will make the process go more smoothly. In my case, I used Google Colab and used this code snippet to clone the repository and install the necessary dependencies. After pasting it in, hit “play” or hit Shift + Enter to run the cell. There are a couple of repositories available on Github for this. This is the one I used:
!git clone https://github.com/assafelovic/gpt-researcher.git
%cd gpt-researcher
!pip install -r requirements.txt
After fetching your API keys, you can set them up like this in a separate cell. There are more elegant and more secure ways of dealing with your API keys, but this will work for now:
import os
os.environ['OPENAI_API_KEY'] = 'your_openai_api_key_here'
os.environ['TAVILY_API_KEY'] = 'your_tavily_api_key_here'
Now for the fun step — using STORM. Tweak this query based on whatever you are interested in:
from gpt_researcher import GPTResearcher
query = "Explain the phenomenon of double descent in machine learning"
researcher = GPTResearcher(query=query, report_type="research_report")
# Conduct research
research_result = await researcher.conduct_research()
# Write the report
report
Inside STORM’s writing process
After entering the preceding code snippet, STORM’s speed and ability to mimic a human researcher’s process becomes evident. It starts by scouring websites, but STORM doesn’t just blindly scrape the web. It begins by formulating targeted research queries based on the main topic with the goal of ensuring that the information gathered is both relevant and comprehensive.

STORM Screenshot
In this step, I observed it pulling from sites like paperswithcode, which is where I learned of it, academic repositories like arXiv.org, educational platforms, and even Wikipedia, prioritizing quality resources.
Once the sources are assembled, STORM extracts pertinent information, identifying key concepts, supporting evidence, and relevant quotes.
In the penultimate step, STORM synthesizes information into a structured format. The output I saw was structured similar to a Wikipedia article with a short title, brief introduction and sections with subheadings. It wraps up with a conclusion and a list of references.
At the very end, it reported the total research costs.
Reviewing the STORM content
In my limited testing, the content STORM generated is better than what a lot of large language models could churn out through simple prompting alone. I did spot a small irregularity in one of the references — the appearance of “(1/4) – Generalization error” in the middle of a reference. In subsequent tests on narrower topics, I did notice some hallucinated sources. And in the STORM article that follows, there were also some traces of hyperbolic AI writing, i.e., “Double descent is a fascinating and counterintuitive phenomenon […].”
The implications of such fast and inexpensive writing could be significant. Perhaps such AI tools will play a growing role on sites like Wikipedia — and elsewhere — in the coming years. The Arxiv analysis on STORM found that its articles were more organized and broader in coverage than articles from other outline-driven, retrieval-augmented baselines.
The company Insilico Medicine recently debuted a product known as Science42: DORA (Draft Outline Research Assistant) to quickly draft academic papers and scientific documents. Other similar offerings are likely to emerge, although concerns about large language model hallucinations are a concern for many would-be users of such systems.
Here’s the unedited article STORM produced.
The Phenomenon of Double Descent in Machine Learning: Implications for Model Performance and Generalization
Introduction
Double descent is a counterintuitive phenomenon in machine learning that challenges traditional notions of model complexity and generalization. This phenomenon has garnered significant attention in recent years due to its implications for the performance of modern machine learning models, particularly deep neural networks. This report aims to provide a comprehensive explanation of double descent, its underlying mechanisms, and its implications for model performance and generalization.
Understanding Double Descent
Classical Learning Theory and the Bias-Variance Tradeoff
In classical learning theory, the bias-variance tradeoff is a fundamental concept that describes the relationship between model complexity and generalization error. According to this theory, as model complexity increases, bias decreases while variance increases. This results in a U-shaped curve for test error, where the optimal model complexity is achieved by balancing bias and variance (Wilber & Werness, 2021).
The Double Descent Phenomenon
Double descent deviates from this classical understanding by introducing an additional phase where test error decreases again after initially increasing. This results in a non-monotonic behavior of test error as a function of model complexity. Specifically, double descent occurs in three phases:
- Underparameterized Regime: In this phase, the model has fewer parameters than the number of data points. Increasing model complexity initially reduces test error due to decreased bias.
- Interpolation Threshold: At this point, the model has approximately the same number of parameters as data points. Test error peaks due to high variance and overfitting.
- Overparameterized Regime: Beyond the interpolation threshold, further increasing model complexity surprisingly leads to a decrease in test error. This phase is characterized by the model’s ability to fit the training data perfectly while still generalizing well to unseen data (Belkin et al., 2019).
Conclusion
Double descent is a fascinating and counterintuitive phenomenon that has reshaped our understanding of model complexity and generalization in machine learning. By challenging the classical bias-variance tradeoff, double descent has provided new insights into why large, overparameterized models can achieve superior performance. This has significant implications for both theoretical research and practical applications, highlighting the importance of considering the entire spectrum of model complexities and leveraging inductive biases to achieve smooth interpolation and improved generalization.
References
- Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854. https://arxiv.org/abs/2303.14151
- Lafon, M. (2021). Deep double descent explained (1/4) – Generalization error. Retrieved from https://marclafon.github.io/blog/2021/double-descent-1/
- Nakkiran, P., Venkat, P., Kakade, S. M., & Ma, T. (2021). Optimal regularization can mitigate double descent. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1912.02292
- Wilber, J., & Werness, B. (2021). Double Descent. MLU-explAIn. Retrieved from https://mlu-explain.github.io/double-descent/
- Wikipedia. (2023). Double descent. Retrieved from https://en.wikipedia.org/wiki/Double_descent