The AI landscape has long been dominated by a “bigger is better” mantra. The basic thinking is that bigger and better GPUs, more sophisticated algorithms, and more data translates into better performance. And while that is generally true, those assumptions can be simplistic. “DeepSeek is really, really good,” wrote one anonymous employee at Google on the site Blind. “I can’t believe that it is open source and free.”
But while DeepSeek is perhaps the most famous entrant to demonstrate that novel approaches such as a reinforcement learning-first pipeline can be tapped to train a model on a budget markedly less than those from the likes of OpenAI, Google or Antrhopic, it is not the only one. ARC’s Reactor Mk.1was trained on 8 NVIDIA L4 and 4 NVIDIA A100 GPUs for under $1 million, Reactor reportedly outperforms offerings from OpenAI, Anthropic, Meta, and Google on key benchmarks.

[From https://reactor.arc.ai/chat]
In quick unscientific testing by the author, it passed the infamous “strawberry” question (correctly surmising that there are three r’s in the word) and concluded that 8.11 was larger than 8.9, which earlier LLMs sometimes struggled with. When asked to draft a Python function to optimize energy use in a data center, where server workloads fluctuate hourly, withe the inclusion of edge cases for power grid failures and renewable energy variability, it provided five basic steps with an explanation for each and outlined a Python function with placeholders. While the model effectively accounted for edge cases like grid failures and renewable variability in its scripted outline, it still stumbled on a minor syntax detail by omitting a closing brace. By contrast, Claude 3.5 Sonnet smoothly handled the same task without that syntax slip. DeepSeek R1 was not available for testing at the time of writing.
For the sake of comparison, the MMLU percentage for DeepSeek V3 (the previously released model before R1) was 88.5%. The breakdown of various players’ scores is as follows:
BENCHMARK PERFORMANCE SCORES OF REACTOR MK.1 AND OTHER MODELS ON MMLU, HUMANEVAL, AND BBH (reproduced from the arXiv preprint) | |||
---|---|---|---|
Model | MMLU | HumanEval | BBH |
ARC Reactor Mk.1 |
92.9% | 91% | 88% |
OpenAI GPT4o |
88.7% | 90.2% | 83.1% |
Anthropic Claude |
86.8% | 84.9% | — |
Meta Llama3 |
86.1% | 84.1% | — |
Google Gemini |
81.9% | 71.9% | 83.6% |
OpenAI GPT 3.5 |
70% | 48.1% | 66.6% |
Mistral 8×228 |
77.75% | — | — |
Reactor Mk.1’s competitive edge appears to stem from its efficient architecture built on the Lychee AI engine, which is described in the arXiv paper. The paper reports notes that the performance was “accomplished with a handful of GPUs,” suggesting a potential breakthrough in training efficiency. It notes: “The benchmark scores indicate that the ARC Reactor Mk. 1 not only outperforms in understanding and generating code but also demonstrates huge performance in reasoning and handling challenging language tasks. These results position the ARC Reactor Mk. 1 as a leading model in the current state of the art of AI technolog”
Tell Us What You Think!
You must be logged in to post a comment.