Meet Reactor Mk.1, a LLM developed on a shoestring budget

The AI landscape has long been dominated by a “bigger is better” mantra. The basic thinking is that bigger and better GPUs, more sophisticated algorithms, and more data translates into better performance. And while that is generally true, those assumptions can be simplistic. “DeepSeek is really, really good,” wrote one anonymous employee at Google on the site Blind. “I can’t believe that it is open source and free.”

But while DeepSeek is perhaps the most famous entrant to demonstrate that novel approaches such as a reinforcement learning-first pipeline can be tapped to train a model on a budget markedly less than those from the likes of OpenAI, Google or Antrhopic, it is not the only one. ARC’s Reactor Mk.1was trained on 8 NVIDIA L4 and 4 NVIDIA A100 GPUs for under $1 million, Reactor reportedly outperforms offerings from OpenAI, Anthropic, Meta, and Google on key benchmarks.

[From https://reactor.arc.ai/chat]

A spokesperson for ARC noted, “ARC trained the model with just eight NVIDIA L4 and four NVIDIA A100 GPUs and under $1m—compared to DeepSeek’s reported $6m training costs.” He continued: “ARC’s overall MMLU score was 92.9% compared to 88.5% for DeepSeek V3. MMLU, which stands for Massive Multitask Language Understanding, is a comprehensive benchmark designed to evaluate the performance of AI language models across a wide range of subjects and tasks.”

In quick unscientific testing by the author, it passed the infamous “strawberry” question (correctly surmising that there are three r’s in the word) and concluded that 8.11 was larger than 8.9, which earlier LLMs sometimes struggled with. When asked to draft a Python function to optimize energy use in a data center, where server workloads fluctuate hourly, withe the inclusion of edge cases for power grid failures and renewable energy variability, it provided five basic steps with an explanation for each and outlined a Python function with placeholders. While the model effectively accounted for edge cases like grid failures and renewable variability in its scripted outline, it still stumbled on a minor syntax detail by omitting a closing brace. By contrast, Claude 3.5 Sonnet smoothly handled the same task without that syntax slip. DeepSeek R1 was not available for testing at the time of writing.

For the sake of comparison, the MMLU percentage for DeepSeek V3 (the previously released model before R1) was 88.5%. The breakdown of various players’ scores is as follows:

BENCHMARK PERFORMANCE SCORES OF REACTOR MK.1 AND OTHER MODELS ON MMLU, HUMANEVAL, AND BBH (reproduced from the arXiv preprint)
Model	MMLU	HumanEval	BBH
ARC Reactor Mk.1	92.9%	91%	88%
OpenAI GPT4o	88.7%	90.2%	83.1%
Anthropic Claude	86.8%	84.9%	—
Meta Llama3	86.1%	84.1%	—
Google Gemini	81.9%	71.9%	83.6%
OpenAI GPT 3.5	70%	48.1%	66.6%
Mistral 8×228	77.75%	—	—

Reactor Mk.1’s competitive edge appears to stem from its efficient architecture built on the Lychee AI engine, which is described in the arXiv paper. The paper reports notes that the performance was “accomplished with a handful of GPUs,” suggesting a potential breakthrough in training efficiency. It notes: “The benchmark scores indicate that the ARC Reactor Mk. 1 not only outperforms in understanding and generating code but also demonstrates huge performance in reasoning and handling challenging language tasks. These results position the ARC Reactor Mk. 1 as a leading model in the current state of the art of AI technolog”

Related Articles Read More >

New Pistoia Alliance survey shows just 1% of professionals report AI’s value in the wet lab

The startup JustPaid drew inspiration from the sitcom Silicon Valley gag into an AI engineer

UniX AI’s new robot can tackle your household chores

The startup QuantHealth claims it can predict how any patient will respond to any therapy, even novel ones

Search R&D World