
Benchmarks shared in the xAI demo
Two years ago, Elon Musk signed an open letter urging a six-month pause on training AI systems more powerful than OpenAI’s GPT-4, citing “profound risks to society and humanity.” Now, in a move some might call ironic, his AI venture, xAI, has released Grok 4, a model he claims could invent new technologies by year’s end and discover new physics within two years.
Musk acknowledged the swift pace of progress but said he was committed to witnessing the outcome, for better or worse. “I think it’ll be good, most likely it’ll be good,” he mused during Wednesday’s livestreamed unveiling. “But even if it wasn’t going to be good, I’d at least like to be alive to see it happen.”
The rollout follows backlash over Grok 3, which posted antisemitic content on X praising Adolf Hitler and calling itself “MechaHitler,” leading xAI to scrub the messages and issue a statement. The company stated it was “aware” of the “inappropriate posts” made by Grok and was “actively working to remove” them.
Musk says Grok 4 can ace many academic tests
But the focus of Wednesday’s presentation was squarely on Grok 4’s smarts. Musk made the claim that the model is already superhuman in academia. “With respect to academic questions, Grok 4 is better than Ph.D. level in every subject, no exceptions,” he stated.
Grok 4 is smarter than almost all graduate students in all disciplines simultaneously. —Musk
Hype often eclipses reality in new AI model unveilings and Musk’s predictions warrant context: while his companies are undoubtedly influential, he’s also become known for ambitious timelines that often slip… from full self-driving cars to Mars colonies. In addition, his Grok 4 claims echo the broader AI industry’s tendency toward hyperbole, where a range of companies hail their latest respective model as market-leading. In addition, genAI systems continue to grapple with limitations. Current models, including those from OpenAI, Anthropic, and Google, still struggle with persistent memory across conversations and are prone to hallucinations, confidently stating incorrect information as fact. While xAI hasn’t released data on Grok 4’s accuracy or hallucination rates, such hurdles remain unsolved problems across all frontier models to date, casting doubt on claims of reliable “superhuman” academic performance from a bot.
Musk, however, noted that Grok 4’s capabilities extend beyond tests, predicting it will soon tackle real-world challenges. “I think it may discover new technologies as soon as later this year, and I would be shocked if it is not done so next year,” he said. “It might discover new physics next year, and within two years, I’d say almost certainly.”
Increased RL focus
xAI research scientist Tony Wu highlighted the model’s training advances, noting a shift from pre-training to a heavy emphasis on reasoning and reinforcement learning. “From Grok 3 to Grok 4, we’re putting a lot of compute into reasoning and RL,” Wu said. He added that with added tools and multi-agent systems in Grok 4 Heavy, the model solved over 50% of text-based problems on the tough Humanities Master Exam benchmark, a notable leap from single-digit accuracy for earlier models.
Musk attributed the leap to massive compute scaling, stating xAI increased training by an order of magnitude from Grok 2 to Grok 3, and then again to Grok 4. “It’s 100 times more training than Grok 2, and that’s only going to increase,” Musk said. “In some ways, it’s a little terrifying, but the growth of intelligence here is remarkable.”
xAI co-founder Jimmy Ba echoed the scale-up, crediting the company’s Colossus supercomputer, expanded to 200,000 GPUs, for enabling 10 times more compute in reinforcement learning than any rival model. “This is literally the fastest-moving field,” Ba noted.
Demos range from black hole models to video games
Demos showcased practical applications. One showed Grok 4 excelling at Handle Labs’ VendingBench, an AI business simulation where the model managed inventory and contracts to double the net worth of rivals. Musk reacted with characteristic humor: “It’s great to see that we’ve now got a way to pay for all those GPUs,” he joked. “We just need a million vending machines and make $4.7 billion a year. Let’s go!”
Voice mode also received significant upgrades. After demonstrating a snappier, more natural conversation against competitors, Jimmy Ba explained their philosophy: “We were shooting for something more calm, smooth, more natural, versus something that’s more poppy or artificial.”
The roadmap targets key R&D pain points. A specialized coding model is expected “in a few weeks.” The forthcoming Version 7 foundation model will boost multimodal understanding, leading to powerful video generation. Musk set ambitious creative timelines: “I would expect the first really good AI video game to be next year,” he predicted, “and probably the first watchable AI movie next year.”
Access requires a SuperGrok Heavy subscription, while the API is live for developers. But as models outpace human-designed tests, Musk argued that a new benchmark is needed. “The one thing that is an excellent judge of things is reality,” he concluded. “Because physics is a law, ultimately everything else is a recommendation… The ultimate test for an AI is reality.”



