
Google DeepMind just unveiled Genie 3, offering evidence that AI-generated “world models,” which differ from large language models in their ability to truly understand 3D environments, have graduated from proof-of-concept demos to something approaching an interactive platform.
The system can spin up a playable 3D environment from nothing more than a text prompt and keep it running in real time at 24 fps, 720p for several minutes while maintaining spatial consistency. Public demos ranged from volcanic rover expeditions to Victorian streets with portal jump-cuts, all rendered on the fly.
Why does that matter? World models compress the physics and semantics of a scene into latent representations that an AI (or a human tester) can rewind, branch and perturb at will. That makes them attractive as a data amplifier for robotics, autonomous-driving edge cases, safety testing and even scientific “digital twins.” (And for creating video-game worlds in real-time.) Until now, though, the field has wrestled with short clips and brittle memory. Genie 3 pushes the interaction horizon beyond the approximately one-minute duration of Genie 2 and introduces “promptable world events.” That capability lets a user reshuffle weather or drop new objects mid-session via text. DeepMind attributes the jump to an autoregressive pipeline that rereads the entire action trajectory every frame, so the model can cope when a user backtracks through the scene.
The advance is real, but so are the caveats. Google openly lists a constrained action set, rudimentary multi-agent physics and a session length still measured in “a few minutes.” Accordingly, Genie 3 ships only as a limited research preview for select academics and creators while safety questions are studied.
A suddenly crowded field of video and world models
The strategic value of credible simulators is not lost on rivals:
NVIDIA Cosmos World Foundation Models debuted at CES 2025. Split into Nano, Super and Ultra tiers, Cosmos emphasizes physics-aware video generation and synthetic-sensor data, ships under a permissive commercial license and offers models ranging from 4B to 14B parameters, as TechCrunch has noted.
Meta’s V-JEPA 2 takes a different tack: it pre-trains on over 1 million hours of internet video and fine-tunes on less than 62 hours of robot trajectories to achieve state-of-the-art action anticipation and zero-shot robotic planning, as an arXiv preprint notes. The message is that vast passive video plus a sprinkling of interaction can yield a viable world model.
Startup Decart is commercializing its world model directly as a game called Oasis, trained on Minecraft footage and already claiming “millions of users” (reaching its first million users in just over three days). The company just raised $32M Series A and claims transformer-diffusion hybrids give it a cost edge on consumer hardware, as SiliconANGLE noted.
Beyond Genie 3, the video-generation field includes OpenAI’s Sora (consumer access currently tops out at 1080p and roughly 20-second clips), Google DeepMind’s Veo 3 (4K output, stronger physics, and native audio, available through Gemini), U.S. tools like Runway Gen-3 Alpha and Luma Dream Machine (widely used creator tools, with Dream Machine commonly producing approximately 10-second, up-to-1080p clips), plus fast-moving Chinese systems such as Kuaishou’s Kling (v2.1 adds 720p/1080p modes and multi-image reference for identity consistency), MiniMax’s Hailuo-02 (native 1080p) and Alibaba’s Wan 2.1 (announced for open-source release; a recent VBench leader). These systems emphasize cinematic fidelity and prompt adherence but remain non-interactive generators—unlike Genie 3’s agent-playable worlds.
DeepMind optimizes for low-latency interaction, NVIDIA for high-fidelity physics and sensor realism, Meta for scalable action understanding, and startups for user-generated content loops. Expect intense debate this year over evaluation: synthetic scenes must be judged not just by FVD or SSIM but by how faithfully an agent trained inside them transfers to the real world.
What could come next
DeepMind hints at longer sessions, richer action vocabularies and multi-agent social physics, but it hasn’t set public timelines. NVIDIA has developed an upsampling model and guardrails to keep Cosmos outputs safe for enterprise use. Meta researchers are already layering V-JEPA 2 with language models for instruction following; their next milestone is hour-long procedural tasks. And Decart aims to port Oasis to custom inference chips to cut latency for consumer play.
Synthetic environments are maturing fast enough to become a standard part of R&D pipelines, but only if their physics, memory and safety constraints are measured as rigorously as their photorealism.



