World models are a key concept in AI, used to simulate how agents behave in virtual environments and enable immersive, interactive experiences. They’re not only transforming game and media generation, they’re also opening new frontiers for using AI in complex, dynamic settings.
One emerging trend is generative games, where game environments are created frame by frame using neural networks. Microsoft’s MUSE system, for example, can generate scenes from the game Bleeding Edge using deep learning models.

Yet beneath the visual polish, generative games often contain noticeable inconsistencies. Background elements may disappear or shift abruptly after minor player actions, like a form of short-term memory loss. These disruptions highlight one of the field’s biggest challenges: maintaining consistency.
In response, researchers from Microsoft Research Asia, the Hong Kong University of Science and Technology, and the University of Chinese Academy of Sciences have introduced a new framework called Model as a Game (MaaG). This approach addresses two core inconsistencies in generative games: numerical and spatial.
Defining the problem: Numerical and spatial consistency
Numerical consistency refers to the logical accuracy of score updates based on game events. For example, if an action yields a +1 score, the result should reflect that exact change. Spatial consistency, by contrast, means the environment remains visually coherent when players revisit previously explored areas.
To examine these issues in a controlled setting, the team created a minimalist 2D game called Traveler. In it, a small black block moves left and right. As it passes through empty spaces, colorful buildings are randomly generated, and the score increases by one.
Despite its simplicity, Traveler clearly reveals the limitations of current generative models. Notably, the game was generated using large language models (LLMs) and built with Pygame, a set of Python modules for writing video games. It also supports frame-by-frame data export with synchronized numerical states, offering a strong foundation for research.

Inside the MaaG framework: Numerical and spatial modules
The MaaG framework uses a numerical module and a spatial module to enhance the Diffusion Transformer (DiT) architecture. Together, they work to ensure that generative models do more than just produce images, they also recognize and follow game logic.

- Numerical module: At the core of this module is LogicNet, a compact, trainable network that determines whether specific in-game events should occur. For example, it decides if a +1 score event should be triggered in Traveler.
LogicNet doesn’t perform arithmetic itself. Instead, the updated score is calculated externally, converted into special numerical tokens, and reinjected into the DiT model using the TextDiffuser-2 approach. This design offloads computation from the generative model, significantly improving numerical consistency. - Spatial module: This component introduces External Map, a persistent memory mechanism that stores all previously explored scenes, such as building colors and locations. Before rendering a new frame, the model consults this map to retrieve surrounding context, including areas outside the current field of view, supporting visual continuity.
After generating a frame, it uses a sliding window matching algorithm to align the local environment with the external map and updates it in real time. It’s as if the model has both GPS and a world atlas, keeping the environment consistent as the player moves.
Showcasing the results: Traveler, Pong, and Pac-Man
Unlike traditional games that rely on graphics engines, generative games synthesize each frame using neural networks. The following video demonstrates the MaaG framework in action across Traveler, Pong, and Pac-Man—showing how the framework keeps the scenes visually consistent as gameplay unfolds.
Generative games differ from traditional games that rely on game engines for rendering; instead, each frame in a generative game is directly synthesized by a neural network. The videos above present a sequence of examples from three such games — Traveler, Pong, and Pac-Man — shown from left to right.

Table 1 presents qualitative results demonstrating that MaaG effectively mitigates common issues in baseline models, such as erratic score changes and sudden visual transitions. Thanks to its modular architecture, MaaG is highly adaptable. Developers can adjust LogicNet’s rules and modify the dimensions of the spatial map to support a wide range of 1D and 2D games.
The system also allows creators to predefine or dynamically update the external map during gameplay, offering more control over the gaming environment than previous systems like GameGAN.
Despite introducing new logic and spatial modules, MaaG maintains a low inference latency of approximately 0.015 seconds, preserving gameplay fluidity.

Pushing the boundaries of AI-driven game generation
MaaG offers major improvements, though it still has limitations in repetitive environments, where spatial alignment can break down. Still, this framework represents a step forward in addressing the consistency challenges that have long plagued generative games.
The work shows that by decoupling numeric logic and spatial memory from the core pixel-generation process and incorporating these elements as explicit conditions, AI can generate game worlds that are both visually compelling and mechanically coherent.
Looking ahead, the team plans to expand MaaG into more complex 2D and 3D environments and explore more robust strategies for ensuring spatial consistency. With continued advances in approaches like MaaG, AI-generated, highly playable, and logically sound game worlds are rapidly becoming a reality.
References
[1] Kanervisto, Anssi, et al. 2025. “World and Human Action Models towards gameplay ideation.” Nature, 656–663.https://ora.ox.ac.uk/objects/uuid:519b4d38-1ee2-4c1b-95a0-ed116a149bf3 (opens in new tab).
发表回复