MaaG: A new framework for consistent AI-generated games

World models are a key concept in AI, used to simulate how agents behave in virtual environments and enable immersive, interactive experiences. They’re not only transforming game and media generation, they’re also opening new frontiers for using AI in complex, dynamic settings.

One emerging trend is generative games, where game environments are created frame by frame using neural networks. Microsoft’s MUSE system, for example, can generate scenes from the game Bleeding Edge using deep learning models.

微软提出的 MUSE 模型利用神经网络生成游戏《嗜血边缘(Bleeding Edge)》的画面
Figure 1. Microsoft’s MUSE generates frames from Bleeding Edge using neural networks.

Yet beneath the visual polish, generative games often contain noticeable inconsistencies. Background elements may disappear or shift abruptly after minor player actions, like a form of short-term memory loss. These disruptions highlight one of the field’s biggest challenges: maintaining consistency.

In response, researchers from Microsoft Research Asia, the Hong Kong University of Science and Technology, and the University of Chinese Academy of Sciences have introduced a new framework called Model as a Game (MaaG). This approach addresses two core inconsistencies in generative games: numerical and spatial.

Defining the problem: Numerical and spatial consistency

Numerical consistency refers to the logical accuracy of score updates based on game events. For example, if an action yields a +1 score, the result should reflect that exact change. Spatial consistency, by contrast, means the environment remains visually coherent when players revisit previously explored areas.

To examine these issues in a controlled setting, the team created a minimalist 2D game called Traveler. In it, a small black block moves left and right. As it passes through empty spaces, colorful buildings are randomly generated, and the score increases by one.

Despite its simplicity, Traveler clearly reveals the limitations of current generative models. Notably, the game was generated using large language models (LLMs) and built with Pygame, a set of Python modules for writing video games. It also supports frame-by-frame data export with synchronized numerical states, offering a strong foundation for research.

chart, bar chart
Figure 2. In Traveler, a moving block generates buildings and scores, exposing consistency challenges.

Inside the MaaG framework: Numerical and spatial modules

The MaaG framework uses a numerical module and a spatial module to enhance the Diffusion Transformer (DiT) architecture. Together, they work to ensure that generative models do more than just produce images, they also recognize and follow game logic.

diagram
Figure 3: MaaG incorporates numerical (red, left) and spatial (blue, right) modules to improve consistency.
  • Numerical module: At the core of this module is LogicNet, a compact, trainable network that determines whether specific in-game events should occur. For example, it decides if a +1 score event should be triggered in Traveler.
    LogicNet doesn’t perform arithmetic itself. Instead, the updated score is calculated externally, converted into special numerical tokens, and reinjected into the DiT model using the TextDiffuser-2 approach. This design offloads computation from the generative model, significantly improving numerical consistency.
  • Spatial module: This component introduces External Map, a persistent memory mechanism that stores all previously explored scenes, such as building colors and locations. Before rendering a new frame, the model consults this map to retrieve surrounding context, including areas outside the current field of view, supporting visual continuity.
    After generating a frame, it uses a sliding window matching algorithm to align the local environment with the external map and updates it in real time. It’s as if the model has both GPS and a world atlas, keeping the environment consistent as the player moves.

Showcasing the results: Traveler, Pong, and Pac-Man

Unlike traditional games that rely on graphics engines, generative games synthesize each frame using neural networks. The following video demonstrates the MaaG framework in action across Traveler, Pong, and Pac-Man—showing how the framework keeps the scenes visually consistent as gameplay unfolds.

Generative games differ from traditional games that rely on game engines for rendering; instead, each frame in a generative game is directly synthesized by a neural network. The videos above present a sequence of examples from three such games — Traveler, Pong, and Pac-Man — shown from left to right.

MaaG 在多种游戏中显著提升了一致性,解决了基线分数波动和场景突变问题,并具备良好的灵活性与通用性。
Figure 4. MaaG resolves issues like score fluctuations and scene glitches across multiple games, enhancing both flexibility and coherence.

Table 1 presents qualitative results demonstrating that MaaG effectively mitigates common issues in baseline models, such as erratic score changes and sudden visual transitions. Thanks to its modular architecture, MaaG is highly adaptable. Developers can adjust LogicNet’s rules and modify the dimensions of the spatial map to support a wide range of 1D and 2D games.

The system also allows creators to predefine or dynamically update the external map during gameplay, offering more control over the gaming environment than previous systems like GameGAN.

Despite introducing new logic and spatial modules, MaaG maintains a low inference latency of approximately 0.015 seconds, preserving gameplay fluidity.

table
Table 1: MaaG improves key metrics—numerical consistency (NumCon), spatial consistency (SpaCon), action recognition accuracy (ActAcc), and FID/FVD quality scores—across all tested games.

Pushing the boundaries of AI-driven game generation

MaaG offers major improvements, though it still has limitations in repetitive environments, where spatial alignment can break down. Still, this framework represents a step forward in addressing the consistency challenges that have long plagued generative games.

The work shows that by decoupling numeric logic and spatial memory from the core pixel-generation process and incorporating these elements as explicit conditions, AI can generate game worlds that are both visually compelling and mechanically coherent.

Looking ahead, the team plans to expand MaaG into more complex 2D and 3D environments and explore more robust strategies for ensuring spatial consistency. With continued advances in approaches like MaaG, AI-generated, highly playable, and logically sound game worlds are rapidly becoming a reality.

References

[1]  Kanervisto, Anssi, et al. 2025. “World and Human Action Models towards gameplay ideation.” Nature, 656–663.https://ora.ox.ac.uk/objects/uuid:519b4d38-1ee2-4c1b-95a0-ed116a149bf3 (opens in new tab).


评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注