WorldAgen logo WorldAgen

Unified State-Action Prediction with Test-Time World Model Training

๐Ÿš€ AAAI 2026

Chi Wan*, Kangrui Wang*, Yuan Si, Pingyue Zhang, Manling Li

Northwestern University

*Equal Contribution

How can vision-language-action (VLA) models adapt to new environments where world dynamics shift? Existing methods that combine world modeling with action prediction rely on pretraining on static datasets and lack mechanisms for active adaptation at deployment time, so they struggle to generalize to unseen object configurations and dynamics.

We present WorldAgen, a unified framework that jointly learns world modeling and action prediction while enabling Test-Time Training (TTT). A shared Transformer backbone hosts two heads: a world model head that predicts future states from past state-action trajectories, and a policy head that predicts actions conditioned on task instructions. A Mixed Unidirectional Attention Mask disentangles the two heads within a single architecture.

At test time, WorldAgen samples exploratory actions, collects ground-truth state transitions, and performs lightweight TTT updates to refine its world model. This online adaptation sharpens the model's understanding of the environment and, in turn, yields more accurate action predictions โ€” delivering consistent gains on CALVIN and LIBERO.

Method

WorldAgen unifies a task-conditioned policy and a task-agnostic world model inside a single Transformer backbone, then adapts the world model online at deployment with Test-Time Training.

WorldAgen architecture: a shared Transformer backbone with a policy head and a world model head.
WorldAgen architecture. A shared backbone feeds two heads: the policy head predicts the next action chunk from the instruction, observations, and robot states, while the world model head predicts the next observation chunk to refine the shared representation of environment dynamics.

Joint State-Action Modeling

The policy head predicts actions conditioned on the task; the world model head predicts future observations independent of the task. Training them jointly aligns scene understanding with action prediction and produces richer representations of dynamics.

Mixed Unidirectional Attention Mask

A local mask blocks intra-step leakage between the heads, and a global mask keeps the world model invisible to the task instruction. This lets both heads share one backbone while staying strictly causal and decoupled.

Two-Stage Test-Time Training

Stage 1: free exploratory rollouts collect unlabeled state transitions. Stage 2: a few LoRA updates adapt only the world model on the observation loss โ€” improving environment modeling without touching the policy or needing task labels.

Mixed Unidirectional Attention Mask combining local and global masking.
The Mixed Unidirectional Attention Mask: local masking prevents the action chunk from seeing its placeholder, and global masking builds a task-agnostic world model head.
Trajectory splitting, chunking, and two-step inference.
Trajectory splitting and two-step inference: the policy head fills the action placeholder, then the world model predicts the next observation chunk to roll out the trajectory.

Results

WorldAgen matches or beats state-of-the-art VLA baselines, and Test-Time Training of the world model head delivers further gains on both CALVIN and LIBERO.

CALVIN (Long-Horizon)

Success rate (%) for completing 5 consecutive tasks, and average sequence length (Avg. Len.).

Method T1T2T3T4T5 Avg. Len.
RoboFlamingo82.461.946.633.123.52.47
SuSIE87.069.049.038.026.02.69
GR-185.471.259.649.740.13.06
3D Diffusor Actor92.278.763.951.241.23.27
CLOVER96.083.570.857.545.43.53
Seer93.082.472.362.653.33.64
Seer-Large92.784.676.168.960.33.83
WorldAgen96.387.776.867.359.13.87
WorldAgen-TTT96.688.578.568.760.53.93

LIBERO-10 (Multi-Task)

Average success rate (%) on the LIBERO-10 long-horizon suite.

MethodAvg. Success
MT-ACT41.0
OpenVLA54.0
MVP68.2
MPI77.3
Seer78.7
WorldAgen75.5
WorldAgen-TTT79.0

Test-Time Training in Action

Qualitative CALVIN rollouts before and after Test-Time Training. Each pair uses the same task sequence, showing how world model adaptation changes the agent's behavior.

Task: rotate blue block right -> move slider right -> lift red block from slider -> place in slider -> turn off lightbulb

Before TTT
Rollout before Test-Time Training for task: rotate blue block right -> move slider right -> lift red block from slider -> place in slider -> turn off lightbulb
After TTT
Rollout after Test-Time Training for task: rotate blue block right -> move slider right -> lift red block from slider -> place in slider -> turn off lightbulb

Task: open drawer -> push red block right -> move slider left -> lift pink block from slider -> place in slider

Before TTT
Rollout before Test-Time Training for task: open drawer -> push red block right -> move slider left -> lift pink block from slider -> place in slider
After TTT
Rollout after Test-Time Training for task: open drawer -> push red block right -> move slider left -> lift pink block from slider -> place in slider

Task: rotate pink block right -> turn off LED -> lift pink block -> place in slider -> open drawer

Before TTT
Rollout before Test-Time Training for task: rotate pink block right -> turn off LED -> lift pink block -> place in slider -> open drawer
After TTT
Rollout after Test-Time Training for task: rotate pink block right -> turn off LED -> lift pink block -> place in slider -> open drawer

Cite

If you use WorldAgen or its trained models, please cite our paper.

@article{wan2026worldagen,
  title   = {WorldAgen: Unified State-Action Prediction with Test-Time World Model Training},
  author  = {Wan, Chi and Wang, Kangrui and Si, Yuan and Zhang, Pingyue and Li, Manling},
  year    = {2026}
}