Towards Robust Zero-Shot Reinforcement Learning

NeurIPS 2025

1The Chinese University of Hong Kong 2Tsinghua University 3Huawei Noah's Ark Lab 4Shanghai Artificial Intelligence Laboratory
*Equal contribution Corresponding author
BREEZE Framework Overview

BREEZE FRAMEWORK

Zero-shot reinforcement learning aims to create generalist agents that can adapt to entirely new tasks without retraining—a crucial step toward scalable, autonomous intelligence. However, current methods remain limited by weak expressivity and unstable training.

BREEZE is an FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality, through three key designs:

  • Regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm.
  • Diffusion policy extraction, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings.
  • Attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics.

Transformer Architecture

Figure: BREEZE Transformer-like architecture for F (left) and B (right)

Experiments

We evaluate BREEZE on standard benchmarks including ExORL and D4RL-Franka Kitchen. Our experiments assess zero-shot policy performance, robustness under distribution shifts, and ablation studies on regularization and diffusion components.

Environment

Figure:We demonstrate our method results on 4 different environments.

BREEZE consistently achieves top or near-top returns across benchmarks, demonstrating strong zero-shot generalization. It enhances the generalization ability of vanilla FB methods, converges faster, and reaches higher performance with smoother, more stable learning curves.

Results Figure Results Figure Results Figure

Figure: BREEZE achieves superior zero-shot performance across tasks.

Results Table

Figure: BREEZE achieves superior zero-shot performance across tasks.

Our design corrects the distorted \( M^{\pi} \) and \( Q \) distributions of earlier FB frameworks, yielding stable, properly scaled value representations.

Results Figure Results Figure Results Figure Results Figure

Figure: BREEZE enables more realistic distribution learning of M (two left figures) and Q (two right figures) values.

4. Conclusion & Discussion

BREEZE unifies diffusion-based policies, transformer encoders, and behavior-regularized learning to achieve stable, expressive zero-shot reinforcement learning. By mitigating extrapolation errors and enhancing policy expressivity, it delivers consistent generalization across tasks.

The main trade-off lies between computational cost versus performance, as diffusion sampling and expressive architectures improve robustness at the expense of efficiency. Future work will focus on reducing this overhead through lighter generative policies and exploring theoretical guarantees for behavior-regularized generalization.

BibTeX

@inproceedings{zheng2025breeze,
  title={Towards Robust Zero-Shot Reinforcement Learning},
  author={Kexin Zheng and Lauriane Teyssier and Yinan Zheng and Yu Luo and Xianyuan Zhan},
  booktitle={NeurIPS},
  year={2025}
}