SDPG: Efficient On-policy Visual RL via
Stochastic Decoupled Policy Gradient

Haoxiang You1*, Yilang Liu1*, Davis Zong1, Qian Wang1, Teeratham Vitchutripop1, Qi Wang2, Daniel Rakita1, Ian Abraham1,3
1Yale University    2Shanghai Jiao Tong University    3University of Sydney
Under review
*Equal Contribution

SDPG trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU.

Abstract

We present the Stochastic Decoupled Policy Gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation and challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.

Method Overview

SDPG combines batch-rendered and physics-only environments to estimate policy gradients. Batch-rendered environments evaluate policy performance, while physics-only environments provide perturbed rollouts for policy improvement. This mixture of environments significantly reduces the computation and memory overhead. We also introduce several engineering improvements—such as an adaptive exploration strategy and reward-invariant normalization—to keep updates numerically stable throughout training.

SDPG diagram
Figure 1. SDPG combines batch-rendered and physics-only environments to estimate policy gradients.

Visual MuJoCo Benchmark

On visual MuJoCo benchmarks, SDPG achieves high rewards, matching state-based performance, and fast training, on par with distillation and far quicker than other end-to-end visual RL methods.

MuJoCo benchmark
Figure 2. SDPG matches distillation in training speed, is significantly faster than DrQ-v2 and DreamerV3, and achieves higher final rewards on humanoid tasks.

Visual control on classic MuJoCo tasks from third-person RGB.

A key bottleneck in applying on-policy methods like PPO to visual RL is memory: they rely on thousands of parallel environments (e.g., 4096) to estimate gradients, and rendering pixel observations for so many environments quickly exhausts GPU memory. By mixing different environments, SDPG estimates gradients accurately with an order of magnitude fewer batch-rendered environments, keeping memory usage on par with off-policy and model-based approaches—and enabling training on a single NVIDIA RTX 4080 GPU.

Memory Usage (GB)

MethodHopperWalkerAntHumanoid
SDPG (Ours)10.210.310.310.5
PPO48484950
DrQv210.68.210.511.6
DreamerV310.810.810.810.9
Distillation10.610.610.310.7

PPO memory is estimated with 4096 environments and state-based hyperparameters; all other methods use 64 batched environments.

Memory scaling
Figure 3. Memory scaling. Visual environments require significantly more memory than state-based ones.

Egocentric Task Suite

To support further research, we release a suite of realistic, challenging tasks for visual RL spanning dexterous manipulation and challenging locomotion. The suite covers diverse on-robot sensor setups—RGB and depth, single- and multi-camera—paired with proprioception to mimic real hardware, and every policy is trained end-to-end with SDPG.

Sim-to-Real: Unitree Go2

We validate SDPG on Unitree Go2 hardware. The robot uses an egocentric RealSense depth camera to perceive its environment and navigate challenging terrains, including uneven surfaces, boxes, and stairs. The policy is trained entirely in simulation in under 2 hours on a single GPU and transferred to the real world via zero-shot sim-to-real.

BibTeX

@misc{you2026efficientonpolicyvisualrlstochastic,
      title={Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient},
      author={Haoxiang You and Yilang Liu and Davis Zong and Qian Wang and Teeratham Vitchutripop and Qi Wang and Daniel Rakita and Ian Abraham},
      year={2026},
      eprint={2605.26478},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.26478},
}