High-Performance Reinforcement Learning on Spot: Optimizing Simulation Parameters with Distributional Measures

Paper Code

Abstract

This work presents an overview of the technical details behind a high-performance reinforcement learning policy deployment with the Spot RL Researcher Development Kit for low-level motor access on Boston Dynamics’ Spot. This represents the first public demonstration of an end-to-end reinforcement learning policy deployed on Spot hardware with training code publicly available through NVIDIA Isaac Lab and deployment code available through Boston Dynamics. We utilize Wasserstein Distance and Maximum Mean Discrepancy to quantify the distributional dissimilarity of data collected on hardware and in simulation to measure our sim-to-real gap. We use these measures as a scoring function for the Covariance Matrix Adaptation Evolution Strategy to optimize simulated parameters that are unknown or difficult to measure from Spot. Our procedure for modeling and training produces high-quality reinforcement learning policies capable of multiple gaits, including a flight phase. we deploy policies capable of over 5.2m/s locomotion, more than triple Spot’s default controller maximum speed, robustness to slippery surfaces, disturbance rejection, and overall agility previously unseen on Spot. We detail our method and release our code to support future work on Spot with the low-level API.

Methods

Visual overview of the training and deployment pipeline. On the upper diagram, is an approximation of how the policy is trained in simulation, and on the lower diagram, is how it is deployed on hardware. Isaac Lab is the simulation environment, provides the physics engine, and employs our modeling for training. Policies are deployed on Spot using an NVIDIA Jetson Orin that communicates with the Boston Dynamics API for state estimates and sends motor control law values with the Spot RL Researcher Development Kit to command the motors.

A novel simulation hyperparameter optimization method was necessary to achieve high speed gaits on hardware. Hardware data is collected from an initial baseline policy to be used as targets for optimizing simulation parameters. Sim-to-real gap between simulated policy rollouts and hardware is quantized using Wasserstein distance and Maximum Mean Discrepancy, and CMAES is used to optimize this objective. This procedure enables us to deploy policies capable of 5.2m/s locomotion, robustness to slippery surfaces, disturbance rejection, and overall agility previously unseen on Spot.

Locomotion on Normal Ground

Policy trotting on flat ground

Locomotion on Slippery Ground

Policy walking over slippery surface (dish soap and water on acrylic)

Push Recovery on Normal Ground

Policy responding to external disturbances

Push Recovery on Slippery Ground

Push recovery over slippery surface

BibTeX

@article{miller2025spotrl,
          title={High-Performance Reinforcement Learning on Spot: Optimizing Simulation Parameters with Distributional Measures},
          author={Miller, AJ and Yu, Fangzhou and Brauckmann, Michael and Farshidian, Farbod},
          journal={arXiv preprint ...},
          year={2025}
        }