In This Story
Training foundation models on large, diverse datasets of robotic behaviors has allowed robots to add new tasks to their repertoire such as folding laundry, busing tables, kitting parts, and snapping legos together. Most often the data used for training these policies are trajectories produced when the robot is teleoperated by humans.Teleoperation results in data with a minimal embodiment gap as the target platform is manipulated by the human user. While powerful, teleoperation interfaces make it difficult to collect data with human-level dexterity and speed because they require the operator to transpose their actions into a distant workspace and often to work without force feedback.
Hand-held data collection devices are replicas of the target robot gripper that the human operator uses to perform the tasks. These devices make it easier for the data collectors to perform some tasks, because using the hand-held data collection devices is much closer to doing a task with human hands, allowing for quick motions and precise force feedback. Second, they allow data to be collected in the wild, generating much more diverse data than teleoperation, which is usually limited to a relatively small number of robot stations.
Simulation offers a way to generate very large datasets very quickly. This data can closely match the chosen embodiment of the robot, including sensor observations like joint angles that are impossible to collect with hand-held data collection devices, and can have super-human performance on challenging tasks. The downside of simulation is that it might not be a perfect match for the physics of the real world because of simplified assets or imperfect contact models. This sim-to-real gap can be mitigated by domain randomization, as is common in reinforcement learning for legged locomotion, but having a reasonable distribution of simulation parameters such as friction and restitution from impacts is still critical.
Simulation Data Generation with AnyTask
Generating data in simulation requires engineers to design simulation environments, randomization settings, tasks, and rewards, which if performed by hand, is time consuming and limits the amount and diversity of simulation data. If this process is performed manually, then computing lots of data for a single task in simulation is cheap, because the engineering time can be amortized across many trajectories, but getting a large, diverse, multi-task dataset, of the kind required to train foundation models, is expensive because additional engineering time is required for each new task.
At the RAI Institute, we have developed AnyTask, a pipeline for simulation data generation that leverages language models to eliminate much of the human effort required to generate simulated data. Given a high level description of the task to be performed (such as stacking) and a database of objects, AnyTask leverages LLMs to generate a number of concrete task descriptions. Each task is then instantiated in code by generating functions to determine success, reset the environment, and generate state-based observations. Additionally, the system generates code for a scripted policy and a dense reward function. All of these pieces can be combined to generate data. We will now describe some of the components of this system in more detail.
Language-guided manipulation requires semantically meaningful object arrangements. In order to take an object out of a drawer, it has to start inside the drawer. We developed a framework for procedurally generating semantically meaningful scenes. This framework implements a number of object-wise spatial relationships and utilizes batched, gpu-accelerated collision checking, allowing it to generate scenes that fulfill those relationships between objects 200 times faster than existing methods. Anytask’s language models can easily generate these scene configurations, which can then be instantiated to create large numbers of randomized scenes.
While language models can generate correct code for some tasks zero shot, the task description does not always contain enough information to generate correct code. For example, language models can make incorrect assumptions about the geometry, kinematics, and dynamics of the task and produce incorrect code for coordinate frames transformations and how an object will roll when pushed. We improve the generated code through an iterative process, where a VLM inspects both the resulting text descriptions and image observations from executing a policy to understand if the success checker is correctly identifying success and failure, if the scripted policy is correctly executing the task, and if the reward function is aligned with the desired behavior. The LLM then makes changes to the generated code and the process repeats for a constant number of iterations.
Solving Behavior Generation Limitations with ExpertGen
Unfortunately, even with VLM-guided refinement, the behaviors that can be successfully generated are still limited. Reinforcement learning policies trained with LLM-generated rewards frequently exhibit behavior that exploits limitations of the simulator or of the reward functions, allowing them to achieve high rewards without completing the desired task in a way that would be feasible in the real world. Scripted policies can transfer to the real world more easily and can more easily obey semantic priors, such as recognizing that a pick-and-place task requires lifting the robot arm away from the table before initiating the task. Unfortunately, even relatively simple contact rich or reactive tasks, like pushing an irregularly shaped pear across a table, can be impossible for LLM generated scripted policies to reliably solve.
To address this problem, we developed ExpertGen, a pipeline for generating policies in simulation that attempts to maintain the best characteristics of both approaches, combining the semantic priors and feasible real world trajectories of scripted policies with the reactivity and success rates of reinforcement learning.
ExpertGen learns a behavior prior by fitting a diffusion policy to demonstration trajectories, such as those from the LLM-generated scripted trajectories. Given that prior, ExpertGen then uses massively parallel reinforcement learning to find initial noise values for the diffusion policy that represent successful behavior while staying within the distribution of the initial demonstrations. ExpertGen only relies on a sparse success checker, which tends to be significantly more reliable than dense, LLM-generated rewards.
ExpertGen learns policies that depend on privileged state information in simulation, such as accurate object poses, that are not available in the real world. These powerful assumptions significantly speed up learning, but the resulting policies often fail in the real world. We use DAgger to distill the state-based expert policies into visual policies that can then be directly deployed in the real world. During this distillation process, we randomize backgrounds, camera parameters, and lighting conditions to make the resulting policies robust to the visual conditions of the real world.
We demonstrate zero-shot RGB sim2real without reward engineering. RGB policies are more robust to certain kinds of distractors than point cloud policies and handle certain kinds of occlusions better. Real videos at 1x speed, sim videos at 0.25 speed.
Going beyond data generation
While policies trained by ExpertGen can achieve good zero-shot performance in the real world, changing conditions or imperfectly modeled dynamics can lower the success rates.
Before
After
Policies may fail on unseen objects, but on-robot refinement techniques can be used to improve performance
In future work, we will continue to scale simulated data generation for training foundation models, while incorporating real-world data to complement simulation’s imperfections. One example of this line of work is improving policies with on-robot refinement. Our next blog post will discuss how these on-robot refinement approaches developed at RAI work in more detail.