In This Story
Robot performance is no longer just a matter of hardware and control. Instead, it is limited by cognitive ability to reason and learn about the world or a specific task. For applications that require performance on a broad set of tasks, the key question is how to quickly and easily add new tasks to the robot’s repertoire. At the RAI Institute, we’re exploring how robots can learn the same way people do – by watching someone perform a task, understanding what is required to complete that task and then doing the task themselves. We call this approach “Show, Don’t Tell” because we show the robot what to do rather than trying to explain what actions it should take with language prompts or telling it what to do explicitly through programming.
Traditional methods for teaching robots new tasks include directly programming with code, physically moving the robot’s arm, and teleoperation where a user interface is used to position the physical robot. These methods generally require the operator to have some level of understanding of the robot’s capabilities and how those can be mapped to the actions needed to perform the task. For example, if the robot’s camera is mounted on its gripper then, unlike a human, the robot has to use its hand to first see an object before grasping it.
Techniques for programming robots can be divided into two main methodologies: scripted and learned. The scripted approach requires a human to exactly define a sequential program or decision tree for every state the robot will encounter, necessitating detailed knowledge of the robot’s abilities and of all objects with which the robot interacts. These approaches result in brittle executions, which fail if anything is unaccounted for. The learned methodologies, in contrast, require less human programming but demand extensive human demonstrations, either through teleoperation or simulation. Small numbers of demonstrations (hundreds) tend to lead to brittle systems when not limited to closed worlds, with robust systems currently requiring thousands or tens of thousands of demonstrations to generalize across objects, environments, and initial configurations.
Most robots are programmed for closed worlds where all objects have been identified in advance. Teaching a robot how to recognize an object it has never seen before adds another layer of difficulty as the algorithms for recognition and grasping are missing. While learned algorithms can provide more generalization than scripted approaches, robots often struggle with custom parts or unique objects that were not in their original training data. Sometimes, this problem can be solved with prompt engineering and large language models, but that requires a human to spend minutes or even hours trying to find the perfect text description to help the robot understand what it’s looking at. However, the “Show, Don’t Tell” approach – showing a robot how to recognize a novel object through human demonstration – helps solve this challenge.
Challenges with “Telling:” Understanding Novel Objects
Even very large models struggle to understand novel objects because these items are very different from the bulk of objects used to train the model. While systems like Grounding DINO have been trained on vast datasets and are excellent at finding common objects, even under variations in color and shape (like millions of different mugs), unique items (like parts of assemblies in factories) will not have examples in the training set, making it difficult for the system to discriminate among them. These kinds of optimization systems are incentivized to group items with small label discrepancies into a single category when the number of instances is low.
Leveraging Natural Human Interactions for Object Detection
Our goal is to have a robot detect and interact with objects it has never seen before from a single human demonstration. We use a twofold approach:
- Use the human demonstration to create a dataset of the task-relevant objects, then
- Train a lightweight object detector using that dataset.
In order to train our object detector, we first need a labeled dataset, complete with bounding boxes drawn around all objects of interest.
Usually, these datasets are labeled by hand, making them very time-consuming to create. We’ve found that we can automate the labeling using natural human interaction during a demonstration, which provides a strong signal about what is important. For example, if a human picks up a particular assembled object and places it in a particular tray, the human grasping the object is a signal that that object is important. We can exploit this signal as follows (also described in the video below):
- We use existing methods like HOIST-Former to identify every object the human grasps.
- We then use existing tracking methods, such as SAMURAI, to track all of the manipulated objects before and after they are grasped by the human.
- Finally we post-process the tracking into a labeled dataset where all the objects the human manipulated in the demonstration have bounding boxes drawn around them in every frame in which they are present.
We use our datasets to fine-tune a lightweight object detector, such as an F-RCNN, on this dataset like any supervised deep network. Both dataset creation and detector training are fast: On a 15 second human video, the entire pipeline – dataset creation and detector training – can be run in under 10 minutes. The result is a task-specific object detector, which we call a “manipulated objects detector,” that can be quickly trained from a single human demonstration to detect all objects with which the human interacted:
Validating “Show, Don’t Tell” with a Robot Sorting Task
To validate the core principle of “Show, Don’t Tell” we created a complete, end-to-end robotics application simulating a real-world object sorting task:
- We ensure that the sorting task involves novel objects by asking human participants to construct new objects from construction kits.
- We collect a video of a human participant sorting this set of objects.
- The robot uses a large language model (LLM) to parse the sequence of picks and places the participant used in the video.
- The system creates a dataset of all of the manipulated objects in the video and trains a manipulated objects detector.
- The robot copies the demonstrated sorting of these novel objects in a different environment using the trained manipulated objects detector to detect the novel objects.
Comparisons with State-of-the-Art: Why Not Use VLMs?
The current state-of-the-art in object detection are large vision-language models (VLMs). In our investigation, however, these VLMs – even those trained specifically for arbitrary object detection – performed poorly on novel objects.
VLMs all require a language prompt describing the object they should detect. We provided 16 human participants with a range of novel objects and asked them to generate these prompts for the VLMs. In 80% of cases, VLMs failed to detect the novel objects on the first prompt the participants tried. Even with five prompt attempts, the VLMs still failed to detect the objects 43% of the time. When humans, who can see the object, understand the task context, and iteratively refine their descriptions, fail to produce a working prompt 43% of the time even after five attempts, it is unreasonable to expect automated prompt generation (e.g., via GPT) to reliably succeed. This experiment suggests a fundamental limitation of prompt-based detection of novel objects.
In contrast, our “Show, Don’t Tell” approach outperformed these state-of-the-art object detection methods, achieving 66% better performance than RexOmni (a slower, more accurate detector), and 150% better performance than GroundingDino method (a fast, but less accurate detector), on classical object detection metrics – all while being supervised by a single human demonstration, with no additional annotations.
Technical Limitations
Despite our “Show, Don’t Tell” approach outperforming existing prompt-based VLM approaches at detecting novel objects, it isn’t perfect. On occasion it confuses objects, and sometimes misses detecting objects altogether. Like its VLM counterparts, the manipulated object detector still requires some (though less) human-supervision when used as an object detector deployed on robot. It currently also relies on the human to physically interact with a task-relevant object as supervision; but this could be replaced by using human gaze over the objects, for a less intensive form of supervision.
Looking Ahead
The “Show, Don’t Tell” methodology, as implemented in our system, has broad applicability across computer vision and robotics. The automated dataset creation could be used for a variety of visual training and recognition techniques: visual prompting, creating a visual description of an object “codebook” style, or fine-tuning larger vision language models. The full pipeline could be used in assembly and kitting tasks that involve a large number of novel and sometimes changing components.
We are also looking into how these techniques can improve imitation learned skills. Traditionally these skills take in an image from the robot sensors and directly output commands for the robot’s end effector with no additional information. But training these skills takes many demonstrations and they still struggle to generalize to novel environments. If we include segmentations of novel objects or trajectories from demonstration can we improve the robustness of these skills?
While language has been, and remains, a powerful modality in vision and robotics, a demonstration can be worth a thousand words.