AVID, a method for robots to learn to imitate human behavior through video
2020-06-15

01.png

This work proposes AVID, a method that allows robots to learn tasks (such as making coffee) directly by watching human behavior

The ability to learn by observing others is one of the most important indicators of intelligence. Humans are particularly good at this and can usually learn tasks by observing other people. Because we are not just copying actions taken by others. Instead, we first imagine how we should perform the task.


Robots cannot yet learn by observing humans or other robots. Existing methods of imitation learning that robots learn from task demonstrations usually assume that they can be demonstrated directly by robots using techniques such as action teaching or remote operation. This assumption limits the applicability of robots in the real world. In the real world, robots are often required to quickly learn new tasks without the need for programmers, robotics experts, or special hardware settings. So, can we make robots learn directly from human demo videos?


This work proposes AVID, which is a method of robot imitation learning from human videos through human-like imagination and practical strategies. Given human demonstration videos, AVID first converts these demonstrations into videos of robots performing tasks through image-to-image translation. In order to directly convert human videos to robot videos at the pixel level, we used CycleGAN, a recently proposed model that can use unpaired images from each domain to learn the image-to-image conversion between two domains.


In order to handle complex multi-stage tasks, we extracted instruction images from these translated robot demonstrations, which describe the key stages of the task. These instructions then define a reward function for the model-based reinforcement learning (RL) program, which allows the robot to practice tasks to understand how it performs.


 The main goal of AVID is to minimize the personnel burden associated with defining tasks and supervising robots. Providing rewards through manual videos can handle task definitions, but there are still labor costs in the actual learning process. AVID solves this problem by letting the robot learn to reset each stage of the task on its own so that it can be practiced multiple times without manual intervention. Therefore, the only human involvement required when the robot learns is the key press and some manual resetting. We have proved that this method can solve complex long-term tasks with minimal human involvement, eliminating most of the human burden associated with testing task settings, manually resetting the environment, and supervising the learning process.


Automated visual instructions-follow-up demonstration

2.gif

AVID uses CycleGAN to convert human instruction images into corresponding robot instruction images, and uses model-based RL to learn how to complete each instruction

Our method is called automated visual instruction-following with demonstrations (AVID). AVID relies on several key ideas in image-to-image conversion and model-based RL, where we will discuss each component.


Translate human videos into robot videos


03.png

4.gif

Top: CycleGAN has successfully completed tasks ranging from horse videos to zebra videos. Next: Apply CycleGAN to translation tasks from human demonstration videos to robot demonstration videos


It has been proven that CycleGAN is effective in many fields, such as converting horse video to zebra frame by frame. Therefore, we trained a CycleGAN, where the domain is images of humans and robots: for training data, we collected demonstrations from humans and random movements from humans and robots. In this way, we have obtained a CycleGAN, which is able to generate fake robot demonstrations from human demonstrations, as described above.


 Although most of the robot demonstrations are visually realistic, the translated video will inevitably show artifacts, such as coffee cup picking and placing and robotic grippers moving away from the arm. This makes learning from the complete video ineffective, so we designed an alternative strategy that does not rely on the complete video. Specifically, we extract instruction images that illustrate the key stages of the mission from the translated video. For example, for the coffee making task shown above, the instructions include grabbing the cup, placing the cup in the coffee machine, and pressing a button on the top of the machine. By using only specific images instead of the entire video, the learning process is hardly affected by poor translation.


 

 

Completion instructions through the plan


The instruction image we extracted from the demo divides the entire task into several stages, and AVID uses a model-based planning algorithm to try and complete each stage of the task. Specifically, using the robot data we collected for CycleGAN training and the translated instructions, we will learn a dynamic model and a set of instruction classifiers that can predict when each instruction will be successfully completed. When trying the stage s, the algorithm samples the action, uses the dynamic model to predict the result state, and then selects the classifier for the action s predicted by the stage to have the greatest chance of success. The algorithm repeats the selection action within a specified number of time steps, or until the classifier signals a success, that is, the robot thinks that the robot has completed the current stage.


5.gif

Use a structured latent variable model similar to the SLAC model to learn state representation based on image observation and robot actions

 

Previous work has shown that training structured latent variable models is an effective strategy for learning tasks in image-based domains. At a high level, we want our robot to extract a state representation from its visual input. The state representation is low-dimensional and easier to learn than learning directly from image pixels. This is done using a model similar to the SLAC model, which introduces a latent state, which is decomposed into two parts, these states evolve according to the learned dynamic model, and the robot is generated according to the learned neural network decoder image. After presenting the image observation result, the robot can encode, the image enters the latent state through another neural network, and runs on the state level instead of pixels.

 


Coach through model-based reinforcement learning

06.png

AVID uses a model-based plan to complete instructions, polls people when the classifier signals a success, and automatically resets when the instructions are not fulfilled.

 

 

By letting the robot automatically try to reset itself, we reduce the manual burden of manually resetting the environment, because this is only needed when there is a problem such as a cup falling. In most cases, only the personnel need to provide buttons during the training process, which is much simpler and less intense than manual intervention. In addition, step-by-step resets and retries allow the robot to practice difficult phases of the task, thereby focusing the learning process and enhancing the robot's behavior. As shown in the next section, AVID can solve complex multi-stage tasks on a real Sawyer robot arm directly through human demonstration videos and minimal human supervision.

 

Experiment


image.png

We proved that AVID can learn multi-stage tasks on a real Sawyer robotic arm, including operating the coffee machine and removing the cup from the drawer


As mentioned above, we conducted experiments on the Sawyer robotic arm, which is a 7-degree-of-freedom manipulator whose task is to operate the coffee machine and remove the cup from the closed drawer. In both tasks, we compared it with Time Difference Network (TCN), which is an existing method that can learn robotic skills from human demonstrations. We also ablated our method in order to learn from a full demonstration, we call it "imitation ablation", and operate directly at the pixel level, we call it "pixel space ablation". Finally, in an environment where we can directly demonstrate through robots (this is an assumption made in most previous imitation learning work), we compare observation with behavioral cloning (BCO) and standard behavioral cloning methods. For additional details about the experiment, such as hyperparameters and data collection, please refer to this article.

 

 

Task setting


09.png

The image of instructions given by the person (top) and converted to the domain of the robot (bottom) for coffee making (left) and cup retrieval (right) tasks.

As mentioned above, we have specified three stages for the coffee making task. Starting from the initial state on the left, it means picking up the cup, placing the cup in the machine, and then pressing the button on the top of the machine. We used a total of 30 human presentations for this purpose, totaling about 20 minutes. Cup retrieval is a more complex task, and we have specified five stages here. Starting from the initial state, the instructions are to grab the drawer handle, open the drawer, move the arm up and move it away, grab the cup and place the cup on the top of the drawer. The middle stage of moving the arm is important so that the robot does not touch the handle and accidentally close the drawer. This highlights other advantages of AVID, because specifying this additional note is like subdividing another time step in a human video simple. For the retrieval of the cup, we used 20 human demonstrations, again totaling about 20 minutes of human time.

Result


010.png

AVID is significantly superior to the ablation and previous methods that use manual demonstrations on the task we are considering. Compared with the benchmark method using the real demonstration given by the robot itself, AVID even outperforms the competition sometimes.

Click the link to watch the video:https://mp.weixin.qq.com/s/ATvd-UUJyywyCfyDFpbGmg


The above table and video summarize the results of running AVID and a comparison of coffee making and cup retrieval tasks. AVID showed excellent performance, and most of the time successfully completed all stages of the two tasks, and had a very perfect performance at the beginning. As shown in the video, AVID always uses automatic reset and retry functions during training and final evaluation, and failure usually corresponds to a small but serious error, such as tipping a cup over. Compared with imitation or pixel space ablation, AVID's performance is also much better, which proves the advantages obtained by training and learning latent variable models in stages. Ultimately, TCN can learn the early stages of retrieval, but otherwise it will usually not succeed.


We also evaluated two methods that assume access to a real robot demo, which AVID does not require. First, BCO only uses the image observation results in the demonstration, and the performance of this method drops sharply in the later stages of each task. This highlights the difficulty of learning temporary extended tasks directly from the complete demonstration. Finally, we compare behavioral cloning with behaviors that use robot observations and actions, and note that this method is the strongest benchmark because it uses the most privileged information of all comparisons. However, we found that AVID still outperforms behavioral cloning in retrieval, which is probably due to the explicit phased training of AVID.



Related work

 

As mentioned above, most work on imitation learning assumes that it can be demonstrated directly on the robot, rather than learning directly from human videos. However, learning methods from human videos have also been studied through various methods (such as pose and object detection, predictive models, context translation, learning reward representation, and meta-learning). The main difference between these methods and AVID is that AVID directly translates the human presentation video at the pixel level in order to clearly handle the changes in the embodiment.


 In addition, we evaluate complex multi-stage tasks, and AVID's ability to solve these tasks is achieved to a certain extent by combining explicit staged training, where each stage will learn to reset. Previous work in RL has also studied learning reset, similarly showing that doing so can learn multi-stage tasks, reduce the burden on personnel, and reduce the need for manual reset. AVID combines the ideas of reset learning, image-to-image conversion, and model-based RL, so that only a small amount of human presentation can be used to learn temporarily extended tasks directly from image observations in the real world.


 

Future career

 

The most exciting direction for future work is to expand the functions of the general-purpose CycleGAN so that a variety of tasks can be learned efficiently with just a few human videos. Imagine a CycleGAN trained on a large kitchen interaction data set, which includes coffee machines, multiple drawers, and many other objects. If CycleGAN can reliably translate human presentations involving any of these objects, then this opens up the possibility of a universal kitchen robot that can perform any task quickly with observation and a small amount of practice. Carrying out this field of research is a promising way to enable capable and useful robots to truly learn by observing humans.


 

This article is based on the following papers:


Smith L, Dhawan N, Zhang M, et al. AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos[J]. arXiv preprint arXiv:1912.04443, 2019.



Donghu Robot Laboratory, 2nd Floor, Baogu Innovation and Entrepreneurship Center,Wuhan City,Hubei Province,China
Tel:027-87522899,027-87522877

Technical Support

Post-Sale
Video
ROS Training
Blog

About Jingtian

About Us
Join Us
Contact Us

Cooperation and consultation

Business cooperation: 18062020215

18062020215@qq.com

Pre sales technical support:

Tel 13807184032


Website record number:鄂ICP备17004685号-1 | Technical Support | Contact Us | Terms of Service and Privacy | Map