Artificial intelligence... or advanced imitation? How DeepMind used YouTube vids to train game-beating Atari bot

montezumas_revenge_AI
A screenshot of the agent playing Montezuma's Revenge. Image Credit: Tobias Pfaff

Video DeepMind has taught artificially intelligent programs to play classic Atari computer games by making them watch YouTube videos.

Typically, for this sort of research, you'd use a technique called reinforcement learning. This is a popular approach in machine learning that trains bots to perform a specific task, such as playing computer games, by tempting them with lots of little rewards.

To do this, developers have to build algorithms and models that can figure out the state of the game’s environment, identify the rewards to obtain, and then go get 'em. By seeking out these prizes, the bots should gradually progress through the game world, step by step. The goodies should come thick and fast to continuously lure the AI through levels.

But a new method, developed by DeepMind eggheads and documented in a paper this week, teaches code to play classic Atari titles, such as Montezuma’s Revenge, Pitfall, and Private Eye, without any explicit environmental rewards. Instead, an agent is asked to copy the way humans tackle the games, by analyzing YouTube footage of their play-through sessions.

Exploration games like 1984's Montezuma’s Revenge are particularly difficult for AI to crack, because it's not obvious where you should go, which items you need and in which order, and where you should use them. That makes defining rewards difficult without spelling out exactly how to play the thing, and thus defeating the point of the exercise.

For example, Montezuma’s Revenge requires the agent to direct a cowboy-hat-wearing character, known as Panama Joe, through a series of rooms and scenarios to reach a treasure chamber in a temple, where all the goodies are hidden. Pocketing a golden key, your first crucial item, takes about 100 steps, and is equivalent to 10018 possible action sequences. That’s way too big for typical reinforcement learning algorithms to cope with – there are too many sequential steps for a neural network to internalize just to obtain a single specific reward.

These sorts of rewards are therefore described as sparse: each of the steps involved to obtain the reward appears to achieve very little, and there is little in the way of an immediate bounty to guide the bot, even though together the steps would lead the player to a goal. Games like Ms Pac-Man are the opposite, and provide software agents with near instant feedback: points are racked up as she guzzles pellets and fruit, and she is punished when she gets caught by ghosts. Sparse games – such as Montezuma’s Revenge and other puzzle adventures – require agents to have much more patience than reinforcement learning usually affords.

Imitation learning

One way to get around the sparse rewards problem is to directly learn from demonstrations. After all, it's how you and I learn things, too. “People learn many tasks, from knitting to dancing to playing games, by watching videos online,” the DeepMind team wrote in their paper's abstract.

"They demonstrate a remarkable ability to transfer knowledge from the online demonstrations to the task at hand, despite huge gaps in timing, visual appearance, sensing modalities, and body differences. This rich setup with abundant unlabeled data motivates a research agenda in AI, which could result in significant progress in third-person imitation, self-supervised learning, reinforcement learning (RL) and related areas."

To educate their code, the researchers chose three YouTube gameplay videos for each of the three titles: Montezuma’s Revenge, Pitfall, and Private Eye. Each game had its own agent, which had to map the actions and features of the title into a form it could understand. The team used two methods: temporal distance classification (TDC), and cross-modal temporal distance classification (CDC).

TDC taught an agent to predict the temporal distance, or difference between two frames. It learned to spot which visual features have changed between two video frames in the game, and what actions were taken in between. To generate training data, pairs of frames were chosen randomly from a given YouTube video of the game.

CDC is clever as it tracks sounds. The noises in the game correlate to actions, such as jumping or collecting items, and so it mapped these sounds to important game events. After these visual and audio features were extracted and embedded using neural networks, an agent could begin copying how humans played the game.

Here's the agent in action in Montezuma's Revenge. You can also see more footage of the computer software, trained to play Pitfall and Private Eye, here.

Youtube Video

The DeepMind code still relies on lots of small rewards, of a kind, although they are referred to as checkpoints. While playing the game, everything sixteenth video frame of the agent's session is taken as a snapshot and compared to a frame in a fourth video of a human playing the same game. If the agent’s game frame is close or matches the one in the human's video, it is rewarded. Over time, it imitates the way the game is played in the videos by carrying out a similar sequence of moves to match the checkpoint frame.

It’s a nifty trick, and the agent does reach pretty decent scores on all three games – exceeding average human players and other RL algorithms: Rainbow, ApeX, and DQfD. Crucially, it is learning to copy a person's actions, rather than master a game all by itself. It is seemingly reliant on having a good human trainer, just like we relied on good teachers at school.

deepmind_results

A table of the results for the AI agent playing the Atari games against average human scores and other RL algorithms. Image credit: Aytar et al.

Although impressive, it’s unknown how practical this all is. Can it be used for something else other than Atari games? The research is also probably pretty difficult to replicate. What hardware did the researchers use? How long did it take to train the agents? The paper doesn’t say, we asked DeepMind, and it declined to comment. ®




Related articles


0 Comments