By Adrien Gaidon

A few months ago, a computer program built by Google’s DeepMind artificial intelligence lab, challenged the world champion at Go, an ancient board game more complex than chess – and it won. The software had already beaten a European Go champion earlier in the year, signaling an advance in artificial intelligence that many in the research community believed was years away.

Adrien Gaidon

Adrien Gaidon recently presented his team’s findings at the Computer Vision and Patt”ern Recognition conference . He is organizing the workshop on Virtual/Augmented Reality for Visual Artificial Intelligence in conjunction with the European Conference on Computer Vision from October 8 – 16. Adrien is a research scientist at the Xerox Research Centre Europe.

Software platforms capable of interpreting data and making decisions like humans do could be used in all sorts of settings – healthcare, transportation, finance, marketing and many more. The race is on between the likes of Google, Facebook and other top artificial intelligence research labs around the world.

In our labs at Xerox, we’re combining deep learning techniques with video game engines to teach computers how to “see” the world around them and interpret what’s going on. Last year we taught software how to successfully track objects and people in a city to predict traffic conditions.

Computer vision algorithms could one day turn a video camera into a means for humans to extend significantly their point of view and increase their abilities, attention and focus. If you think about it, your attention is quickly becoming one of the most human valuable assets. In a hospital, for example, an intelligent camera could recognize a patient getting uncomfortable and warn a nurse.

A radical way to teach computer vision

However to teach a computer to do these kinds of things you need to feed it data. There are currently two main approaches on how to do that. The first is “supervised” machine learning meaning humans feed machines labeled data or examples to learn from.  The other approach, spearheaded by Google and Facebook researchers, is unsupervised learning, where there are no labels and the machine has to observe the raw data and discover new strategies for solving problems. Although computers may one day be able to learn from raw data without supervision, what works best right now is supervised learning.

Our research idea was simple – to use game engines to generate realistic labelled video data that is then used as training examples in supervised learning.

This idea radically changes the way we can teach computer vision algorithms to understand video and it overcomes two problems: that most real-world video data is not neat and clearly labeled.

Today, if you want to teach a computer how to recognize a person falling down, you have to find thousands of examples and feed them into the program. With video game engines, you can simply create the teaching data.

Secondly, supervised learning algorithms are generally ‘black boxes’ that don’t take into account the laws of physics. A video camera cannot know that a baseball hitting a window will cause the glass to shatter and fall on the ground. So, we asked a question: Can virtual worlds provide synthetic data that could then be used to train a computer to see and interpret events in the real world?

Learn more about computer vision

Our work in computer vision at the Xerox Research Centre Europe deals with all types of interactions that involve a visual component.

Xerox researchers create the future today. Learn more about innovation at Xerox.

Video game engines can’t see the world, but they can build virtual worlds with complex systems of physical laws so that actions have realistic consequences. A car that runs a red light will affect the movement of other cars at the intersection, for example.

The advantage of synthetic data

At the Xerox Research Centre Europe (XRCE) we’ve built virtual worlds with game engines to teach our computer vision software how to track objects.1

These worlds account for real-world parameters such as realistic lighting, weather changes, material properties and even gravity. In this way we can create synthetic data that is realistic but also free from real-world concerns such as privacy.  And we can test our computer vision algorithms without real world consequences – like causing a traffic accident.


1 — “Virtual Worlds as Proxy for Multi-Object Tracking Analysis” by Adrien Gaidon, Eleonora Vig and Qiao Wang, CVPR 2016, June 26 – July 1, 2016, Las Vegas, USA.