[Hearing from an AI Expert – 3] Vision is About Understanding the World

Sven Dickinson, Head of Samsung’s Toronto AI Center

 

Can you imagine a world where the personal AI assistant on your smartphone is able to understand as much about the world as you do? What about a scenario where communicating with that AI assistant is as natural and easy as interacting with another human? Developing those kinds of capabilities is exactly what the team at Samsung’s AI Center in Toronto are putting their minds to.

 

Samsung Newsroom sat down with Sven Dickinson, Head of Samsung’s Toronto AI Center to learn more about these exciting fields, and what they could mean for the future.

 

 

The Vision for Vision

The second Samsung AI center established in North America, Samsung’s Toronto AI Center is led by Dr. Sven Dickinson, an expert in computer vision and former chair of the Department of Computer Science at the University of Toronto.

 

At the epicenter of AI research and development, Samsung’s Toronto AI Center is mainly focused on developing the visual understanding capabilities that allow a Samsung device to understand the world in which it’s situated. In addition, the team is working on multi-modal interactions, which are user-machine interactions that encapsulate vision, language and knowledge.

 

“Allowing Samsung devices to ‘see the world’ through computer vision enables them to ‘visually ground’ their dialog with the user, providing an integrated, multimodal experience that’s far more natural than one that’s solely vision or dialog-based” says Dickinson, whose expertise includes exploring problems surrounding shape perception and object recognition.

 

Touching on the benefits of multimodal technology, Dickinson claims that, “I should not have to read manuals to figure out which buttons to push on my device and in which order. Rather, I should be able to show my device what I want, and tell it what I want, in natural language that is understandable, and situated in the world that I live in.”

 

Extrapolating on the interplay between computer vision and multimodal inputs, he goes on to say that, “To achieve this breadth of comprehension, the device has to have a model of my understanding of the world, the capacity to communicate robustly and naturally with me, and the ability to see and understand the same world that I see.”

 

 

Remarking on applications for this technology, Dickinson identifies the most compelling as being “a personal assistant that you not only speak to, but that sees the world the same way that you do.” Speaking to the importance of multi-modal device interactions, Dickinson points out how much cancelling out one of the modes of communication (audio, speech, sight etc.) would hamper communication between two people, and says that also applies to personal devices.

 

 

A Truly Enhanced User Experience is Key

At the 2019 Consumer Electronics Show (CES), Samsung unveiled its vision for Connected Living, which involves connecting the 500 million devices the company sells every year, and making them intelligent. Dickinson highlights that Samsung’s broad product portfolio will be instrumental in fulfilling this vision, saying that, “What differentiates Samsung is that it makes a multitude of devices in the home, including digital appliances, TVs, and mobile phones. Samsung has a unique opportunity to leverage these devices to yield a multi-device experience which follows the user from one device to another, and one room to another. This will help realize the full potential of each device to effectively communicate, to help the user execute device-specific tasks, and to learn the user’s habits and preferences so that subsequent communication is not intrusive but instead ‘always helpful.’”

 

Speaking about what his center will need to do to truly realize computer vision and multimodal interaction, Dickinson comments that, “Vision is not about understanding images; vision is about understanding the world. Truly capable AI systems must possess an understanding of our world, of its physics and causality, of its geometry and dynamics. They must also be able to model and understand human behavior.” He extrapolates on this by pointing out that, “If our devices can see the 3D world that we live in the same way as we do, i.e., understand the 3D shapes, positions and identities of objects in our shared environment, then our devices can visually experience the world as we do. Such a shared visual context will be crucial in developing fully realized personal assistants.”

 

Dickinson says that Samsung is leading the charge when it comes to truly intelligent visual understanding, and identifies ‘visual grounding’ as an essential pre-requisite for well-rounded visual understanding capabilities. “Samsung is leading the way when it comes to developing human-device interaction that closely mimics human-human interaction,” Dickinson says, “We aim to provide visual grounding and knowledge representation scaffolding for dialog-based interaction services. Without these components in place, users become disappointed with services, and quickly tune out.”

 

 

Human-device Interactions Based on Open Information Sharing

Dickinson goes on to explain that AI also needs to be able to explain itself to the user. He remarks that, after failing to carry out a task or provide an appropriate response, “A device should be able to reflect to the user precisely how and why it came up with that response (or lack thereof). Ideally, it should be able to follow up with the user by asking a question or asking the user to adjust its camera or other input modes so that it can gather more information and formulate an appropriate response.” Dickinson relates that this kind of openness and information sharing will be key to the further sophistication of human-device interactions, noting that “What we call the domain of ‘active dialog and active vision’ is where the system can construct a mental model of what the user understands, and can, in turn, open up its own mental model so that the user can understand the thought processes of the device.”

 

 

The Benefits of Being Based in Toronto

Asked about how being based in Toronto affects the AI center, Dickinson remarks that the center enjoys a lot of benefits due to its close proximity to various world-class AI-related institutions, including the University of Toronto, York University and Ryerson University. “Being in Toronto offers us a tremendous regional advantage,” Dickinson comments, “We are across the street from the University of Toronto, home to the Department of Computer Science (DCS), which is one of the top-10 international computer science departments. Over half the members of our AI Center are either active faculty, graduates or current students at DCS.”

 

On the topic of collaboration between Samsung’s global AI centers, Dickinson relates that, “The seven global AI centers are working to create industry-leading solutions in their respective areas of focus, while coordinating to achieve the common goal that is realizing Samsung’s ultimate AI vision.” Dickinson touches on the topic of the Toronto AI center collaborating with other AI centers further afield, saying that, “We are starting to explore possible research collaborations with other global AI centers, and hope to converge on some use cases of value to Samsung and its products and services.”