Understanding Social Robotics

Pepper, Jibo and Milo make up the first generation of social robots, leading what promises to be a cohort with diverse capabilities and applications in the future. But what are social robots and what should they be able to do? This article gives an overview of theories that can help us understand social robotics better.

What is a social robot?

dreamstime_xs_72413274

I most like the definition which describes social robots as robots for which social interaction plays a key role. So these skills should be needed by the robot to enable them to perform some kind of function. A survey of socially interactive robots (5) defines some key characteristics which summarise this group very well. A social robot should show emotions, have capabilities to converse on an advanced level, understand the mental models of their social partners, form social relationships, make use of natural communication cues, show personality and learn social capabilities.

Understanding Social Robots (1) offers another interesting perspective of what a social robot is:

Social robot = robot + social interface

In this definition, the robot has its own purpose outside of the social aspect. Examples of this can be care robots, cleaning robots in our homes, service desk robots at an airport or mall information desk or chef robots in a cafeteria. The social interface is simply a kind of familiar protocol which makes it easy for us to communicate effectively with the robot. Social cues can give us insight into the intention of a robot, for example shifting gaze towards a mop gives a clue that the robot is about to change activity, although they might not have eyes in the classical sense.

These indicators of social capability can be as useful as actual social ability and drivers in the robot. As studies show, children are able to project social capabilities onto simple inanimate objects like a calculator. A puppet becomes an animated social partner during play. In the same way, robots only have to have the appearance of sociability to be effective communicators. An Ethical Evaluation of Human–Robot Relationships confirms this idea. We have a need to belong, which causes us to form emotional connections to artificial beings and to search for meaning in these relationships. 

How should social robots look?

Masahiro Mori defined the Uncanny Valley theory in 1970, in a paper on the subject. He describes the effects of robot appearance and robot movement on our affinity to the robot. In general, we seem to prefer robots to look more like humans and less like robots. There is a certain point at which robots look both human-like and robot-like, and it becomes confusing for us to categorise them. This is the Uncanny Valley – where robot appearance looks very human but also looks a bit ‘wrong’, which makes us uncomfortable. If robot appearance gets past that point, and looks more human, likeability goes up dramatically.

In Navigating a social world with robot partners: A quantitative cartography of the Uncanny Valley (2) we learn that there is a similar effect between robot appearance and trustworthiness of a robot. Robots that showed more positive emotions were also more likeable. So it seems like more human looking robots would lead to more trust and likeability.

Up to this point we assume  that social robots should be humanoid or robotic. But what other forms can robots take? The robot should at least have a face (1) to give it an identity and make it into an individual. Further, with a face, you can indicate attention and imitate the social partner to improve communication. Most non verbal cues are relayed through the face, and creates expectation of how to engage with the robot.

dreamstime_xs_64957796The appearance of a robot can help set people’s expectations of what they should be capable of, and limit those expectations to some focused functions which can be more easily achieved. For example, a bartender robot can be expected to be able to have a good conversation and serve drinks, take payment, but probably it’s ok if it can only speak one language, as it only has to fit the context it’s in (1).

In Why Every Robot at CES Looks Alike, we learn that Jibo’s oversized, round head is designed to mimic the proportions of a young animal or human to make it more endearing. It has one eye to prevent it from triggering the Uncanny Valley effect by looking too robotic and human at the same time. Also, appearing too human-like creates the impression that the robot will respond like a human, while they are not yet capable.

Another interesting example is of Robin, a Nao robot being used to teach children with diabetes how to manage their illness (6). The explanation given to the children is that Robin is a toddler. The children use this role to explain any imperfections in Robin’s speech capabilities.

Different levels of social interaction for robots

A survey of socially interactive robots (5) contains some useful concepts in defining levels of social behaviour in robots:

  • Socially evocative: Do not show any social capabilities but rely on human tendency to project social capabilities.
  • Social interface: Mimic social norms, without actually being driven by them.
  • Socially receptive: Understand social input enough to learn by imitation but do not see social contact.
  • Sociable: Have social drivers and seek social contact.
  • Socially situated: Can function in a social environment and can distinguish between social and non-social entities.
  • Socially embedded: Are aware of social norms and patterns
  • Socially intelligent: Show human levels of social understanding and awareness based on models of human cognition.

Clearly, social behaviour is nuanced and complex. But to come back to the previous points, social robots can still make themselves effective without reaching the highest levels of social accomplishment.

Effect of social robots on us

To close, de Graaf poses a thought-provoking question (4):

how will we share our world with these new social technologies and how will a future robot society change who we are, how we act and interact—not only with robots but also with each other?

It seems that we will first and foremost shape robots by our own human social patterns and needs. But we cannot help but be changed as individuals and a society when we finally add a more sophisticated layer of robotic social partners in the future.

References

  1. Understanding Social Robots (Hegel, Muhl, Wrede, Hielscher-Fastabend, Sagerer, 2009)
  2. Navigating a social world with robot partners: A quantitative cartography of the Uncanny Valley’ (Mathur and Reichling, 2015)
  3. The Uncanny Valley (Mori, 1970)
  4. An Ethical Evaluation of Human–Robot Relationships (de Graaf, 2016)
  5. A survey of socially interactive robots (Fong, Nourbakhsh, Dautenhahn, 2003)
  6. Making New “New AI” Friends: Designing a Social Robot for Diabetic Children from an Embodied AI Perspective (Cañamero, Lewis, 2016)

Nao Robot with Microsoft Computer Vision API

Lately, I’ve been experimenting with integrating an Aldebaran Nao robot with an artificial intelligence API.

While writing my previous blog post on artificial intelligence APIs, I realised there were way too many API options out there to try out casually. I did want to start getting some hands-on experience with the API’s myself, so I had to find a project.

Pep the humanoid robot from Aldebaran

My boyfriend, Renze de Vries and I,  were both captivated by the Nao humanoid robots during conferences and meetups but found the price of buying one ourselves prohibitive. He already had a few robots of his own – the Lego mindstorms robot and the Robotis bioloid robot which we named Max – he has written about his projects here. Eventually we crossed the threshold and bought our very own Nao robot together from http://www.generationrobots.com/  – we call him Peppy. Integrating an AI API into Peppy seemed like a good project for me to get familiar with what the AI API’s can do with real life input.

DSC02814

Peppy the Nao robot from Aldebaran

Nao API

The first challenge was to get Pep to produce an image that could be processed. Pep has a bunch of sensors including those for determining position and temperature of his joints, touch sensors in his head and hands, bumper sensors in his feet, a gyroscope, sonar, microphones, infrared sensors, and two video cameras.

The Nao API, Naoqi, contains modules for motion, audio, vision, people and object recognition, sensors and tracking. In the vision module you have the option to take a picture or grab video. The video aspect seemed overly complicated for this small POC so I went with the ALPhotoCapture class – java docs here. This api saves pictures from the camera to the local storage on the robot, and if you want to process them externally, you have to connect to Pep’s filesystem and download them.

ALPhotoCapture photocapture = new ALPhotoCapture(session);
photocapture.setResolution(2);
photocapture.setPictureFormat("jpg");
photocapture.takePictures(1, "/home/nao/recordings/cameras/", "pepimage", true);

The Nao’s run a Linux Gentoo version called OpenNAO. They can be reached on their ip address after they connect to your network using a cable or over wifi. I used JSCape’s SCP module to connect and copy the file to my laptop.

pepimage2

Picture taken by Peppy’s camera

Microsoft Vision API

Next up was the visual api – I really wanted to try the Google Cloud Vision API, but it’s intended for commercial use and you need to have a VAT number to be able to register. I also considered IBM Bluemix (I have heard good things about the Alchemy API) but you need to deploy your app into IBM’s cloud in that case, which sounded like a hassle. I remembered that the Microsoft API was just a standard webservice without much investment needed, so that was the obvious choice for a quick POC.

At first, I experimented with uploading the .jpg file saved by Pep to the Microsoft Vision API test page, which returned this analysis:

Features:
Feature Name Value
Description { “type”: 0, “captions”: [ { “text”: “a vase sitting on a chair”, “confidence”: 0.10692098826160357 } ] }
Tags [ { “name”: “indoor”, “confidence”: 0.9926377534866333 }, { “name”: “floor”, “confidence”: 0.9772524237632751 }, { “name”: “cluttered”, “confidence”: 0.12796716392040253 } ]
Image Format jpeg
Image Dimensions 640 x 480
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-LineDrawing
Black & White Image Unknown
Is Adult Content False
Adult Score 0.018606722354888916
Is Racy Content False
Racy Score 0.014793086796998978
Categories [ { “name”: “abstract_”, “score”: 0.00390625 }, { “name”: “others_”, “score”: 0.0078125 }, { “name”: “outdoor_”, “score”: 0.00390625 } ]
Faces []
Dominant Color Background
Dominant Color Foreground
Dominant Colors
Accent Color

#AC8A1F

I found the description of the image quite fascinating – it seemed to describe what was in the image closely enough. From this, I got the idea to return the description to Pep and use his text to speech API to describe what he has seen.

Next, I had to register on the Microsoft website to get an api key. this allowed me to programatically  pass Pep’s image to the API using a POST request. The response was a JSON string containing data similar to that above. You had to put in some url parameters to get the specific information you need. The Microsoft Vision API Docs are here. I used the Description text because it was as close as possible to a human constructed phrase.

https://api.projectoxford.ai/vision/v1.0/analyze?visualFeatures=Description

The result looks like this – the tags man, fireplace and bed were incorrect, but the rest are correct:

{"description":{"tags":["indoor","living","room","chair","table","television","sitting","laptop","furniture","small","white","black","computer","screen","man","large","fireplace","cat","kitchen","standing","bed"],"captions":[{"text":"a living room with a couch and a chair","confidence":0.67932875215020883}]},"requestId":"37f90455-14f5-4fc7-8a79-ed13e8393f11","metadata":{"width":640,"height":480,"format":"Jpeg"}}

Text to speech

The finishing touch was to use Nao’s text to speech API to create the impression that he is talking about what he has seen.

ALTextToSpeech tts = new ALTextToSpeech(session);
tts.say(text);

This was Nao looking at me while I was recording with my phone. The Microsoft Vision API incorrectly classifies me as a man with a wii. I could easily rationalise that the specifics of the classification are wrong, but the generalities are close enough.

Human

|                 |

Woman    Man

 

Small Electronic Device

|                    |                 |

Remote    Phone         Wii

 

This classification was close enough to correct – a vase of flowers sitting on a table.

Interpreting the analysis

Most of the analysis values returned are accompanied by a confidence level. The confidence level in the example I have is pretty low, the range being from 0 to 1.

a vase sitting on a chair", "confidence": 0.10692098826160357

This description also varied based on how I cropped the image before analysis. Different aspects were chosen as the subject of the picture with slightly different cropped views.

The vision api also returned Tags and Categories.

Categories give you a two-level taxonomy categorisation, with the top level being:

abstract, animal, building, dark, drink, food, indoor, others, outdoor, people, plant, object, sky,text, trans

Tags are more detailed than categories and give insight into the image content in terms of objects, living beings and actions. They give insight into everything happening in the image, including the background and not just the subject of the image.

Conclusions

Overall, I was really happy to integrate Nao with any kind of Artificial Intelligence API. It feels like the ultimate combination of robotics with AI.

The Microsoft Vision API was very intuitive and easy to get started with. For a free API with general classification capabilities, I think it’s not bad. These API’s are only as good as their training, so for more specific applications, you would obviously want to invest in training the API more intensively for the context. I tried IBM Bluemix’s demo with the same test image from Pep, but could not get a classification out of it – perhaps the image was not good enough.

I did have some reservations about sending live images from Pep into Microsoft’s cloud. In a limited and controlled setting, and in the interests of experimentation and learning, it seemed appropriate, but in a general sense, I think the privacy concerns need some consideration.

During this POC I thought about more possibilities for integrating Pep with some other API’s. The Nao robots have some sophisticated Aldebaran software of their own which provides basic processing of their sensor data like facial and object recognition and speech to text. I think there is a lot of potential in combining these API’s to enrich the robot’s interactive capabilities and delve further into the current capabilities of AI API’s.