Nao Robot with Microsoft Computer Vision API

Lately, I’ve been experimenting with integrating an Aldebaran Nao robot with an artificial intelligence API.

While writing my previous blog post on artificial intelligence APIs, I realised there were way too many API options out there to try out casually. I did want to start getting some hands-on experience with the API’s myself, so I had to find a project.

Pep the humanoid robot from Aldebaran

My boyfriend, Renze de Vries and I,  were both captivated by the Nao humanoid robots during conferences and meetups but found the price of buying one ourselves prohibitive. He already had a few robots of his own – the Lego mindstorms robot and the Robotis bioloid robot which we named Max – he has written about his projects here. Eventually we crossed the threshold and bought our very own Nao robot together from http://www.generationrobots.com/  – we call him Peppy. Integrating an AI API into Peppy seemed like a good project for me to get familiar with what the AI API’s can do with real life input.

DSC02814
Peppy the Nao robot from Aldebaran

Nao API

The first challenge was to get Pep to produce an image that could be processed. Pep has a bunch of sensors including those for determining position and temperature of his joints, touch sensors in his head and hands, bumper sensors in his feet, a gyroscope, sonar, microphones, infrared sensors, and two video cameras.

The Nao API, Naoqi, contains modules for motion, audio, vision, people and object recognition, sensors and tracking. In the vision module you have the option to take a picture or grab video. The video aspect seemed overly complicated for this small POC so I went with the ALPhotoCapture class – java docs here. This api saves pictures from the camera to the local storage on the robot, and if you want to process them externally, you have to connect to Pep’s filesystem and download them.

ALPhotoCapture photocapture = new ALPhotoCapture(session);
photocapture.setResolution(2);
photocapture.setPictureFormat("jpg");
photocapture.takePictures(1, "/home/nao/recordings/cameras/", "pepimage", true);

The Nao’s run a Linux Gentoo version called OpenNAO. They can be reached on their ip address after they connect to your network using a cable or over wifi. I used JSCape’s SCP module to connect and copy the file to my laptop.

pepimage2
Picture taken by Peppy’s camera

Microsoft Vision API

Next up was the visual api – I really wanted to try the Google Cloud Vision API, but it’s intended for commercial use and you need to have a VAT number to be able to register. I also considered IBM Bluemix (I have heard good things about the Alchemy API) but you need to deploy your app into IBM’s cloud in that case, which sounded like a hassle. I remembered that the Microsoft API was just a standard webservice without much investment needed, so that was the obvious choice for a quick POC.

At first, I experimented with uploading the .jpg file saved by Pep to the Microsoft Vision API test page, which returned this analysis:

Features:
Feature Name Value
Description { “type”: 0, “captions”: [ { “text”: “a vase sitting on a chair”, “confidence”: 0.10692098826160357 } ] }
Tags [ { “name”: “indoor”, “confidence”: 0.9926377534866333 }, { “name”: “floor”, “confidence”: 0.9772524237632751 }, { “name”: “cluttered”, “confidence”: 0.12796716392040253 } ]
Image Format jpeg
Image Dimensions 640 x 480
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-LineDrawing
Black & White Image Unknown
Is Adult Content False
Adult Score 0.018606722354888916
Is Racy Content False
Racy Score 0.014793086796998978
Categories [ { “name”: “abstract_”, “score”: 0.00390625 }, { “name”: “others_”, “score”: 0.0078125 }, { “name”: “outdoor_”, “score”: 0.00390625 } ]
Faces []
Dominant Color Background
Dominant Color Foreground
Dominant Colors
Accent Color

#AC8A1F

I found the description of the image quite fascinating – it seemed to describe what was in the image closely enough. From this, I got the idea to return the description to Pep and use his text to speech API to describe what he has seen.

Next, I had to register on the Microsoft website to get an api key. this allowed me to programatically  pass Pep’s image to the API using a POST request. The response was a JSON string containing data similar to that above. You had to put in some url parameters to get the specific information you need. The Microsoft Vision API Docs are here. I used the Description text because it was as close as possible to a human constructed phrase.

https://api.projectoxford.ai/vision/v1.0/analyze?visualFeatures=Description

The result looks like this – the tags man, fireplace and bed were incorrect, but the rest are correct:

{"description":{"tags":["indoor","living","room","chair","table","television","sitting","laptop","furniture","small","white","black","computer","screen","man","large","fireplace","cat","kitchen","standing","bed"],"captions":[{"text":"a living room with a couch and a chair","confidence":0.67932875215020883}]},"requestId":"37f90455-14f5-4fc7-8a79-ed13e8393f11","metadata":{"width":640,"height":480,"format":"Jpeg"}}

Text to speech

The finishing touch was to use Nao’s text to speech API to create the impression that he is talking about what he has seen.

ALTextToSpeech tts = new ALTextToSpeech(session);
tts.say(text);

This was Nao looking at me while I was recording with my phone. The Microsoft Vision API incorrectly classifies me as a man with a wii. I could easily rationalise that the specifics of the classification are wrong, but the generalities are close enough.

Human

|                 |

Woman    Man

 

Small Electronic Device

|                    |                 |

Remote    Phone         Wii

 

This classification was close enough to correct – a vase of flowers sitting on a table.

Interpreting the analysis

Most of the analysis values returned are accompanied by a confidence level. The confidence level in the example I have is pretty low, the range being from 0 to 1.

a vase sitting on a chair", "confidence": 0.10692098826160357

This description also varied based on how I cropped the image before analysis. Different aspects were chosen as the subject of the picture with slightly different cropped views.

The vision api also returned Tags and Categories.

Categories give you a two-level taxonomy categorisation, with the top level being:

abstract, animal, building, dark, drink, food, indoor, others, outdoor, people, plant, object, sky,text, trans

Tags are more detailed than categories and give insight into the image content in terms of objects, living beings and actions. They give insight into everything happening in the image, including the background and not just the subject of the image.

Conclusions

Overall, I was really happy to integrate Nao with any kind of Artificial Intelligence API. It feels like the ultimate combination of robotics with AI.

The Microsoft Vision API was very intuitive and easy to get started with. For a free API with general classification capabilities, I think it’s not bad. These API’s are only as good as their training, so for more specific applications, you would obviously want to invest in training the API more intensively for the context. I tried IBM Bluemix’s demo with the same test image from Pep, but could not get a classification out of it – perhaps the image was not good enough.

I did have some reservations about sending live images from Pep into Microsoft’s cloud. In a limited and controlled setting, and in the interests of experimentation and learning, it seemed appropriate, but in a general sense, I think the privacy concerns need some consideration.

During this POC I thought about more possibilities for integrating Pep with some other API’s. The Nao robots have some sophisticated Aldebaran software of their own which provides basic processing of their sensor data like facial and object recognition and speech to text. I think there is a lot of potential in combining these API’s to enrich the robot’s interactive capabilities and delve further into the current capabilities of AI API’s.

 

8 thoughts on “Nao Robot with Microsoft Computer Vision API

  1. Hello Thosa,

    A few months ago i did similar POC using Blockspring API for reverse image search https://open.blockspring.com/tags/1-image-processing to match pic from NAO camera. It returns back a JASON file with best match found on the web and a text description of matching image. Blockspring also has an API to summarize the matching web page and return sentences summary. I used NAO text to speech to have NAO say the text. Since then Blockspring has begun charging monthly subscription fee, and Google has recently opened Cloud Vision. So now I am working on building NAO AI application with Google Cloud Vision API and Cloud Speech API. Also some IBM Watson API’s could be useful for robotics AI.

    Since we are similar thinkers maybe we can collaborate in some way?

    Bob Dixey – USA

  2. Hi Thosha,

    Your blog really inspired me.I appreciate your efforts.I am also developing POC using Microsoft computer vision API.
    Keep sharing 🙂

    Thanks

  3. Thanks for your comment Randy! I think both social and non social incarnations of a robot can be valid, depending on the purpose of the robot. For instance a dishwasher is perfectly functional without having any social packaging around it’s communication. However a robot that performs more complex or socially oriented functions in the home like a nanny or personal assistant or cook or companion could be aided by personification. What do you think?

Leave a comment