European Robotics Week 2016

The European Robotics Week in 2016 took place from 18 to 22 November in several countries including The Netherlands, Austria, Lithuania, Norway, Portugal, Serbia and many more. This event has been occurring since 2011 to spread public awareness of robotics applications and brings together industry, researchers and policymakers. The central event this year was held in Amsterdam. I attended one of the 5 days of activities in the Maritime Museum. The theme was ‘Robots at your service – empowering healthy aging’ which encompassed a variety of activities arranged over the 5 day duration, including debates, open sessions where you could network while interacting with different kinds of robots, workshops for children and a 2 day hackathon. I attended the robot expo and two of the debate sessions which I will summarise below.

Robot Exhibition

Although it’s clearly still early days for general consumer robotics in terms of the price-value ratio, there are ever more options available for enthusiasts and for very specific applications. The exhibition had a good selection, including these lovely bots:

Panel Discussion: Roboethics

This discussion was about ethics in robotics but it touched a really wide variety of aspects around this, including some philosophy. The speakers were of a very high standard and from a wide variety of backgrounds which gave the discussion its great breadth.

Here are some highlights:

Robots in care – good or bad?

  • How should robots in care evolve?
    • Robots should be applied to care because the need for care increases as populations in developed countries age, while the labour force interested in care shrinks. But are robots really the answer to this sensitive problem?
    • The distinction was made that a wide range of activities qualify as ‘care’, ranging from assisting people to care for themselves, to caring for them, to providing psychological and emotional support in times of depression, distress or loneliness. Are robots suitable for all of these kinds of activities or only a spectrum? Before you judge, consider this example given, of a Nao robot used in interacting with children with diabetes, in the PAL project (Personal Assistant for health Lifestyle). The robot interacts with the children to educate them about their illness and help them track it. The children confide in the robot more easily than adults and hospital attendance in children with diabetes goes up. This is an example of using robots to build a relationship and put people at ease – something that the robot does more easily in this case than a human. When is a robot more trustworthy than a human? When is the human touch really needed?
    • Should we want robots to be more human when they are used for care? In some cases we do, when there is a need to soothe and connect, to comfort. But in other cases a more impersonal and less present robot might be desirable. For instance, if you needed help going to the toilet or rising from bed for the rest of your life, would you need a lot of human interaction around that, or prefer it to blend seamlessly into your life to enable you to be as independent as possible?

Robot ethics and liability

  • Responsibility vs liability of robots for their misdeeds
    • This question always comes up in a discussion about robot ethics – people feel uncomfortable with the idea of a robot’s accountability for crimes. For instance, if an autonomous car runs over a pedestrian, who would be responsible for that, the car, owner or manufacturer?
    • A good point was raised by philosophy professor Vincent Muller to take this argument further. If a child throws a stone through a window, she is responsible for the action but the parent is liable for damages. In the same way, a robot can be responsible for doing something wrong, but another entity, like the owner, might be liable for damages caused.
    • When discussing that a robot can be held liable for a crime, we imply it can understand that it’s actions were wrong and did it anyway. But robots as yet have no understanding of what they are doing, thus the conclusion was that a robot cannot be meaningfully convicted of a crime.
wp-1479497111251.jpg
Gorgeous Maritime Museum in Amsterdam

 

Panel Discussion: Our Robotic Future

In this next session, a general discussion on the future of robotics ensued, followed by each speaker giving their wish and hope for this future. There was a strong Euro-centric flavour to this discussion, which gave a fascinating insight into the European search for identity in this time of change – in the robotics and AI revolution, who are we and what do we stand for? How will we respond to the threats and oppourtunities? How can we lead and usher in a good outcome? The panel itself fell on the optimistic side of the debate, looking forward to positive outcomes.

Robotics Research

  • The debate started off underlining the need to share information and research so that we can progress quickly with these high potential technologies.
  • A distinction was made between American and European methods of research in AI and robotics
    • American research is funded by defence budgets and not shared openly
    • In the US, research is done by large corporations like Facebook and Google, and is often funded by DARPA.
    • In Europe, research is funded by the European Commission, and often is performed through startups which means less brute strength, but more agility to bring research to fruition. However because these startups are so small, they can be reluctant to share their intellectual property which might be their only business case.

Accepting the Robot Revolution

  • There is a lot of hype about robots and AI in the media which stirs people’s imaginations and fears – how can we usher in all the benefits of the robot revolution?
    • Part of the fear is that people don’t like to accept that the last distinction we have of being the smartest on Earth could be lost.
    • There is a growing economic divide caused by this technological revolution
      • We should not create a system that creates advantage for only a select group, but aim for an inclusive society that allows all to benefit from abundance of robotics.
      • People can be less afraid of robots if the value they add is made clear, for example a robot surgeon which is more accurate than a human surgeon can be viewed positively instead of as a threat.
      • Technological revolutions are fuelled by an available workforce which can pick up the skills needed to usher them in. For example, in the industrial revolution, in which agricultural workers could be retrained to work in factories, and later a large labour force was available to fuel the IT revolution which, again, would have put many out of work by automating manual tasks. There is a concern that with STEM graduates on the decline, we will lack the skilled resources to build momentum for AI and Robotics.
    • Investors in robotics and AI are discouraged by the growing stigma around these technologies.
    • Communication policies on this topic are designed around getting people to understand the science, but this thinking must shift. The end users who currently fear and lack understanding must become the center of communication in the future – AI and robotics must enable them and they should have the tools to judge and decide on it’s fate.

 

Conclusion

This is certainly an exciting time to be alive, as there is so much to still determine and discover in the growth of the AI and Robotics industries and disciplines. Such an event also highlights how far we are from this reality – 10 year horizons were discussed for AI and Robotics to become commodities in our homes. It is very encouraging to see the clear thinking and good intentions that go into making these technologies mainstream. In the coming months I’d like to dig into the topics of ethics, regulation, EU funding, and what the future of AI and robotics could bring.

Nao Robot with Microsoft Computer Vision API

Lately, I’ve been experimenting with integrating an Aldebaran Nao robot with an artificial intelligence API.

While writing my previous blog post on artificial intelligence APIs, I realised there were way too many API options out there to try out casually. I did want to start getting some hands-on experience with the API’s myself, so I had to find a project.

Pep the humanoid robot from Aldebaran

My boyfriend, Renze de Vries and I,  were both captivated by the Nao humanoid robots during conferences and meetups but found the price of buying one ourselves prohibitive. He already had a few robots of his own – the Lego mindstorms robot and the Robotis bioloid robot which we named Max – he has written about his projects here. Eventually we crossed the threshold and bought our very own Nao robot together from http://www.generationrobots.com/  – we call him Peppy. Integrating an AI API into Peppy seemed like a good project for me to get familiar with what the AI API’s can do with real life input.

DSC02814
Peppy the Nao robot from Aldebaran

Nao API

The first challenge was to get Pep to produce an image that could be processed. Pep has a bunch of sensors including those for determining position and temperature of his joints, touch sensors in his head and hands, bumper sensors in his feet, a gyroscope, sonar, microphones, infrared sensors, and two video cameras.

The Nao API, Naoqi, contains modules for motion, audio, vision, people and object recognition, sensors and tracking. In the vision module you have the option to take a picture or grab video. The video aspect seemed overly complicated for this small POC so I went with the ALPhotoCapture class – java docs here. This api saves pictures from the camera to the local storage on the robot, and if you want to process them externally, you have to connect to Pep’s filesystem and download them.

ALPhotoCapture photocapture = new ALPhotoCapture(session);
photocapture.setResolution(2);
photocapture.setPictureFormat("jpg");
photocapture.takePictures(1, "/home/nao/recordings/cameras/", "pepimage", true);

The Nao’s run a Linux Gentoo version called OpenNAO. They can be reached on their ip address after they connect to your network using a cable or over wifi. I used JSCape’s SCP module to connect and copy the file to my laptop.

pepimage2
Picture taken by Peppy’s camera

Microsoft Vision API

Next up was the visual api – I really wanted to try the Google Cloud Vision API, but it’s intended for commercial use and you need to have a VAT number to be able to register. I also considered IBM Bluemix (I have heard good things about the Alchemy API) but you need to deploy your app into IBM’s cloud in that case, which sounded like a hassle. I remembered that the Microsoft API was just a standard webservice without much investment needed, so that was the obvious choice for a quick POC.

At first, I experimented with uploading the .jpg file saved by Pep to the Microsoft Vision API test page, which returned this analysis:

Features:
Feature Name Value
Description { “type”: 0, “captions”: [ { “text”: “a vase sitting on a chair”, “confidence”: 0.10692098826160357 } ] }
Tags [ { “name”: “indoor”, “confidence”: 0.9926377534866333 }, { “name”: “floor”, “confidence”: 0.9772524237632751 }, { “name”: “cluttered”, “confidence”: 0.12796716392040253 } ]
Image Format jpeg
Image Dimensions 640 x 480
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-LineDrawing
Black & White Image Unknown
Is Adult Content False
Adult Score 0.018606722354888916
Is Racy Content False
Racy Score 0.014793086796998978
Categories [ { “name”: “abstract_”, “score”: 0.00390625 }, { “name”: “others_”, “score”: 0.0078125 }, { “name”: “outdoor_”, “score”: 0.00390625 } ]
Faces []
Dominant Color Background
Dominant Color Foreground
Dominant Colors
Accent Color

#AC8A1F

I found the description of the image quite fascinating – it seemed to describe what was in the image closely enough. From this, I got the idea to return the description to Pep and use his text to speech API to describe what he has seen.

Next, I had to register on the Microsoft website to get an api key. this allowed me to programatically  pass Pep’s image to the API using a POST request. The response was a JSON string containing data similar to that above. You had to put in some url parameters to get the specific information you need. The Microsoft Vision API Docs are here. I used the Description text because it was as close as possible to a human constructed phrase.

https://api.projectoxford.ai/vision/v1.0/analyze?visualFeatures=Description

The result looks like this – the tags man, fireplace and bed were incorrect, but the rest are correct:

{"description":{"tags":["indoor","living","room","chair","table","television","sitting","laptop","furniture","small","white","black","computer","screen","man","large","fireplace","cat","kitchen","standing","bed"],"captions":[{"text":"a living room with a couch and a chair","confidence":0.67932875215020883}]},"requestId":"37f90455-14f5-4fc7-8a79-ed13e8393f11","metadata":{"width":640,"height":480,"format":"Jpeg"}}

Text to speech

The finishing touch was to use Nao’s text to speech API to create the impression that he is talking about what he has seen.

ALTextToSpeech tts = new ALTextToSpeech(session);
tts.say(text);

This was Nao looking at me while I was recording with my phone. The Microsoft Vision API incorrectly classifies me as a man with a wii. I could easily rationalise that the specifics of the classification are wrong, but the generalities are close enough.

Human

|                 |

Woman    Man

 

Small Electronic Device

|                    |                 |

Remote    Phone         Wii

 

This classification was close enough to correct – a vase of flowers sitting on a table.

Interpreting the analysis

Most of the analysis values returned are accompanied by a confidence level. The confidence level in the example I have is pretty low, the range being from 0 to 1.

a vase sitting on a chair", "confidence": 0.10692098826160357

This description also varied based on how I cropped the image before analysis. Different aspects were chosen as the subject of the picture with slightly different cropped views.

The vision api also returned Tags and Categories.

Categories give you a two-level taxonomy categorisation, with the top level being:

abstract, animal, building, dark, drink, food, indoor, others, outdoor, people, plant, object, sky,text, trans

Tags are more detailed than categories and give insight into the image content in terms of objects, living beings and actions. They give insight into everything happening in the image, including the background and not just the subject of the image.

Conclusions

Overall, I was really happy to integrate Nao with any kind of Artificial Intelligence API. It feels like the ultimate combination of robotics with AI.

The Microsoft Vision API was very intuitive and easy to get started with. For a free API with general classification capabilities, I think it’s not bad. These API’s are only as good as their training, so for more specific applications, you would obviously want to invest in training the API more intensively for the context. I tried IBM Bluemix’s demo with the same test image from Pep, but could not get a classification out of it – perhaps the image was not good enough.

I did have some reservations about sending live images from Pep into Microsoft’s cloud. In a limited and controlled setting, and in the interests of experimentation and learning, it seemed appropriate, but in a general sense, I think the privacy concerns need some consideration.

During this POC I thought about more possibilities for integrating Pep with some other API’s. The Nao robots have some sophisticated Aldebaran software of their own which provides basic processing of their sensor data like facial and object recognition and speech to text. I think there is a lot of potential in combining these API’s to enrich the robot’s interactive capabilities and delve further into the current capabilities of AI API’s.

 

Update on latest Artificial Intelligence API’s and services

The past few years has seen a blur of software giants releasing AI and Machine Learning themed APIs and services. I was, frankly, surprised at how many options there currently are for developers and companies. I think it’s a positive sign for the industry that there are multiple options from reputable brands when it comes to topics like visual and language recognition – these have almost become commodities. You also see strong consolidation into very typical categories like machine learning for building general predictive models, visual recognition, language and speech recognition, conversational bots and news analysis.

  • Google
    • Google Cloud Platform
      • Google Prediction API
        • Hosted Models (Demo)
          • Language Identifier
            • Spanish, English or French
          • Tag Categoriser
            • Android, appengine, chrome youtube
          • Sentiment Predictor
            • Positive or negative label for comments
        • Trained Models
          • Train your own model
      • Google Cloud Vision API
        • Label Detection
        • Explicit Content Detection
        • Logo Detection
        • Landmark Detection
        • Optical Character Recognition
        • Face Detection
        • Image Attributes
      • Cloud Speech API
        • Audio to text
        • >80 languages
        • Streaming Recognition
        • Inappropriate Content Filtering
        • Real-time or Buffered Audio Support
        • Noisy Audio Handling
      • Google Translate API
        • Text Translation
        • Language Detection
    • Tensor Flow
      • Open Source graph-based numerical computation and model building

 

  • Facebook
    • Bot for messenger
      • Ability to build a chat bot for your company that chats via facebook messenger

 

  • IBM
    • Bluemix
      • Alchemy
        • Alchemy Language
          • Keyword Extraction
          • Entity Extraction
          • Sentiment Analysis
          • Emotion Analysis
          • Concept Tagging
          • Relation Extraction
          • Taxonomy Classification
          • Author Extraction
        • Alchemy Data News
          • News and blog analysis
          • Sentiment, keyword, taxonomy matching
      • Concept Insights
        • Linking concepts between content
      • Dialog
        • Chat interaction
      • Language Translation
      • Natural Language Classifier
        • Phrase classification
      • Personality Insights
        • Social media content analysis to predict personal traits
      • Relationship Extraction
        • Finds relationships between objects and subjects in sentences
      • Retrieve and Rank
        • Detects signals in data
      • Speech To Text, Text to Speech
      • Tone Analyzer
        • Emotion analysis
      • Tradeoff Analytics
        • Decision making support
      • Visual Recognition
      • Cognitive Commerce
        • Support for driving commerce, recommendations etc
      • Cognitive Graph
        • Creates a knowledge graph of data thats typically difficult for a machine to understand
      • Cognitive Insights
        • Personalised commercial insights for commerce

 

 

  • Microsoft
    • Microsoft Cognitive Services
      • Vision
        • Categorise images
        • Emotion recognition
        • Facial detection
        • Anaylze video
      • Speech
        • Speech to text, text to speech
        • Speaker recognition
      • Language
        • Spell checking
        • Natural language processing
        • Complex linguistic analysis
        • Text analysis for sentiment, phrases, topics
        • Models trained on web data
      • Knowledge
        • Relationships between academic papers
        • Contextual linking
        • Interactive search
        • Recommendations
      • Search
        • Search
        • Search autosuggest
        • Image and metadata search
        • News search
        • Video search
      • Bot Framework
      • Content Moderator
      • Translator
      • Photo DNA Cloud Service

Did I miss something from this list? Comment and let me know!

In touch with tech at Permanent Future Lab’s Geeky Night Out

The Permanent Future Lab is a place to rediscover your sense of wonder and amazement at technology. Jurjen de Vries and Samir Lahiri  are the co-initiators who host us during the Geeky Night Out, a chance to experiment with the wide range of modern technologies at the Permanent Future Lab. The idea behind the lab is to encourage people and companies to experiment with disruptive technologies and embrace innovation.

The lab is hosted inside the Seats2Meet meeting space in Utrecht. It’s small but potent, every wall packed with goodies. Getting started can be a bit overwhelming as there are so many interesting things lying around.

I decided to begin with the Equil Smartpen2, since I’m always taking notes. It comes in a prism-shaped white plastic module, containing a pen and a page scanner. You have to fit the scanner carefully into the right position on the page, in the middle, and level with the edge, so it will get a good reading of your drawing. I downloaded both smartphone apps, EquilSketch and Equilnote to try out drawing and writing. The phone needed to be tethered to the scanner and then input needed to be received successfully by the app for calibration. After quite some back and forth, at which point I was joined by my partner in crime for the evening, Bart Kors, everything was connected and ready to go.

Equilnote with handwriting recognition
Equilnote export with uninspired handwriting recognition test

The transcription to the smartphone was instant and pretty accurate. The lines were smooth and nothing was lost in translation. The notes app seemed to recognise cursive text much more easily than my loose block capitals. It was fairly accurate with text recognition.

You can either enter freestyle handwriting and save that directly in your note, or use text recognition to convert your handwriting into digital text. Fonts and colours can be modified in the app. Interestingly, the app also works without the smartpen, and you can use your finger or a stylus on your phone screen to enter handwriting, and it will recognise it.

The drawing app allowed selection of different drawing tools, colour, size and opacity. You can export drawings to any of the apps on your phone. Jurjen joined us at some stage and suggested we try to see our notes on the tv using the Chromecast. So we hooked the Chromecast up to my phone, cast the entire screen and were able to see drawing on the paper, transcribed on the phone and then cast to the TV in real time. It’s an interesting solution to presenting what you are drawing to a group of people.

My next experiment was with the Muse brainwave reader. The goal of the Muse is to train brain relaxation. You have to download the app and then some extra content to get started. After a calibration sequence, you start a 3-minute exercise to relax your mind. The app shows a grassy plane and sky on the screen and you hear the wind blowing. The sound of the wind is an indication of your state of mind. Your goal is to keep the wind quiet by calming your mind.

IMAG0731
The Muse is the white band in the picture

Because I have meditated before, I thought that this task would feel natural to me, but my three minutes stretched out and I quite glad when the end came. Trying to relax the mind while still being aware of the wind noise created a curious kind of tension.

The app provides feedback on your session, divided into three mental states – active, neutral and calm. I felt that the device works well because the signal was very strong and easy to influence. Also, the app is of good quality and has been well thought-out. Its an interesting and unusual way of interacting, quite out of the ordinary.

The nice thing about the lab is that you also get to experience devices that others are busy with. This is how I encountered Sphero, a remote controlled ball that can move and change colours. It’s very responsive to its controls, and went racing off much faster than one expected, like an over-excited puppy. Another group was working with either the Arduino or Spark core, trying to illuminate a long strip of led’s to make a clock. They had some success at the end and it looked brilliant, with the lights blinking in different colours.

This slideshow requires JavaScript.

My experience at the Permanent Future Lab lowered the threshold and increased the fun factor in experimenting with innovative technologies. Furthermore, I didn’t need to do any coding to have meaningful experiences with technology. I met some really nice people and am looking forward to the next session where I might discover my killer idea. Hope to see you there!

Amazon Machine Learning at a Glance

Here is a brief summary of Amazon’s machine learning service in AWS.

awsml
AWS ML helping to analyse your data. Slides here

Main function:

  • Data modelling for prediction using supervised learning
  • Try to predict characteristics like
    • Is this email spam
    • Will this customer buy my product

Key benefits:

  • No infrastructure management required
  • Democratises data science
    • Wizard-based model generation, evaluation and deployment
    • Does not require data science expertise
    • Built for developers
  • Bridges the gap between developing a predictive model and building an application

Type of prediction model:

  • Binary classification using logistic regression
  • Multiclass classification using multinomial logistic regression
  • Regression using linear regression

Process:

  • Determine what question you want to answer
  • Collect labelled data
  • Convert to csv format and upload to Amazon S3
  • Cleanup and aggregate data with AWS ML assistance
  • Split data into training and aggregation sets using the wizard
  • Wait for AWS ML to generate a model
  • Evaluate and modify the model with the wizard
  • Use the model to create predictions on a batch or single api call basis

Pricing: Pay as you go

Useful links:

Will the Real AI Please Stand Up? A search for structure in the AI world

Where is AI development going, and how do we know when we are there? The AI world is developing rapidly and and it can be quite challenging to keep up with everything that’s happening across the wide spectrum of AI capabilities.

While writing this post, I tried to discover a way for myself to organise new developments in AI, or at least differentiate between them. The first idea I had was to turn to definitions of AI. Here is an example from alanturing.net:

“Artificial Intelligence (AI) is usually defined as the science of making computers do things that require intelligence when done by humans. .. Research in AI has focussed chiefly on the following components of intelligence: learning, reasoning, problem-solving, perception, and language-understanding.”

It’s a good definition, but far too functional for my taste. When defining AI, I think about a quest to produce human level and higher intelligence, a sentient artificial consciousness, even an artificial life form. Working from this expectation, I decided to broaden my search to define the boundaries of AI.

I tried to find a taxonomy of AI, but none that I found satisfied me, because they were based on what we have built organically over the lifetime of AI research. What I wanted was a framework which indicated the potential, as well as the reality. I decided to take the search a level higher to look for linkages between intelligence in humans and AI, but I still did not find a satisfactory taxonomy of human intelligence related to AI topics.

While searching for models of human intelligence, I came upon the Cattell-Horn-Carroll (CHC) Theory of Cognitive Abilities. It’s a model which describes the different kinds of intelligence and general cognitive capabilities to be found in humans. I decided to try to map AI capabilities to this cognitive abilities list:

CHC Cognition Capabilities Mapped to AI
CHC Cognition Capabilities Mapped to AI

The cognitive ability which most closely matched my idea of what AI should aspire to was Fluid Reasoning, which describes the ability to identify patterns, solve novel problems, and use abstract reasoning. There are many AI approaches dedicated to providing reasoning-based intelligence, but they are not as yet at the level of human capabilities. I included neural Turing Machines in this category, after some deliberation. This article from New Scientist convinced me that the Neural Turing Machine is the beginning of independent abstract reasoning in machines. The working memory component allows for abstraction and problem solving.

Crystallised Intelligence, also known as Comprehension Knowledge, is about building a knowledge base of useful information about the world. I have linked this type of knowledge to Question-answer systems like IBM Watson which specialises in using stored knowledge.

After some puzzling, I associated Long Short Term Memory (LSTM) neural nets with long and short term memory. In this approach, the neural network node has a feedback loop to itself to reinforce its current state, and a mechanism to forget the current state. This serves as a memory mechanism to aid in reproducing big picture patterns, for instance. This article on deeplearning.net provided some clarity for me. I also added Neural Turing Machines into the short term memory category because of the working memory component.

Another interesting aspect which came up was the range of sensory cognitive capabilities which are addressed by machines, not only with software, but also with hardware like touch sensors and advances in processors, not to mention robotic capabilities like movement and agility. Some senses were also included like visual, auditory and olifactory.

This model is strongly focused on human intelligence and capabilities. It could probably be improved by adding a scale of competence to each capability and mapping each AI area onto the scale. Perhaps it also limits thinking about artificial intelligence, but it does at least provide a frame of reference.

Once I had produced this diagram, I really felt that I had reached a milestone. However, the elements above did not cover exactly what I was looking for in a sentient machine. After some search, I discovered another level of cognition which intrigued me – metacognition. This is the ability to think about thinking and reflect on your own cognitive capabilities and process. We use metacognition to figure out how to overcome our own shortcomings in learning and thinking. As far as I can tell, metacognition is still in the theoretical phase for AI systems.

The last puzzle-piece for my ideal AI is self-awareness. This is the ability to recognise yourself and see yourself as others would see you. There is much research and philosophy available on the topic, for example Dr’s Cruse and Schilling’s robot Hector, which they use as an experiment to develop emergent reflexive consciousness. There are promising ideas in this area but I believe it’s still in a largely theoretical phase.

The mapping above could be improved upon, but it was a good exercise to engage with the AI landscape. The process was interesting because AI approaches and domains had to be considered from different aspects until they fitted into a category. I expect the technology mapping to change as AI matures and new facets appear, but that’s for the future.

Do you dis/agree with these ideas? Please comment!

UI Testing with Sikuli and OpenCV Computer Vision API

Sikuli Player Test
Sikuli IDE with video player test

This week I’ll be zooming in on Sikuli, a testing tool which uses computer vision to aid in verifying UI elements. Sikuli was created by the MIT User Design Group in 2010. The Sikuli IDE allows use of Jython to write simple test cases based on identifying visual elements on the screen, like buttons, and interacting with them, then verifying that other elements look correct. This comes close to a manual tester’s role of visually verifying software and allows test automation without having serious development skills, or knowledge of the underlying code. Sikuli is written on C/C++ and is currently maintained by Raimund Hocke.

If you’ve ever tried to do visual verification as a test automation approach in a web environment, you know that it’s a pretty difficult task. From my own experience of trying to setup visual verification on our video player at SDL Media Manager using Selenium and these Test API Utilities for verification, you will experience issues like:

  • different rendering of ui elements in different browsers
  • delays or latency makes tests unreliable
  • constantly updating web browsers getting ahead of your Selenium drivers
  • browsers rendering differently on different OS’s
  • browser rendering changes after updates
  • interacting with elements that are dynamically loaded on the screen, with non-static identifiers is inconsistent
  • creating and maintaining tests requires specialised skills and knowledge.

Sikuli aims to reduce the effort of interacting with the application under test. I have downloaded it to give it a test drive. I decided to try out interacting with our SDL Media Manager video player to take Sikuli through its paces, since I already have some experience testing it with Selenium.

test video first halftest video 2nd half

The first thing I had to do was setup the test video I created for video player testing. It’s comprised of static basic shapes on a black background which helps increase repeatability of tests since its hard to get a snapshot at exactly the right moment in the video. The black background also helps with transparency effects. I then started the player running and clicked on the test action buttons in the IDE to try to interact with the player.

Some Sikuli commands:

  • wait
    • either waits for a certain amount of time, or waits for the pattern specified
  • find
    • finds and returns a pattern match for whatever you are looking for
  • click
    • perform a mouse click in the middle of the pattern

I had to play around a bit but this is what finally worked.

Sikuli test 2

The click function was not working on the Mac because the Chrome app was not in focus, so I had to use switchApp. After this the automation seemed to work quite nicely in locating the play/pause button of the player, clicking it to pause, clicking to resume playing, then waiting for the next next part of the video to show, containing the yellow square, and clicking on that to pause the video.

This is what happened when the test did not succeed:

Failed Sikuli Test

An interesting characteristic of Sikuli is that you can specify how strong a pattern match you would like to trigger a positive result. It uses the OpenCV computer vision API which was built to accelerate adoption of computer perception and which contains hundreds of computer vision algorithms for image processing, video analysis and object and feature detection. It’s built for real-time image and video processing and is pretty powerful. It was created by Intel and can be used in C\C++, Java, Ruby and Python. There is even a wrapper for C# called Emgu CV. Check out the Wiki page for a nice overview.

Traditional automated testing methods which validate web UI element markup might miss issues with browser rendering that would be fairly obvious to a human. Although automated UI tests are costly to setup and maintain, in my opinion, they represent a vital aspect of product quality that could and should be automated.

Sikuli has a lot of potential, especially since it’s built on a solid computer vision API and is actively being maintained. This indicates to me that there is still room for growth in automated visual verification. I would love to hear your stories about visual verification or Sikuli or any ideas you have on this topic. Comment below!