Quality advice for Robotics startups

Robot recharge

I have discussed the topic of quality and testing with a few robotics startups and the conversation tends to reach this consensus: formal quality assurance processes have no place in a startup. While I totally appreciate this view, this blogpost provides and alternative approach to quality for robotics startups.

The main priority of many startups is to produce something that will attract investment – it has to basically work well enough to get funding. Investors, customers and users can be very forgiving of quality issues, especially where emerging tech is involved. Startups should deliver the right level of quality for now and prepare for the next step.

In a startup, there is not likely to be any dedicated tester or quality strategy. Developers are the first lines of defence for quality – they must bake it in to the proof of concept code – they might do this with unit tests. The developers and founders probably do some functional validation. They might experience more extreme use cases when demo’ing the functionality. They might do limited testing with real life users.

What are the main priorities of the company at this phase and the matching levels of quality? The product’s main goal, initially, is to fulfil requirements of application development, demo’ing, and to be effective and usable to its early adopters. Based on these priorities, I’ve come up with some quality aspects that could be useful for robotics startups.

A Good quality demo

Here are some aspects of quality which could be relevant for demoing:

Softbank Pepper

  1. Portable setup
    1. Can be transported without damaging the robot and supporting equipment
    2. Is possible to explain at airport security if needed
  2. Works under variable conditions in customer meeting room
    1. Poor wifi connections
    2. Power outlets not available
    3. Outside of company network
    4. Uneven floors
    5. Stairs
    6. Noise
    7. Different lighting
    8. Reflective surfaces
  3. Will work for the duration of the demo
  4. Demo will be suitable for audience
  5. Demo’ed behaviour will be visible and audible from a distance, e.g. in a boardroom
  6. Mode can be changed to a scripted mode for demos
  7. Functionality actually works and can be shown – a checklist of basic functionality can take away the guesswork, without having to come up with heavy weight testcases

Quality for the users and buyers

The robot needs to prove itself fit for operation:

  1. Functionality works
    1. What you offer can be suitably adapted for the customer’s actual scenario
      1. Every business has its own processes and probably the bot will have to adapt to match terminologies workflows and scenarios that fit the users processes
      2. Languages can be changed
      3. Bot is capable of conversing at the level of the target audience (e.g. children, elderly)
      4. Bot is suitable for the context where its intended to work like a hospital or school, will not make sudden movements or catch on cables
  2.  Reliability
    1. Users might be tolerant to failures up to a certain extent, until it gets too annoying or repetitive, or if they cannot be recovered from
    2. Failures might be jarring for vulnerable users like the mentally or physically ill
    3. Is the robot physically robust enough to interact with in unplanned ways?
  3. Security
    1. Will port scanning or some other exploitative attacks easily reveal vulnerabilities which can result in unpredictable or harmful behaviour
    2. Can personal data be hijacked through the robot
  4. Ethical and moral concerns
    1. Users might not understand that there is no consciousness interacting with them, thinking the robot is autonomous
    2. There might be users who think their interactions will be private while they might be reviewed for analysis purposes
    3. Users might not realise their data will be sent to the cloud and used for analysis
  5. Legal and support issues
    1. What kind of support agreement does the service provider have with the robot manufacturer and how does it translate to the purchaser of the service?

Decos robots, Robotnik and Eco

Quality to maintain, pivot and grow

During these cycles of demoing to prospects, defects will be identified and need to be fixed. Customers will give advice or provide input on what they were hoping to see and features will have to be tweaked or added. The same will happen during research and test rounds at customers, and user feedback sessions.

The startup will want to add features and fix bugs quickly. For this to occur, it will help them to have good discipline with clean code which is maintainable, and at least unit tests to give quick feedback on change quality. They will but hopefully also have some functional (and a few non functional) acceptance tests.

When adoption increases, the startup might have to do a quick pivot to a new application, or to be able to scale to more than one customer or usecase. At this phase, probably a lot of refactoring will happen to make the existing codebase scalable. In this case, good unit tests and component tests will be your best friend, and ensure you are able to maintain the stability of the functionality you already have (as mentioned in this techcrunch article on startup quality).

robot in progress

Social robot companies are integrators – ensure quality of integrated components

As a social robotics startup, if you are not creating your own hardware, OS, or interaction and processing components, you might want to consider becoming familiar with the quality of any hardware or software components you are integrating with. Some basic integration tests will help you keep confident that the basics work when an external API is updated, for instance. It’s also worth consider your liability when something goes wrong somewhere in the chain.

Early days for robot quality

To round up, it seems to indeed be early days to be talking about social robot quality. But it’s good for startups to be aware of what they are getting into because this topic will no doubt become more relevant as their company grows. I hope the above post can help robotics startups now and in the future to ensure they stay in control of their quality as they grow.

Feel free to contact me if you have any ideas or questions about this topic!

Thanks to Koen Hindriks of Interactive Robotics, Roeland van Oers at Ready for Robotics and Tiago Santos at Decos, as well as all the startups and enthusiasts I have spoken to over the past year for input into this article.

Making social robots work

 

wp-1490820852441.jpg

Mady Delvaux, in her draft report on robotics, advises the EU that robots should be carefully tested in real life scenarios, beyond the lab. In this and future articles, I will examine different aspects of social robot requirements, quality and testing, and try to determine what is still needed in these areas.

Why test social robots?

In brief, I will define robot quality as: does the robot do what it’s supposed to do, and not do what it shouldn’t. For example, when you press the robot’s power button from an offline state, does the robot turn on and the indicator light turn green? If you press the button quickly twice, does the robot still exhibit acceptable behaviour? Testing is the activity of analysis to determine the quality level of what you have produced – is it good enough for the intended purpose?

Since social robots will interact closely with people, strict standards will have to be complied with to ensure that they don’t have unintended negative effects. There are already some standards being developed, like ISO13482:2014 about safety in service robots, but we will need many more to help companies ensure they have done their duty to protect consumers and society. Testing will give insight into whether these robots meet the standards, and new test methods will have to be defined.

What are the core features of the robot?

The first aspect of quality we should measure is if the robot fulfils its basic functional requirements or purpose. For example, a chef robot like the robotic kitchen by Moley would need to be able to take orders, check ingredient availability, order or request ingredients, plan cooking activities, operate the stove or oven, put food into pots and pans, stir, time cooking, check readiness, serve dishes and possibly clean up.

 

A robot at an airport which helps people find their gate and facilities must be able to identify when someone needs help, determine where they are trying to go (perhaps by talking to them, or scanning a boarding pass), plan a route, communicate the route by talking, indicating with gestures, or printing a map, and know when the interaction has ended.

 

With KLM’s Spencer the guide robot at Schiphol airport, benchmarking was used to ensure the quality of each function separately. Later the robot was put into live situations at Schiphol and tracked to see if it was planning movement correctly. A metric of distance travelled autonomously vs non autonomously was used to evaluate the robot. Autonomy will probably be an important characteristic to test and to make users aware of in the future.

Two user evaluation studies were done with Spencer, and feedback was collected about the robot’s effectiveness at guiding people around the airport. Some people, for example, found the speed of the robot too slow, especially in quiet periods, while others found the robot too fast, especially for families to follow.

Different environments and social partners

How can we ensure robots function correctly in the wide variety of environments and interaction situations that we encounter everyday? Amazon’s Alexa, for example, suffers from a few communication limitations, like knowing if she is taking orders from the right user and conversing with children.

At our family gatherings, our Softbank Nao robot, Peppy, cannot quite make out instructions against talking and cooking noises. He also has a lot of trouble determining who to focus on when interacting in a group. Softbank tests their robots by isolating them in a room and providing recorded input to determine if they have the right behaviour, but it can be difficult to simulate large public spaces. The Pepper robots seem to perform better under these conditions. In the Mummer project, tests are done in malls with Pepper to determine what social behaviours are needed for a robot to interact effectively in public spaces.

 

The Pepper robot at the London Science Museum History of Robots exhibition was hugely popular and constantly surrounded by a crowd – it seemed to do well under these conditions, while following a script, as did the Pepper at the European Robotics Forum 2017.

When society becomes the lab

Kristian Esser, founder of the Technolympics, olympic games for cyborgs, suggests that in these times, society itself becomes the test lab. For technologies which are made for close contact with people, but which can have a negative effect on us, the paradox is that we must be present to test it and the very act of testing it is risky.

Consider self-driving vehicles, which must eventually be tested on the road. The human driver must remain aware of what is happening and correct the car when needed, as we have seen in the case of Tesla’s first self driving car fatality: “The … collision … raised concerns about the safety of semi-autonomous systems, and the way in which Tesla had delivered the feature to customers.” Assisted driving will probably overall reduce the number of traffic-related fatalities in the future and that’s why its a goal worth pursuing.

For social robots, we will likely have to follow a similar approach, first trying to achieve a certain level of quality in the lab and then working with informed users to guide the robot, perhaps in a semi-autonomous mode. The perceived value of the robot should be in balance with the risks of testing it. With KLM’s Spencer robot, a combination of lab tests and real life tests are performed to build the robot up to a level of quality at which it can be exposed to people in a supervised way.

Training robots

Over lunch the other day, my boss suggested the idea of teaching social robots as we do children, by observing or reviewing behaviour and correcting afterwards. There is research supporting this idea, like this study on robots learning from humans by imitation and goal inference. One problem with letting the public train social robots, is that they might teach robots unethical or unpleasant behaviour, like in the case of the Microsoft chatbot.

To ensure that robots do not learn undesirable behaviours, perhaps we can have a ‘foster parent’ system – trained and approved robot trainers who build up experience over time and can be held accountable for the training outcome. To prevent the robot accidentally picking up bad behaviours, it could have distinct learning and executing phases.

The robot might have different ways of getting validation of its tasks, behaviours or conclusions. It would then depend on the judgement of the user to approve or correct behaviour. New rules could be sent to a cloud repository for further inspection and compared with similar learned rules from other robots, to find consensus. Perhaps new rules should only be applied if they have been learned and confirmed in multiple households, or examined by a technician.

To conclude, I think testing of social robots will be done in phases, as it is done with many other products. There is a limit to what we can achieve in a lab and there should always be some controlled testing in real life scenarios. We as consumers should be savvy as to the limitations of our robots and conscious of their learning process and our role in it.

UI Testing with Sikuli and OpenCV Computer Vision API

Sikuli Player Test
Sikuli IDE with video player test

This week I’ll be zooming in on Sikuli, a testing tool which uses computer vision to aid in verifying UI elements. Sikuli was created by the MIT User Design Group in 2010. The Sikuli IDE allows use of Jython to write simple test cases based on identifying visual elements on the screen, like buttons, and interacting with them, then verifying that other elements look correct. This comes close to a manual tester’s role of visually verifying software and allows test automation without having serious development skills, or knowledge of the underlying code. Sikuli is written on C/C++ and is currently maintained by Raimund Hocke.

If you’ve ever tried to do visual verification as a test automation approach in a web environment, you know that it’s a pretty difficult task. From my own experience of trying to setup visual verification on our video player at SDL Media Manager using Selenium and these Test API Utilities for verification, you will experience issues like:

  • different rendering of ui elements in different browsers
  • delays or latency makes tests unreliable
  • constantly updating web browsers getting ahead of your Selenium drivers
  • browsers rendering differently on different OS’s
  • browser rendering changes after updates
  • interacting with elements that are dynamically loaded on the screen, with non-static identifiers is inconsistent
  • creating and maintaining tests requires specialised skills and knowledge.

Sikuli aims to reduce the effort of interacting with the application under test. I have downloaded it to give it a test drive. I decided to try out interacting with our SDL Media Manager video player to take Sikuli through its paces, since I already have some experience testing it with Selenium.

test video first halftest video 2nd half

The first thing I had to do was setup the test video I created for video player testing. It’s comprised of static basic shapes on a black background which helps increase repeatability of tests since its hard to get a snapshot at exactly the right moment in the video. The black background also helps with transparency effects. I then started the player running and clicked on the test action buttons in the IDE to try to interact with the player.

Some Sikuli commands:

  • wait
    • either waits for a certain amount of time, or waits for the pattern specified
  • find
    • finds and returns a pattern match for whatever you are looking for
  • click
    • perform a mouse click in the middle of the pattern

I had to play around a bit but this is what finally worked.

Sikuli test 2

The click function was not working on the Mac because the Chrome app was not in focus, so I had to use switchApp. After this the automation seemed to work quite nicely in locating the play/pause button of the player, clicking it to pause, clicking to resume playing, then waiting for the next next part of the video to show, containing the yellow square, and clicking on that to pause the video.

This is what happened when the test did not succeed:

Failed Sikuli Test

An interesting characteristic of Sikuli is that you can specify how strong a pattern match you would like to trigger a positive result. It uses the OpenCV computer vision API which was built to accelerate adoption of computer perception and which contains hundreds of computer vision algorithms for image processing, video analysis and object and feature detection. It’s built for real-time image and video processing and is pretty powerful. It was created by Intel and can be used in C\C++, Java, Ruby and Python. There is even a wrapper for C# called Emgu CV. Check out the Wiki page for a nice overview.

Traditional automated testing methods which validate web UI element markup might miss issues with browser rendering that would be fairly obvious to a human. Although automated UI tests are costly to setup and maintain, in my opinion, they represent a vital aspect of product quality that could and should be automated.

Sikuli has a lot of potential, especially since it’s built on a solid computer vision API and is actively being maintained. This indicates to me that there is still room for growth in automated visual verification. I would love to hear your stories about visual verification or Sikuli or any ideas you have on this topic. Comment below!

Facebook at GTAC on using AI for Testing

As a follow-up to my post on Google’s use of AI in Testing at their GTAC 2014 conference, here is a review of the Facebook Testing session:

GTAC 2014: Never Send a Human to do a Machine’s Job: How Facebook uses bots to manage tests (Roy Williams)

In this talk, Roy Williams tells us about the Facebook code base growing until it became hard for developers to predict the system-wide effects of their changes. Checking in code caused seemingly unrelated tests to fail. As more and more tests failed, developers began ignoring failed tests when checking in and test integrity was compromised. With a release schedule of twice a day to the Facebook website, it was important to have trustworthy tests to validate changes.

To remedy this situation, they setup a test management system which manages the lifecycle of automated tests. It’s composed of several agents which monitor and assign test quality statuses. For instance, when new tests are created, they are not released immediately to run against everyone’s check-ins, but run against a few check-ins to judge the integrity of the test. If the test fails, it goes back to the author to improve.

Facebook test lifecycle

If a passing test starts to fail, an agent, FailBot marks the test as failing, and assigns a task to the owner of the test to fix it. If a test fails and passes sporadically, another agent, GreenWarden, marks it as a test of unknown quality and the owner needs to fix it. If a test keeps failing, it will get moved to the disabled state, and the owner gets 5 days to fix it. If it starts passing again, its status gets promoted, else it gets deleted after a month. This prevents the failing tests from getting out of hand and overwhelming developers, and eventually, test failures being ignored when checking in code.

Facebook test bots and wardens
Slides can be found here by the way.

This system improves the development process by maintaining the integrity of the test suite and ensuring people take can afford to take test failures seriously. It’s a great example of how to shift an intelligent process from humans to machines, but also highlights an advantage of using machines, which is the ability to scale.

Writing this post also made me ponder why I had classified this system as an application of artificial intelligence. I believe the key lies in transferring activities requiring some degree of judgement to machines. We have already allocated test execution to computers with test automation, but in this case, it is test management which has been delegated. I will dig into this topic more in a future post I am working on, about qualifiers for AI applied to testing. 

Overall, this talk was a pretty fascinating insight into Facebook’s development world, with some great concepts that can be applied to any development environment.

What Google has to say on AI in Testing

This week, the Google test blog newsletter was about GTAC, the Google Test Automation Conference. I found this session on AI applied to testing really relevant to:

Free Tests Are Better Than Free Bananas: Using Data Mining and Machine Learning To Automate Real-Time Production Monitoring (Celal Ziftci of Google)

The session was about Google’s assertion framework which runs against their production logfiles. The framework runs on real time logs, checking for inconsistencies. Examples of meaningful assertions:

transaction.id > 0, transaction.date != null.

If any assertions fail, a notification is sent to a developer to take some kind of action. Usually a developer would have to design assertions, but now they use a tool to assist.

Daikon invariant detector identifies invariants (rules which are always true for a certain section of code). Although Daikon has been designed to analyse code, they have modified it to work on data like logfiles. Daikon starts with a set of hypotheses about what your input will look like, and as you push data through it, it eliminates those which prove to be false. The rules identified can be used as assertions, thereby automatically generating test cases. These test cases still need a developer to determine value and validity, however.

The other technique they use is Association Rule Learning, which finds relationships between different data items, e.g.

when country is uk, time zone is +0

These, too, are added into the assertion framework to identify issues occurring in production.

In this case, the work that developers used to do in defining test cases is now being done by machines. But human beings are still needed to made a decision on whether the identified rules make sense and add value or not.

The AI system, at times, identifies trivial rules, but is also capable of identifying complex relationships that would be less obvious to humans.

Why is it still so difficult?

There are many researched and documented approaches to apply artificial intelligence techniques to testing. So why don’t we have AI automatic testing software already?

Probably it’s the size and irregularity of the problem set, and, I guess, this is the challenge of mimicking the wonder of the human brain. How do you build an application that will figure out how to test a unique program? The program might have varying purpose, structure and size in each instance. It might have different combinations of components and infinite viable execution paths. How do we as software testers approach this problem?

Often we use models of our software to conceptualise it and make it easier to digest. Then we break it up into components and might test each component and its combinations. Or we might approach the system from the aspect of its functionality and try to ensure the expected functions work as they should.

We use certain heuristics to impart value to each aspect  of our systems and then test the ones we feel are most valuable or most likely to fail first. We might include information from different sources including defect registers, product management and knowledge of support calls to impart priority. We use experience and trial and error to determine how to test and what to test.

Then we need to design a method of testing the software. Will we manually test, automate or both? In each case, how can we interact with the software in the most realistic and valuable way? This kind of organic investigation and decision making, and the number of decisions that have to be made as well as the knowledge store required to make them, make it difficult to transfer the entire process in one go to any one automated system (developers, I hope you will appreciate your testers a bit more after reading this 😉 ).

Although test automation techniques could give us clues as to how to model the test process and test activities, these still require a large amount of human input. This input is what we are trying to substitute by applying AI to testing.

Where are we?

© Sarah Holmlund | Dreamstime.comWhen I first Google’d this topic, to my surprise, the top results were not touting test tools using AI to augment the test process, but research papers and books describing the theory of applying AI to software testing.

The progress we have recently made in artificial intelligence applications has fascinated me. Particularly in areas touching on the potential creativity of machines. This is one of the highly experimental areas of AI, which still seems to be forming. Some other applications of AI have either been rather successful since the beginning, like AI planners, which are routinely used to aid in planning of complex systems, and data mining, which becomes increasingly popular with time. It looks like AI applied to testing is one of those areas where it still proves to be rather difficult to replace a human being.

In a training I attended this year, we discussed whether testers would soon be put out of work, replaced by machines. Since we started using agile development practices, and test automation tools, it would seem that we require fewer and fewer testers to produce the same amount of software. Perhaps the next step in this progression is for software development and testing to be carried out in part by machines. This could put people out of work and would not be limited to the software industry. This article from the Economist discusses the economic risk to jobs posed by intelligent computing. But luckily we still seem to have some years to go before we have put ourselves out of business.

On Becoming Obsolete

An examination of artificial intelligence techniques applied to software testing.

The idea of computers testing software with little or no human intervention intrigues me. Software development (and other creative pursuits) is arguably an inefficient and unpredictable endeavour. It’s rather difficult to design a perfect system and produce it accurately in one step. We need lots of back and forth in terms of sharing ideas and then improving them until we have something which could work. Then we need some iterations to translate this idea into code and probably lose something in that translation along the way. After this we must verify that we have achieved our goal with more communication and rework. Thus, there is something beautiful about cutting ourselves out of the loop and increasing computer independence, proceeding towards an efficient and natural process of software generation.

A forum discussion on Linkedin triggered me to dig deeper into how close we are to replacing testers in software development. As a lead tester, I find it a fascinating challenge. Although we have formal methods of testing software, a lot of testing in the field is, I believe, lead by gut instinct. It could be said that the best testers are the ones who intuitively are able to sniff out defects and weaknesses. But where does this intuition come from? And how can we replicate it in machines? Or will it come down to brute force approaches with thousands of inputs and outputs to achieve the same goal?

In future postings, I intend on unpacking what there is to know in this field and then speculating on where it could be headed and how it can be applied.