Making social robots work

 

wp-1490820852441.jpg

Mady Delvaux, in her draft report on robotics, advises the EU that robots should be carefully tested in real life scenarios, beyond the lab. In this and future articles, I will examine different aspects of social robot requirements, quality and testing, and try to determine what is still needed in these areas.

Why test social robots?

In brief, I will define robot quality as: does the robot do what it’s supposed to do, and not do what it shouldn’t. For example, when you press the robot’s power button from an offline state, does the robot turn on and the indicator light turn green? If you press the button quickly twice, does the robot still exhibit acceptable behaviour? Testing is the activity of analysis to determine the quality level of what you have produced – is it good enough for the intended purpose?

Since social robots will interact closely with people, strict standards will have to be complied with to ensure that they don’t have unintended negative effects. There are already some standards being developed, like ISO13482:2014 about safety in service robots, but we will need many more to help companies ensure they have done their duty to protect consumers and society. Testing will give insight into whether these robots meet the standards, and new test methods will have to be defined.

What are the core features of the robot?

The first aspect of quality we should measure is if the robot fulfils its basic functional requirements or purpose. For example, a chef robot like the robotic kitchen by Moley would need to be able to take orders, check ingredient availability, order or request ingredients, plan cooking activities, operate the stove or oven, put food into pots and pans, stir, time cooking, check readiness, serve dishes and possibly clean up.

 

A robot at an airport which helps people find their gate and facilities must be able to identify when someone needs help, determine where they are trying to go (perhaps by talking to them, or scanning a boarding pass), plan a route, communicate the route by talking, indicating with gestures, or printing a map, and know when the interaction has ended.

 

With KLM’s Spencer the guide robot at Schiphol airport, benchmarking was used to ensure the quality of each function separately. Later the robot was put into live situations at Schiphol and tracked to see if it was planning movement correctly. A metric of distance travelled autonomously vs non autonomously was used to evaluate the robot. Autonomy will probably be an important characteristic to test and to make users aware of in the future.

Two user evaluation studies were done with Spencer, and feedback was collected about the robot’s effectiveness at guiding people around the airport. Some people, for example, found the speed of the robot too slow, especially in quiet periods, while others found the robot too fast, especially for families to follow.

Different environments and social partners

How can we ensure robots function correctly in the wide variety of environments and interaction situations that we encounter everyday? Amazon’s Alexa, for example, suffers from a few communication limitations, like knowing if she is taking orders from the right user and conversing with children.

At our family gatherings, our Softbank Nao robot, Peppy, cannot quite make out instructions against talking and cooking noises. He also has a lot of trouble determining who to focus on when interacting in a group. Softbank tests their robots by isolating them in a room and providing recorded input to determine if they have the right behaviour, but it can be difficult to simulate large public spaces. The Pepper robots seem to perform better under these conditions. In the Mummer project, tests are done in malls with Pepper to determine what social behaviours are needed for a robot to interact effectively in public spaces.

 

The Pepper robot at the London Science Museum History of Robots exhibition was hugely popular and constantly surrounded by a crowd – it seemed to do well under these conditions, while following a script, as did the Pepper at the European Robotics Forum 2017.

When society becomes the lab

Kristian Esser, founder of the Technolympics, olympic games for cyborgs, suggests that in these times, society itself becomes the test lab. For technologies which are made for close contact with people, but which can have a negative effect on us, the paradox is that we must be present to test it and the very act of testing it is risky.

Consider self-driving vehicles, which must eventually be tested on the road. The human driver must remain aware of what is happening and correct the car when needed, as we have seen in the case of Tesla’s first self driving car fatality: “The … collision … raised concerns about the safety of semi-autonomous systems, and the way in which Tesla had delivered the feature to customers.” Assisted driving will probably overall reduce the number of traffic-related fatalities in the future and that’s why its a goal worth pursuing.

For social robots, we will likely have to follow a similar approach, first trying to achieve a certain level of quality in the lab and then working with informed users to guide the robot, perhaps in a semi-autonomous mode. The perceived value of the robot should be in balance with the risks of testing it. With KLM’s Spencer robot, a combination of lab tests and real life tests are performed to build the robot up to a level of quality at which it can be exposed to people in a supervised way.

Training robots

Over lunch the other day, my boss suggested the idea of teaching social robots as we do children, by observing or reviewing behaviour and correcting afterwards. There is research supporting this idea, like this study on robots learning from humans by imitation and goal inference. One problem with letting the public train social robots, is that they might teach robots unethical or unpleasant behaviour, like in the case of the Microsoft chatbot.

To ensure that robots do not learn undesirable behaviours, perhaps we can have a ‘foster parent’ system – trained and approved robot trainers who build up experience over time and can be held accountable for the training outcome. To prevent the robot accidentally picking up bad behaviours, it could have distinct learning and executing phases.

The robot might have different ways of getting validation of its tasks, behaviours or conclusions. It would then depend on the judgement of the user to approve or correct behaviour. New rules could be sent to a cloud repository for further inspection and compared with similar learned rules from other robots, to find consensus. Perhaps new rules should only be applied if they have been learned and confirmed in multiple households, or examined by a technician.

To conclude, I think testing of social robots will be done in phases, as it is done with many other products. There is a limit to what we can achieve in a lab and there should always be some controlled testing in real life scenarios. We as consumers should be savvy as to the limitations of our robots and conscious of their learning process and our role in it.

Facebook at GTAC on using AI for Testing

As a follow-up to my post on Google’s use of AI in Testing at their GTAC 2014 conference, here is a review of the Facebook Testing session:

GTAC 2014: Never Send a Human to do a Machine’s Job: How Facebook uses bots to manage tests (Roy Williams)

In this talk, Roy Williams tells us about the Facebook code base growing until it became hard for developers to predict the system-wide effects of their changes. Checking in code caused seemingly unrelated tests to fail. As more and more tests failed, developers began ignoring failed tests when checking in and test integrity was compromised. With a release schedule of twice a day to the Facebook website, it was important to have trustworthy tests to validate changes.

To remedy this situation, they setup a test management system which manages the lifecycle of automated tests. It’s composed of several agents which monitor and assign test quality statuses. For instance, when new tests are created, they are not released immediately to run against everyone’s check-ins, but run against a few check-ins to judge the integrity of the test. If the test fails, it goes back to the author to improve.

Facebook test lifecycle

If a passing test starts to fail, an agent, FailBot marks the test as failing, and assigns a task to the owner of the test to fix it. If a test fails and passes sporadically, another agent, GreenWarden, marks it as a test of unknown quality and the owner needs to fix it. If a test keeps failing, it will get moved to the disabled state, and the owner gets 5 days to fix it. If it starts passing again, its status gets promoted, else it gets deleted after a month. This prevents the failing tests from getting out of hand and overwhelming developers, and eventually, test failures being ignored when checking in code.

Facebook test bots and wardens
Slides can be found here by the way.

This system improves the development process by maintaining the integrity of the test suite and ensuring people take can afford to take test failures seriously. It’s a great example of how to shift an intelligent process from humans to machines, but also highlights an advantage of using machines, which is the ability to scale.

Writing this post also made me ponder why I had classified this system as an application of artificial intelligence. I believe the key lies in transferring activities requiring some degree of judgement to machines. We have already allocated test execution to computers with test automation, but in this case, it is test management which has been delegated. I will dig into this topic more in a future post I am working on, about qualifiers for AI applied to testing. 

Overall, this talk was a pretty fascinating insight into Facebook’s development world, with some great concepts that can be applied to any development environment.

What Google has to say on AI in Testing

This week, the Google test blog newsletter was about GTAC, the Google Test Automation Conference. I found this session on AI applied to testing really relevant to:

Free Tests Are Better Than Free Bananas: Using Data Mining and Machine Learning To Automate Real-Time Production Monitoring (Celal Ziftci of Google)

The session was about Google’s assertion framework which runs against their production logfiles. The framework runs on real time logs, checking for inconsistencies. Examples of meaningful assertions:

transaction.id > 0, transaction.date != null.

If any assertions fail, a notification is sent to a developer to take some kind of action. Usually a developer would have to design assertions, but now they use a tool to assist.

Daikon invariant detector identifies invariants (rules which are always true for a certain section of code). Although Daikon has been designed to analyse code, they have modified it to work on data like logfiles. Daikon starts with a set of hypotheses about what your input will look like, and as you push data through it, it eliminates those which prove to be false. The rules identified can be used as assertions, thereby automatically generating test cases. These test cases still need a developer to determine value and validity, however.

The other technique they use is Association Rule Learning, which finds relationships between different data items, e.g.

when country is uk, time zone is +0

These, too, are added into the assertion framework to identify issues occurring in production.

In this case, the work that developers used to do in defining test cases is now being done by machines. But human beings are still needed to made a decision on whether the identified rules make sense and add value or not.

The AI system, at times, identifies trivial rules, but is also capable of identifying complex relationships that would be less obvious to humans.