Introduction and image analysis
The HowOld Robot doesn't always guess your age correctly when you show it your photograph, but it's certainly caught everyone's attention. And that's just one of the four REST APIs Microsoft Research is making available through Project Oxford.
Ryan Gaglon from Microsoft Research (MSR) explained to TechRadar Pro what the services can actually do – and where developers might recognise them from. "The speech APIs are about [your apps] being able to hear and to speak back. This is the same backend that powers Cortana," he told us.
The service can turn speech into text or synthesise speech from text in a variety of synthetic voices; they cover 17 languages in the initial beta. The recognition works over a Web Socket connection and as you watch, you can see the API figuring out individual words and then going back to turn that into phrases and sentences, complete with punctuation and capital letters.
If what it's trying to recognise is a short phrase that the API isn't certain about – it's very easy to mix up 450 6th Street and 456th Street, for example – it will send back up to five alternatives (and it's up to the developer to decide if it's useful to show those).
The face service is what HowOld Robot is using. "It's about being able to detect, describe and recognise a human face," says Gaglon, "and it does both detection and verification. Detection tells you how many faces there are in a photo and where they are, plus it can give you landmarks on the face – like the tip of the nose or the left and right side of the mouth.
"Then there are the experimental features, like predicting the age and gender. Verification says if you have two photos and there's a face in both photos, what is the likelihood it belongs to the same individual? Then there's grouping – given a collection of photos, which sets have the same people in."
Some of the face recognition services are the same as the ones used by Kinect.
The vision APIs include a wide mix of tools "to help describe the content within an image," Gaglon explains. "You can manipulate and work with images, recognise words in a photo. It can scale and crop photos more intelligently, so you can have it crop a photo in different dimensions but keep the most important content of the photo in the frame."
That would come in handy for automatically resizing images so they work on a phone or tablet screen as well as on a larger desktop screen – in action, it looks very like the way Microsoft's Sway authoring service picks which part of a photo to show.
"The image analysing service helps you describe an image; whether it's clipart or not, whether it's a colour photo or not, whether it's adult content or not." The vision services can also categorise images, stating whether you're looking at a building or a flower or someone swimming – if a picture shows buildings or streets, the service will say the most likely category is a cityscape. "That's some of the same technology that's used in the OneDrive photo tagging," says Gaglon, noting that many of the vision APIs are services Bing uses for image search.
Getting that to work involves some ground-breaking machine learning research. "One of the things the vision APIs make use of is whole image categorisation, and MSR recently published some results where the team was the first to surpass human image recognition performance on the Imagenet benchmark," he mentions.
You can sign up for and start using those three aforementioned services today, but the fourth, LUIS – Language Understanding Intelligent Services – is in private preview. LUIS takes short phrases, like things people type into a search engine, and tells you what they're really asking – so it's not just matching words, it's trying to actually understand them. "If the text snippet is 'tell me news about flight delays', LUIS comes back and says 'the topic is flight delays and the intent is find news'," Gaglon explains.
The breakthrough here is that instead of being an expert in natural language processing and building a model of all the phrases people could use to ask for news on specific topics, developers can use LUIS as a model building service. "Building a model is easy if you only need to label a bunch of instances by hand, but what about when you start getting hundreds, thousands or tens of thousands of utterances?"
LUIS has some readymade models for date and time, numbers, temperatures, distances and an encyclopaedia of common places and things; those come from Bing. It also does 'active learning', creating a list of the phrases that you can most improve the service with via labelling.
Generally it's less automatic than the other APIs – what LUIS gives you is an interactive labelling environment where you can quickly label short snippets of text, to create a system that can learn from a few phrases that you've labelled to handle a lot more commands. It only takes a few examples to create a system that can understand the intent behind questions users are typing in, which could be a way to create much smarter search tools and personal assistants.
Why is Microsoft making its machine learning services available for free like this? What you're getting here is a beta (or a private preview) of services Microsoft will probably offer in pay-for products later on (perhaps as part of its existing Azure Machine Learning service).
Getting developers to use Microsoft machine learning will help raise the profile of Redmond's other machine learning tools. "A lot of these services are machine learning models from research and investments we've made that have been used in Microsoft apps," explains Gaglon. "Now we're interested in exposing them to the developer community, to build on top of what the community comes up with."
It's also useful for Microsoft to collect more samples to try out its machine learning on (although unless there's a way to indicate which results were right and which were off target, it won't improve the algorithms much). The Cortana team is particularly keen to get voice samples for a much wider range of queries than the most common searches, and in a wider range of accents
At this point, Gaglon says, Microsoft is mostly talking to developers on the MSDN forums. "We're evaluating how to have a feedback loop from developers. We want to know what's working and what's not."