Keeping spam out of your inbox isn't easy – around 99% of all the email messages sent are spam and even if only a few get through, they're annoying and can be dangerous. Those iTunes receipts and PayPal account warnings you weren't expecting are often trying to slip malware onto your system.
But legitimate email you don't actually need to read is also a huge drag on productivity. Whether it's newsletters you didn't sign up for (often known as bacon, because they're tastier than spam) or the endless discussion about where to have the team lunch this week, irrelevant email is why so many employees would rather use social media or messaging services.
Cutting through the Clutter
To stop people abandoning email in frustration, Microsoft launched a new feature called Clutter as part of its Office 365 cloud service, and this files away messages it thinks you don't need to read using machine learning to simulate your world – and what you care about.
"We're trying to address the biggest pain point in email, which is information overload," John Winn of Microsoft Research in Cambridge told us. "This causes the most pain, because when you have so many emails coming in to your inbox it's difficult to focus on the high priority email – the ones that you need to take action on."
The problem is that what's interesting to you might be clutter for me, unlike spam.
"What's spam for you is spam for me," Winn points out. "Clutter is a very personalised model; it's about learning which emails are low priority and which are high priority for you, because the same message can be high priority for one person and low priority for another person."
"We learn from the signals you provide just by working normally; what messages do you read, what do you reply to, which do you forward? Which are likely to be of interest, actionable or high priority? And messages that aren't any of those can be removed and placed in a 'next door' inbox you can look at any time you want."
It doesn't need a special client the way Gmail's Inbox does, because it just happens in Exchange – and you can look at your normal email and your Clutter-filed mail in any mail client, because it's just a normal folder.
Clutter gets information from the Office graph as well as from the email message itself. It looks at who sent the message, whether it's sent just to you, or if you're only on the CC list, as well as examining the words in the body and subject, and the Office graph tells it how you treated the email – whether you replied to the message or forwarded it, whether you deleted it or moved it to another folder or marked it as unread, and how long you spent reading it.
"We match those signals from the email and your behaviour and we take a fresh email and say, 'what is the likely behaviour? Will you read it urgently or ignore it?'" explained Winn.
Cleverly, Clutter takes into account that you don't always treat email the way you mean to. For a start, it waits up to a week before saying you've ignored a message (because you might just be busier than usual), but if you take an action like replying or deleting, it can take that into account straight away. If you're on holiday and you're treating email differently – not replying to messages that would otherwise be urgent – it models that as well.
Bring the (label) noise
There's also a concept called 'label noise'. This pertains to "when your actual behaviour differs from your ideal behaviour – when the action that you should take or maybe the action you intended to take isn't what you do," Winn explained. "You intend to reply but you don't; or you reply to another message from the same person instead, because it's the most recent mail from them. We explicitly model those behaviours."
The idea is to make Clutter accurate and sensitive to the subtle nuances of the way we handle email – and to do that without a mass of complicated machine learning code for all these special cases that would make it hard to maintain.
It works by using all the information from Exchange and the Office graph to build a probabilistic prediction model that simulates what you'll do when you get a new email. Unlike older systems that keep the whole model in memory – which slows things down – Clutter uses Microsoft's Infer.NET compiler. It runs fast enough to handle the petabytes of information in Exchange, and adding the idea of 'label noise' to explain unexpected user behaviour takes only a few lines of code.
This also made it easier for the MSR team to work with the Exchange group. "In the early days we would talk the Exchange team through our program and what assumptions we were making and they could easily see what we were doing. And they'd say 'that's not right! We know users do this in Outlook, not that' and we could quickly go back and modify our model of what a user does," Winn told us.
This approach is one of the reasons that Clutter became a feature when other ideas the researchers had come up with in their four years of working with the Exchange team didn't get anywhere. "We've been exploring a number of different ways that machine learning could work in the inbox," Winn said. "Only with Clutter did we feel we'd got something that can really add value, and not be in some way creepy or have the negativity you can sometimes get when you start applying machine learning to personal email."
We may all complain about email, but people quickly get unhappy if their mail system 'interferes' with their messages – and gets it wrong. So far Clutter is well received – and Winn hopes to extend it beyond email.
"The opportunity is very broad. We're looking at other applications in Microsoft products. Some I can't talk about, but there are some already in the Azure ML service and we're actively working with both Exchange on future work with Clutter and with other product teams on using probabilistic predictions in other products."
One possibility is working not just with the structured information in the email header but also the unstructured information in the message itself. Winn calls unstructured text "the last uncomputable data – it isn't easy to compute with, so it tends to just sit there."
The latest version of Infer.NET is better at working with unstructured text, like the content of email or Office documents. That means in the future, Clutter might be able to understand what a mail message or attachment is about, to decide if you'll be interested in it – and that would be much more accurate.