Computers are learning about sarcasm, spotting the difference between English and American, and getting to grips with previously undecipherable data, writes Charles Orton-Jones
Computers love numbers. They love spreadsheets. They love finite concepts which can be reduced to ones and zeros, and crunched by their microprocessors. What gives computers a nervous breakdown is fuzziness.
Ask a computer to look at a picture and identify whether the animal portrayed is a cat or a dog and you’re in for trouble. You’d be better off asking a five-year-old child. Is a joke funny? Is that man over there old or young? These are deep waters for the cyberbrain.
Alas, our world is not primarily composed of finite numbers which can be easily transposed on to a spreadsheet. It’s a messy place – a world of unstructured data. It is this world that lies at the new frontier of big data.
Unstructured data can take many forms. Photographs, video, text and speech are obvious examples. These forms are too nuanced, too ambiguous, too human for machines to easily interpret. The sequence of letters in a sentence can be recorded by a machine, but to decode the meaning of that sentence is a phenomenal challenge.
Unstructured data can include any data for which there is no existing structure to allow machines to navigate. Medical records, the weather, the movement of people around a shopping mall, traffic jams – these are all potential sources of unstructured data. The concept extends to digital data which has simply not been categorised yet and which lies in an unexamined jumble.
Currently the hottest repository of unstructured data is Twitter. The messages are short enough to be approachable – easier than having a crack at The Times leader column – are produced in abundance and so are applicable to the methodologies of big data.
Firms such as Air France and Accenture employ social media analytics firm Spotter to monitor Twitter. Spotter does more than merely watch for keywords. Chief executive Ana Athayde says: “Our R&D focuses on designing algorithms to capture and analyse information from all sources – data mining, text mining, semantics, linguistics, syntax – so that sentiment analysis is more than simply negative/positive scoring.
“For each sector we fine-tune our algorithms with specific linguistic dictionaries in order to get the best results on sentiment analysis. What is challenging in sentiment analysis is deciphering the way people talk about an issue as there are big differences depending on their age, culture, language and the sources they use.”
The data from Twitter, along with data from LinkedIn, Instagram, Facebook and YouTube, can be converted into an easy-to-read dashboard. For marketers the service provides a window into an otherwise unmanageable morass of information. For example, Air France can identify specific consumer desires or dissatisfaction with a stage of the booking process.
The technique of semantic text reading is still developing. Some social media firms prefer to use dozens of humans to read and assess tweets and Facebook posts on the grounds that machines are still unable to decipher sarcasm, jokes and pop-culture references.
Currently the hottest repository of unstructured data is Twitter… messages are short enough to be approachable, are produced in abundance and applicable to big data methodologies
So how reliable is machine-interpretation of language? Director of the Centre for Textual Studies and professor of computer science at Loyola University, George K. Thiruvathukal, is regarded as one of the world authorities in this area. “How well can computers understand natural language? The answer is very well,” he says, citing the success of the Natural Language Toolkit, an open source project. “Anyone can download it and it allows you to break language down into its syntactic structure.”
Even obscure cultural references are being addressed. “Facebook and Google are tuning their algorithms to be culturally relevant. They are starting to understand the difference between American and British idioms,” says Professor Thiruvathukal.
And what of other realms of unstructured data? How about something really tricky, such as the movement of people in a shopping area? Kevin Curran at the University of Ulster says he knows of a plethora of methods to perform this, including work by Nokia, Path Intelligence, NextNav and GeLo.
Marketers ought to love this sort of data, especially when mixed with other data such as the users’ ID. “A vendor could present the mobile user with a promotion on a specific product when it is right in front of the user and the offer could be targeted to shoppers based on past purchases or other factors. When the customer reached the checkout stand, the discount could be applied automatically,” says Dr Curran.
Facebook is making strides in decoding the content of photographs – identifying faces and objects. Video data can be decoded using the same methods.
The progress in mapping unstructured data means vast new tranches of data can be added into the big data mix. It is an accelerating process. For example, Apple has been granted a patent to collect data on body temperature and heart rate through audio buds. Google is scooping up speech recognition patents so it can improve its structuring of voice data.
The end-game is a world in which almost no data is unstructured. Every interaction, no matter how arcane, will be available for input into a big data algorithm.