Seeing big data through the cloud

The International Data Corporation (IDC) predicts worldwide revenues for public cloud services will nudge $73 billion (£47 billion) by 2015. In Gartner’s view, the market will be worth $150 billion (£98 billion) by 2014. Forrester Research estimates it will hit $241 billion (£157 billion) six years after that.

But while the numbers differ, analysts are in widespread agreement about the direction of travel. Not only is the market for public cloud computing services booming, it is doing so at precisely the moment when big data is growing exponentially – resulting in a compelling collision of platforms and infrastructures.

However, beyond the out-of-the-ballpark predictions (after all, this is an industry in which hype is hardly unknown), what are the payoffs and risks for businesses considering cloud-based data storage?

According to Andrew Greenway, Accenture’s global cloud computing programme lead, a characteristic of big data is that business requirements for information from the data tend to change quickly. “Therefore, when dealing with big data projects, it can be a costly and risky task to build the infrastructure required yourself,” he says.

Data storage is simply running out of control and the cloud offers a solution

“Flexible service provision, through the cloud, allows the business to pay for what they use, when they want to use it. By using cloud services, organisations do not have to build clusters of storage, which risk becoming an under-utilised investment as projects end and data requirements shift.

“Many organisations are taking a hybrid approach including cloud and on-premises technology. Bearing in mind the complexity and required integration between data sources and the new technologies coming onto the market, it’s important to choose carefully the right solutions, otherwise it is easy to spend a lot on data warehouses that don’t then deliver the flexibility the business demands.”

Donald Farmer, vice president of product management, QlikView, says the challenges with big data and cloud storage are threefold. “First, there isn’t really one thing called ‘the cloud’. If you have data from many sources, they are typically spread through many different clouds and the challenge is how you manage the complexity of multiple clouds,” he says.

“Second, we often talk about ‘the three Vs of big data” – velocity, variety and volume. There’s a fourth V, too – vagueness. People really don’t know what they want to do with the data or how they go about finding the slice of data that’s relevant to their particular business problem.

“Third is the distinction between data that is born in the cloud and data that is moved there for storage. That’s a shift from keeping your data on premises to keep it ‘on promises’. And that, psychologically, feels more dangerous.

“The opportunities are really considerable. The biggest one is the freeing up of resources. Data storage is simply running out of control and it’s becoming a tremendous challenge to business. Cloud storage offers a solution to that. The opportunities around providing applications in the cloud, which are simpler to administer, are also very attractive.”

Tim Moreton, chief executive of Acunu, says the very low cost of scalability is really important for fast-growing businesses. Streaming video service Netflix is a great example, he says. “They have stated that they would not have been able to hit the subscriber numbers they have built in the last three years, without using the cloud for all of their storage, processing and distribution of online videos,” says Mr Moreton.

“The reason for that is they would not have had the money to build data-centre capacity fast enough. So for organisations which are growing very quickly, with data at the heart of their business models, the cloud is really important.

“The biggest challenge is that it is very expensive to get your data out, once you’ve got it in. So for big organisations that means you are tying yourself to a future in which you process your data in the cloud.

“While there are benefits, it also means that some of those protections, such as security and regulatory requirements, can get circumvented. But if you look at the availability of a service, like Amazon’s and the other big cloud-hosting providers, they are probably far more reliable than most organisations’ data centres.”

Jim Dietz, product marketing manager at Teradata, says the challenges with big data, in the areas of volume and velocity, are availability and security. “When you want to do important things with data, to be able to analyse it in detail and get real business value out of it, you need to do it quickly,” he says. “That’s especially true of web-click data.

“Having speed that’s consistent in the public cloud is oftentimes hard to do. We’re seeing people handling big data for business intelligence and analytics more in the private cloud, either on premises or in controlled premises, where they know they can guarantee the availability, speed of access and the security of the data.

“Of the opportunities, one is that business analysis becomes much more affordable in a cloud environment, because now you are able to take the resources of that infrastructure and share it among a large number of business functions and processes so that the cost per analysis goes down.”

FOCUS

What is dirty data? And why it matters

When analysing data, you’re trying to identify patterns and relationships. Any “noise”, anything that obscures the information you want, can loosely be described as “dirty data”.

Angela Eager, TechMarketView research director, explains: “Dirty data is becoming more of a problem because not only do we have a greater volume of data, but we are also drawing in more data from unstructured sources, which are inherently dirty, such as data from social media, sentiment analysis, GPS, RFID and smart metering.

“Traditionally, you were referring to things like duplicated records, but that’s not appropriate in today’s information world. Take duplicate data into a website-visits environment, for example, then one person making multiple visits is valuable information.

“Combating dirty data is all about understanding the context using art backed by science. There are lot of tools emerging, such as sentiment and text analysis tools, that can look out for keywords, perhaps used in the same phrase, and which use background algorithms to determine whether the relationship between those words is significant.”