Big data or big statistics?

Is statistical analysis a science and big data more of an art? Miya Knights poses the question and reviews the development of analytical software

Organisations often use statistics to define and understand so much of their activity, from the descriptive, such as the gap between planned and actual performance, to the diagnostic, which can help understand why such a gap exists.

The value of these kinds of analyses is well proven within the enterprise reporting structures used across many finance, supply chain and customer-facing operations. But the sheer volume, velocity and variety of data collected nowadays have led to a new trend in IT known as big data.

The more we rely on digital technologies enabled by internet-connected systems, from our work PCs to our smartphones and the likes of Twitter, Facebook and online shopping, the more data organisations have at their disposal to analyse. Even so-called machine-to-machine (M2M) data, where equipment sensors may be used to communicate with diagnostic or maintenance systems for example, is adding to unprecedented data volumes.

Frank Buytendijk, vice president of information innovation research for analyst firm Gartner, says the speed at which this data is created and its diversity is driving the big data analysis trend. “The commoditisation of lower-cost data storage means handling massive data volumes is the least of the challenges,” he says. “It is the fact that this data is generated in real time and comes from many different sources, like smartphones or social networks, which creates the real complexity.”

In order to manage data in these dimensions, Adrian Simpson, chief innovation officer at technology firm SAP, believes statistical analysis could be seen more as a science, while big data “is more of an art”. “Big data brings together a number of different, already existing technologies and disciplines, while statisticians are looking for exactness and definition,” he says. “Big data analysis looks more for trends and patterns, and to predict outcomes based on those trends and patterns.”

Big data brings together a number of different, already existing technologies and disciplines, while statisticians are looking for exactness and definition

Where the application of traditional statistical approaches is used for descriptive or diagnostic analysis, big data analysis harnesses data volumes across a variety of sources to produce predictive or prescriptive analyses. Mr Simpson explains: “We’re seeing a situation where organisations historically put in siloed IT systems to solve particular problems. But they are now demanding a joined-up view.” He says big data approaches have grown out of the need to model, predict and act on “what if?” scenarios that take into account myriad sources of both structured and unstructured data.

In response, SAP has developed new appliance software using in-memory computing techniques to aggregate and analyse big data at speed. Citing its work with T-Mobile USA as an example, Mr Simpson says: “We first started working with them using HANA [appliance software] to clean up their data, but now they are experimenting with real-time marketing. The idea is to identify the risk of customer churn by analysing of patterns data to make a judgment on the fly about how likely a customer is to leave and offer them incentives to stay.

“It’s something they could’ve done before with their existing systems, but the direct effect this approach is having on revenue would’ve been lost in the cost of doing so with traditional technologies and methods, and would’ve taken a year, whereas with HANA the project was up and running in three months.”

Another IT giant throwing its weight fully behind the trend to harness big data for business insight and action is IBM. Stephen Mills, big data consulting lead for IBM in the UK, says: “Multiple, different sources of data need to be integrated and analysed.” Large data sets collated using Extract, Transform and Load (ETL) data warehousing tools – or, in the case of more unstructured data, a message broker – can then be processed where it actually resides, across clusters of commodity servers managed using the open source Apache Hadoop software project, for example, or using in-memory or streaming processing techniques.

Mr Mills adds: “Hadoop has a number of tools alongside it, whether that’s Pig, Hive or MapReduce as just a few examples, which allow access to and interrogation of that data using analytical models before the outcomes are fed into downstream systems, such as campaign management or marketing automation platforms, to serve up the best offer to a particular customer, for instance. The real change taking place is the ability to use these existing and new analytical tools within a more efficient platform for storing and accessing big data.”