What do you need to be a big data power-user? Thank mercy the answers to that are clearer than even a year ago. The industry is consolidating and tweaking tools so that even the most nervous of firms can start dabbling with big data. Charles Orton-Jones details what you need to know
Big data presents big challenges. The influx of data will flood your storage capacity. The data processing tools you need will be new and scary. And there’s the actual task of analysing the data to find commercially valuable insights.
Most firms aren’t up to it. Figures from Hitachi Data Systems reveal 46 per cent of UK firms can’t do data mining because they have the wrong IT. Some 75 per cent lack the understanding, expertise and logistics to do it – which is terrible.
The conundrum for firms handling oodles of data is how to store it. The old fashioned method was to use traditional “spinning platter” hard disk drives. These are slow, but store a lot for not much money. If speed is a priority then firms with a bit more cash can move to solid state drives (SSD). These are ten to a hundred times faster, but more expensive. And for firms who really want to turbo-charge their set-up, there’s “in memory” computing. This allows data to be stored in RAM which can be 10,000 times faster than the old methods.
So what’s the best for you? Industry consensus is endorsing a mix of methods. Mark Whitby, vice-president of storage at disk maker Seagate, says: “The most useful method of storage for enterprise is a new tiered model. This model will utilise a more efficient capacity-tier based on pure object storage at the drive level, and above this a combination of high-performance hard disk drives, solid state hybrid and SSD will be used. This SSHD hybrid technology has been used successfully in laptops and desktop computers for years, but today it is just beginning to be considered for enterprise-scale data centres.”
And in-memory computing? The German national football team uses an in-memory system provided by SAP Hana to monitor training sessions, since ten players with three balls can generate seven million data points. The approach gives coaches instant access to data. Does it work? Germany’s World Cup win certainly gave in-memory computing just the sort of publicity it deserves.
Talk to a big data consultant and you can time how long before they start to waffle about Hadoop. In fact, big data and Hadoop are so interlinked that many people assume they are inseparable. For newcomers, Hadoop is a way to store and process data in separate chunks or clusters. This means it is easy to manage and easy to grow, making it ideal for handling huge data reservoirs.
So what’s new? The whisper on the street is that Hadoop is starting to look its age. Here’s Sanjay Joshi, global big data boss at Indian giant Tech Mahindra: “Hadoop has been around for almost a decade and, when it comes to technology, that is a long time – and there seem to be some limitations. New solutions are coming up, such as Apache Storm and Spark, which seem to be able to process data faster and are much better when it comes to real-time processing of streaming data.”
Clients are already moving, including Quantcast, a world leader in measuring TV and advertising audiences, logging a trillion data records each month. Quantcast R&D boss Jim Kelly says: “Back in 2006, Quantcast was a small startup with a modest data set and a budget to match. To manage this amount of data we were running open source Hadoop software. Very quickly we found ourselves pushing its limits so we began innovating and developing our own solution. The company invested years of development to create an equitable alternative called Quantcast File System (QFS), which we rely on internally for storage and released as an open source project in 2012. This is the ticket forward into super-scale data management.”
In ye olde days of big data, the only folk who could do the analysis were highly paid data scientists. They baffled and amazed onlookers with their coding expertise. No longer. Today it is possible for ordinary staff to look for valuable patters in data by using simple graphical interfaces. The most popular are Tableau, which use drag-and-drop charts, Qlikview, SAS Visual Analytics and Opinurate.
Karaoke chain Lucky Voice, founded by Martha Lane-Fox, uses Tableau for planning. Staff can see forward booking figures, what food and drinks clients prefer, which songs are the most popular and contribute to repeat bookings. The Tableau format eschews coding altogether. Users just grab and move data lists to create new charts.
Walmart uses the Neo4j graph database to identify products suitable for cross-selling and up-selling promotions. It is simple enough for non-technically minded staff to use. The big data software makers are aware of this trend and are now co-operating to ensure their systems mesh easily. For example, Logi Analytics, maker of a visualisation tool, has recently partnered with HP’s Vertica analytics platform, with the stated goal of ending dependency on data scientists. Connectivity between the two is automatic. Data scientists are barely needed. This ought to be what other visualisation tools aspire to.
The ultimate user-interface for processing? Netflix uses a recommendation service based on user data. Customers are presented with titles they might like and can scroll through the possibilities. This is pure big data analysis by the users, not that they realise it. There’s no reason corporate big data can’t be this smooth.