What is a data lakehouse and do you need one?
Welcome to the data lakehouse. Combining a data lake’s flexibility with a data warehouse’s management features, it offers all mod cons. Will every business come to need one?
The inexorable rise of data has changed our world and the way we do business. Data-driven insights, enabled by the sheer volume of information generated by everyday users, are revolutionising corporate decision-making. But working out how best to handle all this material remains a challenge for senior executives.
The latest term to have become a buzzword for business leaders is the ‘data lakehouse’. But what does it mean, exactly, and how can a company determine whether it needs one?
The emergence of the data lakehouse is the result of a range of developments in the field. As organisations used more and more data in their everyday operations, they started placing it into data warehouses – a centralised management system that stores the material in such a way that makes it easy to interrogate. But doing that soon became too restrictive for some companies, which instead pooled their data into so-called lakes: vast, unorganised gatherings of material in its native format, waiting to be analysed.
“These approaches have been used for years, driven by organisations’ increasing reliance on data insights to increase profitability, uncover new opportunities and detect issues,” explains Jitesh Ghai, chief product officer at US software firm Informatica.
Yet both approaches have run into a scalability problem. The increasing volume of data sources that fed into the smooth running of a business began to test the limits of firms’ technical capabilities. The weakness of data lakes became so significant that they were rechristened data swamps by some in the industry – a testament to the muddy challenge of dredging up the insights lurking within them
The data lakehouse is the logical solution to this problem. Ghai calls it an attempt to gain control of free data, enabling users to dump unstructured material into their systems. It’s an advancement on warehouses, which require the data put into them to be carefully processed and structured beforehand.
“Lakehouses merge elements of data warehouses and data lakes in one platform,” he says. “This model promises the best of both worlds by blending technologies for analytics and decision-making with those for data science and exploration.”
Leila Seith Hassan, head of data science and analytics at Digitas UK, observes that a lakehouse combines the high-speed performance of a warehouse on a scale that’s enabled by lake technology. “If you think about how the world has evolved in the past 20 years and the explosion in the volume of data that has become available, data warehouses have had some limitations about what you could do with them at speed,” she says.
A lakehouse enables unstructured data to be analysed more quickly – potentially creating linkages that would previously not have been seen. It not only provides a performance improvement; it also avoids some of the data management pitfalls that have blighted businesses’ ways of working in the past.
One of the key problems facing companies in the pre-lakehouse world was that their data was often contained in several differently structured systems. In order to analyse this material, they would first have to transfer it into a single location and restructure it – processes that could make the underlying data unreliable.
“In a lakehouse, processing becomes much faster, as the system helps to organise big data more effectively,” Ghai says. “It also unifies data, bringing it all under one system, which eliminates redundancies and allows for better data management.”
A lakehouse contains some structure, but not so much that makes it impossible for users to glean new insights from organic happenstance, Seith Hassan adds. “It gives analysts and non-technical data practitioners the ability to access data and do things such as using it for decision-making without having to extract it and put it somewhere else first.”
Has a lakehouse become a business essential and, if so, is there an ideal time to implement such a system? Companies need to conduct a cost-benefit analysis of what implementing a data lakehouse could do for their operations and decide carefully whether it’s worth their while
“The good thing is that the way to answer this question isn’t new: work out what your requirements and use cases are for it,” says Seith Hassan, who can foresee a world in which most organisations would benefit from using a lakehouse. But she acknowledges that “the technology will cost quite a bit of money and resources to get it up and running. If it doesn’t do something for you now and potentially in the future, you shouldn’t move forward with it.”
James Corcoran is senior vice-president of engineering at streaming analytics platform KX. His company adopted a lakehouse after recognising that shoehorning different data from a multitude of sources into a single model wasn’t working.
He suggests that businesses consider investing in a lakehouse as soon as they start experiencing intractable problems with organisational silos that prevent data from being joined together logically.
But Corcoran recommends holding back until the ultimate goal of such an investment is clear. “Maybe you’re developing or launching a new product, or you’re struggling to get the insights you need to be competitive with an existing one,” he says. “Tying it to some sort of business outcome is really, really key.”
The process of planning, developing and deploying a data lakehouse is neither straightforward nor cheap, but the potential rewards are great. Data lakehouses can unlock huge promise for the future, according to Corcoran.
They will enable a business to “update the heart of its strategy, its data and its decision-making process”. he says. “You’ll be able to move a lot faster than your competitors.”