Mastering the data supply chain will unlock AI value

Organisations around the world are racing to deploy artificial intelligence (AI) technologies to drive efficiencies and improve decision-making. Though the mathematics behind AI has largely existed for some time, it is the combination of this foundational science with the cheap computing power and storage that has emerged in recent years, in concert with the explosion of data at companies’ disposal, which has enabled huge advancements in AI capabilities.

Navigating the hype, however, requires discipline and turning AI algorithms into successful production systems is a lot harder than many organisations often realise. The high-tech companies that do it really well, such as Amazon, Google and Tencent, have systems and production lines, almost like a factory, that manage this with expert staff. They have extensive expertise in putting the parts together and assembling and maintaining them.

For typical companies where this is not a core competency, however, attempting to build, integrate and maintain their own processing infrastructure and data stores to support a full suite of analytics, including machine-learning, modelling simulation and aspiring towards AI is destined to fail.

Many select a well-intentioned programme or product, get some sample data that is manually cleaned, write an algorithm that happens to work well enough and then think it warrants rolling out across the business. They soon discover none of what they did is viable and they don’t have the skill base to manage and scale their own infrastructure, build an inventory of data assets or come up with their own data registry.

The business problem is lost and they’ve wasted their time on something that’s not differentiating for them; deep learners are particularly susceptible to this wishful thinking.

“It’s a total distraction and often the wrong people build it, so then we see use-cases where companies get to moderate or even minuscule scale and realise they didn’t know how to architect distributed systems correctly and now they have to start again,” says Jason Crabtree, co-founder and chief executive of QOMPLX, whose unified analytics infrastructure platform makes it faster and easier for organisations to integrate disparate data sources and make better decisions at scale.

“We’ve seen a lot of major financial institutions, in particular, thinking they could build all this infrastructure and capability in-house, but then rapidly have to turn to specialised partners or major cloud providers because they realise it’s too complex, it’s not cost effective and it’s not their core competency.”

The first rule of having a good AI system has nothing to do with AI: it’s managing data effectively. If a company can’t do that well, it won’t even have the ability to validate the information it’s going to use in its model is of an acceptable quality or standard. With a poorly managed data supply chain, it will often find some of its data is suddenly not available and its mathematical techniques can’t deal with the missing information.

A more nuanced problem is when the right data show up, but many of the fields are not usable because of a data quality issue. Just as a manufacturer runs into major problems if there are faulty parts in its supply chain, an insurance underwriter working from AI models with bad data is only going to get bad decisions out of them. Setting up filters that catch and apply quality control to the data ultimately going into the algorithms is vital.

“One of the best places to use AI is in the information cleaning, extraction and schematisation space, helping sanitise the data so users can spend their time working on high-quality data,” says Crabtree.

“Savvy organisations will set up more advanced data processing and pipelines that help manage the flow of raw material, and organise, clean and structure it into the right places. By doing that, worst case they vastly improve the efficacy of the people working on the data science side and best case they can now start to do things like automated model-tuning and training, which can help further scale their use of AI.

“If they don’t get the data supply chain right, they’re deluding themselves and essentially trying to apply advanced techniques over the top of bad data.”

QOMPLX works with enterprises to get their data supply chain in order and to integrate their many point solutions. It focuses on large-scale data processing
and a lot of streaming data processing for mission-critical and high-performance use-cases, such as securing some of the world’s largest companies against cyberattacks, including validating users are who they claim to be by securing enterprise authentication in Active Directory and Kerberos.

To do this, the company built a streaming-focused analytics infrastructure that is able to collect, move, aggregate and process data around the globe. That broad infrastructure is now also applied to non-security use-cases, focusing on data risk attributes and helping with cleaning and scaling this information.

Crabtree, who previously served as a special adviser to senior leaders in the US Department of Defense cyber community, recognised that when most people talk about AI and machine-learning, they’re actually just talking about retrospective models. By essentially driving by looking in the rearview mirror, companies end up searching for “god-like algorithms” where they take data along with a statistical or machine-learning technique, often incorrectly labelled as AI, and end up with an overfitting problem.

We encourage the enterprises we work with to start simple by collecting, ingesting, centralising and structuring their information

As soon as there is a material change that means the conditions no longer match those the model was trained under, companies can make really bad decisions really quickly. This is why all modelling at scale requires a data supply chain to ensure the data is real, valid, and formatted and structured correctly, as well as constant checking of whether circumstances have changed.

Savvy companies can then switch algorithms focused on very specific areas. They run one algorithm when the market’s really good, for example, and another one when it’s volatile, and they blend together multiple models.

“This is the ‘no god algorithms’ principle we advocate,” says Crabtree. “A blended approach is key to navigating real-world volatility and uncertainty, rather than blindly using overfit and often improperly applied machine-learning techniques in production use-cases without adequate attention to the details.

“We blend retrospective models, including statistical and machine-learning, with generative models, such as the agent-based models that allows users to watch and see what happens in an experiment, and then we look at how they differ. A continuous cycle of learning between top-down and bottom-up modelling allows companies to perform better. We encourage the enterprises we work with to start simple by collecting, ingesting, centralising and structuring their information. Once that data is listed in an appropriate data store, they can work on more advanced modelling, including machine-learning and moving gradually towards AI.”

For more information please visit