The secret of seamless automated data integration


Data-driven: how businesses can overcome replication and integration challenges

Replication and integration issues can prevent businesses from getting good insights from their data, but a clear idea of their needs – and the right tools and governance strategy – can help them tackle any issues

Businesses need good data for everything from customer relationship management to product development. But integrating data from multiple fragmented sources can be a challenging task – particularly as data volumes are rapidly increasing. Indeed, even replicating data to ensure its availability for analytics or disaster recovery isn’t as straightforward as it might seem.

“The central challenge lies in the fact that databases are not static,” says Patrick Bangert, SVP of data, analytics and AI at Searce. “In fact, common enterprise databases will receive updates frequently throughout the day. Copies must be kept mutually up to date, which is especially difficult in non-cloud environments where handshaking and automatic updating is complex.”

Alongside the technical challenges associated with keeping data consistent across replicas, such as monitoring for latency and accuracy issues, there are a host of issues around security and compliance – including data privacy.

“A geographical spread of data creates a geographic spread of data regulations, she continues, “and data professionals implementing and managing database replication may be restricted…by the compliance challenges of moving data across borders with differing data laws,” says Julie Smith, director of data and analytics at Alation.

Indeed, security issues are one of the biggest challenges firms may face when it comes to database replication. “Without consistency and visibility for security teams, replicated databases have the potential to ‘slip through the cracks’ – lacking the necessary security controls as the original database,” says Anders Blair, senior software engineer at Expel.

Jeremiah Morrow, director of product management at Dremio, also highlights the resources needed to successfully manage data replication. “Supervision of updates, monitoring, and regular testing are necessary for the data replication process, which requires IT involvement. Due to the global scarcity of skilled technical labour and IT departments being stretched thin, assigning team members to this task may be challenging.”

When resource shortages are compounded by weak disaster recovery processes, it’s a recipe for data replication problems. “Being able to failover successfully to a replicated database not only takes defined standard operating procedures…but practice from the various teams involved, such as security and compliance,” says Blair.

On the plus side, cloud services have made it easier to scale database replication and reduced the cost. “However, new issues have arisen such as vendor lock-in and the introduction of third-party security risks,” says Steve Whiteley, head of data at CTI Digital.

Avoiding bottlenecks

High-volume data integration, which involves consolidating data from various sources into a unified view, is increasingly common in today's data-driven business environments. However, this process can be fraught with challenges. 

Poor data quality and integrity can quickly become a serious problem when data is being integrated rapidly at scale, for example. “In the rush to integrate vast amounts of data, inconsistencies, duplicates, or errors can creep in, leading to unreliable data sets,” says Peter Wood, CTO at Spectrum Search. “This not only affects the accuracy of business insights but also can have regulatory implications, especially for businesses dealing with sensitive information.”

Data bottlenecks are another issue. “…[T]he sheer volume of data overwhelms the integration tools, resulting in delayed or failed data processing,” says Wood. “This can directly impact decision-making processes, as businesses rely on timely data to inform their strategies.

These inefficiencies can be costly. “The resources spent on rectifying data errors, along with potential regulatory fines and the loss of business opportunities due to inaccurate data, can quickly add up,” Wood explains. “Inefficient data integration processes also divert resources from core business activities, affecting overall productivity and profitability.”

The right solution

Thankfully, there are solutions to many of the issues around data replication and integration. But before looking to implement them, organisations first need to fully understand their data requirements.

“What is the goal for the replication? What data is needed? What are the regulatory and compliance requirements?” says Smith. “A data catalog can prove invaluable at this preliminary stage in order to help businesses understand what kind of data they are dealing with and what their requirements will be.”

Real-time data replication or high-volume integration may be an expensive overkill for some organisations, for instance. “If the priority is disaster recovery then slower processes which focus on integrity may be preferential,” says Whiteley. “Where speed is emphasised, for example synchronising customer records or inventory, then high-speed change data capture should be the focus.”

Once organisations understand their requirements, there will be multiple replication and integration strategies available to them – and selecting the right one will come down to each organisation's own bespoke requirements. “Whatever solution is employed, organisations must ensure it is monitored with reconciliations occurring to capture errors and data inconsistencies as soon as possible,” adds Smith.

Automation and AI can also help to enhance the efficiency and accuracy of data replication and integration. “Automation minimises human error and speeds up repetitive tasks, while AI and machine learning algorithms can predict and resolve inconsistencies in data sets,” says Wood.

Investing in scalable and flexible IT infrastructure is also crucial. “This means choosing technologies and platforms that can handle increased data loads and integrate seamlessly with various data sources,” says Wood. “Cloud-based solutions are particularly effective, offering scalability and the ability to handle large data volumes efficiently.”

When optimised data replication and integration is achieved, it provides multiple advantages. “It can improve data architecture through increased resilience, as well as improving the ability for geographically spread workforces to seamlessly share data,” says Smith. “In a world becoming more and more driven by data, optimising integration and replication also improves the capability for real-time analytics with less fear of impacting ‘business as usual, which in turn unlocks even greater data-driven power.”

How the five pillars of seamless data integration support Customer 360

The smooth flow of data from multiple sources to one destination is essential for achieving a complete view of the customer, and it cannot be achieved without strong support in five key areas

Data insights are the fuel for Customer 360: the unified view of customers across all channels that is vital for delivering highly personalised experiences and improved engagement. 

This comprehensive view of the customer relies upon the capture of every interaction a business has with them. But this data also needs to be centralised, accessible and reliable to be of real value to decision-makers. 

Ensuring the smooth, rapid flow of all this data from source to one centralised destination is a deceptively complex engineering problem. Do-it-yourself data pipelines, for instance, can make Customer 360 a complex undertaking that demands considerable investment in time, labour and money.

Fivetran’s automated data integration platform frees data teams from building and maintaining these pipelines, making it easier for businesses to achieve Customer 360. But they should also aim to strengthen the following five pillars of their data foundation:

1. Reliability

Today, 44% of data engineers’ time is spent maintaining data pipelines – time that could be better spent on more strategic work. However, it does illustrate the importance of data reliability.

A managed platform gives businesses the ability to monitor and adapt to changes in their data foundation without the need for this level of manual intervention. Proactive monitoring of API breakages makes it easier to handle modifications to APIs, for example, and idempotence (the ability to perform the same operation multiple times on the same data and always get the same result) prevents the creation of duplicate data when data syncs fail. 

This means that decision-makers can always have total confidence in their data. “Trust in data is critical when you're going on a data transformation project, which a lot of our customers are doing right now,” says Dominic Orsini, lead solution architect at Fivetran. “If you see a report and you don't trust the data, you're not going to use it. So reliability is key to having a successful data foundation.”

2. Control and governance

A data catalog (an inventory of data assets in the organisation) is an essential first step toward good controls and governance. “Yes, they’re hard to implement,” Orsini says. “But it results in huge cost savings further down the line.”

That’s because it’s not uncommon for two data pipelines to be performing the same function, for example, or for two reports to be providing the same information. Often, this is because different teams have deployed their own data tools and systems. “The data catalog allows you to assess all of that,” says Orsini.

A data catalog also provides security and legal teams with visibility for audits. But it must be combined with granular access control. “Access control is key, and it has to start from the beginning as well,” says Orsini. “It allows you to scale, but it also allows you to control what's happening within your data foundation.”

3. Observability

Metadata is king when it comes to the observability of your data platform. Ideally, it should be automatically sent to the data catalogue, streamlining audits of data access and handling. Orsini adds: “This also allows you to see what's changed – if, say, a pipeline is damaged, what report or which team is affected.”

Monitoring and alerting related to data integration and status is, of course, a major component of strong observability. But the core data team should also have the necessary tools to know what individual teams are doing with the organisation’s data, ideally in real-time. 

This level of observability is critical for empowering teams to configure and deploy their own data connectors, as well as effective auditability. “Observe your platform, see what people are doing, see what's changed, who has access to what, whether someone has added a column to a pipeline,” says Orsini. “An observability tool that works well with the other tools in your data stack allows you to do that in one big dashboard.”

4. Scalability

Once you’ve got the previous three pillars in place, you need the ability to scale your data foundation without destabilising them. 

Automation and standardisation are key here – particularly in terms of onboarding. In other words, if a new user joins the organisation, they should automatically have access to all the data used by their team. Streamlining data access in this way means new users don’t have to raise a support ticket every time they want access to a certain tool. Likewise, the deactivation of users across all platforms should be consolidated and automated.

Workflow templates are another key component of scalability. By restricting what data, dashboards and management systems developers can access and work with, for example, workflows allow organisations to onboard new hires quickly but in a controlled way. Orsini adds:  “All of that can be done in code, so you can scale without having to monitor everything that everyone is doing through a UI.”

5. Expertise

Expertise is the crucial fifth pillar of a solid data foundation. To ensure you have the skills you need to achieve initiatives like Customer 360, first complete a skills assessment of your current team and then determine where there are gaps – i.e. where you need training, where you need to hire, and whether you need new software services to get your team up and running on a particular tool.

Expertise also needs to go hand in hand with a data culture: a shared set of beliefs and behaviours that values and prioritises the use of data to improve decision-making and business performance. Orsini also recommends having data champions in teams across the business. 

“They can help to sell the idea of prioritising the use of data, help do the training, and develop your understanding of problems,” Orsini says. “By coming together and reporting back to the business, they can also help it to track the level of trust in the data foundation.”

Customer 360 achieved

In a digital ecosystem where end users now expect an Amazon-level of personalisation at every touch point, a data foundation that’s supported by these five pillars should enable data usage at scale, helping anyone in a company wield the power of data to improve customer outcomes and deliver that all-important Customer 360 view.

Commercial Feature

How can data in legacy enterprise systems be managed?

Centralising data within legacy enterprises can free up resources for tech teams to focus on productive technologies including artificial intelligence

Legacy enterprises often have highly complex and vast data systems that pose substantial challenges for them to manage.

Due to their size, legacy companies often have a very large number of databases to draw from.

Taylor Brown, co-founder and COO of Fivetran, said: “The majority of our enterprise customers are looking for access to all their on-premises databases. The challenge with that is some of these companies have between 30,000 and 50,000 databases across their organisations.”

Regulations, including the General Data Protection Regulation (GDPR), also present additional obstacles as companies must comply with the limitations around which datasets can be centralised.

Brown said: “It’s a complicated task for a central IT organisation within a large company to know what can be centralised and what cannot. For companies that fall under the regulations, we connect all the various data sources in an extremely standardised way so it’s the same every single time.”

Fivetran, which has worked with large data-driven companies including Lufthansa, Nando’s and Morgan Stanley, also offers controls for IT teams that automatically detect data that falls under Principal Adverse Impact (PAI) disclosures or Personally Identifiable Information (PII) data. 

The controls allow IT teams to set up policies that prohibit moving these specific datasets, which cannot be centralised under the regulations.

Brown said: “When companies need to pull in the PII data, there are a few different options. The first one, which is the one companies usually start with, is just to block the PII data to find out what’s in all their databases.

“Once they know that, they can move it into privacy-specific data marts within each of the data warehouses. Or we can hash it, which means for example we could hash email addresses, which would turn the emails into something unreadable if they appear on your screen, and then you can un-hash them if you need access to that information further down the line.”

The importance of connectors

Gaining access to their various datasets and increasing reliability and speed are typically the top priorities for companies.

Brown said: “Customers always want better connectors because the faster we can move the data for customers, the more opportunities there will be to work with that data downstream. You can go beyond daily or hourly reporting to triggering additional workflows that [automatically] happen when the data moves in.”

Last year, Fivetran added 20 new connectors to its product line-up, and this year it will add 300. In 2024, the company plans to expand further and add another 1,000.

Back to basics for AI integration

Data centralisation is crucial to taking advantage of the progress being made in generative articifical intelligence (AI). Companies that start with solid data foundations and have strong reporting standards will be better placed to implement AI-based models.

Brown said: “Companies are worried about falling behind with AI but many of them still haven’t got the basics right – of centralising their data and using it for reporting. Once you do that you can build models from there, but I’ve seen companies centralising their data in a non-sophisticated way that makes it much harder to work with.

“My advice would be to take the time to build a cloud data platform or a data lake and set up a process for moving the data into that in a reliable and dependable way. Don’t try to get ahead of yourself without having the right infrastructure in place.”

What are the benefits of data replication and integration?

Working with Fivetran to sync its external data has allowed Monzo to better understand its customers and more effectively target its communications. Sharing this at Big Data LDN, Monzo explores its improved data replication and integration

Online bank Monzo uses data replication and integration in its marketing and customer engagement to understand the best ways to interact with customers. The challenger bank uses a third-party platform to manage the data it has gathered from its users and identify shared attributes. It then uses data integration specialist Fivetran to synchronise the data.

Klara Raic, analytics engineering manager at Monzo, said: “Talking about money can be an uncomfortable subject for some people, so it really is important that we have the right attributes in our platform to allow us to build the right kind of campaigns and talk to customers about the products they want to hear about.”

Monzo contacts customers about new and existing products through emails or push notifications in its app. It gathers data from these campaigns on how users interacted with messages, including how many customers opened the email, followed an embedded link or opened the notification.

Raic said: “It’s basically plug and play. Fivetran connects everything, and we can select all the categories of information that we want and do some easy transformation. After that, the data comes back to us and we combine it with all the other data points from the app so we can understand both immediately.”

Customer data can be used to understand both the short-term and long-term impacts of a marketing campaign. It provides information on immediate engagement, such as click-through rates, as well as the effect on future app behaviour.

Raic said: “If we prompt a customer to do some budgeting or assess their spending, we can look long-term to see how that affects behaviour on the app and if the marketing communication helped them achieve their financial goals.”

The data from consumer communications range from simple information on how many people opened an email sent by the bank to complex inputs such as whether the email was sent at the best time of the day to optimise engagement.

Pan Hu, senior data scientist at Monzo, said: “Our customer communications campaign reporting benefits from the seamless integration facilitated by Fivetran data connectors. It enables us to easily extract data from our third-party platform, including campaign dimensions such as names, tags and message variants. We also sync message engagement metrics for channels that are native to the platform such as push notifications and emails.”

The bank uses the attributes identified by Fivetran to create a dimension table that stores all the information from all the campaigns.

How is the data used?

Monzo uses an uplift model to predict how a customer is likely to respond to a marketing action, helping the bank identify which customers are most likely to respond only after a marketing message, a voucher or a discount.

Customers are split into four groups: people who will always buy a product, regardless of marketing, people who will never buy it, regardless of marketing, those who would only buy something after receiving a marketing action and those who would be less likely to buy a product due to a marketing action.

The bank uses this model to boost the effectiveness of its campaigns across multiple products, while complying with contact rules to avoid sending customers too many messages. 

The bank built specific uplift models with different scores for each user and a constant that it can adjust to perform controlled experiments. 

Comparing data between customers who were not sent any marketing communications to both those that were randomly sent marketing communications and those who were sent communications optimised according to the model, the bank acheived improvements in campaign performance of up to 200%.

Valeria Cortez, senior data scientist at Monzo, said: “We were able to connect millions of data points, from our individual customers, between our third-party platform and Monzo using Fivetran to not only get data from messages received but also use all the information to build uplift models. By understanding the user preferences for different products, we can achieve much better marketing message results.”

The benefits of seamless automated data integration

Moving huge volumes of data poses myriad challenges but making the change need not be problematic and can reap many benefits for businesses

Duncan Jefferies Joe McGrath
Duncan Jefferies Freelance journalist and copywriter specialising in digital culture, technology and innovation, his work has been published by The Guardian, Independent Voices and How We Get To Next.
Joe McGrath Financial journalist and editorial director of Rhotic Media, he has written for Bloomberg, Financial Times and Dow Jones, and was previously asset management editor at Financial News.