- Data management
- Data Analytics
- Internet of things
- Artificial Intelligence
- Software Factory
- Customer Experience
- Data-driven ESG
- Case Studies
It is increasingly necessary for companies to have the same data in various systems. In this context, data virtualization has become rather a hot topic. Over the last two years, we have even come across the term in tenders. However, it is not proving successful in all the situations where it is requested. In our experience, a more suitable alternative to data virtualization is often data replication. It may not appear so much better at first glance, but data replication has a number of features that work in its favor.
In 2021, Varada published the results of a survey they conducted among more than 130 data experts in the USA on the state of data virtualization
- 63% of experts consider virtualization too expensive
- 33% of experts say that there is actually no business case for virtualization
- 47% of experts think that virtualization does not perform well enough
So what then?
When should you use data virtualization and when data replication?
Both approaches address the same problem: how to ensure that a user or application can access data from multiple systems.
Data virtualization is a quick solution for browsing and connecting data from several different systems at the same time. Its main advantage is the ability to query numerous systems directly, without using an intermediary. Another benefit may be that you can query anything at all that comes to mind. All the data are available. You can search for new information and mine insights from diverse areas.
Data replication is an efficient solution for providing the same data to several systems simultaneously. It is suitable wherever you do not need to have absolutely all the data, but only those that have the potential for further use, especially in analytics and reporting.
The main advantages of data replication are its reliability and the availability of replicated data.
Virtualization vs. Replication
|I want to work quickly||I want to work efficiently|
|I just want to look at the data||I want to know the history of the data|
|I am not sure which data I want to use||I know which data I want to use|
|The output does not require the data structure to be modified||I need to modify/transform the source data for the output|
|The data/view is provided on a one-time basis||The data/view is available repeatedly|
Let us focus on three main criteria that can tell you when it is better to virtualize and when to replicate.
Criterion no. 1 – Data volume/size
Data virtualization places a burden primarily on the source systems where the data are stored. If we choose to virtualize large volumes of data, we must have sufficient computing resources at our disposal. In practice, this has proven difficult.
In contrast, data replication is computationally resource-efficient because it can operate in incremental mode. In other words, it collects only new and changed data.
A practical example:
As soon as you are transferring more than 1TB of data, for example, virtualization “stops working”. This is not to say that it is impossible, but large volumes of data usually do not “fit” in the RAM during virtualization: it is then problematic to retrieve them all and show them to users.
The solution is to increase the memory, but this also raises the cost of virtualization. It is precisely the price of RAM today that is blocking development and thus restricting user experience.
Replication, unlike virtualization, makes the most of disk or cloud storage space. The data are stored there in a data warehouse or on a platform. In addition, the cost of data storage has decreased dramatically in recent years.
Criterion no. 2 – Data transformation complexity
Data virtualization runs into trouble when you need the data in a different format than in the source systems and the data need to be transformed in a complex way. The more transformations are undertaken, the more constrained the source systems become.
In the case of data replication, however, you do not have to deal with this issue because the source data are transferred without being modified; the transformations are carried out in the target system. The latter is specifically designed to handle complex transformations.
A practical example:
When you need to anonymize information, i.e., mask the original values and replace them with others, you are talking about quite a complicated computational operation that puts a strain on the RAM during virtualization. If you are anonymizing sensitive data, then this happens disproportionately often. This is why replication is the best option for creating an analytics environment.
Criterion no. 3 – Number of users
Another case where data virtualization is not the best choice is when a large number of users are working with the system or application at the same time. When they all “request” virtualized data from multiple systems simultaneously, it quickly places a significant load on the source systems. This is true even if they are dealing with a small volume of data with minimal transformations.
If you use data replication here, the number of queries to the source systems is minimized. Rather, users work with the data in the target systems, which are typically much faster and more powerful. This may be because they were designed later than the source systems and tend to be built on more modern and robust technologies.
A practical example:
Using virtualization to make all customer operations available to all the employees in all the branches of one of the largest Czech banks would place such a burden on the source systems that it would simply not be possible without first replicating the data to a central data warehouse.
Virtualization or replication according to a data specialist
What do two data specialists, whose job is frequently to prepare new views of data, gain from virtualization and what from replication?
1. A Reporting Specialist needs to prepare a new reporting dashboard that combines data from multiple systems.
The Reporting Specialist knows which data they need, so:
- for a one-off report, we use data virtualization to provide them with the source data, to which the final report is connected.
- for a recurring report, we use data replication to move the source data to the reporting platform, on which the final report is prepared. The latter can be updated regularly to include new occurrences.
2. A Data Scientist needs to prepare data for a new analytical model.
The Data Scientist creates a model
- for the first time – usually not knowing which data will be relevant to the model, or all the data sources, and thus needing to explore and assess them. Consequently, we decide to virtualize the data. This means the data will immediately be available for modelling in their entirety. The Data Scientist can then choose among the data, vary the requirements, and specify the need for data transformations.
- repeatedly – modifying or improving it, so knowing which data are necessary. We recommend replicating the data to the analytical platform. The Data Scientist can then transform, modify and analyze them at will, as well as prepare the final model.
Conclusion: When should you use virtualization and when replication?
Data virtualization is useful when you need to access data quickly and do not know which data you will need, how many there will be or how they should look. In such cases, virtualization is the perfect tool to facilitate finding the right solution. On the other hand, data virtualization is unsuitable if you want to work with the entire data history, if you want to load huge data sets, or if you want to carry out a large number of data transformations and modifications.
Data replication is recommended when you need to have “important data” available to a large number of users or applications at the same time, when you need to work with the data repeatedly, or when you need to transform the data in complex ways. Data replication is inappropriate if you only want to look at the data.