Pavel Kysela for LUPA.CZ: The Big Data market will develop in the shape of a “W”
05. 02. 2019
Reading time: 10 minutes
Offloading is an extract, load, transform (ETL) process for the transfer of a huge amount of data from an enterprise data warehouse (EDW) to big data platforms. It is one of the key steps that has to be overcome by companies wanting to take advantage of Big Data and advanced analytics.
As the amount of data increases, so does the need to process or store big data. As traditional data warehouses can be inefficient for some types of data, specific large data platforms can be used as an alternative. This typically applies to log, sensor or signal data, which is generated in large volumes and which gains value only through processing and interconnection with other data, such as the device or customer profile. But how to ensure that the linking of large data and (traditional) profile data is always up-to-date and complete?
One option is to regularly move key data from traditional data warehouses (EDW) to a big data platform. The following text discusses aspects of ensuring regular EDW data transfer.
The integration of many data sources into a single corporate environment (enterprise data warehouse, or EDW) is generally achieved using so-called ETL processes. These include not only the retrieval of data from data sources (data extraction), but also transformation (normalisation, denormalisation, structure changes) and the uploading of data (loading) to EDW, where it is further processed, analysed and reported.
The aim of ETL processes is to provide the user with an appropriate view of the data generated by different systems and to provide them with the necessary information as a basis for decision-making.
With the acquisition of a Big Data (BD) platform, or sometimes a Big Data Warehouse (BDW), the IT architecture and data transfer principle become slightly more complex. The new platform must connect with both systems that generate large volumes of data, and with the EDW, in which reliable, high-quality information from the company's business is usually located.
And it is in connection with BDW fulfilment that a new concept - "offloading" - appears within ETL processes. This term refers to an ETL process whose task it is to perform BDW. Offloading is the transfer of massive quantities of data from ERW to BDW. Sometimes this task is also referred to as data mirroring.
Data from an enterprise data warehouse is generally of high cost and quality, as companies have often invested considerable resources and efforts to obtain it. The benefits of EDW are not inconsiderable, and therefore it is advisable to take full advantage of them in connection with BDW. These consist mainly of:
Companies such as banks or financial institutions, or retail or telecommunications companies, often find that their EDW is not sufficient for one of the above benefits and acquires a BDW. This not only provides all the above benefits, but also brings new ones: it has much greater scalability (cheaper storage, parallel data processing, and supports a full range of analytics tools, not just SQL).
Although recent trends are leading large businesses to acquire BDWs, their internal IT departments often do not know exactly what such a platform could be used for and how it can be deployed in the infrastructure architecture.
One motivation may be to “alleviate” the growing demands on an EDW. In this case, they must take the key architectural decision as to whether to “migrate” an EDW to BDW, or whether the company should extend an existing EDW to a BDW environment. In the first case, the company is faced with a very radical and demanding migration process, and in the second case, it will need to resolve the "offloading" of data when extending an EDW to a BDW environment.
Fig. 1: Integration of data from different systems into EDW
Offloading is a way of offering data available in EDW in a BDW environment, too. Its goal is to ensure that EDW data is available in the BDW for its interconnection and allows new or richer insights into the company's core business.
Offloading therefore has to do the following:
When transferring data between EDW and BDW, the transfer method must be determined. In this respect, two transfer methods can be identified.
The first is an indirect transfer method, in which the EDW regularly prepares extracts (batches) for a selected server, from where they are then transmitted to the BDW via STFP or another transfer method. The BDW processes these extracts and prepares them for final storage. An indirect method is preferred if the quantity of data is not too large or if it is not possible to utilise a second transfer method for security reasons.
The second transfer method can be referred to as direct, whereby a direct connection is made between the EDW and BDW and a technology that is capable of establishing a connection and transmitting data directly without the need to create an 'extract' is selected. In this case, the technologies most often selected are those native to the BDW, such as Apache Sqoop, Apache Spark or Apache Kafka, or technologies that are suitable for the chosen EDW environment.
A further architectural challenge is determining when and which data to transmit so as to keep it up-to-date.
In terms of transfer scheduling, it is important that the updating of data from the ETL source to the EDW does not overlap with offloading.
Tables can be transferred in two basic modes. The first mode is full, which represents the transfer of a whole table (used for small tables up to 100,000 rows), while the second type of transfer can be referred to as incremental, which is the transfer of selected data only (ideal for large tables over 100,000 rows).
For incremental transfer, it is important to note that it is most often done in the forward (i.e. new data only) or recent mode (new data and updates of recent transfers), but should support the possibility of updating historical data
Fig. 2: Data transfer from EDW to BDW
When transferring data it is necessary to ensure its consistency, i.e. pay attention to data types and avoid the risk of loss of accuracy. Therefore an output layer is created on the EDW side for the modification of data prior to transfer to a transfer-friendly structure.
On the other hand, on the BDW side a landing layer is created. Data is ‘landed’ on this layer then stored in a compatible structure. The data from the landing layer in the BDW is then transferred to the destination layer, where it is accessible to the general user.
The transfer between the source and the output layer is provided on the EDW side. In the EDW, this transfer is handled by storing existing data or by looking at source data. In both cases it is necessary to convert data types (to compatible data types). The EDW output layer then binds so-called offload jobs, which ensure data transfer to the BDW landing layer.
Transfer between the landing layer and the target layer is facilitated by so-called transformation commands. These are used to remove invalid characters (some EDWs often generate non-compatible characters in their outputs) and to convert data types (from compatible to target types).
These steps can be executed using Apache Hive, Apache Impala, and Apache Kafka, Apache Spark or other native BDW technologies.
The means and level of metadata synchronisation play an important role in maintaining data consistency. Their structures on the BDW can be:
The metadata check may take place once per day (if changes in the EDW are scheduled regularly) or before every transfer (if the EDW allows ad hoc changes to the structures). In any case, the structure change in the EDW should be reflected in both the output layer (in the EDW) and in the landing and destination layer (in the BDW).
The output layer can be fixed (re-saves the modified data) or in the form of views (when the data is only converted into compatible data types, or into reduced data for transfer). The final variant then depends solely on the EDW options.
A scheduler that the company already has and uses to manage data transfer and EDW processing can be used for data transfer management. It is followed by offloading, which begins to transfer the already processed data. If this is not possible, a custom scheduler or custom API for the receipt of data transfer requests can be created for offloading. Finally, the traceability of all actions taken must be considered, which necessitates the provision of a location for the storage of metadata from offloading, and from which it will be available for audit.
Metadata includes information on both the structures of the transferred tables and the actual data transfers, as well as an overview of the transfer parameter settings. Traditional relational databases (MariaDB or Postgre) can serve this purpose. A set of dashboards that provide the IT infrastructure with information on ongoing and completed transfers can be created above it.
Offloading as an ETL process should be not only unidirectional, but also support the possibility of returning data (e.g. interesting results from data analysis) from the BDW to the EDW. It is then necessary to reverse the output layer on the BDW and the landing layer on the EDW.
In order to ensure quality and timely data on both platforms, end users must have the choice of using the EDW or the BDW to work with the data. It is important that “traditional” EDW data is the same at both locations.
The transfer of user rights must be adequately addressed in order allow end users freedom in their decision-making. Within the EDW, rights are often addressed through a user rights administration system (LDAP, AD, etc.), from where user rights are generated for the EDW.
The same needs to be done for the BDW. However, as the BDW supports a greater number of technologies, this task is somewhat more complex than for the EDW.
The creation of a high-quality database is a real challenge in companies with large data. In most cases, IT must find answers to a number of conceptual issues. The company's strategic decisions about its direction and need for innovation must often be taken into account.
The linking of traditional EDWs (or even several EDWs) to BDWs can rely on the offloading concept, even with respect to moving data outside the company to cloud technologies. However, the complexity of data services also places increased demands on internal IT operations. Therefore, the sophisticated automation of data transfers from platform to platform through sophisticated offloading can make things easier for the IT department, while at the same time bringing a greater degree of comfort, quality and range of data for use in business. Data must deliver key information to businesses and promote competitiveness, responsiveness and innovation.
An enterprise data warehouse (EDW) that contains 10,000 tables (total 200-500 GB) must be updated daily. There is no external metadata source for tables and the only source is the EDW.
In terms of data distribution, the ratio of small and large tables is 9:1. The EDW permits transfer for eight hours per day. A direct connection can be used for transfer.
This connection creates both so-called robust connections, which can transfer any table, and fast connections, which can transfer only small tables. The company has a scheduler and wants to use it to manage all transfers.
If a direct method is chosen for BDW transfer, a single, highly robust connection, which can transmit any table within one to ten minutes, is created during the permitted eight hours. The advantage of this transfer is that it is highly economical as it creates only a small burden on the EDW.
However, assuming that the average transfer time of one table is two minutes, the total transfer would last approximately 13 days. The disadvantage of this solution is therefore obvious.
Alternatively, the task can be performed by creating 100 parallel threads in order to transfer the tables for greater speed. The entire transfer would be completed in three and a half hours, thus meeting the time requirement. The disadvantage, however, is that such a transfer would lead to instability of the EDW system, since the EDW would have to dedicate all of its resources, most commonly RAM and CPU, to it, and these may not be available (they must still provide/store data from their source systems).
In order to resolve this, it is necessary first to clarify how the transfer will take place. Because the company has a scheduler, the transfer will be activated when the data is ready for transfer (the output layer is filled).
In terms of transfer, an alternative is often sought between the number of connections and the RAM and CPU requirements in the EDW. In this case, for example, 1-2 robust transfers for large tables and 10-20 fast connections can be selected to ensure the transfer of all small tables.
Final processing takes place as soon as the data is available on the landing layer. Due to the use of a large number of tables, tables can be merged into so-called download processes for transfer management; these are then configurable in the scheduler.
Metadata synchronisation between the EDW and the BDW must be ensured in order to perform this transfer automatically. For metadata synchronisation, a variant of EDW structure synchronisation with BDW is chosen, that is, the structure of every table from the EDW is generated and used in the BDW for both the landing layer and the target layer.
Finally, it is necessary to ensure that all transfers are documented in detail, because offloading generates a huge quantity of logs. For this purpose it is advisable to use a relational database.
Tables are created to log transfer metadata, table metadata (store table structures in time for audit purposes), and store transfer settings.
This database must be synchronised with the scheduler and all transfers are executed based on its settings. Finally, a set of dashboards can be created over this database to ensure the monitoring and audit of transfers.