8 tips for building mature Big Data platforms

Companies who base their business on data generate more revenue than those who rely on intuition or other approaches in their decision-making. Don’t risk falling behind….

Building a Big Data platform takes a couple of months. But for it to fulfil its function and fully mature, that takes time. We have seen this more than once in real implementations in banks, telecoms and manufacturing companies. Although we built the data platform for one bank in 2015, it is still waiting to reach maturity. There is still much to improve and develop. And more and more data are still being added.

Big Data projects tend to be complex and can be realized using various technologies and architectures. So, how to go about making your efforts a success?

1. Break down data silos

We often come across a lack of willingness to share data with anyone else in the organization. In large businesses, an ownership mentality persists and data silos endure. While people are aware that they need data, they do not want to share them.

Business tip: The transformation to a data-based company is rooted in users’ heads. It cannot be achieved without their motivation, ideally, encouraged by the company’s leadership, and clearly stated benefits. One Czech bank has managed to break down its silos thanks primarily to a committed Chief Data Officer. Do not hesitate to sway management to your side.

Technology tip: Data platforms (whether on-premise or in the cloud) make it possible to share data and thus break down data silos. You can automate data transfers to the platforms from individual departments (for example, replicating them using Adoki). Your original data will remain intact and not disappear anywhere, while other departments can work with them as necessary. Nothing you need will be deleted or ruined.

2. Create a Data Lake  

Many of our customers are building a Data Lake or a Big Data platform to get away from the traditional approach that leaves them with data scattered across a large number of silos, in databases, file systems, applications and other places. Data Lakes have one irrefutable advantage: they can hold huge volumes of data, both structured and unstructured. You can store not just text or numerical data but also images, sounds, videos, and perform interesting analytics on top of them.

Data Lakes, such as Hadoop or similar cloud-based systems, help solve the problem of separated data and enable cooperation. Unfortunately, this can cause new problems….

Business tip: Think carefully about which data you might need and for what purpose. To that end, do not hesitate to reach out to other departments to discuss their needs, plans and visions. Not everyone may see things the same way as you. The Data Lake should contain what is necessary. A lack of communication and willingness to share blocks an organization’s ability to gain real value from data.

3. Think about Data Governance 

It is worth focusing from the very beginning on the data-related topics typically connected with difficulties: ensuring data consistency, data cleanliness, traceability, clarity, what information the data contain and where they come from. We often work in environments that are missing data catalogs and business glossaries. A centralized data catalog should offer an overview of the data portfolio while simultaneously supporting how access rights are granted and managed. This is so that business users can see what data the company has and know how to access what they need, so that they can make good decisions and build innovation on solid foundations.

Business tip: According to Gartner, one of the top-rated Data Governance platforms is the Czech Ataccama. It uses both automation and an innovative approach to data management – the so-called Data Quality Fabric, which Gartner ranked among the “Top 10 data and analytics trends for 2021”.

4. Facilitate data democratization

Although our customers are working extremely hard to democratize user access as part of the overall modernization of the data ecosystem, there are still barriers standing in their way – mainly to do with data security and GDPR. Nevertheless, different user groups, including data analysts, data engineers and data scientists, can work with larger volumes of data and have a variety of options for how to leverage them.

Technology tip: The data may be available, but not all of them to everyone. You can manage and access data at various levels: directly on the data platform, but also through the user interface, for example in Power BI or Tableau.

5. Consider anonymization  

In order to protect customer privacy, it is essential to work correctly with anonymized data. Thus, a procedure should be developed to ensure this – securely, automatically and reliably. Today, a secondary data environment, containing solely anonymized data, is a necessity.  

Business tip: For data scientists, we often create a data environment where advanced statistical and machine-learning models can be produced purely on top of anonymized data. Various anonymization frameworks and scenarios can be used for such environments, but it is important to make sure the data are always fresh.

6. Leverage and integrate various data sources

When provisioning data to a data platform, take care not to establish a data swamp. A data swamp is formed when data are integrated for which there is no specific analytical or business purpose for several months. Consequently, we recommend building the Data Lake incrementally, according to business specifications. It should, in the first place, provide results in one specific area. Only then should you start adding more and more sources in accordance with use cases.

Business tip: When provisioning data sources to a central repository, we often find that companies have two types of data – their own and data from external sources. If you have a Data Lake, you can acquire this “foreign” data once and then pass it on to other teams for further processing.

7. Count with a big data platform from the start

A Big Data project may not be enormous from the beginning. However, it is still a good idea to design the solution architecture to be scaled easily and extended in the future. Keep in mind the advantages and disadvantages of on-premise and cloud solutions. And do not forget to evaluate your needs regularly. For example, how quickly are you able to respond to your company’s growing volumes of data and a rising number of users? Thanks to the cloud, which eliminates the need to invest in hardware, organizations can increase computing power from one day to the next. On the other hand, pushing petabytes of data through your company’s Internet connection may not always be an easy ask.             

Business tip: It is better to start with a smaller project that has a higher chance of success and then build on it over time.

Technology tip: Think ahead for scalability, extensibility, capacity and cost.

8. Avoid a lot of problems by having a strategy

When it comes to building a data ecosystem, it definitely pays to have a strategic plan that is supported by a strong enterprise architect and covered from a financial perspective. This will help you avoid a lot of inconveniences and issues, though there will always be obstacles along the way. On a number of our Big Data projects, we have used a functional strategy to effectively remove barriers that would otherwise prevent business users from accessing the data. Even though we are working in an agile world, strategy and phased development are still imperative.

Business tip: A well-developed, pragmatic strategy eliminates unpleasant surprises on your way to building your ideal data company.

Author: Dagmar Bínová has dedicated her entire professional career to data, analysis and business use. She leads the Big Data and Data Science team at Adastra.

Source: the article was published in IT Systems 9/2022