Data Scalability and Protection Are Key in Handling Petabytes From Our Self-Driving Vehicle Tests (Jan Zahradník, Valeo)

In a world where data volumes are measured in exabytes, Valeo is striving to become a leading player in the autonomous vehicles sector. "Data collection and analysis are crucial for our testing processes. We need to manage huge volumes and ensure their security," says Jan Zahradník, ADAS/AD SysVal Toolchain & Big Data Manager at Valeo, which develops systems for autonomous cars and has a research center in the Czech Republic.


  • What technologies and methods does Valeo use for data collection and analysis from test drives?
  • What challenges arise from sharing and processing vast volumes of data in practice?
  • How does the company ensure the security and protection of the collected data?
  • What does Jan Zahradník predict for the future of autonomous vehicles?

Read the interview

Ivana Karhanová: Test drives of autonomous cars generate gigabytes of data per second. A typical project generates petabytes of data, which must be transported to Prague for Valeo’s teams to not only review but also to replay and simulate individual drives using data from dozens of sensors and LiDARs. How do you build a data platform when you face the prospect of having exabytes of data (10^18 bytes)? Our guest in the studio is Jan Zahradník, ADAS/AD SysVal Toolchain & Big Data Manager at Valeo, which develops systems for autonomous cars and has a research center in the Czech Republic. Hello.

Jan Zahradník: Hello.

Ivana Karhanová: Please, let’s start by explaining how data collection works for autonomous cars on a test track.

Jan Zahradník: We use modified versions of sensors for data collection. These sensors are based on microwave radars at 24, 77, 79 GHz. LiDARs and various cameras are placed both under the front windshield and as 360-degree cameras around the vehicle. We collect the lowest digital level of data from these sensors. For cameras, we record the value of each pixel in sync with time. We add additional information, such as from extra sensors that are more expensive and accurate than those required in vehicles for certain functions. We use them as a so-called “ground truth” and are able to measure reality more accurately and compare it with our sensors.

Ivana Karhanová: But these are not in the cars that you sell, right?

Jan Zahradník: Exactly, because that would make the system much more expensive.

Ivana Karhanová: When a car goes around the test track, experiencing various situations, how much data does that generate?

Jan Zahradník: We test under various conditions. This includes about ten different enclosed testing areas that Valeo has around the world. One of the largest is in the Czech Republic in Milovice. There, we can test on a three-kilometer track at speeds of nearly 200 kilometers per hour. We test various scenarios there, also using control robots to simulate dangerous situations. Most of the data is collected in the real world, under conditions where the systems will be used by end customers. For instance, traffic infrastructure or driver behavior varies in different countries. There, we collect statistical data to discover situations that are, let’s say, unplanned.

Ivana Karhanová: And from one such test drive, you have petabytes of data?

Jan Zahradník: Yes. Once we gather all the necessary data, we create a statistical model, which is a combination of conditions in which data need to be recorded. These include winter/summer conditions, weather, and urban/rural settings, and each category combination requires a certain amount of data recorded in modified cars, numbering in the tens. These cars then drive under these conditions, recording data daily on external discs with capacities in the higher tens of terabytes. One card has around eighty terabytes, and we can fill it in a day on some projects. Then we start to figure out how to get the data to where we will work with it. One of the largest places where we work with it is in Prague.

Ivana Karhanová: How do you physically get the data to Prague?

Jan Zahradník: The most common method so far is that we physically send the discs by mail. Unfortunately, this involves delays and things like customs, as we send from different parts of the world and within various trade zones.

Ivana Karhanová: Okay, so the discs arrive in Prague by car or plane. What then?

Jan Zahradník: Then we take the data from the discs and start copying them to data storage.

Ivana Karhanová: So CTRL+C/CTRL+V, and a message pops up that the recording will take 3 years? 🙂

Jan Zahradník: Fortunately, it goes relatively quickly. When we need daily continuity of data, we have an oversized infrastructure to be able to process a new batch of data that comes in daily on discs from cars driving around the world. We can empty a disc in about eighteen to twenty hours. The rest is kept in reserve, which is then sent back out into the world.

Ivana Karhanová: With this data, you then simulate further drives, so you not only need to get it to one place but also to another place to simulate further processes on them?

Jan Zahradník: Once we get the data into the data center, one of the first things we do is to verify its quality. We need to check if they contain all sensors, if any are missing, if there is mutual synchronization, if they were set correctly, if the cameras were pointing correctly at the sky. Then we start processing the data and watch how we fulfill the various combinations of the statistical model. Of course, it happens that some situations are recorded better than others. Therefore, we have to send the test vehicles to where we need the right data combination.

Ivana Karhanová: So, for example, send them to serpentines, potholed roads, and mountains?

Jan Zahradník: Yes, if that’s part of what the function needs to know, and under those conditions, it needs to work.

Ivana Karhanová: I imagine it quite simply, but in reality, it’s a huge amount of data. How do you ensure that everything works for you? I mean calculations and everything you do with the data.

Jan Zahradník: We don’t solve extremely complicated tasks, but we do do it with huge volumes of data. This means we have a close relationship with IT infrastructure andhow many different types of computer networks can transfer data. We have to optimize the architecture so that it can handle it.

Ivana Karhanová: When we move on to the development of the next level of autonomous cars, where do you expect data demands to increase?

Jan Zahradník: That’s a difficult question. There are two aspects that need to be addressed – technology and cost. We can’t build infrastructure that’s a million times bigger. We have to innovate methodology and combine the real world with the virtual. We are able to simulate extreme situations in the virtual environment and look for the limits where the system fails, without using data from the real world.

Ivana Karhanová: Does that mean you’re running the test drive again?

Jan Zahradník: Yes, and we have two ways of doing that. In reality, we don’t have any physical vehicle in the data center. We have a combination of electronics and sensors from the vehicle, connected together, but it’s modified differently. For example, the cameras don’t actually see anything:we project the image that came on discs to them. We run the same ride exactly as it happened, again. We call this open loop or playback, where we play recorded data into the sensors with a newer version of embedded software. Then we have another type of validation, closed loop, where algorithms affect what happens. There is a need for broader modeling of sensor behavior, the vehicle, and the environment in which the models are simulated.

Ivana Karhanová: Does that mean the software reacts differently to the same data?

Jan Zahradník: Yes, in the second method, which we call closed loop, the world we drive in is virtual, offering almost infinite possibilities of parameters we need. We don’t use data recorded in the real world.

Ivana Karhanová: Speaking of the fact that the data has to travel by mail, what’s stopping you from using other methods to transfer it?

Jan Zahradník: That’s an interesting question. We are currently discussing with several cloud providers and telecommunications operators the possibility of getting data online to key data centers. We don’t have a solution yet, and there are several reasons, including technological ones.

Ivana Karhanová: Does that mean network throughput?

Jan Zahradník: Yes, network throughput. It might allow it, but there’s also the question of cost, especially when using cloud services. Most of the data we process physically in Prague, because we need to connect our electronics. Connecting these things to the cloud is not simple, and besides, we are limited in Prague by the possibilities of connecting to the cloud.

Ivana Karhanová: The amount of data when simulating further drives would be huge, right?

Jan Zahradník: Yes, the fees for transferring data from the cloud are high. We are discussing with providers about various models because we are not the only customer with similar needs.

Ivana Karhanová: And do they have a solution for you?

Jan Zahradník: They do, and it’s a very current topic.

Ivana Karhanová: Let’s take an example where I work as a big data manager in the banking sector with tens of millions of records a day, and they are small records. In contrast, you work with a huge amount of more comprehensive data. How is the data architecture built for such a solution?

Jan Zahradník: I joined the company in 2013. At that time, we were already developing various platforms for product validation. We consolidated multiple independently developed solutions into a unified system. This integration involved forming a team that included external IT experts, as our primary expertise at Valeo was focused on algorithms and embedded software. For designing our architecture, we chose to collaborate with Adastra.

Ivana Karhanová: Are you considering future scalability and the growth of data volumes?

Jan Zahradník: Yes, scalability was a key requirement. We needed the ability to scale both to additional or faster on-premise infrastructure as well as to leverage public cloud services. We aim to integrate these services into our platform, while maintaining all data recorded in our system, ensuring we have instances of that data in specific clouds.

Ivana Karhanová: So, essentially, you’re addressing storage and computational power separately?

Jan Zahradník: Correct. The primary services we utilize are data storage and computing clusters. Depending on the situation, we might need just processors or graphic cards. Whether it’s data storage or computing nodes, we can utilize them either on-premises or at various Valeo locations where we have existing cloud infrastructure. Our decision is based on where the data is located.

Ivana Karhanová: Beyond cost, what are the main challenges you face in data processing and management?

Jan Zahradník: Data protection and security are significant concerns. We are very cautious about safeguarding our data and preventing breaches. Before using any open-source tool, we meticulously review it and understand the supplier processes to avoid the risk of malicious code. It’s about our internal processes and how swiftly we can adapt new technologies from a security and risk analysis perspective.

Ivana Karhanová: When do you realistically think we can advance to the next level of autonomous vehicles? In five, ten, twenty years?

Jan Zahradník: I can be quite precise here. We had two projects. One, a Mercedes S Class with a combination of sensors from Valeo, enabled level 3 autonomous driving, which is conditional automation. It operates only in certain areas in Germany, up to sixty kilometers per hour, mainly in traffic jams. The previous project was for Honda in Japan, also a similar system operating under specific conditions at speeds up to sixty kilometers per hour.

Ivana Karhanová: Thank you for the discussion.

Jan Zahradník: Thank you for the invitation.

The interview is based on a podcast (in Czech) and the text has been edited and shortened using ChatGPT. 

Case studies

Eurowag built a data lake in the cloud in under a year

Sales, customers and the financial statements for completed acquisitions are...

Read more