Application monitoring can deal with problems that occasionally occur, for example, in banking applications. "Thanks to it, we can see how long, what and where within the application runtime. We can link that to log metrics, infrastructure metrics, and the cloud, just everything we find in modern IT today. But most importantly, we can go down to the code level, which then helps developers fix the problem," explains Jiří Kurejko, who works at Adastra as an expert in application monitoring.
- When to deploy application monitoring?
- Why is it worthwhile for banks to implement application monitoring before launching new internet banking?
- How do we reduce the number of complaints to IT about the functioning of systems in companies?
- Why do some customers have technical problems and others are unaffected?
Listen to the podcast (CZ)
Read the podcast as an interview
Ivana Karhanová: A real example from practice. Every day at 10 o’clock and 30 minutes, the internal system in one Czech bank slowed down so much that the back office could not work properly. A team of the most senior people had been wracking their brains for several months, but they hadn’t come up with anything. The mysterious 10:30 problem continued. The back office could go for coffee for a few dozen minutes each day. Sitting with me in the studio today is the man who solved the problem. Jiri Kurejko, application monitoring expert at Adastra. Hi, welcome to the studio.
Jiří Kurejko: Ivo, thanks for having me.
Ivana Karhanová: What was the problem?
Jiří Kurejko: I would like to clarify that I didn’t solve it only by myself, but I worked on it together with a team of people. The problem was that some logos were crossed, to put it non-technically. The application was somehow logging its runtime, and something got crossed, so it was not easy for my colleagues in that bank to find the problem.
Ivana Karhanová: OK, but why weren’t they able for months to break down those logs, go through them, and figure out where it was blocking?
Jiří Kurejko: They didn’t see that deeply into it because the monitoring world we already have today is so complex. The systems are so big that just analyzing all this from the logs is not enough. Sometimes the information is not there at all, or it’s buried under a lot of other information, and getting to it and linking everything together is difficult.
Ivana Karhanová: And how did you and your team solve that? What happened when you suddenly saw what was wrong?
Jiří Kurejko: We deployed Dynatrace on their application, an application monitoring solution that can break problems in applications down to the code level. And then, we can see how long, what and where within the application runtime. We can tie that in with log metrics, infrastructure metrics, cloud metrics, and everything that we find in modern IT today. But most importantly, we can go down to the code level, which helps developers fix it in the end.
Ivana Karhanová: Let’s describe what was going on in the bank in layman’s terms. What was causing everyone to put their feet up at 10:30 because they had nothing to work on?
Jiří Kurejko: There are thousands of people working in that system who are given tasks in the back office and have to complete them within a certain time. Then they get another task, and the system logs those tasks so that when the task is completed, the number of the task and the number of the person who completed it are logged somewhere. People started working from eight or nine o’clock, and the volume of work gradually increased. When 10 o’clock or 10:30 came, the logs locked into each other in the event logging system. The application was logging one task and couldn’t log the other tasks in the queue. Then they locked up with each other, and the app started slowing down. The queues of tasks waiting to be logged eventually caused the crash.
Ivana Karhanová: How hard was it then to fix the problem?
Jiří Kurejko: It was terribly simple. It was enough to think for two hours about how the logging was written and not do it synchronously but somehow asynchronously. It sounds very simple when you say it, but when there are thousands of users in the system, dozens of servers, and it is connected to dozens of other banking systems, it is very difficult to find the needle in the haystack. One then searches for days, weeks, or months and can search for a long time if there are a lot of systems interconnected.
Ivana Karhanová: I did mention that they have been working on this or looking for the cause of this problem for months. Then how long does it take for application monitoring to show wherein that piece of code is the error or which piece of code is causing the error?
Jiří Kurejko: We spent a couple of days with the team analyzing the problem, talking about when it happens, why, and if there’s a trigger. We installed Dynatrace into that. Once it was installed, and once we understood the problem, it was within a couple of hours that we helped them uncover that and show how those transactions were gradually slowing down and how the time spent in the locks and in that logging was increasing.
Ivana Karhanová: Let’s stay with another example in banking as well. In another bank where you were just deploying application monitoring, the purpose was to detect critical points that could have compromised the sharp operation of the new internet banking. What is the strength of that application monitoring in this particular case?
Jiří Kurejko: At first glance, application monitoring is mainly for production monitoring, but this is a common misconception that we are trying to dispel. If we detect the problem already during development or testing with application monitoring, it is cheaper for the company. It costs a completely different amount of money to fix a problem in production and fix it in a test. And the bank we worked with was very aware of that. They had worked with the new online banking system supplier for a long time, of course, and they had worked with us for some time as well. So we knew the systems and were able to agree to help debug it before it went into production. They wanted to make sure that the launch of the new online banking, which they were putting a lot of money, a lot of marketing, and a lot of investment into, was as smooth as possible.
Ivana Karhanová: We also talked about the fact that if we deploy it on a specific product, we are able to tell quite easily where the specific fault is. And the actual supplier and the client don’t have to argue about where to look for that bug.
Jiří Kurejko: That’s right. Those suppliers don’t always know exactly where the problem is. If they don’t know, and they don’t have, for example, people who can look into it because they’re on other projects or they’re busy, then a lot of times they can tell:
- The problem is not with us.
- It’s in the customer’s infrastructure.
- It’s just somewhere else.
The systems are already so complex, and when there is not just one supplier, but there is an internal IT supplier, another supplier, and a cloud provider. They all have to participate in the operation of the application. It is not easy to agree on the relationships, who is responsible for what, SLAs, and things like that. And then measuring it is another fun thing.
Ivana Karhanová: When you put application monitoring in there, if I understand correctly, application monitoring can show already in the test quite accurately where the whole application is maybe stuck.
Jiří Kurejko: Absolutely. We work with customers who regularly test their applications. Before every release, it is, of course, practical to test the application before it goes live. Customers then write some automated tests, for example, to check this on an executed basis so that they have a comparison with what was happening. Historically, we’ve been there to help them identify very quickly and accurately where the application has some weaknesses. Compare that to what was happening in previous releases, for example, and automatically give feedback to that developer: Hey, you’re calling the database twice as much here as before. If there are 20 database calls in a previous release and suddenly there are 50, when you deploy that to a live system with thousands of those calls and tens of thousands or millions of customers at once, it can bring that database down. We help to see even these things happening in the background, to detect in regression testing that sometimes in releasers you can do not maybe consciously. Not entirely because somebody wants to break it, but because maybe the business has a requirement to make more calls within something in that application.
Ivana Karhanová: On the other hand, if we’re able to pinpoint a piece of code that needs to be fixed or needs to be looked at, then basically at that point – in a good way – there’s no need to argue between the supplier and the client or between the suppliers about who should do it because it’s very clear who owns that piece of code.
Jiří Kurejko: That’s right, that’s a big saving. Customers say it saves them 80 to 90 percent of the time that would normally be spent troubleshooting and finding root causes. So Dynatrace saves them a really big amount of time. It’s also important to remember that troubleshooting is not done by junior people in companies but by the most senior people, who are the most expensive and could be doing new pieces of code, innovations, and automation instead of troubleshooting. In addition, they are most frustrated with having to do troubleshooting, i.e., digging into logs into some metrics. They just don’t quite like doing it. Running applications is generally a very stressful area, let’s face it. It means running 24/7, being on the phone, being woken up at 4 o’clock, and having to fix it within five minutes, or it’s a mess. And then having to go and dig into the log tends to be frustrating for a lot of those people.
Ivana Karhanová: Let’s go from banking just a little bit to the insurance industry. You’ve dealt with a lot of cases here as well. Some of them have a common denominator: their internal systems tend to be slow, which of course, blocks the business, and then the business can complain that the applications don’t work as they should. That means that the performance in that business can’t be as high as expected. How is this handled by default?
Jiří Kurejko: The standard way of dealing with this is to meet higher and higher batches of individual departments, i.e., business and IT management. They each bring their charts, looking at what’s wrong and why we’re not selling. At that point, IT has to defend itself against the accusation that the business is not selling as much as it could because the IT systems are not working as they would expect.
Ivana Karhanová: But what can IT do about it?
Jiří Kurejko: IT needs some data at that point that it doesn’t measure, typically the experience of real users in online banking. It is quite understandable that we need to measure how internet banking works for internal systems. Sometimes people say, well, they’re our employees, they’ll wait, or they’ll write it down on a piece of paper somewhere. But of course, the business has changed a lot in the insurance industry. Everything is no longer handled in the branch, a lot of insurance companies sell through intermediaries and brokers, and they have a whole range of offers that they can sell that way. So if one of those systems doesn’t work, they just go to another one. They are selling it now with another insurance company. That means that the response of those systems and the customer experience is no longer just for the internal staff but for the whole chain in general.
Ivana Karhanová: On the other hand, IT must have a lot of requests from specific people. Like now, this is not working for me. I can’t close/open this here. This is not saving. I assume that employees write to support when something doesn’t work.
Jiří Kurejko: Yes, they write to support. They contact the helpdesk when something doesn’t work for them. A lot of it is subjective, of course. If it’s not measured, somebody might think that the system is slow in the morning, so they announce it to all their colleagues in the office, and all of a sudden, it’s slow. So there are a lot of impressions.
Ivana Karhanová: You’re able to read that from the data.
Jiří Kurejko: We can readjust how fast or slow the system is for each of those users, in which location, from which device, what connectivity they have, and whether they are on a fast or slow connection. And we’re able to tell which users are using, say, an outdated browser, which is why it’s slow for them. And it would probably be a good idea to recommend that they get a newer browser. So we’re able to tell those things from the data, and that’s related to subjectivity. If some people’s system is working as fast as others, but subjectively they think it’s slow, we can pick that up and work with it. The customer can then work with their team, educate them, etc.
Ivana Karhanová: You also said that after you deployed application monitoring in this insurance company, the drop in incidents was a whopping 90 percent. Why is it that incidents have almost disappeared?
Jiří Kurejko: Of course, it was partly because there were some optimizations made to the application after we deployed Dynatrace. The application got quite a bit faster, it stabilized, and it was communicated to people that a new system was deployed that monitors how the application works and that if they have a slow application, they don’t have to complain anymore that IT sees it and starts to proactively address it. So a lot of those complaints were, I’d better put it in there, so IT knows about it, but they’re not going to solve it anyway because they don’t even know where it is. So there were a lot of established patterns, but those people sometimes complained just because they thought it was slow, but actually, there was like a new release. So they complained that it was slowing down the new release. So there were a lot of those impressions and notions.
Ivana Karhanová: They were just burdening the helpdesk without being able to do anything about it effectively.
Jiří Kurejko: Yes. There then the number of complaints dropped so dramatically. Eighty-ninety percent went down.
Ivana Karhanová: Let’s take another example, which is also common in the IT world. At a mobile operator, some customers complained that they couldn’t log in to the web self-service, but, according to IT, everything worked perfectly fine. One of the typical phenomena is when IT sees that everything is OK on their dashboards, yet the other party complains. What to do about it?
Jiří Kurejko: Of course, this is determined by what is measured on the dashboards. If the real user experience is not measured on the servers, but the CPU is measured there, for example, it may flash green, but if the user’s response in the application is not measured there, IT will not see the problem there. All the databases and servers light up and flash green, but if it’s not being monitored from the application plane, you may not see the problem there.
Ivana Karhanová: So what was the cause of some customers’ self-service not really working there?
Jiří Kurejko: There were many causes, from some suboptimal calls within the code but also some product or business decisions. For example, some customers had many contracts and many numbers attached to them, and for example, when they logged in, all their contracts, invoices, and numbers were listed immediately upon logging in, which took quite a long time for some customers.
Ivana Karhanová: And maybe it wasn’t even necessary, right?
Jiří Kurejko: Yes, maybe it was enough to move the loading to a tab. And if the customer is interested in the breakdown, he could click through there. So it was quite elementary things from the user’s point of view. Still, it was also, for example, poorly optimized database queries or poorly optimized code of the application, which historically was already kind of in place, and nobody considered it as a possible cause or problem.
Ivana Karhanová: We should mention that those customers were basically about five percent of the total number, which means relatively few. Still, their user experience was very bad at that point and could have affected others.
Jiří Kurejko: Absolutely. It didn’t affect all customers at once. For example, it wasn’t a huge outage that would bring down the whole system. Of course, our IT colleagues know about such outages and deal with them. But this was related to a customer from time to time, and that’s why it was very difficult to find a connection why it was happening to this customer and not to other or similar customers. It’s just the connection that that customer has. He’s on a Mobile Edge connection somewhere on a cottage, and that’s why it’s loading slowly for him. Or it’s because he has a lot of contracts, an old browser… Or maybe it’s what we have written in the app here for this particular customer, what’s in his contract. So again, IT had to look into hundreds of thousands of causes. And they said before we started working with them, it took on the order of weeks to figure out these problems. Every ticket that came in from the help desk that a customer called in, that they were logged in for twenty or thirty seconds, was dealt with in weeks again by these fairly senior developers, senior vendor people. They needed to sort it out because the business was pushing it, understandably, that this was not acceptable. After we installed Dynatrace and got it up and running on that application, the IT colleagues who know the application intimately were immediately able to see where the root cause was and what to tell the vendor to fix. Those contractors were then especially very happy that they didn’t have to spend long days and weeks looking for those needles in a haystack. Still, within a few hours, they were able to fix it, and they were able to work on new features of the application.
Ivana Karhanová: Let’s describe here the solution that we’ve been talking about all along. How does application monitoring work, in this case, specifically Dynatrace?
Jiří Kurejko: Dynatrace is a software platform that collects a lot of data using agents installed in the applications that we want to monitor. There are a lot of talks now about the concept of observability or visibility or insight into those applications. So I collect many metrics from applications, and then I look at how that application is performing over that and in some context. And, of course, because there’s an awful lot of data and metrics flowing in modern applications, there needs to be quite a lot of automation over that so that I don’t have to configure that monitoring with every change to that application and it adapts itself, it learns what’s good and what’s not so good in those applications. And of course, there’s some set of algorithms running on top of that, based on the experience of running those applications. Dynatrace has been in this field for decades, and it’s Dynatrace’s core business. So it has quite a bit of know-how on how to track these things, evaluate them, and determine what’s right and what’s wrong. So the system works on the principle of know-how and algorithms that help simplify and speed up the work.
Ivana Karhanová: What kind of customers typically find it worthwhile to deploy this software solution?
Jiří Kurejko: Typically, it’s worthwhile for customers with a very mission-critical application. Most of them are larger companies. Our customers tend to be larger enterprise customers, big banks, insurance companies, and telco operators, but not only them, because nowadays, many companies from other industries also have business-critical applications. So typically, the application has to be critical to the business of that company. And ideally, there needs to be some sort of cost quantified for when that app might not work. How much money it costs us, how much it costs us reputationally, which is harder to quantify, and how much money it costs the business if it doesn’t work for an hour or two.
Ivana Karhanová: What kind of investment are we talking about then?
Jiří Kurejko: Dynatrace is licensed through subscription licenses. It is in the order of hundreds of millions of crowns a year, depending on the size of the application and the infrastructure. But we can certainly produce some volume discounts and things like that there.
Ivana Karhanová: How long does it take you at the customer to get it up and running in a form that the customer can adopt?
Jiří Kurejko: It takes a matter of days. Long gone are the days when we used to spend months or years with these solutions, actually setting up the monitoring of these systems. There is no time for that anymore. And the way Dynatrace is automated, it takes us a couple of hours or a couple of days to do the installation. Then we spend a lot of time creating some dashboards, analyzing metrics, outputs that come from Dynatrace, and a lot of time integrating into the customer’s processes and systems so that it’s completely seamless so that Dynatrace fits into that organization and can start adding value immediately.
Ivana Karhanová: Customers can manage the system themselves, or do they still need your team to do that?
Jiří Kurejko: Customers usually manage it themselves, and we are in that role of a friend on the phone. When they need to consult with someone about something or configure something, they turn to us for advice or help. So we’re more in a consultant role than running the monitoring at the customer’s site. They usually do it themselves.
Ivana Karhanová: I can think of a logical thing: if you deploy another application or platform to monitor those applications, doesn’t it slow them down?
Jiří Kurejko: No. First of all, technology has been in development for some time. Secondly, it is a big goal of Dynatrace as a company that the software should not affect the application that we want to monitor because that goes exactly against the principle and the mission that Dynatrace has with the software, which is to make those applications run faster and more stable. Of course, it works on the principle of agents, there is some computing power that those agents need, but on large systems and huge servers, which is where most of those key applications run, it’s within one percent of the performance. So the server virtually doesn’t know that the agent is running there. Of course, customers have tested in great detail several times how bulletproof we are on this. We’ve always come out with them not being able to tell that the agent is running there. On the graphs that they then painted in different systems, they could not determine where that agent was running or not running. They found that the slowdown was what we say, or maybe even lower.
Ivana Karhanová: We’ve mentioned user experience, the customer experience of the end-users of the applications, several times. But still, you think corporations should consider developer experience and developer experience. Why?
Jiří Kurejko: Nowadays, when there is a significant shortage of capacity in the IT market and IT experts, there is a great effort to retain experienced key people. And there is nothing worse than when experienced developers who know the systems, know-how to develop them, manage them, and so on, get frustrated with some manual tasks that can be automated. That usually frustrates those people a lot. With Dynatrace, you can automate a lot of tasks that are related to, for example, traffic analysis troubleshooting, but also some proactive application maintenance. Experienced senior developers then have a lot more time to innovate, improve automation, etc. They usually enjoy a lot more than endlessly digging through logs and maybe not even finding the root cause because it’s simply not even logged in that log.
Ivana Karhanová: Says Jiří Kurejko, a person who is an expert in application monitoring at Adastra. Thanks for talking to us today and for coming by the studio. And I’ll see you sometime.
Jiří Kurejko: Thanks for the nice chat.