Rumcajs' Data Patrol won the Adastra block in the DS&BI Academy at VŠE
Data science is one of the most interesting fields in IT today. It attracts students of various study programs focused on data, statistics, or Machine Learning (ML). However, many aspiring data scientists have a somewhat distorted view of data science. They imagine that, in practice, they will simply be assigned a task and all they need to do is complete it. In other words, they think they’re getting back into the same groove as when they were tackling student projects – they’ll take the data they need, apply a set of advanced algorithms, pick the best one, and the task will be done. But this isn’t the case. Our colleague from the Big Data team, data scientist Oleg Masajlo, shares the most common mistakes made by junior data scientists in particular. In addition, he offers practical recommendations on how to avoid them.
As data science is an interdisciplinary field, let’s look at what areas it covers and what skills a good data scientist should have:
Each of these areas contains hidden pitfalls. Practice helps when it comes to dealing with them, but sometimes good advice can be useful to correct misconceptions and false expectations early on.
Business knowledge here means a basic understanding of the area in which the customer operates. Of course, this doesn’t mean you have to understand the issues and processes in detail. Rather, it’s about the data scientist’s competence in clarifying the project brief, understanding which problem the customer needs to solve, and explaining how data science can help them do that. This is closely related to the art of communication. The latter is, above all, the ability to find common ground and mutual understanding, whether with the project sponsor or with your colleagues on the customer’s side. A data scientist doesn’t need to be capable of delivering brilliant presentations to hundreds of people. Rather, it’s important that they’re unafraid to ask questions and able to listen carefully to the answers. If you can find that mutual understanding, you’ll have no difficulty defining the brief, specifying the objectives, assigning roles in the newly formed team, or using appropriate content when presenting the outcomes of your work.
As much as the importance of this soft skill is emphasized in literature and methodology, it’s often neglected in practice. The less-experienced data scientist focuses too much on the problem itself rather than on communication and cooperation. Yet there are many points that can be affected by poor communication. For example, it can easily happen that you leave a meeting unsure of what exactly you’re supposed to do. You think to yourself, “I'm not going to be the idiot who asks the same question twice.” Instead, you prefer to get straight to work based on your own assumptions. At the next status meeting, after a week of work, you find out that you didn’t do the right thing. It’s always important to ask about the specifics of the subject. There’s no such thing as a silly question.
Problems can arise when communicating not only outward but also inward, within the team. For example, roles and responsibilities may not be clearly established at the beginning of a project. This lack of communication can then manifest itself in something as trivial as replying to a customer’s email. The way we respond can be likened to a game of volleyball where everyone’s watching the ball fly, not playing but waiting, thinking the next person will step up. It’s the same with e-mail – when no one responds then it can cause unnecessary complications, from upsetting the client to delaying the project.
Communication pitfalls also await at the end of a project when you need to show the results of your work. If you don’t present them in a way that makes sense to the client, then the client may not accept them at all. The form and the vocabulary you use play a big part. A professional manager isn’t a data scientist, so you need to convey the outputs in their language, often in the form of pictures, specific conclusions, and other recommended practices. Otherwise, it can easily happen that they simply do not understand your work or the results.
COMMUNICATION PITFALL: Is this really what the client wants?
On one project, we had to develop an application for a client to track whether an order actually arrived on time. Among other requirements, there was one that sounded exactly like a data science assignment. The required application had to be able to predict possible delivery delays. That’s already a fairly complicated prediction task involving a whole range of variables, including information about the traffic along the route and the weather. We spent all week searching for possible data sources that could help deliver this requirement. Unfortunately, we then discovered that we had underestimated the importance of an initial clarification and just formally confirming the brief with the client hadn’t been enough. In fact, when the project finally began, the original requirements were no longer up-to-date, and the client was satisfied with reporting output that didn’t include any predictions. The time spent on data exploration could have been used for more productive work.
LESSONS LEARNED: If I don't know, don't understand, am not sure, I ask and communicate with anyone and everyone. That's why it's great that working on projects is a team effort - you're never on your own!
Many juniors imagine that the focal point of analytics projects is to create a mathematical model that processes data and returns the desired result. The model itself should be as advanced as possible, perhaps a neural network.
This is partly true because you can’t get to the result without an algorithm. In practice, however, preparing the algorithm itself is often a matter of a few lines of code but preceded by a lot of work – especially with the data.
In the beginning, you spend hours trying to get into the problem you're going to solve. Then you're looking for data because you rarely get them nicely grouped together in one place. Once you've gathered them all, the next stage is understanding the data and then preparing them for modeling. This phase actually requires the most time, as client data are often in different formats or incomplete. Similarly, you need to edit the data into a different form than they’re currently in – either to correspond to the business side of things or so as to be suitable for the algorithm to process them. Then comes the modeling phase, which is often the icing on the cake. Finally, there’s space to try out those cool algorithms you’ve learned at school, at work or just through the grapevine. The truth is, however, that on many projects, even the “ordinary” and decades-tested statistics and machine learning algorithms that are taught in school are enough. The last stage is evaluation, where you put together the results of your creation and where it becomes clear whether the project has met the desired goal.
In addition to all of the above, there’s another project area that’s important but often neglected, namely administration. Throughout the project, a bunch of documents are created that should not be forgotten, from meeting minutes, statements of work, budget tracking, and regular progress reports to the actual documentation for the data sources, the model, etc. The project manager should take care of a lot of these deliverables, but unfortunately not every project has a project manager. Then this administrative burden falls on you.
ADMINISTRATIVE PITFALL: Do you have the documentation? And can I see it?
It was this point, administration, that caused us great complications during the delivery of one of our projects. For the claims process, we implemented a model to evaluate the validity of a customer’s complaint. The solution delivered the expected predictions, was technically and mathematically sound, and the business customer was satisfied. But we underestimated the administrative complexity involved in deploying it to production. The IT department requested 15 different documents from us, ranging from a project plan to a risk assessment to a verification of the licensing requirements for the software being used. Not only did the documents need to be prepared, but they also had to be approved by the relevant departments. As a result, our initially seamless project dragged on for weeks. Excessive administration contributed to the fact that the project was finally accepted with almost a month’s delay.
LESSONS LEARNED: Never underestimate proper preparation. Check your assumptions. It will make your job much easier, and help you deliver the project on time and, most importantly, to the customer’s satisfaction.
The most difficult part of a data scientist's job is often deployment, the process of making your program, application or model a functional part of the customer’s infrastructure. The word “functional” is important here.
It often happens that a passionate data scientist does a perfect job of creating a model. But the problem is that they did that work in a closed bubble that wasn't embedded in the real environment in which the client operates.
This is a particularly tough nut to crack for new data scientists, as there may be obstacles that they don’t usually encounter on their school projects. Data formats change suddenly, the security department rejects the software you’re using, a change in priorities indefinitely postpones the use of your work, and much more. Without practice, it’s hard to imagine all the possible errors that can occur during the transition to real-world use. Again, this is helped by early communication with other team members and the client.
REALITY PITFALL: Neglected inputs can trip you up
Take one of our projects in marketing as an example. We had to develop a so-called propensity-to-buy model, a model based on machine learning, which gauges how likely a customer is to be suitable for an offer of a specific type of product. Selecting customers for a campaign is influenced not only by the calculated probability, but also by several other parameters (consent to be approached, contact policy, time since last approach, etc.) that we did not take into account when evaluating the success of the model. As a result, the model initially appeared good, but the implemented campaign did not achieve the expected success.
Are you intrigued or even surprised by all the things we address, consider, process on our projects? All the examples are based on our real practice and demonstrate that data science is not just about data, models, and programming. We take them as valuable lessons that we draw from on other projects and that shape our best practices and our know-how.
And what does all this imply for you, for those of you who are considering a career in data science or are already setting out on one?
Even if you can code perfectly and have a PhD in statistics, this alone does not guarantee that you can successfully complete a data science project. Always keep in mind that data science is a broad and complex discipline, and it’s up to you how you tackle the task at hand.
Would you like to help our clients turn their data to gold – implementing solutions that will improve their business? Take a look at our open positions and send us your CV.