ML and AI at Scale – WHY?
It's time to talk about MLOps. Why should you invest in data science development even without instant gratification? Well, maybe precisely for that reason.
Written by — Joonas Kivinen, ML ENGINEER
At the time of writing this, GenAI has been buzzing for a while increasing interest and opening new business opportunities in the data science domain. Now is a good time to step off the hype train for a minute and think about the practical challenges that many organizations still face.
Starting with the basics
Big-scale data science has pretty much become a lot like software development so it feels natural for many to use the same well-established techniques and principles in the data work. Understandably so. All organizations, and especially data departments, do not however have a history in that domain, and DevOps culture is not widely adopted in the industry. Many data science practitioners neither have an extensive software development background – the author of this text included. So let’s start with the basics real quick.
AWS defines DevOps as the combination of cultural philosophies, practices, and tools that increase an organization’s ability to deliver applications and services at high velocity. Sounds pretty easy to just implement in big-scale data work. Then again, though we are writing code and usually running the stuff in the same environments, data science has some unique aspects that make it a little different.
That's why there is MLOps, Machine Learning Operations.
The yet-another-Ops fatigue is real but the existence of the word MLOps is justified. You see, a narrow definition for MLOps could be rather technical and include the things that MLOps tools and services are designed to tackle: developing, deploying, and monitoring ML models. A more broad definition brings in DevOps, the developer’s experience, and all the challenges that are part of data science work on a daily basis. Personally, I prefer the latter definition: The path from an idea to production includes many steps and can require skills for example in data science, data engineering, cloud, data, DW, CI/CD, API, and backend development along with good team conventions and business knowledge. If we solely focus on MLOps tools, we can already ease the process a lot but things can go wrong if not built on solid foundations.
To put it bluntly, this blog aims to justify investments in data science development even when there is no instant gratification. Instead, there is long-term business value. Data science often is a domain where projects come and go and if there is no big-picture planning, maintaining becomes a burden – let alone scaling things. The larger the data science team and the number of projects or the scope of a typical project, the larger the returns. This blog will concentrate on the why side of things. The how will be a separate post in the near future so remember to follow our socials not to miss that out!
Sidenote: For the lack of a better word, data science is used here to cover ML, AI, data science, and so on. So everything that could fall for someone working with the title Data Scientist.
How is data science different?
In other words, some of the distinctive aspects that characterize Machine Learning are:
- Uncertainty and working with best guesses: Is there a business case and if there is, what is the best model to tackle that? Some say that we are always working with the next best solution since the best one is not implemented yet. Sometimes the questions are quite tricky and hard to answer and define. It is a little different to answer questions like what products we want to recommend to a customer compared to these are the most often bought products by a customer.
- Models tend to decay over time: Models are trained at a certain point of time and they tend to lose their accuracy as time goes by. We surely want to have some monitoring and metrics for that. Sometimes there might even occur external shocks that make a model irrelevant at once. Just wondering how many forecasting models went to trash when COVID-19 hit. On the plus side, easy improvements are also available if, for example, there is new data suddenly available.
- Experimental nature of the workflow: Because of the things mentioned above, the whole workflow is quite experimental. From the outside, it can seem that nothing is happening but there is still a constant improvement process going on even when we don't achieve any real results at once.
- Sometimes extensive computational needs: Training and scoring might need some extensive computational resources and time which makes the feedback loop quite long.
- Stuff tends to pile up: Partly because of the things above and all the hype during the recent years, my gut feeling is that compared to other types of projects, there are quite many POCs in the data science world that end up in a semi-production kind of state. Quite often a model and things around it can be thought of as an independent micro service and if there are many of those, it gets tricky to keep everything under control.
An example of a typical data science workflow is where we try to validate and define a business case and if there is one, we probably start with something simple, explore, and iterate until someone decides the solution is a good enough MVP. Then someone probably (and hopefully) wants to use the end product so we put the code somewhere to run. Then, depending on the use case, we start working on a new project. Alternatively, if the use case at hand is a serious one, we keep iterating, exploring and trying to improve the model.
In the case of a customer-facing solution, we want to get some real feedback to see if our model is working or not. In other words, we need a feedback loop. If we have multiple models, we need the ability to do A/B testing to see which one is working the best. At some point, we might want to test something completely new like replacing our simple model with a fancy neural network. At all times, we still want to monitor not only the technical side but also the performance of our model. Depending on the problem, all this can take anything from weeks to months. It could also be that we leave our model to run for a while and then come back to it later if there is new data or a new algorithm available or something else needs to be changed.
Why invest in data science development?
Do any of these sound appealing:
- Easier experimentation, research, and evaluation of business cases
- Faster time to market and increased development velocity
- More reliable and predictable products in production
- Happier data science team that can concentrate on what they do best
Great, because that’s what we want to achieve with MLOps. It all pretty much comes down to the last bullet, though: If the data science team is happy and the developer experience is good, many of the other things are probably at least on a decent level.
Now let's think about what might happen if we do everything in the wild without any forward thinking. Let's assume that we start from scratch with a fixed-size data science team, an empty cloud account, an empty code base, and so on. As seen from the workflow above, there are quite many steps from an initial business idea to production. Those steps also tend to heavily repeat from model to model. The actual model code, the data science part, is usually just a small part of the end-to-end pipeline. Now let's say that some years have passed, we have done some projects that we’ve all started separately. Things have begun to accumulate and there’s a little different pipeline for every project. The codebase and the (cloud) environments are a mess and there are ten different deployment pipelines for ten models.
The harsh reality is that projects need to be maintained and that things do and will break every now and then. Do we want to end up in a situation where most of our time is spent on only keeping our current projects running? The tricky part is that things accumulate rather slowly and it's hard to point out the exact time that we have gone beyond the critical point. Even if our workflow and processes were perfect, some accumulation is always going to happen as time goes by. In the perfect world, the overall complexity, meaning everything that a data science team needs to maintain, would increase only by our model-specific code and other model-specific things. The opposite would be the previous scenario and a reinvented wheel of the whole process on top of that. Luckily the good thing is that many of the steps in the development workflow can be abstracted and automated. Even if we don't have anything accumulated yet, we want the process to be smooth so that if there is a potential business case, we don't have to spend weeks just to get started.
As hinted above, a lot comes down to the developer experience. If the workflow is smooth and the annoying things are easy, the overall end product is usually going to be in good shape. Data scientists are also a scarce resource; many can be very picky about their workplaces so why would they work for an organization where their work is pretty much just pain? Another good proxy is that if a project is outsourced, do we have any development guidelines to follow or is the job just to get something done and then get out? It would be nice if our own data science team could easily take over and do the possible updates in the future.
This blog gives some ideas on why it is important to think ahead when doing data science. The good thing is that in general, things tend to get easier as more and more aspects get abstracted and industry standards develop. All of the major cloud providers have tools and services for MLOps on top of other commercial and open-source solutions.
As we learned here, though, it is more than just installing some tools; it all starts with the fundamentals. This does not mean that you should stop everything and start building a platform to tackle all of the possible use cases. It is, however, good to have some type of long-term goals in mind and then build towards those step by step. At some point, you might even realize you have a platform in the making, or at least you’re not spending so much time with the annoying stuff but instead, with the things that matter. If there is an opportunity, moving fast is great but in the long run, sustainability should be the thing in mind.
Now we know the why but the big question of how still remains. More about that in part two so let us know if you wanna get a heads-up when that comes.