ML and AI at Scale – HOW?
It's time to talk about MLOps again. The second part of the blog series focuses on how you should approach MLOps in practice to do it right.
Written by — Joonas Kivinen, ML ENGINEER
This is a follow-up post for ML AND AI AT SCALE – WHY? and I recommend reading that one at first. As a quick recap, the first part was all about why to invest in data science development. Here we try to answer the question of how.
As the previous blog explains, it takes a wide skill set to get a product running end-to-end, and many of those skills are not in the traditional data science skill set. Newish roles like Machine Learning Engineer, Platform Engineer, and so on have popped up to bridge the gap. Roles are roles and titles are titles but the new ones have not just dropped from the sky without any meaning. They rather reflect how the industry has changed and where it is going. Having a backend developer and other traditional software development roles close to data science is also very beneficial. Some successful organizations even have teams whose only responsibility is to improve the developer experience and make other teams more productive.
Many MLOps-related blogs present a hierarchy pyramid. This one is not an exception. There is a certain priority of things and it all starts from the nontechnical side followed by data and technical fundamentals. All the fancy stuff and everything else comes after that. This doesn't, however, mean that you have to fully complete the previous steps before progressing further. The world is changing all the time and iterative development and learning along the way usually leads to the best end result.
Even though some concrete things are suggested, this is not a definitive or technical MLOps: How to? -guide. We want to look at data science as a whole. It can sometimes be hard to point out where data engineering ends and data science starts or where data science ends and the business side starts; They are all connected. On the other hand, while many good ideas are universal, organizations are different and use different tools so there is no clearly one-size-fits-all way of doing things. It is important to consider whether the current way of working is sustainable and if the resources match the ambitions of your organization.
Sidenote: As in the previous post, for the lack of a better word, data science is used here to cover everything from ML, AI, data science, and so on. So everything that could fall for someone working with the title Data Scientist.
Starting with the basics
Data science needs to have the resources to be successful in order to get anything bigger done. That's just a fact especially if the ambitions are high. To have these resources, data science needs to be acknowledged inside the organization and not only in the marketing speech. As mentioned above, it takes more than having one data scientist to get something working at the end. Having an end-to-end responsibility is preferred but that might be difficult if the required skills are not there. In the long run, it is better to offer guidance and coaching as well as abstract things so that the data science team can manage things on their own rather than relying on someone's help for the same issues all the time. Having ownership and an ability to make decisions is also needed for the end-to-end responsibility to work. In any case, there are going to be some responsibility boundaries and it is important that those are clear.
All work is pretty much done in teams nowadays. Team members come and go and some might be external. When working as a team, it is important to have a shared understanding of how things are done and how the team works. Creating some development guidelines is highly recommended. To share the knowledge and have everyone on board, work should be visible. Code reviews are highly recommended. We want to avoid the situation where one superstar developer does everything and when that person leaves, the whole team is in trouble. If bigger decisions need to be made, e.g. on architecture, it is important to have a shared direction or vision about the direction. The worst-case scenario is that two or more team members are pushing in different directions. Even if the daily work is done in sprints, it is important to think a little ahead about what might be coming in the future.
TIME TO TALK data
Not much has been said about data so far and it is a big topic of its own. This is not a DataOps blog or intro so we will not go deep into that but some big-picture things are good to keep in mind.
The AI Hierarchy of Needs is still very relevant and it is impossible to do any successful data science without data being in good shape. Too often it also happens that data science ends up being a 2nd class data citizen meaning that the data is there but it isn't easily accessible for data science needs. There might be a DW built strictly for reporting, some sort of data lake containing all sorts of stuff, or sometimes maybe not even that. Developing anything on top of random data dumps or broken data is not pleasing and sustainable in the long run and can, and usually will, lead to some shady things. If things are not smooth, people often find a way around and end up doing their own hacks and as time goes by, some more hacks are done on top of the previous hacks. Once production stuff ends up being built on top of those, it is a real pain to start fixing things. The moral of the story is that data science data needs have to be addressed and when changes are made, data science has to be in the loop.
Also, as a part of the data science workflow, there is usually a need for some data prep, storing experiments, intermediate and output results as well as getting feedback data from a model. The list goes on. Then there is a thing called a feature store to store precalculated features that models use for training and inference. We want things to be easy and ideally the data science team should be able to manage that part on their own. That might require some setup and guidance but after that, the team should be able to play around without interrupting someone all the time.
Read more about how to do data engineering right in The Data Engineering Manifesto.
By technical foundations, we mean the development, test, and runtime (cloud) environments, codebase, and CI/CD that tie those together.
A famous saying goes that everyone has a development environment and the lucky ones have a separate production environment. I hope that you belong to the latter category. For the development and test environments to be of any use, they have to mimic the production environment. The way to achieve that is Infrastructure as Code (IAC). If you are not familiar with the term, AWS defines IAC as "the ability to provision and support your computing infrastructure using code instead of manual processes and settings". When doing IAC, there should be one common template that all environments use with the environment and other possible differences parameterized. If you are using one cloud account or project for all data science work, I recommend separating your common core environment components into a separate stack and keeping that as minimal as possible.
By codebase in this context, we mean the codebase quality and practices in general. Are there any? There is no definitive answer to what the best way to arrange and organize everything is. Some thought should go into it, though, just to avoid a situation where after some years, everything is just a big mess.
CI/CD (Continuous Delivery / Continuous Integration) ties the codebase and environments together. If that is not done well and the process is not smooth and easy, we again end up in a situation where people will find a way around and do some shady things. If a deadline is approaching and it is hard to go through the official channels, it is tempting to take a shortcut and click things in the cloud console. If something unexpected then happens, that is not a human error but a bad process.
When the fundamentals are right, it is possible to develop step by step and do small releases. The opposite is the whole thing developed as a works-on-my-machine kind of product and then hacking it to work in the production environment. Before putting our code into production, we want to test that it works as intended. Being able to run the code in a production-like test environment is a minimum requirement. Doing this manually is better than nothing. Automated tests as a part of a CI/CD pipeline would be even better. That takes a little effort but if the IAC is done right, it is doable. Unit testing takes a little less effort but sadly it is not a standard in data science. Unit testing also tends to make the code cleaner.
Scaling data science
Once we have the fundamentals right, we are ready to really start scaling things. In short, that means that we have more models in production and development without compromising the overall reliability and maintainability. To do that we want to automate and standardize everything we can and make it so that starting something new is easy. We also need monitoring and alerts to catch if something is not working. On top of technical monitoring, we want to know how our model is performing and if there are many models, which is performing the best.
For automation and standardization, we need templates. If a new project is something like a Python container running in the cloud, it shouldn't be much more than cloning a template, putting your code to
src or whatever folder, some custom configs and
git push. Another clear upside of templates is that they speed up development and ease the documentation burden. Every project doesn't have to strictly follow a template and in those cases that don't, the deviations are clearly visible and there most likely is a good reason for that.
One thing that we often do is write the same code over and over again so we might want to package that. There is a caveat though. While modifying templates is possible and usually easy, that is not the case with packaged code so there has to be some careful thinking. Something like reading data, outputting data, and logging we might always want to do in the same way but it might be hard to draw the line. Packaging code is still usually worth the effort but it has to be done right. For Python projects, for example, a private PyPi or similar is pretty much needed in order to have proper versioning. This prevents us from ending up in a dependency hell and starting to break everything here and there.
To get things to the next level, there is a model registry. It is a central storage system for versioning models and their related information. The registry not only stores the models but also keeps essential metadata about the data and training processes and parameters establishing a clear lineage for each model. A model registry also makes model performance monitoring possible
SOME EXTRA POINTS
These things here can be debatable and do depend on the context and business case but they are also things that I have found to be useful. The first thing is not to tie a model too deeply into the business end product. The logic is that if we do so, changing or updating the model might become difficult. It gets even more tricky if another business wants to use the same model but there is already some business logic built in. The alternative would be to let the model be a model and add or enrich business logic later on.
The second thing is about organizing the codebase and is also a bit connected to the above. It is good to have all code related to a model, and only to that model, in the same place in one repository or one folder. Depending on how the codebase is organized of course. The opposite would be, for example, a technical split with SQL being in one place and Python code in another. If the model becomes irrelevant, it is easy to delete everything at once. Workflow and IAC also become smoother since we don't have to hop around.
The third thing is aiming to have everything that is connected in one pipeline. The opposite would be that there is a job that after finishing silently triggers another job. Tracking an error in such situations is really painful. The whole pipeline can be event-driven but in those, the monitoring has to be top-notch.
All this sounds like a lot. However, the idea is not to get everything done at once but with small incremental steps. The main takeaway is to think a little ahead: Think about the roles and responsibilities and ways to make things easier in the long run. Once the fundamentals are set, everything else becomes so much easier.
If you now want to read about something more concrete, How To Get Started with Vertex AI & MLOps would be a good option. If you wanna tackle the topic head-on with us, hit the button below.