Data science projects are like any other software project which needs regular maintenance, enhancements, and improvements over a period of time after the first production deployment. But when it comes to putting ML models in productions, companies are already struggling big time, leave alone the regular maintenance. As per one report, only 22% of the companies, who ran ML projects were able to deliver them in production successfully. This number is indeed very poor and to make it worse, 43% of the data scientists find it challenging to scale their data science projects as per their company’s needs. This means that data scientists are unable to deliver new improvements consistently and efficiently to meet the growing requirements of the projects.
What does Scalability Mean in Data Science Project?
Scalability in a data science project can mean different things in different contexts –
- How efficiently and fast the data scientists in your team can train their ML models on the server to deliver regular changes in production. The training of huge ML models (Deep Neural Network) may take days, hence they need to device an effective training strategy and tap into the VMs with the right CPU/GPU and memory as per model size and training data volume. Otherwise, training of ML models will itself become a bottleneck, and the time taken to production will become long, making it challenging to scale.
- How efficiently your ML model is able to serve the increasing requests in the production. The deployment architecture should be robust so that if needed it can be scaled easily to handle serving requests from just a few thousand to up to millions if needed. If not much thinking goes into the deployment architecture in the early days, it can become a nightmare later when the load for inference increases.
- How efficiently you can collect data from various sources into your data lake for training your model. This process is known as ETL (Extract Transform and Load) and is the prerequisite for bringing data to the ML model. This is the entry point of any data science project hence you should have the right set of infrastructure to create data lake and ETL process that can be scaled with the growing demand and complexity of the project. (Scaling ETL process is a data engineering topic and we are going to limit our discussion around first two points specific to data science in the remainder of this article.)
Challenges of Scaling Data Science Projects
Generally, the data scientists are not well equipped with the IT and infrastructure aspects of things in the project. This can lead them to make poor choices that can create a bottleneck. E.g. they may start training their ML models on a VM with a low configuration that can unnecessarily elongate the training process. Similarly, they may invariantly start training their models in a VM with high resource than that was really required, thus blocking it from other data scientists who actually needed a high resource VM for training. Sometimes, the data and ML model parameters are so huge, the traditional way of training them in a single VM can also become quite cumbersome.
A survey revealed that 38% of the data scientists accepted that they lack skills for the deployment of ML models. Looking at this state of things, it is too much of an ask to expect them to design deployment architecture that can scale with load. But is the data scientist the only poor soul responsible for everything?
The term data science projects can mislead you to think data scientists are the sole people responsible for running the show. But in reality, to deliver an end to end data science project, it is supposed to be a collaborative effort between different teams of Data Engineer (for data collection), Data Scientists (for ML model creation and training), IT and Operations (for deployment). Unfortunately, these teams work in silos resulting in a lack of collaboration, making it difficult to deliver regular changes efficiently in the production to meet project demands.
How to Scale Data Science Projects?
There are many technical, architectural, and process improvements that can be incorporated in a data science project at the very beginning stage to ensure it can be scaled easily in the future as the project matures. We have listed three strategies that can help you create a scalable data science project
1. Automatic Deployment and Resource Management
We need to bring a separation of concern and accept that Data Scientists should do what they do best i.e. create and train models and all the complexity of resource management should be abstracted from them yet readily available. This can be done by automatic deployment, resource allocation to the ML models, and auto-scaling as per the needs.
The first step of achieving this is by adopting the methodology of containerizing the ML models. Containerization ensures that all the environmental dependencies and ML models are packaged together in a container to bring consistency in deployment across training, testing, and production environments. Docker is the most popular containerization technology. In a team of multiple data scientists, who are trying to deploy their containerized models for training, testing or production serving on a cluster, Kubernetes can be used to streamline all the automatic resource allocation to these multiple requests, auto scale the resources, and orchestrate between multiple containers if needed. So, if there is an increase in load for model serving requests, Kubernetes can manage it for you.
In fact, Kubernetes can be scaled easily to serve hundreds and thousands of containers, and such is its market demand that all the popular cloud providers like AWS, GCP, and Azure provide Kubernetes as a service on their platform. In spite of the flexibility, Kubernetes can still be difficult for data scientists to work with, hence Kubeflow was created by the Google engineers to bring a layer of abstraction on Kubernetes for deploying ML workflows.
2. Distributed Machine Learning
When working with a huge amount of big data or ML models with hundred and thousands of parameters, it can become a computational challenge to train the ML model on a single machine. One possible option here is to scale vertically by adding more RAM size, but it is not a sustainable solution. This is where distributed machine learning can be quite useful to train models with a degree of parallelism and scale up the training process.
There can be two strategies for distributed machine learning:
- Model Parallel: This approach is ideal where a model is quite huge and its weights can be distributed across the machines for computational purposes.
- Data Parallel: In this approach, the multiple distributed nodes have a copy of the ML model and its weights each and is useful to process a huge amount of big data with parallelism on these nodes
For the distributed computation of this nature, traditionally Hadoop and Spark had been the go-to choice and they even have support for machine learning libraries. But now with its growing popularity, Kubernetes can be used to orchestrate a distributed machine learning system to achieve scalability.
3. MLOPs Pipeline
There are many steps involved in an end to end data science project, right from data collection, ETL, model creation, model training, testing, production deployment. Since there are multiple teams responsible for different steps and most of the process is manual it makes sense to automate everything end to end in a pipeline. Borrowed from the concept of DevOPs of traditional software projects, is the methodology of MLOPs for creating CI/CD pipelines for data science projects. You can develop your own custom pipeline using Kubeflow or you can leverage pipeline as a service offered by likes of AWS, GCP, Azure.
Automating everything from end to end not only eliminates manual touchpoints but also gives room to scale the data science project. Data scientist no longer needs to worry about who from data engineer team is going to help him procure data or who will assist him in deployment from IT team, pipeline will take care of everything and he or she can just focus on building models and push it through CI/CD pipeline. This helps to reduces time to production and scale up with requirement.
It is quite evident with the current stae of data science that scalability is one of the main challenges that companies are facing. It is not that there is lack of technologies to help you scale the project, it is instead lack of awareness and skill which is the real underlying issue. Here we discussed challenges and some strategies that can be used to scale data science projects and we hope that you would be able to leverage it for you own project as well.