One of the primary reasons for the failure of data science is the lack of collaboration between the data scientists, IT and operation teams. This creates a bottleneck to put model hypothesis from experimentation to production in a swift manner or to iterate many version changes of models in production. To deal with this situation, the industry borrowed the idea of DevOps of traditional software projects and repurposed it for data science and machine learning projects under the name MLOPs.
Although MLOPs helped to streamline the process to a great extent, yet it is not enough to ensure the success of a data science project unless it is adequately managed! Compared to managers of software projects, the data science project managers are still working without standard project management KPIs, and before they realize, their plans are spiralling out of control towards failures.
Let us first understand what are the current issues and challenges in a data science project management.
Challenges of Data Science Projects Management
Software project managers have tonnes of proven metrics and KPIs to track the progress and outcome of their projects. The data science managers cannot simply reuse these traditional KPIs, and we will discuss why is it so in this section.
Each phase of software projects has clear expectations and outcomes, e.g. the requirement phase gives a concrete input to the design phase, the design provides a piece of explicit information to developers, and the developer creates the code which complies with the requirement. This helps to time box the phases and track them accordingly.
On the other hand, to tackle a problem statement, the data science project evolves gradually from the initial set of hypotheses. This may lead to additional unexpected insights that result in more questions than answers and require more effort to address those. Due to such experimental and non-linear path, you cannot blindly use traditional software project KPIs to gauge the progress of your data science projects. Hence it becomes challenging to track if your project is moving in the right direction or deviating from its initial goals.
The significant artefact in traditional software projects is the code produced by developers, and managers use metrics like Line of Codes, Function Points, Story Points to estimate the effort of the work and gauge the productivity of the team.
However, in data science projects, the main tangible deliverables are the ML models, but during the experimentation phase data scientist usually create and discard many models until they arrive on the best performing one. Dropping the model does not mean the data scientist is unproductive; it is just how they work. But a prolonged wait to arrive at good ML models may also be a sign of being unskillful or non-productive. So, this makes it tricky to estimate and as well as track the productivity, both at the team and individual level.
Lack of Clear Vision
Usually, in a software project, there is a good thought process, purpose, and communication that takes place right from the business and leadership down to the software project team. Different stakeholders review the progress of the project with a lot more clarity at various junctures to ensure nothing is going out of track.
However, when it comes to data science, the majority of organizations started investing in data science projects under peer pressure just because there was a big buzz around it, and everyone was doing data science. In various companies, the data science budgets were approved in haste, and the team was set up overnight without a crystal clear vision or purpose. Lack of clear problem statements and goals resulted in no one knowing how to effectively measure the success or failure of various phases of the data science project or the overall project as a whole.
Data Science Projects KPIs
Indeed, we cannot just reuse all the KPIs of software project but we can tailormade them for the data science projects, similarly to how MLOps was spun off from DevOps. Here we list down some useful KPIs that you can leverage for your own data science projects.
- Clear Goal and Vision
A well-articulated goal is very important to measure the success of the overall project. The goal should not be ambiguous, instead, it should be quantifiable and measurable so that you can track the project’s progress against them.
An example of an ambiguous goal from the business is – “We want to prevent our customers from leaving our service”. You cannot validate your project outcome against this goal.
Rather, a quantifiable goal looks like this – “We want to reduce customer churn rate by 10% in the next financial year”. This has a measurable goal and a specific timeline against which you will have to drive your data science project.
2. User Stories Delivered
We can also define clear goals and timelines for the smaller phases of the data science project. To do this, you can adopt the agile methodology to create user stories for your project, assign to team members, and timeline it in 2–3 weeks sprint. Below are some possible examples of well-defined stories for a typical data science project.
● Story 1 — “Collect data from the MongoDB and Oracle database and create a combined dataset.”
● Story 2 — “Perform detailed exploratory data analysis and prepare a report.”
● Story 3 — “Create a hypothesis model with a base accuracy of at least 60%.”
● Story 4 — “Prepare a business report of our insights with visualizations.”
Do notice, that all these stories have a well defined tangible expected deliverables, like “dataset”, “report”, “model”, “business report”.
By tracking the user stories, you can arrive at the below two metrics:
- The outcome of each story at the end of the sprint is determined if the deliverable was completed successfully or not and this is very useful to gauge the productivity of an individual. If the individual is failing to deliver stories in too many sprints, then you should appropriate action.
- The success of each sprint can be determined by how many stories were closed successfully. This will help to understand the overall productivity of the team and also the health of the whole project. If you see back to back sprint with a few successful stories, then it is a sign the things are not under control, and you need to evaluate the gap.
Reusability is always desired in software projects. Certain artefacts can be created keeping reusability in mind, which can not only improve the productivity for the current project but can also be beneficial for other projects. Similarly, if you can leverage reusable artefacts in your project you can save a lot of time.
In a data science project, you can create reusable artefacts like data scraping or collection tools, frameworks, ML models, etc. To give a perspective, Tensorflow and PyTorch were created by Google and FB because they wanted to develop Frameworks that could be reused internally for all the projects and that was later open-sourced for reusability by the broader community. The no. of such artefacts you produce or reuse is useful metrics to indicate the productivity yielded by the project.
4 No. of Production Deployments
No matter how much experiment and POC you do for creating the machine learning model unless you are no able to deploy models in production, the efforts cannot be justified. And once the model is deployed, rarely model performs perfectly; hence multiple iterations and enhancements of models are needed in production.
Some projects do deployment after each sprint or in a predefined cycle, but the idea is to deploy smaller changes quite often in production. If the number of production deployments over a period of time is less, it indicates you take time to deliver an idea into production. It is time to identify the bottleneck at the end to end process or in MLOps pipeline.
5. Actionable Insights Delivered
The key outputs of data science projects to the business are actionable insights from their advanced analytics or machine learning model. The actionable insights are usually different kinds of business optimization suggestions to improve processes like operations, sales, inventory, etc.
An efficient data science projects should produce many actionable insights over a period of time. This can be tracked either in a monthly or quarterly basis and is a crucial KPI to highlight how much business value your project is providing. If there are fewer insights produced over a period, then you should check the other KPIs listed above to identify the issue.
6. Return of Investment (ROI)
When a company invests in a data science project, it ultimately boils down to whether it can help the company to maximize the revenue or minimize the loss. This is the pinnacle of success for a data science project. How much your project was able to give back to the company on top of their investment is known as ROI and it is the ultimate KPI for you keep an eye on.
Even if after many months or years, the data science project is nowhere moving towards break-even for the organization, then it is worth reassessing the project from top to bottom. On the other hand, if you could deliver a significant ROI to the company, then congratulations your data science project is super successful.
Currently, there are no standard Data science project management KPIs available that have been successfully proven. But with passing the time and learning from project failures, this industry will eventually produce a robust project management framework, just like the software industry matured over the years. Meanwhile, in this article, we borrowed KPI ideas from software projects and showed how you could leverage them for your data science project.