Data Science Projects

How do they differ from usual engineering projects?

In my experience, data science projects are different from "ordinary" engineering projects (if such a thing exists). The main difference is that data science projects have to deal with a higher degree of uncertainty and risk.

Can we get the data? How is the data quality? Will we actually find a way to achieve the level of accuracy in our ML models that is required to make this viable? And so on.

Practically this means that the work of a data scientists is often more open ended and research-y. It is often hard to say how long it is going to take. Sometimes it is even hard to say whether it will work or not. That's not an easy thing to say or hear when you depend on a team to do some work.

The work of a data scientists is highly iterative, like in the picture below.

There are several different aspects that you need to figure out how to solve. What is the best way to formulate the task as an ML problem? What data do you use? Which method? After each iteration you need to see whether the results look good, or otherwise think of another thing to try and reiterate.

Managing Risks and Time Boxing

So what's the best way to organize the work of a data scientist into a data science project? I think the two key ingredients are managing risk and time boxing.

Organize a data science project in a way that you are aware of the biggest risks, and order the work in a way to reduce the risk the most. The idea is to figure out what would endanger the project first, and then think about ways to test in the most efficient manner whether that's the case. And of course, at some point you will start working towards the solution.

Typical examples for risk in data science projects are:

Do we have the data?
Can we get the data?
Does the data have the right quality?
Does the data has the kind of information we're looking for?
Is there an ML method that can give us a solution that's as good as we need it to be?
Is that method fast enough to be trained with the data we have?
Is the method fast enough/use acceptable amounts of resources when predicting for the use cases we have?
Is our infrastructure able to support the method or will we have to build something from scratch?
Do we know how to use the methods and technologies needed?
Does the approach make sense for the product we're using?

And so on. Any of these can kill the project. If the data is not available, you don't have to think about how to make prediction fast. If there is no method that can solve the problem, you don't have to worry to get the data.

Once you have your risks defined, you can start asking yourself what you could do about the risks.

If you're not sure whether there is data, you could make a list of people to ask for the data, or you can look into the data lake to check whether the data is already there.

If you are not sure whether the data has the information you need, you can take a sample of the data set and use a method that has worked elsewhere to try a small scale experiment.

Or if you are unsure whether the method can be solved by existing ML methods, you can do some literature research to see whether other people have solve such a problem already.

Once you know this, you time box your next step (for example to two weeks), and ask what you could do that would help to get more clarity on one of the biggest risks. The time box is important because many of these questions are open ended and you could spend the next year on trying out ML methods, but the goal should be to get enough information to decide whether to move forward or not.

Managing Engineering Projects

Now let's compare this with ways to run bigger engineering projects. I think it is important to look at this, because this is what people not experienced with ML will probably assume.

Engineering projects also have to deal with complexity, of course, but I think these tend to be more about complexity and can often be solved by "thinking it through" (which can be hard, though). How exactly are we going to interface to that other service? How can we deal with all that traffic? I'm not saying these questions are easy solve. Sometimes they are very hard, especially if the organization is large and the systems are complex. But for data science (and sometimes engineering, too), thinking it through is not enough. You need to do some work to figure out how to solve it.

There is a kind of uncertainty in engineering, and that's "building the wrong thing." Iterative approaches like agile software development have been designed to make short iterations in order to deal with the risk of spending a lot of time on something that doesn't solve the original problem.

For bigger projects, companies often have a process that begins more experienced people getting together to do the "thinking through" part. For example, you start with an idea what needs to happen. Amazon-style "working backwards" process describes the end result from a customer perspective. This helps to focus on the why, and also to not be distracted by thinking too much about the solution. The next step is often to have other, more technical people do the "thinking through" part to come up with a plan that could work, and if that looks good, you finally begin the actual work.

This sounds a lot like the dreaded waterfall model, and I think there are two reasons why it is not exactly the same as waterfall. First of all, instead of doing a lot of detailed planning and then only "execute", you should take the original plan just as a version of what could work, but still take a more agile approach to create the solution, still focussing on delivering increments of work quickly.

Sometimes you need to take such a plan even if you are aware it will take a long time to build potentially the wrong thing, because of the complexity of the work that needs to be done.

So, What are the Differences?

The main difference is that data science project might take much more time doing exploratory work. It is not uncommon to spend even the majority of your time trying out ideas and experimenting.

Many people will find this quite surprising. How can it be that even highly paid experts cannot guarantee that it will work? Do they really know what they are doing? I think they do, but knowing what you're doing means something different. It means having the experience to know what to focus on first, and how to deal with new information and continually react and improve.

Data science projects are not the only kind of work that has a high research aspect. Developing a new product can also be essentially a research activity. You cannot upfront explain all the steps that need to be done. There is a lot of experimentation required, and in some cases, if you're doing something completely new, there is also the possibility that it might be a total failure.

In my view, seeing these differences clearly, and also communicating it to your colleagues is important to manage expectations and collaborate better.

We've only covered the very high level topic of how to decide what to do in which order. There are many more topics, like how to create data driven products, how to move technology from experimentation to production, and so on, which will be covered elsewhere.

PreviousIntroduction Next[WIP] Writing Software At Scale

Last updated 4 years ago

Was this helpful?