The Pareto Principle — Spending Time and Energy Effectively as a Data Scientist

A small portion of effort causes the majority of payoff

The Pareto Principle states that, for a wide variety of situations, about 80% of the outcome is caused by around 20% of causes.

It turns out this is widely applicable, both to how you look at data, and how you think about projects.

Photo by Austin Distel on Unsplash

Examples include:

  • ~80% of bug reports are often caused by ~20% of bugs
  • ~80% of healthcare costs are often caused by ~20% of patients
  • ~80% of user interaction often come from ~20% of users
  • ~80% of the value of a project often comes from the first 20% of effort

Whether the exact percentages end up actually being 80% and 20% isn’t important — there are instances where the discrepancy will be more or less extreme. The important thing to keep in mind is that it’s very common for a small portion of causes to drive most of the impact. This should be a guiding principle both in thinking about how you’re building your machine learning models and how you’re spending your time.

A power distribution, demonstrating how often the top of the distribution has an outsized impact. Picture by Hay Kranen / PD in Wikimedia

The Pareto Principle and Building Machine Learning Models

Most of the performance of machine learning models usually comes from a small amount of effort. If you want to maximize your impact as a data scientist, it might be best to create many minimum viable products rather than trying to make one perfect product. If you get most of the value with 20% of the effort, it can be better to spend the first 20% of effort on a bunch of different projects and have a wider impact rather than spending all your time perfecting one project. Let’s look at a couple of concrete places this may come up.

Features

Often very few features do the bulk of the work for the model. There is an inclination, especially among the less experienced, to seek out any possibly related data to throw into a machine learning model. While it’s a good idea to be inclusive with data, if it takes work to wrangle and add additional features, you should question if it’s worth it. You maybe have already added the most impactful features, and are now getting to the point of diminishing returns.

Usually it’s a couple of the most directly related features that completely dictate the performance of your model.

Take for example the idea of predicting who is most likely to keep a library book past due. You could use all kinds of information about the type of book, perhaps the neighborhood the borrower lives in, and come up with a bunch of creative features to add to the model based on the book borrowed. But almost certainly, the dominating features will all involve one thing: The borrower’s history of returning books. If they’ve returned books on time in the past, they’ll be less likely to keep one past due. If they’ve ever returned something past due, they’re more likely to be past due next time.

This is important because often it’s the simplest, most accessible features that are the most impactful. Depending on the use case, it might not improve the business value much to spend much time accessing the data required to create other features.

Tuning a model

Hyperparameter tuning is great — with a bit of computational power, you can often get some “free” extra performance out of a model.

However, it’s easy to spend too much time on it. Whenever I’ve seen a student learn hyperparameter tuning, they’re usually disappointed that after all of that time and energy, it only had a tiny impact on performance. Usually, that’s because models are pretty smart — at this point, most packages use reasonable default hyperparameters for most applications and work reasonably well out of the box.

Hyperparameter tuning might eek out a bit more performance (or a lot if the defaults happen to be very bad for your particular problem). But you quickly reach the point of diminishing returns. The Pareto Principle again — the first bit of your effort will usually get you most of the way there.

You can quickly get a model up and running doing a simple hyperparameter sweep. It’s good if you have a pipeline that can run through a large hyperparameter sweep and just sort out the problem for you, but if you lack the computational resources to easily set-and-forget a grid-search, it might be worth just moving on. If the model is super important, or if you have some down time later, you can always go back and try to throw more hyperparameter tuning at the problem. But it often doesn’t make sense to spend weeks or months tuning hyperparameters of a simple model when you could instead move on to the next thing.

Conclusion

The Pareto Principle is an important guiding principle: often, in data science and in life, a minority of the effort will get the majority of the outcome. Your time as a data scientist is valuable — make sure you’re using it wisely. Sometimes, a smart form of laziness is the best way to get things done.

Other articles you might be interested in


The Pareto Principle — Spending Time and Energy Effectively as a Data Scientist was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.