R² is a popular statistic for evaluating linear regression models, but it can be easily misapplied, as I discovered.

Data science is an iterative process, which is especially true for new practitioners. We can spend weeks on a project, only to later discover a fundamental flaw in the analysis. This post explores a basic mistake I made on a project, and working through how it happened and why it matters.

My capstone project was a predictive analytics tool that utilized historical sales data and weather data to forecast nightly and weekly sales for an award-winning restaurant in Brooklyn. Historical sales data were pulled directly from the restaurant’s point of sale system (two separate systems that had to be aggregated, actually), and while the project was fairly straightforward, certain elements did present challenges. Holidays, closed days, outside seating that was only open on a seasonal basis and was subject to unexpected closures, room rental fees, inaccurate guest count data, and special events all had to be understood and handled appropriately.

One decision that I had to make was how to deal with holidays / closed days, when sales were $0. I considered two options: (1) Include an encoded boolean “Closed” feature, or (2) Drop closed days entirely from the dataset (a 3rd option would have been to treat the days as outliers — more on this later). While I was experimenting with the two options, I noticed that regardless of which regression model I used, the R² value varied significantly depending on how I treated the closed days.

I ended up using the boolean “Closed” feature as it resulted in a higher R² value, but that decision has never really sat well with me. The purpose of this post will be to revisit that decision and understand why my original decision was the incorrect one.

Set Up

To provide some context, the following graph shows nightly average sales in the restaurant’s dining room, bar, and private dining room areas, excluding the outside area.

Nightly Average Sales, Jan-2017 — June-2019

The restaurant averaged indoor sales of $14,575, with a standard deviation of $1,832 over the period January 2017 through June 2019 (this is a particularly successful restaurant, I should note).

Comparing Approaches

The following table shows how three different evaluation metrics differed between the two options for handling closed days — in addition to R², Root Mean Square Error (RMSE) and Mean Absolute Error (MAE):

So, while RMSE & MAE stayed fairly consistent, R² dropped significantly when I dropped the observations with closed days. Why is this? Time to unpack R² and better understand what’s happening.

Before we move on, it’s instructive to look at the residuals under both scenarios — this will provide some intuition into what is happening here:

Residuals with an Encoded, Boolean Closed Feature:

Residuals when Closed Days are Dropped:

R² represents the proportion of the total variation in a dependent variable (the target variable, in this case, sales) that is explained by the variation in the independent variables (our feature variables, such as day-of-week, month, sales trend, and temperature).

One of the simplest expressions for R² in a modeling context is the following:

The Residual Sum of Squares is the summation of the squared errors — the differences between an actual data point and the data point predicted by an estimation model. The Total Sum of Squares represents the sum of the deviation of each target observation from the mean of all target observations.

The code to calculate R² looks like the following (y_hat is the predicted target variable — this particular code block calculates the test set’s R²):

sum_squares_residual = sum((y_test - y_hat)**2)
sum_squares_total = sum((y_test - np.mean(y_test))**2)
r_squared = 1 - (float(sum_squares_residual))/sum_squares_total

We’re beginning to see why including the closed days was a mistake, but let’s take the next step in the calculation to really understand it:

Boolean Closed Feature:
sum_squares_residual: 225_779_046
sum_squares_total: 921_523_793
sum_squares_residual / sum_squares_total = 0.245
r-squared = 0.755
Dropped Closed Days:
sum_squares_residual: 223_557_101
sum_squares_total: 484_108_603
sum_squares_residual / sum_squares_total = 0.46
r-square = 0.538

Look at the difference in the denominator! The RSS is actually lower when the closed days are dropped, which should indicate a higher R², but the denominator (TSS) is significantly lower, leading to a higher RSS / TSS ratio (and thus lower R²).

So what’s going on?

Since I included the data points from closed days and specifically excluded them from my outlier adjustment, for each of the 12 closed data points in the dataset the R² formula is subtracting $0 from $14,575 (the mean for the entire dataset — in practice the train R² is using the training mean and the test R² is using the test mean), and then squaring the result ($14,575² is $212,430,625!). This an entirely artificially constructed variance that is unnecessary to include, but does have the side benefit of boosting my R² value. A closed day does not represent natural variance — we know why sales are $0 on those days and can actually predict $0 sales for them with 100% accuracy. They should have been dropped.

Lessons

My first mistake was improperly treating closed days — it would have made more sense to drop them from the dataset. And as I dug into R², I also realized that Root Mean Squared Error or Mean Absolute Error would have been better model evaluation metrics for this particular application (and by the time you read this, my project will reflect as much). If you really want to dive into why R² is a flawed metric for a lot of reasons please see these lecture notes from Cosma Shalizi at Carnegie Mellon.

This exercise also further illustrates the importance of properly handling outliers — a lot of the statistical assumptions that machine learning is built on simply don’t work if you aren’t properly account for them. These closed days were effectively outliers — engineering a feature for them was the wrong approach.

Lastly, and most importantly and obviously, this is why understanding statistics and the underlying math is so important for Data Science. This was a simple mistake on my part, and theoretically a forgivable one given that I’m just starting out in the field, but also highly instructive. Nobody is an expert overnight, and when something doesn’t look right, you can learn a lot from digging into why, and not just assuming that your model knows best.