Photo by PublicDomainPictures on PixaBay

Whiskey Dataset ~ K-Means Clustering, Logistic Regression & EDA

“Always carry a flagon of whisky in case of snakebite, and furthermore, always carry a small snake.” | W.C. Fields

Apparently, the project’s domain relies on the most popular liquor in the world — Whiskey. A dark spirit coming from a great variety of grains, distilled throughout the world and arriving at quite a number of styles (Irish, Scotch, Bourbon etc) [1]. Scotland, Ireland, Canada & Japan are among the famous exporters and on an international scale, the global production almost reaches the level of $95m revenue [2].

The main scope, hereof, is to introduce in a… ‘companionable’ way, how helpful can the Clustering Algorithms prove to be, anytime we need to find patterns in a (large) dataset. Actually, it might be considered as a powerful expansion of the standard Exploratory Data Analysis (EDA), which is often very beneficial to try, before using Supervised Machine Learning (ML) models. A predictive case of the latter (Logistic Regression) is also implemented at the end.

Concept

After successfully tuning a music playlist ‘the-Python-way’, the Data Corp I work for, accepted a new project: assisting a renowned Whiskey Vendor to diversify. That is, to bring into light which whiskey varieties are better sold and with that in mind, make the appropriate mergers/acquisitions, to boost sales contextually. The main handicap, though, is that the Vendor does not possess any Sales data from the competitors (aka prospective acquisition targets). But:

How about using whiskey-related data including any attributes (i.e. age, taste, type, price and so on), categorising them in a meaningful way for the Vendor and finally guide them on what specific bottles they should invest in?

In order to better communicate the outcomes, a number of assumptions were made:

#1: To define an adequate set of data for our analysis, I used a pertinent dataset from Kaggle — a remarkable source for almost any kind of data.

#2: The liquor attributes I used are: the name, category, rating & description of whiskies, while developing a couple of new ones (see Section 2).

#3: In lieu of any sales-related data, the only way to accomplish our mission is to uncover potential ‘underlying’ patterns that may lead the Vendor to increase the sales volume, artfully. Id est, preserve the merchandable variety and not just sell the most expensive or highly-rated bottles.

Modus Operandi

  1. Set up the environment to run the code.
  2. Perform EDA using Numpy, Pandas & a number of additional Python libraries.
  3. Reveal additional data patterns, by fitting a K-Means Clustering algorithm to the dataset.
  4. Using the now labeled dataset (clusters = labels), implement a Multiclassification technique — Logistic Regression — to make predictions on new listings (whiskies).

1. Set Up

In this section, we are going to set up the environment needed, in order to apply the analysis techniques.

  • Install Jupyter Notebook — an open-source web application used to create/share documents containing live code, equations, visualizations and narrative text. You may follow the steps here.
  • Install Requests /BeautifulSoup— a Python module for addressing the API and pulling data out of HTML and XML files, respectively. You can either use a CLI (Command Line Interface) or a Jupyter notebook, to run the following commands:
  • Import the necessary libraries:

2. Data Cleaning

The very first section includes reading the data into a DataFrame object, df_init.

Preview of the ‘df_init’ Dataset

Next, we take a series of necessary actions to get it ready for further analysis. Most importantly, we:

  • Check for nulls (no nulls) / Drop redundant columns (Unnamed: and currency).
  • Convert price to float type.
  • Check for duplicates concerning the name column and replace those listings with their mean values [category , rating, price].
  • Extract new features age and alcohol from the name column, by using specific RegEx.
Preview of the ‘df’ Dataset

3. EDA

In the vein of reducing cluttering, I do not include the data visualisation code herein, but it is available on the GitHub repository.

I. Apparent Insights 📕

First things first, we inspect the dataset to confirm any obvious inferences. The variables (excl. description) are quantitative, while also belonging to the ratio scale of measurement. Therefore, a Box-and-Whisker plot could effectively depict each feature’s individual distribution and along with a table of the descriptive statistics (via the pandas.DataFrame.describe method), might provide us with a good visual intuition about the proportion of values that fall under each specific quartile (price has a large value range, thus plotted separately).

‘df’ Descriptive Statistics & Box Plots
  • The rating variable begins from 70% (quite skewed, thus there are no bad whiskeys, as quoted!) and the Mean review is around 87%.
  • As expected (for a whiskey), the age and the alcohol begin from 3 years old and 40%, respectively, with the former noting an average value of 20 years old.
  • Taking into account the price feature, the average (yellow ▲) Scotch goes for 700 US $, while the Median (red line) is at 108 $. This is a clear indication that the distribution is right-skewed. Yet, what is highlighted is the range from 10 to 157.000 $!

As per the Assumption #3 (see above), no Sales data are available and inevitably, we have to manipulate the existent features. We build-up to the following ‘mechanism’:

The Vendor is interested in increasing the profit margin, thus opening the [price-cost] gap. Besides being right-skewed, the price itself cannot be used as a decision factor for the Vendor, since selling extra expensive bottles, also means extra procurement cost.

Finding #1: We should not follow the price as a decision factor. Eventually, we have to capture any interrelation among the features that may reveal which force drives the sales (and profit) upwards.

II. Deep Insights 📙

By inspecting the variables ‘pair-wisely’, we may capture useful relationships between them. A Scatter Plot Matrix is capable to visualize bivariate relationships between combinations of variables.

‘df’ Scatter Matrix
Pearson’s Correlation Coefficient Table

Interpreting the matrix with the Pearson’s correlation coefficients, the most profound findings we get are the following:

  • The most exciting finding is that a good rating to do not necessarily come with high a price (cor = 0.12). In other words, there is much potential to enjoy Scotch of high quality by spending less dollars — a clear bargain.
  • There is, also, a decent relationship between the age and rating (cor = 0,32).
  • The highest interrelation (cor = 0,33) is noted between the price and age; the extra expensive bottles belong to the mature liquors. But, evidently, if the Vendor opts to acquire a firm which produces extra mature whiskies, they will conceivably result to sell extremely expensive liquors, biasing that way their variety distribution:
186    60000.0
739 60000.0
182 60000.0
82 30000.0
699 27620.0
29 26650.0
102 25899.0
397 25899.0
103 25899.0
816 25899.0
Name: price, dtype: float64

So, apart from excluding price (Finding #1) we should also rule out the age and alcohol, too.

Finding #2: Consequently, we should concentrate on the rating feature, which is expected to have a great effect on the Vendor’s profit (more popular whiskies means higher sales).

III. Deeper Insights 📒

Seeking a pattern that may enlight the Vendor on what bottles to promote mostly (in order to get higher profit), we are going to analyse the ‘kava’ by the category attribute, while also ‘illuminating’ the rating feature.

Average Features, grouped by `category`
`rating` Box Plot grouped by `category`

The Blended Malt takes the lead in rating, with the simple Blended coming next — the former’s Mean is by 0,23% (88,11–87,88) higher. It is noteworthy, that the Blended Malt's Median is quite above the 2nd quartile (Mean), hence more than 50% of the bottles are rated above the average (88%). This is a decent insight, we may provide the Vendor with…

Finding #3: The Vendor may choose to boost the sales of the Blended Malts. That way, they may achieve bigger sales, due to the popularity of this whiskey type and as a result enjoy higher profits.

⚠️ But, we still violate Condition #3 (preserve the whiskies variety) — a problem we haven’t yet tackled properly. So, instead of recommending Blended (and only) bottles and in an effort to guarantee the variety to the customers, we proceed to a new, more comprehensive way of clustering.

4. K-Means Clustering

In Unsupervised Learning, we’re finding patterns in data, as opposed to the Supervised ML where we make predictions. Within this context, Clustering Algorithms are capable to group together similar rows, when more than one (invisible) groups may exist. These groups form the clusters we may look at and start to better understand the structure of the data. K-Means Clustering is a popular centroid-based clustering algorithm.

The k refers to the # of clusters we want to segment our data and must be declared, upfront. Among the available approaches to estimate it, we are going to use the Elbow Method, according to which:

  1. We run the algorithm for various k (here from 1–10).
  2. For each iteration, we compute the Within-Cluster Sum of Squares (WCSS) and store it in a respective list.
  3. We plot the WCSS~# clusters k relationship.
  4. We locate the ‘elbow’ — the close area after which the line declines smoother (the source Python library called kneed is useful to that).

After switching to a new dataframe free of nulls (df_no_nulls), comprised of 1.355 whiskies, we plot the WCCS line.

Within Cluster Sum of Squares (WCSS) ~Clusters Plot

The optimal number of clusters is 3 and we are ready to to implement the K-Means clustering.

The model showcased the following clusters and {num of bottles}, respectively: Cluster 0 {# 353}, Cluster 1{# 363} & Cluster 2{# 639}. Inspecting the pertinent Box Plots along with with those of Section 3, we deduce the following…

`rating` Box Plot grouped by `cluster`
(Section #3) `rating` Box Plot grouped by `category`

IV. Top Insights 📗

✔️ Clustering reveals a clearer indication of what whiskey types foster the rating (and sales, as well). See how the new Clusters distinguish themselves, as compared to the categories of previous analysis.
✔️ Cluster #1 is way better when it comes to terms of rating. Not only the Mean gets ahead of the rest, but its Median is located to the rightmost, meaning that at least half of the cluster’s bottles overtake even that remarkable value (89,33%).
✔️ At the same time, Cluster #1 includes Single Malt Scotch {#321}, Blended Scotch Whisky {#33} and Blended Malt Scotch Whisky {#9} — thus, variety guaranteed!
✔️ The analysis, hitherto, takes into account more features (rating, alcohol, age) than the previously attempted (rating), proving the point that clustering promotes a more comprehensive separation of data, deriving from signals of more components.

Bringing all of them to the same space, we can visualise a 3D Scatter Plot. Actually, we are marginally able to do so, because for >3 variables a Dimensionality Reduction Process (like PCA etc) is necessary to be prior executed.

`cluster` 3D Scatter Plot

It is quite prominent that Cluster #1 (red) ‘occupies’ the higher area of y-axis (rating). Additionally, should we normalise the data, we can render a Radar Plot. This one can further assist the results interpretation, by illustrating how better the red polygon ‘concers’ almost all of the features.

Features Radar Chart grouped by `cluster`

5. Logistic Regression

Supervised ML picks up the torch, now that our dataset is labelled (clusters). We aim at training a model that given the 3 features [rating, alcohol, age] of a new listing (whiskey), it will predict its label (aka cluster). That way, the Vendor will be capable to directly categorise it and deduct whether or not it has ‘potential’ to be commercialised!

A Multiclassification problem, like this one, can be tackled with a Logistic Regression model [3], after a couple of tweaks are firstly applied. Particularly, we will implement the one-versus-all method, where we choose a single category (ie Cluster #1) as the Positive case and group the rest as the False case, iteratively. Each of the product models is a binary classification one, that will return a probability [0-1]. When applying on new data, we choose the label corresponding to the model that predicted the highest probability.

An accuracy of 98.9% was achieved, meaning that almost 98 out of 100 new liquor entries may be successfully categorised to 1 of the 3 clusters, we developed.

Conclusion

Eventually, we reached our destination; from the apparent to the well-rounded insights of what ‘type’ of whiskies may foster the Vendor’s entrepreneurship.

Starting from a dataset of Kaggle, we gradually developed from plain EDA to an Unsupervised ML model (K-Means). That way, we revealed insightful patterns, which helped us better identify what it really takes to boost the Whiskey Sales, artfully. Finally, we fitted a Logistic Regression model on the labelled dataset, predicting with high accuracy the Cluster into which a new bottle may be registered.

Photo by Nick Rickerton on Unsplash

I dedicate this post to my good friends Kostas and Panos; we used to taste new malts on Fridays, before the pandemic… Actually, beyond the quantitative ‘realm’, it’s not the price but the good company that makes whiskey tastier…So, cheers to our next ‘review session’ folks!

Thank you for reading. The repo is flavoured and ready to run 🥃 !

References

[1] https://www.thespruceeats.com/history-of-whisky-1807685

[2]https://www.statista.com/outlook/10020100/100/whisky/worldwide#market-globalRevenue

[3] https://machinelearningmastery.com/logistic-regression-for-machine-learning/


Machine Learning ‘on the rocks’ 🥃 [Whiskey Dataset] | by Gerasimos Plegas was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.