The Word ‘churn’ is defined as to make somebody feel upset or emotionally confused in Oxford Online Dictionary[1]. Even though we know churn as loss of customer in business, dictionary meaning also seems so logical in business. If someone feel upset or confused by different chances to consume, she can churn. Here the question is that: can data help us to understand whether a consumer is getting angry or started to confused?

Users who churn probably constitute a small proportion of the all user which makes problem compelling. However, the data is still a great for us even though we can never know exactly what a person experience In real.

In the sparkify problem, logs of users are provided and we try to guess churn behaviors of them. There are 64 days data and approximately 280k logs for about 300 users. 52 of the users has churn log. In the project pyspark has been used for cleaning, modeling etc.; only for plotting, mathplotlib and pandas libraries are utilized.

During the project, main objective was to predict churn. However this is so open-ended expression so clear definition of objectives are required.

Data Exploration and Data Cleaning

After common steps just like setting the spark session and extraction, data analysis started. Only one dataframe is available for the project and it has below columns:

  • artist: string (nullable = true)
  • auth: string (nullable = true)
  • firstName: string (nullable = true)
  • gender: string (nullable = true)
  • itemInSession: long (nullable = true)
  • lastName: string (nullable = true)
  • length: double (nullable = true)
  • level: string (nullable = true)
  • location: string (nullable = true)
  • method: string (nullable = true)
  • page: string (nullable = true)
  • registration: long (nullable = true)
  • sessionId: long (nullable = true)
  • song: string (nullable = true)
  • status: long (nullable = true)
  • ts: long (nullable = true)
  • userAgent: string (nullable = true)
  • userId: string (nullable = true)

Sparkify is a music platform and we can see that this is a log table by looking column names and first few rows. As stated above, objective is to guess users who show churning attitudes. For the purpose, we should differentiate the churned and non-churned users and investigate their different attitudes. I just try to clarify that main focus should be on the users.

Therefore, rows with null user id were removed after observing null users’ logs. They only visit seven pages and they are actually the pages that are not representing any activity like listening a music. By their names, these pages probably do not require any authentication so it can be concluded that these users are who had not sign in yet, or had just signed out.

  • Home
  • About
  • Submit Registration
  • Login
  • Register
  • Help
  • Error

Page feature shows the activity of users, so it will form the main part of our model. All the actions are hold as page. These are listed below. When we observe the pages that users visit, we can see that cancellation confirmation means churn since there is no log after it.

Table 1: pages


Churned users: Users who has the Cancellation Confirmation page log.

Some features of data (after cleaning):

Ø Number of rows: ~278k

Ø Number of days: 64 days (from 2018–10–01 to 2018–12–03)

Ø Number of users: 225

Ø Number of churn actions: 52

Dataset includes only 64 for days, so it is so hard to compare and evaluate users’ different attitudes towards cancellation. However, we had 173 unchurned and 52 users which seems fairly enough to compare different attitudes of users.

When we investigate hourly activities of users, no different trend is observed between two groups. Rush hours in the evenings are generally busy hours of application. Plot is provided below.

Some other features were also analyzed. Some of them look like making difference in churning users. However, another issue is faced here. Since churned users are a very small part of the all users, unchurned users’ behaviors can change significantly. For instance, below plot shows online days of users in last 30 days. As can be seen, there is no doubt mean is higher in churned users but there are still many unchurned users who had been online almost all days.

This situation causes problem in almost all statistics. Therefore, log number for users were collected. As can be see below, a lot of users has almost no log where some few users generate extremely high number of logs. In my opinion, both cases should be eliminated since low numbers does not generate useful statistic while huge number can watch different trends than general and break models’ structure.

Here, a methodology was needed after some visualizations. For example, first I wanted to how last week trends of churned user change according to one who have not churned. Therefore, I extract the last week data of all data, but then I realize that this data is meaningless for churned users because maybe they had already churned before the last week. This situation can provide us to easily determine the churned users but it is so unrealistic since model looks back rather than further.

Then, I improved my strategy and extract their last week data for churned users. However, it was also misleading. First of all, this situation causes that all data could not be made profit. Main problem that make the model weak is that while data is available for large periods for unchurned users, churned users’ information can be relatively small.

Afterwards, I decided that period should be equalized for any kind of users. Moreover, determining an origin and evaluating it as present time can make the model more realistic. Looking data for some period of time, and trying to guess whether she will leave for some next time will provide deeper insights.

In order to analyze the users, page column is very useful. In this column we can calculate number of songs a user listens, number of errors faced, likes/dislikes etc. Another feature is number of home visits which helps to calculate average number of activities (e.g. number of songs) in a session.

Features and Feature Engineering

onlineDays: Number of days that user uses the application. The assumption was users who tend to churn can have fewer active days. However, below plot shows it does not work for ten days-periods. This analysis give the idea that churned users are giving a last chance to application so their active days might be increasing.

likes/dislikes: Number of thumbs up/down. The situation is same as in the online-days analysis. Again, churned users show more tendency to like or dislike songs. These features can be helpful in our models.

listenedSongs: number of songs that users have listened

HomeVisits: How many times that users come back to home page. Again, enjoying more from the application can lead to visit home page more. Probably, inactive and unchurned users are weighting the statistics.

Errors: Users can get errors, so this data is also hold. If users get more errors they can give up using the application. Below graph somehow confirms the assumption, but there are still same problem faced with other features.

differentArtists: This shows how many different artists users have listened during the period. This feature is included in feature engineering but not in the model because it does not make remarkable difference and is highly correlated with number of songs listened.

totalLogs: As can be seen by the name, it is the number of logs that a user created.

likeRatio/ dislikeRatio: These are the ratio of likes and dislikes to total number of songs.

songRatio: Ratio of songs among all logs. Since this is a music application, this ratio can be crucial to how the application reaches its goal.

errorRatio: Ratio of errors among all logs. This ratio is also can create a difference rather than just numbers. For instance, 20 logs might be negligible for so active users, but can be subversive for less actives.

After evaluating the different features and comparing the scores, some features are selected to get better models and avoid overfitting.

Table 2: Features

A crucial step is to clarify the problem and provide required robustness. The model should include features and output, their period can be set. As stated above, I created a training and validation time which are 20 and 10 days respectively, but they can be still changed.

Picture 1: Time-depended Feature Engineering Methodology

Data is extracted as features columns and output is ifChurn column. The main point is that, if we try to guess for today, we are analyzing the last 20 days data and try to guess whether the user will churn in the 5 days.


In order to be able to apply machine learning algorithms, first the data should be split for validation. Two methods can be applied to split data. First one is that earlier data can be used to train data and last time interval becomes test data. This method is more realistic since it cares time. There are also some drawbacks like randomness cannot be applied this method. Moreover, since churned users form only small percentage of the data we try to guess same a few users in this method. Therefore, if we guess right or wrong by some luck, it can lead to miss fitting of data.

Second method is to gather all data together, doing random split to this data to get train and test parts. This method is better for randomness but some lack of reality since we try to guess past churns with future data. Both ways are tried in this project, unfortunately none these make a difference.

Pyspark supports to build pipelines and different machine learning algorithms.

Pipelines is built by three steps:

  • VectorAssembler: This step is to extract features from the dataset and vectorized them. Vector Assembler will provide to train data.
  • Indexer: This is generated to save label column. Model will try to find this created label.
  • Machine Learning Algorithm: Last step is machine learning model to be able train and transform data. Three algorithms were used in the project:
  1. Logistic Regression

2. Decision Tree Classifier

3. Random Forest Classifier

The model can be built by different time periods like 20 days train and 5 days validation periods. After building these models, evaluation starts.

Evaluation and Conclusion

All models are evaluated and compared via F1 score. Since we are trying to guess churned users F1 scoring is one of the best evaluator for us. Churned users are consisted so small piece of data, so trying to guess unchurned users is missleading. For example if 5% of the users churn, and we guess none of the users would churn without any model, we can easily reach to 95% success. However, in F1 scoring, only churned users or users who are predicted to churn are evaluated. ratio1: ratio of churned users among all churns ratio2: ratio of churned users among all users who is predict to churn.

F1 = 2 * ((1/ratio1)+(1/ratio2))^-1

All scores give 0 score. We can conclude that features should be improved. We have already extracted more features than we used but other features also were not satisfying.

Table 3: F1 Scores of Models with Different Time Periods

When we look prediction as churn and real churn below rows return. As can be seen, there are so few churned users, so It is becoming so hard to guess them. When we analyze the probabilities.

  • Data should be enlarged for better trained data.
  • Some users neither use the application regularly nor leave it. These users are dominating the data. In order to handle with it, elimination of edge cases can be done in for each period separately.
  • In the small dataset, most of the churned users cancel their membership at the very beginning of the time. This causes that there is not sufficient amount of data for these users.
  • Features are weak. Even there are obstacles to build a healthy model, zero score is very heartbreaking with many determined features.

Even though none of models give a satisfying result, we have built a real-world problem and it is sometimes so hard to predict such small sub groups. Moreover, features are needed to be improved. Even time periods are separated logically, features do not include time weight. Last activities of users can be more weighted in the model.

Since we could not get any score rather than zero, no cross validation is applied. However, if better features can be found, cross validation will also be helpful to increase accuracy.

[1] (Definition of churn verb from the Oxford Advanced Learner’s Dictionary, 2020)