Customer Churn Prediction with Spark

Predicting music streaming service user churn with exploratory analysis and machine learning

Edited by Peter Le


Sparkify is a digital music service similar to Spotify and Pandora. In Sparkify, users can either listen to music for free or buy a subscription. To identify users who are likely to churn, it’s important to perform an exploratory analysis to glean insights from the data set and identify key features of interest. The next step is to experiment different model algorithms such logistic regression, random forest, gradient boosting, and decision tree to select the best model based on key evaluation metric such as f1 score and accuracy using Spark ML Library.

The classification model is evaluated using standard metrics for binary output data — f1-score and accuracy. F1-score is given greater importance from an interpretation perspective due to imbalanced nature of the output data (significantly fewer customers churn than don’t). Accuracy only works well when the dataset classes are balanced.

The expectation is that some of these features will reveal a substantial difference between customers that churn versus those that don’t. Gaining insights on how significant each feature is for predictive performance and making improvements on these features will greatly benefit the service.


The data we have from Sparkify is composed of user events. Every interaction the user has with the application is provided for us. In other words, every time a user presses home page, listens to a song, presses next song, thumbs up a song, etc, an event is recorded in the data corresponding to the same.

Exploratory Analysis

What is the churn rate of sparkify?

Figure 1: Distribution of Users by Churn Type

The above figure shows out of 225 total users, 52 users were identified to be churned; this is approximately 24% of the universe.

Does the type of gender affect churn rate?

Figure 2. Distribution of Churn per gender type

The figure above illustrates churn per gender. We have more male users (~54% male, ~46% female) in our dataset so it’s no surprise that we’d have more male users who churn. The churn rate for males is quite higher than females (26% vs 19%).

What is the page distribution for user activity?

Figure 3. Page distribution for user activity

Many users visit ‘Next Song’ page which is beneficial for the music streaming business. ‘Thumbs Up’ is another important factor that suggests users like the songs played and enjoy the app. ‘Home’ may indicate constant user activity with the app.

What is the page distribution for churn?

Figure 4. Distribution of page and churn

Next, we’ll explore the distribution of page and churn users. Pages such as ‘Next Song’, ‘Thumbs Up’, ‘Add Friend’, and ‘Add to Playlist’ have a higher proportion of non-churn users. Finding the number of users visit these pages may determine if the users are likely to churn or not.

Does the type of user device influence the churn rate?

Figure 5. Churn per device

The majority of users use Windows or Mac to access the service, which also have the highest customer churn. The churn rate for Windows users is 18.5% which is slightly higher than Mac sitting at 18.1%. Devices such as X11 and iPhone have a much lower user base resulting in lower churn amount.

Does user location affect the churn rate?

Figure 6. Churn vs total user in location

The locations with the highest total users and churn users are in ‘Los Angelos-Long Beach-Anaheim, CA’, ‘New York-Newark-Jersey City, NY-NJ-PA’, and ‘Phoenix-Mesa-Scottsdale, AZ’. User locations are scattered widely and are rather sparse in almost all locations.

Feature Engineering

As a result, a Spark data frame is created with 10 features:

  1. Gender: Gender of the user. (Binary)
  2. Churn: a page for ‘Cancel Confirmation’ defined as churn. (Binary)
  3. Level: Latest level of a user, paid or free. (Binary)
  4. Length: User total length of songs listened (Float)
  5. Average Session Duration: User average session duration (Float)
  6. Page: Number visits per page feature — Add friend, Add to Playlist, Downgrade, Home, etc. (Integer)
  7. Time Since Registration: Time since user registration (Integer)
  8. Sessions: Total number of sessions (Integer)
  9. Songs: Total number of songs played (Integer)
  10. User Agent: Device/Agent used by the user (Integer)

The data frame is split into 70% for training, 30% for testing.

Model Building


Figure 7. Metric scores for classifiers

With the default parameter, Gradient Boosting has the highest f1 score of all the metrics and Random forest comes second.

Model Tuning

Since gradient boosting and random forest scored the highest, we will perform tuning to further improve the model. We’ll also include logistic regression and decision tree for experimentation

Logistic Parameters: regParam([0.1, 0.01]), fitIntercept ([True, False])

Random Forest Parameters: impurity [‘entropy’, ‘gini’]), maxDepth [2, 4, 6, 8])

Gradient Boosting Parameters: maxDepth [2, 4, 6, 8, 10])

Decision Tree Parameters: impurity [‘entropy’, ‘gini’]), maxDepth [2, 4, 6, 8])

Figure 8. Metric scores for tuned classifiers

After hypertuning with gradient boosting, the f1 and accuracy slightly increased (~0.001%), random forest increased by approximately 4.4%, logistic regression increasing by about 3.2%, and decision tree increased by about 5.3%.

Feature Importance

Figure 9. Feature importance for logistic regression model

The top 5 important features are:

1. page_settings
2. page_home
3. page_thumbs_up
4. total_time
5. page_error

This shows the features above have the highest influence on predicting churn rate.

Model Evaluation and Validation

Based on the results of Figure 8, gradient boosting classification is the best model as it has the highest f1 score of 0.9915. Gradient boosting builds trees one at a time, allowing each new tree to help correct errors made by previously trained trees. The model becomes more expressive with respect to their depth with each tree added. After performing hypertuning optimization, the best parameters by grid search is maxDepth [2, 4, 6, 8, 10] for gradient boosting and impurity [‘entropy’, ‘gini’], and maxDepth [2, 4, 6, 8] for random forest respectively.

This makes sense as increasing the depth of tree and number of trees impacts model performance in a positive manner. By performing a 2-fold cross-validation, the risk of overfitting the model was heavily reduced as well.


Overall, the project is successful as all the necessary questions were answered. The exploratory analysis revealed common trends and features that hypothetically influenced churn rates, and the features selected for modeling were proven to be significant factors in predicting churn rate as all the important models have performed very well. The f1 score of 0.9915 for gradient boosting and 0.9871 for random forest respectively.

The f1 score is the key evaluation metric in selecting the best model as it results in low false positive and false negative values thereby reducing business costs. It also provides equal weightage to both precision and recall values and is robust measure in comparison to other metrics.


The objective was to predict users who may potentially churn in the digital music service named Sparkify. Any user who decided to cancel their subscription were identified as a ‘churned’ user.

Key steps:

  1. Perform exploratory analysis to gain a deeper understanding of the distribution of users based on features such as gender, page activity, device, location, etc.
  2. The insights from the analysis and other key features were hypothetically selected to perform feature engineering, creating a data frame ready for modeling.
  3. The data was split into — train (70%) and test datasets (30%) for model building and evaluation.
  4. Machine learning pipelines for several classifiers were performed along with hypertuning grid search parameters to further optimize results.
  5. Gradient Boosting was selected as the best model based on the metric f1- score after using a 2-fold cross validation method on test dataset.

Due to the great model performance, I wouldn’t try any other model. Rather, I would add new aspects to the existing analysis such as adding a feature ranking to the analysis so that we could gain insights into how important each feature is for predictive performance. The feature importance score returned by the prediction model can be used to identify the cause of churn and Sparkify can make improvements of these causes. There could be several ways to tweak the model involving a largest dataset such as the features to engineer, which model and hyperparameters to tune in training.


There are a couple of potential improvements in future:

  1. Collect more user data
    We can create various metrics such as — number of times user logged in by month, number of times user upgraded or downgraded their services, and add demographic information to improve the accuracy of model prediction.
  2. The XGBoost and LightGBM models could be good supervised learning approaches to try here. Another way is to perform A/B testing to select which action to take.
  3. Build a Recommendation Engine.
    By collecting additional data as mentioned above, building a recommendation engine using collaborative filtering where we could identify user similarity between other users based on the type of songs/artists/genres they enjoy, and provide personalized recommendations regarding songs/artists they may like to improve user experience with the app.

The source of this project is available from my Github repository.



Data Science, Python, and Machine Learning :)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store