Churn Analysis Using Banking Dataset¶

Comments and Introductory Remarks¶

This case study/project uses the 'Bank Turnover Dataset' which can be found at https://www.kaggle.com/datasets/barelydedicated/bank-customer-churn-modeling

The objective of the study is to showcase how machine learning models are optimized and chosen to solve churn problems, which apply to every business and are especially important to marketing and sales teams.

We will look at Logistic Regression and Random Forests Classifiers to examine which one performs better and how we optimize their performance.

In the end, we will look to make a conclusion about model fit, showcase how the model can actually be used in practice (with test data, but the same principle applies to new data), and make comments about the case study as a whole and what factors would need to be changed or added in order to make further improvements.

The dataset itself is fairly straightforward and the only thing I bothered to do prior to inputting into Python (other than inspecting the data beforehand for errors and cleanliness), was to get rid of the 'RowNumber' column in advance. Other columns are dropped as part of prep in Python.

Importing Packages, Setting up Environment, and Exploring Data¶

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import os
In [3]:
print(os.getcwd())
path = "F:\[Personal]\Data Analytics Portfolio\Churn Analysis\churn"
os.chdir(path)
print(os.getcwd())
C:\Users\Tyler
F:\[Personal]\Data Analytics Portfolio\Churn Analysis\churn
In [41]:
# disable warnings
import warnings
warnings.filterwarnings('ignore')
In [4]:
bank = pd.read_csv(r'F:\[Personal]\Data Analytics Portfolio\Churn Analysis\churn\bank.csv')
In [5]:
print(bank.head())
print(bank.dtypes)
bank.shape
   CustomerId   Surname  CreditScore Geography  Gender  Age  Tenure  \
0    15634602  Hargrave          619    France  Female   42       2   
1    15647311      Hill          608     Spain  Female   41       1   
2    15619304      Onio          502    France  Female   42       8   
3    15701354      Boni          699    France  Female   39       1   
4    15737888  Mitchell          850     Spain  Female   43       2   

     Balance  NumOfProducts  HasCrCard  IsActiveMember  EstimatedSalary  \
0       0.00              1          1               1        101348.88   
1   83807.86              1          0               1        112542.58   
2  159660.80              3          1               0        113931.57   
3       0.00              2          0               0         93826.63   
4  125510.82              1          1               1         79084.10   

   Exited  
0       1  
1       0  
2       1  
3       0  
4       0  
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object
Out[5]:
(10000, 13)
In [6]:
bank.isna().sum()
Out[6]:
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64
In [7]:
bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   Surname          10000 non-null  object 
 2   CreditScore      10000 non-null  int64  
 3   Geography        10000 non-null  object 
 4   Gender           10000 non-null  object 
 5   Age              10000 non-null  int64  
 6   Tenure           10000 non-null  int64  
 7   Balance          10000 non-null  float64
 8   NumOfProducts    10000 non-null  int64  
 9   HasCrCard        10000 non-null  int64  
 10  IsActiveMember   10000 non-null  int64  
 11  EstimatedSalary  10000 non-null  float64
 12  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 1015.8+ KB

We're going to use the corr() output to see if there are any columns we need to drop on the basis of their correlation. Basically, if similar columns are overly correlated, they don't add anything to the machine learning models and therefore can be dropped.

In [42]:
bank.corr()
Out[42]:
CustomerId CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
CustomerId 1.000000 0.005308 0.009497 -0.014883 -0.012419 0.016972 -0.014025 0.001665 0.015271 -0.006248
CreditScore 0.005308 1.000000 -0.003965 0.000842 0.006268 0.012238 -0.005458 0.025651 -0.001384 -0.027094
Age 0.009497 -0.003965 1.000000 -0.009997 0.028308 -0.030680 -0.011721 0.085472 -0.007201 0.285323
Tenure -0.014883 0.000842 -0.009997 1.000000 -0.012254 0.013444 0.022583 -0.028362 0.007784 -0.014001
Balance -0.012419 0.006268 0.028308 -0.012254 1.000000 -0.304180 -0.014858 -0.010084 0.012797 0.118533
NumOfProducts 0.016972 0.012238 -0.030680 0.013444 -0.304180 1.000000 0.003183 0.009612 0.014204 -0.047820
HasCrCard -0.014025 -0.005458 -0.011721 0.022583 -0.014858 0.003183 1.000000 -0.011866 -0.009933 -0.007138
IsActiveMember 0.001665 0.025651 0.085472 -0.028362 -0.010084 0.009612 -0.011866 1.000000 -0.011421 -0.156128
EstimatedSalary 0.015271 -0.001384 -0.007201 0.007784 0.012797 0.014204 -0.009933 -0.011421 1.000000 0.012097
Exited -0.006248 -0.027094 0.285323 -0.014001 0.118533 -0.047820 -0.007138 -0.156128 0.012097 1.000000

Nothing is worth dropping on the merits of corr() alone.

In [9]:
bank.columns
Out[9]:
Index(['CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age',
       'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
       'EstimatedSalary', 'Exited'],
      dtype='object')

Converting Categorical Columns to Numeric Columns¶

Keep only the columns which affect customer churn, drop the others as they are unnecessary in the model. This might seem counter-intuitive to sales people but we need to form a model and use it and are therefore not focused on the individual customer information present in the dataset. (ie. this needs to work for "any" customer)

In [10]:
dataset = bank.drop(['CustomerId', 'Surname'], axis = 1)

Since Machine Learning algorithms work best with numerical data, we need to do something about the data which is in 'Geography' and 'Gender' columns. First, let's isolate these columns from the dataset.

In [11]:
dataset = dataset.drop(['Geography', 'Gender'], axis = 1)

We need to one-hot encode the Geography column (the gender column will be binary so we can simply use a 1 or a 0 as needed).

In [12]:
Geography = pd.get_dummies(bank.Geography).iloc[:,1:]
Gender = pd.get_dummies(bank.Gender).iloc[:,1:]
In [13]:
dataset = pd.concat([dataset, Geography, Gender], axis = 1)
In [14]:
dataset.head()
Out[14]:
CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Germany Spain Male
0 619 42 2 0.00 1 1 1 101348.88 1 0 0 0
1 608 41 1 83807.86 1 0 1 112542.58 0 0 1 0
2 502 42 8 159660.80 3 1 0 113931.57 1 0 0 0
3 699 39 1 0.00 2 0 0 93826.63 0 0 0 0
4 850 43 2 125510.82 1 1 1 79084.10 0 0 1 0

Here, we verify that our alterations to the data worked and made sense.

Data Preprocessing¶

Data is ready for use but we still need to isolate the variable that we're predicting from the dataset (churn).

In [15]:
X = dataset.drop(['Exited'], axis = 1)
y = dataset['Exited']
In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Logistic Regression¶

Begin by importing the necessary models. We perform a grid search in order to find the best hyperparameters to use with the model itself. If you don't do this, you're simply guessing and using trial/error which is inefficient.

In [17]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, f1_score

# Define the logistic regression model
logreg = LogisticRegression(fit_intercept=False, solver='liblinear')

# Define the grid of hyperparameters to search over
param_grid = {'C': np.logspace(-5, 5, 11)}

# Define the scorer to use for evaluating the models
scorer = make_scorer(f1_score)

# Perform a grid search over the hyperparameters using cross-validation
grid_search = GridSearchCV(logreg, param_grid=param_grid, cv=5, scoring=scorer)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
Best parameters: {'C': 0.001}
Best score: 0.07083600332803872

With our best parameters in hand, we instantiate our model using them before fitting it to the training data.

In [29]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(fit_intercept = False, C = 0.001, solver = 'liblinear')
logreg.fit(X_train, y_train)
Out[29]:
LogisticRegression(C=0.001, fit_intercept=False, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.001, fit_intercept=False, solver='liblinear')
In [30]:
y_pred = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

Find residual differences between train data and predicted train data, and figure out the amount of times our model was correct.

In [31]:
residuals = np.abs(y_train, y_pred)
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize = True))
0    6356
1    1644
Name: Exited, dtype: int64
0    0.7945
1    0.2055
Name: Exited, dtype: float64

80% isn't bad for a first try!

Now, we will look at the common key metrics for these types of models and ascertain if there are any issues.

In [32]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

precision_train = precision_score(y_train, y_pred)
precision_test = precision_score(y_test, y_hat_test)

recall_train = recall_score(y_train, y_pred)
recall_test = recall_score(y_test, y_hat_test)

accuracy_train = accuracy_score(y_train, y_pred)
accuracy_test = accuracy_score(y_test, y_hat_test)

f1_train = f1_score(y_train, y_pred)
f1_test = f1_score(y_test, y_hat_test)
In [33]:
print(precision_train)
print(precision_test)

print(recall_train)
print(recall_test)

print(accuracy_train)
print(accuracy_test)

print(f1_train)
print(f1_test)
1.0
0.47540983606557374
1.0
0.0737913486005089
1.0
0.802
1.0
0.1277533039647577

With weak scores throughout the the '1.0' values for each training score, this indicates that the model is overfitted. With that said, we optimized parameters as best we could and the problem is the low amount of data in the set. This is a systemic issue for this case study because it is meant to be carried out on a home computer.

In [23]:
logreg.predict_proba(X_test)[:,1]
Out[23]:
array([0.22214244, 0.20307728, 0.24135997, ..., 0.15939585, 0.09307679,
       0.25307851])
In [24]:
y_pred_prob = logreg.predict_proba(X_test)[:,1]

Proba values are the practical application of the model - they are the actual preictions and are fairly weak for the most part.

Next, we examine the Receiver Operating Characteristic (ROC) curve to visualize the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR). TPR is the sensitivity or recall of the modeal and shows how many cases are properly identified as positives by the model, while the FPR represents the incorrect positive labels.

The Area Under the Curve (AUC), represents the accuracy of the model in the sense that it ranges from 0 to 1, where 0.5 is random guessing and 1.0 is perfect discrimination.

In [25]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot([0,1], [0,1], 'k--')
plt.show()

Here we see a fairly weak AUC which correlates with our comments above.

In [40]:
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test, y_pred_prob)
print("AUC: {:.3f}".format(auc))
AUC: 0.675

Ultimately, the linear regression model overfits the data, despite using optimal parameters. The issue here is that there isn't a huge amount of data from which to train the model. A different model choice is probably better.

Random Forests¶

To begin, we take very similar steps as above. For some expanded learning and for curiosity, let's also use some generic parameters first to show how the better fitted paramaeters may vary in the end.

In [26]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 200, random_state = 42)
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
In [27]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      1607
           1       0.78      0.48      0.59       393

    accuracy                           0.87      2000
   macro avg       0.83      0.72      0.76      2000
weighted avg       0.86      0.87      0.86      2000

0.8705

Linear Regression was ~80% above compared to this. This is decent but let's try to refine it. Grid search is a good option to refine the parameters because, although it is computationally expensive for larger datasets, this dataset is smaller, so a more robust method ought to work fine in this case.

In [34]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Create a Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define the grid search parameters
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a grid search object
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)

# Fit the grid search object to the data
grid_search.fit(X_train, y_train)

# Print the best parameters found by the grid search
print(grid_search.best_params_)
Fitting 5 folds for each of 81 candidates, totalling 405 fits
{'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 300}

Run the model again with these enhanced parameters to see if there is any difference.

In [35]:
classifier = RandomForestClassifier(n_estimators = 300, max_depth = 30, min_samples_leaf = 1, min_samples_split = 10, random_state = 42)
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
In [36]:
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
              precision    recall  f1-score   support

           0       0.88      0.96      0.92      1607
           1       0.75      0.47      0.58       393

    accuracy                           0.86      2000
   macro avg       0.82      0.72      0.75      2000
weighted avg       0.86      0.86      0.85      2000

0.865

Nearly identical. This means that there isn't much we can do to enhance these models on this size of dataset and the best thing we can do is to choose the best model for the problem. The model is now trained.

In [43]:
# predict probabilities of churn for new customers
y_prob = classifier.predict_proba(X_test)

# extract the probabilities for class 1 (churn) only
churn_probabilities = y_prob[:,1]

# predict churn for a new customer
new_customer_data = [2, 0, 35, 2, 50000, 2, 1, 1, 1, 2, 50000]
churn_probability = classifier.predict_proba([new_customer_data])[0, 1]
if churn_probability >= 0.5:
    print("New customer is likely to churn with probability", churn_probability)
else:
    print("New customer is unlikely to churn with probability", 1 - churn_probability)
New customer is likely to churn with probability 0.6174396429450608

This is, again, a fairly weak outcome although the model choice is better and the parameters have been tuned as best they can be.

Visualizing Top 10 Features for Predicting Churn¶

In [28]:
feature_importances = pd.Series(classifier.feature_importances_, index = X.columns)
feature_importances.nlargest(10).plot(kind = 'barh')
Out[28]:
<Axes: >

Which Model Performed Better?¶

In the end, both models were weak when it came to predicting the probability of customer churn, but with the benefit of appropriate parameter choices and optimizing for as much as we can given the small dataset, there isn't anything we could reasonably do to improve performance short of giving the model more data to work with.

Regardless, this was an excellent exercise to show how this might be applied to bigger datasets (and all done using computational power that could be expected of a home PC). The process and the steps are fully laid out and established.

In other words: This is a sample of what I can do with machine learning and greater input will lead to greater output.

To conclude the case study, the Random Forests model was better than the Logistic Regression model, which seemed insufficient to capture the greater detail in model that the Forests model did.

Comparing two is great for a case study and other models could be similarly compared although it may not be worth doing in this case unless the dataset was expanded.