A/B Testing¶

Introductory Remarks & Comments¶

The data used for this study comes from the following source: https://github.com/etomaa/A-B-Testing/blob/master/data/Website%20Results.csv

It is straightforward and clean. We will encode the 'True' and 'False' values as '1' and '0' below to make things a bit easier on ourselves.

The data is desbribed as having been collected from two websites which are tagged as variants A and B.

The main points of interest here are the variant, whether or not there was a conversion, and the revenue of that conversion.

Data Exploration and Preparation¶

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import os
In [2]:
print(os.getcwd())
path = "F:\[Personal]\Data Analytics Portfolio\A-B Testing"
os.chdir(path)
print(os.getcwd())
C:\Users\Tyler
F:\[Personal]\Data Analytics Portfolio\A-B Testing
In [3]:
ab = pd.read_csv(r'F:\[Personal]\Data Analytics Portfolio\A-B Testing\Website Results.txt', sep = ',', header = 0)
In [12]:
ab.head()
Out[12]:
variant converted length_of_stay revenue
0 A False 0 0.0
1 A False 0 0.0
2 A False 0 0.0
3 A False 0 0.0
4 A False 0 0.0
In [13]:
ab.shape
Out[13]:
(1451, 4)
In [5]:
# encode converted column to be binary, 1 = converted
ab['converted'] = ab['converted'].replace({'False':0, 'True':1})

# calculate baseline conversion rate
conversion_rate = (ab[ab['variant'] == 'A']['converted'].sum()/ ab[ab['variant'] == 'A']['converted'].count())*100
print(conversion_rate)
2.7739251040221915

In this dataset, we already have data for both variants. Let's calculate what the sample size ought to be and compare against what is in our data.

In [8]:
print(ab[ab['variant'] == 'A']['converted'].count())
print(ab[ab['variant'] == 'B']['converted'].count())
721
730

Generating "Ideal" Sample Size¶

We define a function for sample size and use it to calculate the minimum sample size needed assuming an uplift of 10% in conversion ratio.

In [17]:
def get_sample_size(z_alpha, z_beta, p1, p2):
    n = ((z_alpha/2 + z_beta)**2 * (p1*(1-p1) + p2*(1-p2))) / ((p1-p2)**2)
    return n

sample_size = get_sample_size(1.96, 0.84, conversion_rate, conversion_rate * 1.1)
print(abs(sample_size))
481.27515799999895

Our sample size for the second variant is 730 which exceeds our calculation, so we can be confident in the sample size chosen in the study's dataset. Note: we assumed a 10% increase to conversion which is fairly conservative and reasonable for conversion rates.

Interpreting & Analyzing A/B Test Results¶

The first thing we want to do is define our p-value, the probability of observing a value as more extreme than the one we observed. Low p-values mean there is strong evidence against the null hypothesis (ie. the different variant produced the change we are observing and not something else). Remember, that is the change in conversion rate is statistically significant, then we will reject the null hypothesis (that it is not).

In [18]:
# calculate p-value, bear in mind we know control and test sizes from above. We also know control conversion rate above. We will calculate test conversion rate and strip the multiplication from control conv.
# restating calculations for convenience

control_conv = ab[ab['variant'] == 'A']['converted'].sum()/ ab[ab['variant'] == 'A']['converted'].count()
test_conv = ab[ab['variant'] == 'B']['converted'].sum()/ ab[ab['variant'] == 'A']['converted'].count()
control_size = ab[ab['variant'] == 'A']['converted'].count()
test_size = ab[ab['variant'] == 'B']['converted'].count()

def get_pvalue(control_conv, test_conv, control_size, test_size):
    lift = - abs(test_conv - control_conv)
    scale_one = control_conv * (1 - control_conv) * (1 / control_size)
    scale_two = test_conv * (1 - test_conv) * (1/ test_size)
    scale_val = (scale_one + scale_two)**0.5
    p_value = 2 * stats.norm.cdf(lift, loc = 0, scale = scale_val)
    
    return p_value

# calculate p-value
p_value = get_pvalue(control_conv, test_conv, control_size, test_size)
print(p_value)
0.020834227997970103

This p-value indicates strong, but not very strong, evidence against the null hypothesis. Next, we will calculate the confiedence internals.

In [23]:
from scipy import stats

def get_ci(test_conv, control_conv, test_size, control_size, ci):
        sd = ((test_conv * (1 - test_conv) / test_size + (control_conv * (1 - control_conv)) / control_size))**0.5
        lift = test_conv - control_conv
          
        val = stats.norm.isf((1 - ci) / 2)
        lwr_bnd = lift - val * sd
        upr_bnd = lift + val * sd
              
        return ((lwr_bnd, upr_bnd))
              
# calculate cis with ci = 0.95
ci = get_ci(test_conv, control_conv, test_size, control_size, 0.95)
print(ci)
(0.003581288462809147, 0.0435754383055681)
In [24]:
print(control_conv)
print(test_conv)
0.027739251040221916
0.05131761442441054

It is a good idea to report on the following metrics for both test groups and control groups when formalizing reporting:

  • Sample size = 721 control, 730 test
  • Run time (which we unfortunately do not have)
  • Mean =
  • Variance =
  • Estimated lift =
  • Confidence Interval = +/-

We also note that these figures are significant at the 95% confidence level, which is a standard confidence level but could be adjusted as needed.

Visualizing our Results¶

In [47]:
mean_con = ab[ab['variant'] == 'A']['revenue'].mean()
print(mean_con)
9.102149791955616
In [48]:
mean_test = ab[ab['variant'] == 'B']['revenue'].mean()
print(mean_test)
337518048237.48334
In [57]:
import matplotlib.pyplot as plt
import seaborn as sns

# extract data for control and test groups
var = ab[ab.variant == 'B']
con = ab[ab.variant == 'A']

# create a histogram for the revenue distribution
sns.histplot(var['revenue'], color='green', alpha=0.8, bins=10, label='Test')
sns.histplot(con['revenue'], color='blue', alpha=0.8, bins=10, label='Control')
plt.legend(loc='upper right')
plt.xlabel('Revenue')
plt.ylabel('Count')
plt.title('Distribution of Revenue by Group')
plt.show()

# create a boxplot for the revenue distribution
sns.boxplot(x='variant', y='revenue', data=ab)
plt.xlabel('Group')
plt.ylabel('Revenue')
plt.title('Distribution of Revenue by Group')
plt.show()

Suffice it to say, the test results dwarf the control results in this histogram. The variances are close to zero or negative using conventional calculations which will make a normal distribution impossible as an effective visualization. Instead, we use a box plot to avoid using parametric tests which are, at this point, essentially comparing two independent groups.

It is clear from these results that the test treatment did have a substantial effect compared to the control group and was a big success. In fact, it was so successful, that it was impossible to visualize with conventional means (normal distribution).

This was a great exercise in establishing test size for an A/B test and then analyzing those results. A source of improvement would be to use a far larger dataset which would counter the difficulties that we faced in visualizing our results and comparing them in a paramteric fashion.