The data used for this study comes from the following source: https://github.com/etomaa/A-B-Testing/blob/master/data/Website%20Results.csv
It is straightforward and clean. We will encode the 'True' and 'False' values as '1' and '0' below to make things a bit easier on ourselves.
The data is desbribed as having been collected from two websites which are tagged as variants A and B.
The main points of interest here are the variant, whether or not there was a conversion, and the revenue of that conversion.
import pandas as pd
import seaborn as sns
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import os
print(os.getcwd())
path = "F:\[Personal]\Data Analytics Portfolio\A-B Testing"
os.chdir(path)
print(os.getcwd())
C:\Users\Tyler F:\[Personal]\Data Analytics Portfolio\A-B Testing
ab = pd.read_csv(r'F:\[Personal]\Data Analytics Portfolio\A-B Testing\Website Results.txt', sep = ',', header = 0)
ab.head()
| variant | converted | length_of_stay | revenue | |
|---|---|---|---|---|
| 0 | A | False | 0 | 0.0 |
| 1 | A | False | 0 | 0.0 |
| 2 | A | False | 0 | 0.0 |
| 3 | A | False | 0 | 0.0 |
| 4 | A | False | 0 | 0.0 |
ab.shape
(1451, 4)
# encode converted column to be binary, 1 = converted
ab['converted'] = ab['converted'].replace({'False':0, 'True':1})
# calculate baseline conversion rate
conversion_rate = (ab[ab['variant'] == 'A']['converted'].sum()/ ab[ab['variant'] == 'A']['converted'].count())*100
print(conversion_rate)
2.7739251040221915
In this dataset, we already have data for both variants. Let's calculate what the sample size ought to be and compare against what is in our data.
print(ab[ab['variant'] == 'A']['converted'].count())
print(ab[ab['variant'] == 'B']['converted'].count())
721 730
We define a function for sample size and use it to calculate the minimum sample size needed assuming an uplift of 10% in conversion ratio.
def get_sample_size(z_alpha, z_beta, p1, p2):
n = ((z_alpha/2 + z_beta)**2 * (p1*(1-p1) + p2*(1-p2))) / ((p1-p2)**2)
return n
sample_size = get_sample_size(1.96, 0.84, conversion_rate, conversion_rate * 1.1)
print(abs(sample_size))
481.27515799999895
Our sample size for the second variant is 730 which exceeds our calculation, so we can be confident in the sample size chosen in the study's dataset. Note: we assumed a 10% increase to conversion which is fairly conservative and reasonable for conversion rates.
The first thing we want to do is define our p-value, the probability of observing a value as more extreme than the one we observed. Low p-values mean there is strong evidence against the null hypothesis (ie. the different variant produced the change we are observing and not something else). Remember, that is the change in conversion rate is statistically significant, then we will reject the null hypothesis (that it is not).
# calculate p-value, bear in mind we know control and test sizes from above. We also know control conversion rate above. We will calculate test conversion rate and strip the multiplication from control conv.
# restating calculations for convenience
control_conv = ab[ab['variant'] == 'A']['converted'].sum()/ ab[ab['variant'] == 'A']['converted'].count()
test_conv = ab[ab['variant'] == 'B']['converted'].sum()/ ab[ab['variant'] == 'A']['converted'].count()
control_size = ab[ab['variant'] == 'A']['converted'].count()
test_size = ab[ab['variant'] == 'B']['converted'].count()
def get_pvalue(control_conv, test_conv, control_size, test_size):
lift = - abs(test_conv - control_conv)
scale_one = control_conv * (1 - control_conv) * (1 / control_size)
scale_two = test_conv * (1 - test_conv) * (1/ test_size)
scale_val = (scale_one + scale_two)**0.5
p_value = 2 * stats.norm.cdf(lift, loc = 0, scale = scale_val)
return p_value
# calculate p-value
p_value = get_pvalue(control_conv, test_conv, control_size, test_size)
print(p_value)
0.020834227997970103
This p-value indicates strong, but not very strong, evidence against the null hypothesis. Next, we will calculate the confiedence internals.
from scipy import stats
def get_ci(test_conv, control_conv, test_size, control_size, ci):
sd = ((test_conv * (1 - test_conv) / test_size + (control_conv * (1 - control_conv)) / control_size))**0.5
lift = test_conv - control_conv
val = stats.norm.isf((1 - ci) / 2)
lwr_bnd = lift - val * sd
upr_bnd = lift + val * sd
return ((lwr_bnd, upr_bnd))
# calculate cis with ci = 0.95
ci = get_ci(test_conv, control_conv, test_size, control_size, 0.95)
print(ci)
(0.003581288462809147, 0.0435754383055681)
print(control_conv)
print(test_conv)
0.027739251040221916 0.05131761442441054
It is a good idea to report on the following metrics for both test groups and control groups when formalizing reporting:
We also note that these figures are significant at the 95% confidence level, which is a standard confidence level but could be adjusted as needed.
mean_con = ab[ab['variant'] == 'A']['revenue'].mean()
print(mean_con)
9.102149791955616
mean_test = ab[ab['variant'] == 'B']['revenue'].mean()
print(mean_test)
337518048237.48334
import matplotlib.pyplot as plt
import seaborn as sns
# extract data for control and test groups
var = ab[ab.variant == 'B']
con = ab[ab.variant == 'A']
# create a histogram for the revenue distribution
sns.histplot(var['revenue'], color='green', alpha=0.8, bins=10, label='Test')
sns.histplot(con['revenue'], color='blue', alpha=0.8, bins=10, label='Control')
plt.legend(loc='upper right')
plt.xlabel('Revenue')
plt.ylabel('Count')
plt.title('Distribution of Revenue by Group')
plt.show()
# create a boxplot for the revenue distribution
sns.boxplot(x='variant', y='revenue', data=ab)
plt.xlabel('Group')
plt.ylabel('Revenue')
plt.title('Distribution of Revenue by Group')
plt.show()
Suffice it to say, the test results dwarf the control results in this histogram. The variances are close to zero or negative using conventional calculations which will make a normal distribution impossible as an effective visualization. Instead, we use a box plot to avoid using parametric tests which are, at this point, essentially comparing two independent groups.
It is clear from these results that the test treatment did have a substantial effect compared to the control group and was a big success. In fact, it was so successful, that it was impossible to visualize with conventional means (normal distribution).
This was a great exercise in establishing test size for an A/B test and then analyzing those results. A source of improvement would be to use a far larger dataset which would counter the difficulties that we faced in visualizing our results and comparing them in a paramteric fashion.