You Are Most Likely Doing A/B Testing Wrong!

With the introduction of enterprise optimization platforms, never has it been easier for organizations to build and run optimization programs. Before tools like Adobe Target, Optimizely, and Google Optimize, when the marketing teams wanted to run a test, they’d have to venture down into the basement to convince one of the stats geeks to help them out. Now, they have the power in their hands to design and launch tests, all from the comforts of the 7th Floor.

However, with this new found comfort came a hidden danger, by removing much of the statistical rigor that historically had been used to design A/B testing activities, technical marketers, optimization managers, and analytics implementors fell victim to P-Hacking.

hacker-2300772_960_720

So what exactly is P-Hacking? P-Hacking is manipulating data or analyses to artificially get significant p-values.

Ok, so why is P-Hacking potentially so dangerous? Well, before we answer that question, let’s work to understand a bit more how popular A/B testing solutions work.

Fine, so let’s start with the value that is at the heart of all this controversy, what is a p-value? A p-value, or ‘probability value’, helps you determine the significance of your results. Most of the popular enterprise A/B testing platforms use what is called “Null Hypothesis Significance Testing” in which the p-value is used to determine if we should Reject or Fail to Reject the Null (or often referred to as ‘The Control’) hypothesis. This binary decision process leads to four possible scenarios:

001

Out of these four scenarios, marketers who expect to see a relationship are hoping for the bottom-right result, this is our “hey, your alternative variation is a winner!” scenario. The top-left is also a valid conclusion, albeit one most marketers don’t hope to see, in that we correctly fail to reject a control that is true.

The other two “bad” scenarios are typically referred to as Type I Errors, sometimes called a “false positive”, where the Null is True but we incorrectly reject it and Type II Errors, the Null is False and we incorrectly fail to reject it.

I’m going to use Adobe Target as my example, not because it’s an outlier, it functions similarly to the rest of the popular enterprise solutions, but because it’s the solution I have the most experience with.

In Adobe Target, there is a built in “Alpha” or Significance Level, the probability of making a Type I error, of 0.05. This is calculated using a 95% Confidence Level (1 – 0.95 ) that most organizations use to determine that we have reached “statistical significance” and we can officially call a winner. Here is where the p-value comes into play, if we get a p-value that is greater than 0.05, then the course of action is to fail to reject the Null – we haven’t reached the statistical significance threshold.

NOTE: Testing with a 95% confidence level means you have a 5% chance of detecting a lift, that is statistically signifiant, even when there is no difference between the offers.

We now know that we have a 5% chance of generating a Type I error with our A/B test, we reject the control and adopt the alternative, even though there really is no difference between the two. However, we may not know that this number can be greatly inflated depending on how we design our A/B test and analysis plan. With every additional alternative version AND with every segment that is applied to the A/B test results, you increase the likelihood of a Type I, or False Positive, Error. This is called the “Family Wise Error Rate.”

The Family Wise Error Rate can also become evident even when we are doing an A/B test with just a Control and 1 Alternate Version based on how we do our analysis. Let’s say we take the test results and then segment them into many different buckets, California vs. Non-California, Male vs. Female, Millennials vs. Non-Millennials. Each of these “segments” is essentially a different variation of the test and must be accounted for in the “Family Wise Error Rate” as each version will inflate your Type I errors.

An A/B Test with A Control and an Alternate Version. Our Type I Error Rate is calculated as: 1 – (Confidence Level^Number of Alternatives) or 1 – (0.95^1) ~ 0.05 –> this means that 5% of the time, we will get a statistically significant result just by chance.

Now, let’s look what happens when we overlay each of the segments above. Our Type I Error Rate is now calculated as 1 – (0.95^4) ~0.185 –> this means that 18.5% of the time we will get a result just by chance.

NOTE: You can account for Family Wise Errors with a Bonferroni Correction when you are designing your analysis plan.

Ok, let’s get back to P-Hacking and dive into why it is happening. P-Hacking happens when analyses are being chosen based on what makes the p-value significant, not what’s the best analysis plan. Let me repeat that because this is the most important part of this article, P-Hacking happens when analyses are being chosen based on what makes the p-value significant, not what’s the best analysis plan.

Before we go any further, let me state that I don’t believe P-Hacking in the digital analytics industry is widely happening due to malicious analysts, more often it is the result of analysts that lack experience combined with popular A/B testing solutions that make P-Hacking easy to do without realizing it. So if analysts aren’t intentionally being malicious when it comes to concluding on A/B testing results, what is driving it?

Limited Statistical Ability

The statistical experience of the analyst often comes into play, especially in the digital analytics industry, an industry that produces a tremendous amount of helpful content when it comes to configuring and deploying MarTech solutions but very limited content when it comes to applied statistics and analysis frameworks. Many of our analysts have simply not gained the statistical experience to be able to identify when P-Hacking is happen, so it happens right before their eyes and no one notices or challenges the conclusions.

Enterprise A/B Testing Solutions

Many of the world’s most popular A/B testing platforms have gradually reduced statistical rigor over the years as a direct response to buyer demands. While many of these vendors recognize the need for proper test planning and statistical rigor, they often hide this recommendation in the small print, somewhere in their vast library of documentation. Combine an A/B testing solution that incents users to end tests prematurely with an in-experienced analyst and P-Hacking is inevitable.

Pleasing the Boss

How many optimization managers have been given quarterly goals like, “we will run X number of tests per quarter.” These goals, which are often tied to financial bonuses and promotion possibilities, are often reached by P-Hacking. Again, I don’t believe the analyst is purposefully P-Hacking to get a bonus but because his goals are based around volume of tests, often tests are poorly designed and stopped before the proper time horizon which results in unintentional P-Hacking.

Great, what does all of this mean for my A/B testing practice?

When we fail to properly design an analysis plan, we increase the likelihood of Type I and Type II errors.
Deploying new variations (segmentation) after the test has concluded introduces an inflated Type I Error Rate known as the Family Wise Error Rate.
In a UI, such as Adobe Target, the reporting interface will show you a winning version (Green Star) after n > 30 AND a p-value of < 0.05, regardless of your analysis plan.
In a UI, such as Adobe Target, the reporting interface will show a winning badge and a green “WINNING VERSION” banner with a p-value of < 0.05 and a signifiant number of conversions (Adobe does not define what this calculation is, you just have to trust them.), regardless of your analysis plan, often resulting in tests being stopped, and winners crowned, prematurely.

This means that your favorite A/B testing tool could be leading you into committing P-Hacking without you even realizing it.
Screen Shot 2019-02-10 at 11.09.55 AM

At best, the result of this rampant P-Hacking, is that we as analysts are providing insights and recommendations that can not be backed up scientifically. At worst, we are costing companies collectively hundreds of millions of dollars, the result of deploying “winning variations,” winners selected by manipulating data or analyses to artificially get significant p-values, that either have no real impact or worse, perform significantly poorer than the control.

So, how do we solve this?

We can solve this by doing right by our companies, our clients, and our industry. And by doing right, I mean three things:

Committing to building an A/B testing culture that is rooted in statistical rigor and ethics.
Committing to invest time upfront to properly design your A/B testing analysis plan for EVERY test you plan to deploy into the market.
Committing to sizing your test population and time horizon BEFORE you launch a test into market AND committing to hold off concluding a test before your defined time horizon has been met.

Building Statistical Rigor

There are lots of great resources out there, both online and offline, for learning the basic statistics you need to run a proper A/B testing practice within your organization. If you have budget, prioritize providing statistical training for your team. If you are tasked with building and analyzing A/B tests and feel like having a better grasp on statistics would make your a better A/B tester, make the case to your boss how important this investment is to both you and your organization.

Properly Design Your Analysis Plan

Invest the time upfront to really think through your plan. What is our hypothesis? How many alternatives will we want to test? What segments will we need to use in evaluating the test? All too often, the allure of building a new version on the fly in your favorite A/B testing solution causes people to simply skip right over this step without giving it a second thought. Don’t.

Size Your Test

There are many online calculators that you can use to help size your test population and time horizons. Adobe has a great calculator that I strongly suggest you use if you are an Adobe Target shop.

Screen Shot 2019-02-12 at 1.02.45 PM

This calculator will help you properly size your sample size and time horizons e.g. HOW LONG DO I NEED TO RUN THIS TEST?

CONCLUSION: I hope that I have given you something important to think about and you are now motivated to evaluate how you are doing A/B testing today. Can we at least commit to thinking a little deeper and being a bit more deliberate in how we collectively do optimization testing?

Finally, I don’t claim to be an expert in this space by any means, if I have stated a conclusion about a particular statistical method or metric, please correct me, I want to learn. If I have misrepresented in any way how enterprise A/B testing tools work, Adobe Target specifically, please correct me.

The more we know, the more valuable we become. Never stop learning!

[author] Jason Thompson [author_image timthumb=’on’]http://i2.wp.com/33sticks.com/wp-content/uploads/2015/02/jason_250x250.jpg?zoom=2&w=1080[/author_image] [author_info]Jason is the co-founder and CEO of 33 Sticks. In addition to being an amateur chef and bass player, he can also eat large amounts of sushi. As an analytics and optimization expert, he brings over 15 years of data experience, going back to being part of the original team at Omniture. [/author_info] [/author]

You Are Most Likely Doing A/B Testing Wrong!

Limited Statistical Ability

Enterprise A/B Testing Solutions

Pleasing the Boss

So, how do we solve this?

Leave a comment

Cancel reply