A/B Testing - How do you know that your results are statistically valid?

Posted by: Jay Miller
Friday, July 8, 2011
An often-recommended best practice with email marketing, and online marketing in general, is to conduct frequent A/B testing.  ExactTarget recently added this feature to our application to make it easy for marketers to conduct simple A/B subject line and date/time tests.

However, an often overlooked, but critical part of testing is setting up a good test plan for evaluating the results to determine if they are statistically significant or just due to random chance.  The Get Elastic Ecommerce Blog recently had a good post about how to validate test data when doing webpage testing.  The blog I'm writing today is for email marketers who want to apply a bit of science to email testing.

So let's say that you come up with a new great subject line that you want to test against your usual one, and you get these results:
  • Subject line A ("control" or "champion"):  20% open rate
  • Subject line B ("challenger"):  22% open rate
Great!  we have our winner - a 10% improvement.  Subject line B is obviously the better one and if we send that one to the rest of the list, then we will achieve our goal of world domination. 

But wait; a 10% improvement is great, but how do we know it's not just due to random chance?  Well, if we don't do a little statistics, then we don't know for sure, and we're just practicing faith-based marketing.  Many times in testing tutorials, you will see people recommend the use of a "rule of thumb", like "take 10% of your list for the test, and send the winner to the rest".  I've actually seen this published in serious email marketing books, but rules of thumb aren't good enough to actually tell us if our results are real.

The details of setting up valid scientific experiments can get kind of complicated, but for our purposes of email marketing, we can simplify it to a few variables that are easy to use:
  1. How big is the effect you're trying to measure?
    In other words, are you trying to increase open rate from 20% to 22%, or from 10% to 20%?  (a 10% difference vs. a 100% increase)
  2. How large should the sample size be?
    To measure smaller effects, we need a larger sample size to be sure of our results, and vice versa. 
  3. How confident do you want to be in your result?
    Since online marketing is relatively inexpensive and safe (compared to say, pharma testing), we can be ok with a 90% or 95% confidence interval.  This means that there's only a 5% or 10% chance that the effect seen was due to random chance, and it's very likely that our improvement is worth mailing to the rest of the list.
So What Do I Do Now?
  1. Before you run your test, determine the sample size you will need to use to get a valid result. (Tools provided below.)  In general, the more variation in your usual results, and the smaller the effect you're trying to measure, the larger the sample size will have to be.  If you think about it, this makes sense:  if you are asking people if they like chocolate, you don't have to poll many people before you can conclude that most people like chocolate.  However, if you're testing two different kinds of chocolate against each other (a chocolate Pepsi Challenge, if you will), then you might have to get a few more responses before you can conclude A is better than B.
  2. Run the test.
  3. After you run the test, plug the results into a spreadsheet to determine the statistical significance.
Evaluating the Results:
  • 90% Confidence or less:  The result was probably due to chance.  Either increase the size of the sample, or make a bigger change to the "challenger" (i.e. write a better subject line to test).
  • 95% Confidence (or greater):  Go with the challenger.
Tools!!
If you've read this far, you deserve a cookie, but instead, I'm going to give you three online calculator tools that you can use to help you set up your tests and evaluate the results.  Of  the many websites I researched, the LucidView site had the best overall explanation of using statistics for marketing purposes, where we're mostly doing response rate testing, like open rates (did they open or not, as opposed to numerical measurement analysis like blood pressure variation due to an experimental drug).If you're still reading, you must really like stats or something, so go get a book and dive in.  If you want something kind of fun to read, I highly recommend the Cartoon Guide to Statistics.  It sounds ridiculous, but it's really good at explaining graphically all of those concepts like p-values, means, and medians that you have forgotten in all the years since you took that one course in business statistics (Minitab anyone?).
 

Comments for A/B Testing - How do you know that your results are statistically valid?

blog comments powered by Disqus

Comments for A/B Testing - How do you know that your results are statistically valid?