
Savvy marketers and kaizen masters test everything. They test email. They test web pages. They test content. They test load times. They test delivery methods, audiences, and clickable buttons in every color of the rainbow – but even seasoned marketers can stumble into these testing traps.
Pursuing a flawed hypothesis
The first step in A/B testing (or multivariate testing for that matter) is to formulate the hypothesis you intend to test. This naturally precedes everything else, such as the design of the test itself. A flawed hypothesis dooms the test. An example of a flawed hypothesis might be one that is incomplete, poorly formulated or irrelevant.
Begin by looking at website analytics to identify a problem that must be solved. This reduces the chance of conducting an irrelevant test. Heat maps illustrate how users interact with pages and can reveal issues you might not notice otherwise. It’s also helpful to gather input from user tests that may point to blatant issues not easily detected by analytics. Similar information can be gleaned from customer feedback you’re probably already collecting.
Another way to produce a flawed hypothesis is to chase a prebaked one. Just because someone else changed the font on their signup page and saw a 20% increase in conversions, it doesn’t mean you should formulate a hypothesis based on their experience. Your website is not their website; their audience is not your audience; their entire market environment is likely to differ from yours.
Dismissing statistical significance
Once a problem is identified, create a hypothesis to test a single variable that’s likely to improve performance by a statistically significant degree over the control. You don’t know this up-front, of course, but you must start somewhere. The key question you must answer is, “did the variable change the outcome, or was the difference in the outcome due to chance?” Statistical significance denotes the risk of concluding a difference exists when in fact it does not. It’s the complement of the confidence level of a hypothesis, which is the probability that a given hypothesis is true.
“True” and “false” are binary choices; therefore, while the tested premise may achieve a 98% confidence level that it’s true, it’s not “98% true.” For a variable to be considered influential in an outcome, look for a confidence level of 95% or greater, which corresponds to a statistical significance of 5% or less.
For a variable to be considered influential in an outcome, look for a confidence level of 95% or greater, which corresponds to a statistical significance of 5% or less.
It’s a good idea to assess statistical significance when you design a test to get a handle on what to look for in the results, the size of the population you should test, and the power or “sensitivity” of the test. The power of your test is determined by the sample size; the larger the sample, the smaller the difference in performance needed to be considered statistically significant.
For example, 500 views of Version B of a page may yield 120 clicks, outperforming Version A’s 100 clicks by 20% – but with the small sample size, it won’t register as statistically significant. Double the sample sizes and resulting conversions, however, and you can be 95% certain the 20% lift is due to a change you implemented on Version B.
AB Testguide provides a helpful calculator for pre-test and post-test analyses.
There may be cases where it makes sense to lower the significance threshold. In other words, don’t ignore economic significance by focusing solely on statistical significance. A statistically significant result doesn’t prove a proposed change has practical value; likewise, a proposed change might have a great financial impact despite a lack of statistical significance.
Say, for example, you operated the fictional website, malibumansions.us, where every sale is worth millions but sales volume is very low. Because volume is low, you’re restricted to a fairly small sample size of 500 for each of the two house-description page versions you’ll test.
Your test shows that 500 views of Version A of the page resulted in 100 clicks while 500 views of Version B resulted in 117 clicks. This fails the typical statistical significance test – it isn’t a big enough performance gain to indicate with 95% certainty that the extra clicks weren’t simply random chance. You can, however, be 90% certain the clicks were due to changes you made to Version B. Since you’re very interested in the 17 extra clicks, shouldn’t you consider a lower confidence threshold? Generally speaking, if the variable is likely to contribute to outsized gains, it may be considered worthy of pursuit even when the confidence level is less than 95%.
Unintentionally conducting a multivariate test
Eager to deliver results to your fast-paced organization, you may decide to run multiple simultaneous tests, or to test many elements of a page at once. This practice will almost certainly cloud the test’s outcome and make it impossible to determine what factor influenced the results most.
It’s also true that the more elements you test, the larger your sample must be – which normally translates into longer testing times. Larger samples are needed to achieve statistical significance. As part of your testing plan, you should prioritize what elements will be tested first and keep tests simple so you can move onto the next test on the list.
Even as you move down the prioritization ladder, keep past test results and protocols at hand. It’s good practice to review your priorities periodically, and you may wish to revisit a previous test that raised pertinent observations worth testing on their own merit. Similarly, there may be something to learn from even inconclusive tests. For example, is there another variable at play that you’ve overlooked?
Samples that are too small or too large
Conduct A/B testing on appropriate sample sizes. Using small sample sizes won’t produce the data needed to achieve statistical significance while using samples that are too large can mask segmentation differences. Optimizely offers a free tool to determine what sample size you should aim for.
Similarly, tests should run for a specified period to achieve statistical significance, but not much longer. That’s because while version one of a webpage may beat version two early in a test, it may not win in the end. Running a test too long, on the other hand, wastes time and can begin to introduce outside variables into the outcome since influences outside and inside the website environment change more as time passes. A rule of thumb: shoot for a two-week test, maximum.
It may seem obvious, but you should also conduct tests on both versions simultaneously. This mitigates the influence of outside variables on results.
Stopping testing
Each A/B test should expand upon the results of preceding tests. Evaluate test results to gain insights to help you plan more productive future tests. Don’t stop testing after success, either. Remember, testing is integral to kaizen (continuous improvement) and that’s what the real goal is: improving website performance.
The opposite extreme of suspending tests is to test everything, all the time, with no testing plan. Remember, you’re not in business to test hypotheses. Your goal is to give stakeholders relevant information to guide their decisions. A/B testing can help, but testing itself is not the goal. Providing actionable, insightful and impactful information should always be your first concern and it should drive your testing plan.

Sidestep Content Testing Traps
was written by me, Greg Norton – also known as webzenkai. I’ve got more than two decades’ experience building effective websites and powerful email campaigns that yield results. Feel free to contact me regarding this article or anything else you find on this website.