Thursday, July 05, 2012

Puzzling outcomes in A/B testing

A fun upcoming KDD 2012 paper out of Microsoft, "Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained" (PDF), has a lot of great insights into A/B testing and real issues you hit with A/B testing. It's a light and easy read, definitely worthwhile.

Selected excerpts:
We present ... puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain ... [requiring] months to properly analyze and get to the often surprising root cause ... It [was] not uncommon to see experiments that impact annual revenue by millions of dollars ... Reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts.

When Bing had a bug in an experiment, which resulted in very poor results being shown to users, two key organizational metrics improved significantly: distinct queries per user went up over 10%, and revenue per user went up over 30%! .... Degrading algorithmic results shown on a search engine result page gives users an obviously worse search experience but causes users to click more on ads, whose relative relevance increases, which increases short-term revenue ... [This shows] it's critical to understand that long-term goals do not always align with short-term metrics.

A piece of code was added, such that when a user clicked on a search result, additional JavaScript was executed ... This slowed down the user experience slightly, yet the experiment showed that users were clicking more! Why would that be? .... The "success" of getting users to click more was not real, but rather an instrumentation difference. Chrome, Firefox, and Safari are aggressive about terminating requests on navigation away from the current page and a non-negligible percentage of clickbeacons never make it to the server. This is especially true for the Safari browser, where losses are sometimes over 50%.

Primacy effect occurs when you change the navigation on a web site, and experienced users may be less efficient until they get used to the new navigation, thus giving an inherent advantage to the Control. Conversely, when a new design or feature is introduced, some users will investigate the new feature, click everywhere, and thus introduce a "novelty" bias that dies quickly if the feature is not truly useful.

For some metrics like Sessions/user, the confidence interval width does not change much over time. When looking for effects on such metrics, we must run the experiments with more users per day in the Treatment and Control.

The statistical theory of controlled experiments is well understood, but the devil is in the details and the difference between theory and practice is greater in practice than in theory ... It's easy to generate p-values and beautiful 3D graphs of trends over time. But the real challenge is in understanding when the results are invalid, not at the sixth decimal place, but before the decimal point, or even at the plus/minus for the percent effect ... Generating numbers is easy; generating numbers you should trust is hard!
Love the example of short-term metrics improving when they accidentally hurt search result quality (which caused people to click on ads rather than search results). That reminds me of a problem we had at Amazon where pop-up ads won A/B tests. Sadly, pop-up ads stayed up for months, until, eventually, we could show that they were hurting long-term customer happiness (and revenue) even if they showed higher revenue in the very short-term, and finally we were able to take pop-up ads down.

The whole paper is a great read. The authors have a lot of experience with A/B testing in practice and all the problems you encounter with A/B testing in practice. Definitely good to learn from their experience.