The real problem with split testing is the great power it gives you to focus on the tiny details. You'd better be sure you're in the right forest first though before you start thinking of trees and individual leaves on those trees. Because you do get actual, measurable improvement it can reinforce and encourage you focussing on the wrong problem. Just because it is measurable does not make it meaningful.
Which of several AdWords ads convert best - simple and clear-cut, right? What if your ad writer is not as good as you think - you could perhaps have bumped click-through by far more by hiring someone entirely different. Are you going to split test ad copy writers?
Similarly when multivariate testing on your site, you need to be sure that your options are meaningful in terms of copy, design, usability etc. If the design doesn't convert well in the first place, what's the point messing about with colour or layout option a, b and c?
Sample size and distribution: You really need to be running the test for over a week - Monday traffic is not often the same as Sunday. If you have a seasonal business March might not obey the same rules as August. You also need to be getting enough hits to each variation to be statistically worthwhile - for a new site that could be near impossible.
- Does the small variation in conversion mean anything at all in the context of the hits you have?
- Would the same test three months later give entirely different results due to having a seasonal profile and your visitors are now peak season as opposed to off-season three months back?
...and what if you a/b test the old site with the new? Is it meaningful - is it the inevitable resistance to change, or uncertainty with finding things in the new design that's causing the differential, or is it actual and real problems with the new with respect to the old? At what point can you be sure? Really?
Are you actually going to scrap your $xx,xxx site redesign if the split test says so, or are you going to fiddle with the details whilst basically remaining committed to the junk that is the "improved" design?
Ok, there's been plenty of instances of terrible redesigns out on the web, but let me give you just two. (I've no idea if either of these split tested at all)
http://lifehacker.com: Months after the redesign I still hate their new active layout. It's still harder to find things than it ever used to be, it's still inconsistent, and it still returns search results in the tiny right-hand column which is not the one with the scroll bar. Result? I visit only when I get handed a link that interests via someone else. No amount of split or multivariate testing is going to resolve core, broken usability.
http://trustedreviews.com: I visited very regularly, as aside from trusting the content the design was fast and light, displaying a summary that enabled you to rapidly scan for new updates. Redesign based on WordPress. Result? (Ignoring the sea of orange) Horribly slow comparative load times, no longer possible to take in what used to be on the homepage without multiple clicks and slow page loads. Some useful things the old site did simply can't be done any more. Many other minor usability issues. They can A/B the life out of it now and a good chunk of their old user base will never notice as they're now on different sites. I've not visited for months.
Split test those meaningfully.
Don't get me wrong, Google Website Optimizer is a great tool, but it's one of the very last to be used. The last layer of polish, used when something needs polishing.
Google split test a lot - and they never get their new layouts and designs wrong, do they? Obligatory link to the Google 41 shades of blue testing story. The rollover pulldown menu Google now have in place of the discrete black bar with all links seems like a big leap backwards to me usability wise, but I've no doubt the A/B testing makes me "wrong".
Here, have Jeff Atwood's take on it.