Evaluation: Beyond Lab Studies

From CS260 Fall 2011
Jump to: navigation, search

Bjoern's Slides


Extra Materials

Discussant's Materials

Reading Responses

Hanzhong (Ayden) Ye - 10/10/2011 14:23:29

The reading for the topic of evaluation goes on this week, and now into more practical field. The first article talks on the specific process of controlled experiments which can be carried out on the web. Countless examples have showed that the web provides us with an unprecedented opportunity to evaluate ideas quickly using controlled experiments, which embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. With current technology available in web and Internet, it becomes more and more a common practice to learn opinions directly from Internet audiences instead of what the authors call ‘HiPPOs’. The authors also discuss specifically about the process of randomization and hashing techniques, which turn out to be as simple as is often assumed. It is also discussed that much works can be done after the data is acquired besides drawing simple conclusion, such as data mining, etc. I totally agree on the authors’ point that companies can accelerate innovation process through experimentation on web.

The second article is more theoretical and discusses many problems and criteria used for evaluation new UI systems. Although the author agrees that the development of user interface systems has languished with the stability of desktop computing, he believes future systems which are off-the desktop, nomadic or physical in nature will involve new devices and new software systems for creating interactive applications, which requires much more work to be done in the endeavor of UI evaluation.The author believes although simple usability testing is useful, it is totally not adequate for evaluating complex systems, which requires more complex process and criteria for evaluation. A variety pf alternative standards are given by the author to help make comparison and evaluation for new UI design. However, I believe think paper could be better if more substantial examples and evaluation process are described.

Steve Rubin - 10/11/2011 11:02:34

The two papers focused on A/B testing on the web, and user interface framework evaluations. The first paper's main idea was that the web affords companies the ability to remove "intuition" and, implicitly, a more refined "taste" from the design process, and instead lets them make interface/feature decisions based on data. The second paper discussed the key ideas to keep in mind when attempting to build a useful user interface framework (from a research perspective).

Kohavi et al.'s "Practical Guide to Controlled Experiments on the Web" is a good introduction to A/B testing. Their logic is straightforward, and they even provide a glossary to help the non-statisticians engage with their work. As a scientist, it's comforting to know that data is more useful than one expert's opinion. I did have a few issues with their paper, though. First, they are assuming that your web site is large enough to run tests in the tens of thousands of users. If you are operating an online store for a niche market, you may not be able to run high-powered tests. They do not offer an alternative for smaller companies. Second, while A/B testing can give you information about whether B is better than A, it cannot tell you if B is, in fact, good. By growing a site from test-based improvements, a company may never realize that there is something fundamentally wrong with their site.

Olsen's paper, "Evaluating User Interface Systems Research," gives several principles for evaluating a UI framework (and, implicitly, for designing a good UI framework). I thought his most important point was his emphasis on expressivity in design. Expressivity will definitely speed up the adoption of a new UI system. I also agree that you should make sure your contribution is "important," but it's interesting that he summarily rejects statistical significance as a measure of importance. His logic here is that if you need to dig in the data to learn if something is important, then it's not a big enough deal. This is contrary to what the first paper suggested. While Kohavi et al. claimed that features should be added or modified only when they offer statistically significant improvements, Olsen claims that there ought to be some strong intuition as to why something is important.

Valkyrie Savage - 10/11/2011 11:03:47

Main idea:

It’s hard but important to get the evaluation of a task right. “Evaluation” is an idea that encompasses everything from the actual way in which data is gathered to the statement of impact of the work to the consideration of future work.


The Olsen paper (is BYU a big HCI research university, btw? I hadn’t known that before.) describes how impact of a study can be evaluated, in particular how the ideas shown to be “good” by some metric in a study can change the possibilities for future studies. It seems to be a pretty relevant mental model to consider when doing HCI work. I approve highly of his remarks about the underlying hardware issues that have long dogged and defined HCI as a field: why haven’t we separated ourselves from this yet? Everyone is concerned about doing more faster, and really at some point this becomes “fast enough.” Compare this to the second paper’s mention about how 500ms difference in latency can change a website’s conversion rate by something like a few percent. I am in favor of Olsen’s approach, personally: I think that easing startup time for the developer is a key goal, particularly in a world where everyone fancies himself a developer. This actually reminds me of a conversation I had with my partner about the most recent project (which we implemented in Processing): on the topic of Processing as a language, his comment was, “What, so they let anyone program now?” I wonder where the point will be when startup costs are so low that such is really the case. Olsen remarks on Apple’s SDKs for interfaces and their plethora of themed widgets, and truly such things have expanded the number of people who can create professional-looking products that match with other products. At what point is this a bad thing? Well, as per basic prototyping ideas, creating something which looks finished but isn’t can frustrate and confuse users: setting the bar low for creation of UIs should come only when accompanied by setting the bar low for implementation of functionality.

I found it interesting that only the final consideration contained in the Olsen paper is whether important progress was made. One could argue that the entire paper is defining a concept of “important”, but it seems as though this question should be foremost in the researcher’s mind during the entire process.

The other paper about web experiments and HiPPOs (I rather liked that acronym, I must admit, and I hadn’t heard it before) extols, mainly, the powers of data. It cries for us to listen to data, not the Highest Paid Person’s Opinion for design, and cites a few examples of companies that have used this to their benefit. It’s easy in software to listen to data, particularly in this age where most products are websites and changes can happen just as fast as you wish them to. In hardware it’s more difficult: we all had to listen to Steve Jobs (arguably the Highest Paid Person) for ages before it was obvious the stretch of his ideas. This paper really stands in direct contrast to e.g. the scientific revolutions discourse: it’s hard to say which one is right(er), but as an American deeply invested in the cult of the individual I’m obligated to say that I like the sound of the one man fighting against the odds to get what he wants and inspire the people. Ultimately the two approaches may not be at odds, but it takes some time for people to realize what it is that they want (damn customers) and to vote for it with their pocketbooks. Data and design aren’t always side-by-side.

I guess there’s more to the paper, though, in that its main focus seems to be on crowds. All of the things that they mention in the area of “do this to make your study better” seem reasonable. It’s definitely a positive that they try to avoid asking users what they like and instead to observe what they do and return to. It seems practical.

Cheng Lu - 10/11/2011 17:50:06

The first paper, “Practical Guide to Controlled Experiment on the Web”, provides a practical guide to conducting online experiments, where end-users can help guide the development of features. The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. The paper provides several examples of controlled experiments with surprising results. It reviews the important ingredients of running controlled experiments, and discusses their limitations (both technical and organizational). It focuses on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. The author also describes common architectures for experimentation systems and analyzes their advantages and disadvantages. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on the extensive practical experience with multiple systems and organizations, this paper shares key lessons that will help practitioners in running trustworthy controlled experiments.

The second paper, “Evaluating User Interface Systems Research”, presents a set of criteria for evaluating new UI systems work and explores the problems with evaluating systems work. The development of user interface systems has languished with the stability of desktop computing. Future systems, however, that are off-the-desktop, nomadic or physical in nature will involve new devices and new software systems for creating interactive applications. Simple usability testing is not adequate for evaluating complex systems. User interface technology, like any other science, moves forward based on the ability to evaluate new improvements to ensure that progress is being made. However, simple metrics can produce simplistic progress that is not necessarily meaningful. Complex systems generally do not yield to simple controlled experimentation. This is mostly due to the fact that good systems deal in complexity and complexity confounds controlled experimentation. This paper shows a variety of alternative standards by which complex systems can be compared and evaluated. These criteria are not novel but recently have been out of favor. The author suggests that people must avoid the trap of only creating what a usability test can measure. This would recreate the fatal flaw fallacy.

Laura Devendorf - 10/11/2011 18:06:26

Olsen's paper provides a number of alternate measures for the success of a new user-interface, system or tool and argues that trendy usability testing strategies are limited to simplistic problems. The Practical Guide to Controlled Experiments on the web discussed how interface designers can evaluate varying designs on the site's existing users.

I enjoyed Olsen's piece and I thought it brought up a number of interesting points. I was particularly interested the section on empowering new design participants and how the ability to do so can be a useful measure. I also appreciated the examples that were provided alongside of the explanations in that they more clearly, and sometimes obviously, outline how and when that measure could be used. I think that this paper will be something that I refer back to in order to support my own work. As a criticism, I could also see how some of the ideas he presented would benefit many researchers, but could also be used as a crutch for groups to make claims about a system that may be simplistic and just not have any data backing it up. I am curious to know if paper reviewers value these measures and if they carry the same weight as usability statistics.

The practical evaluations paper hit home for me in many ways since I used to be in charge of designing and managing and e-commerce store. I began to wonder if this method would be useful for the store I ran. We had a relatively small customer customer base, insignificant compared to amazon, and our customer base was also very sensitive to changes on the website. If they were used to it functioning one way, a change would be largely disruptive and could result in lost sales. The paper presented a number of measures and guides to prevent lost sales but I think they are very heavily linked to the size of the company and the extent to which their customers will be adaptable to the change. I would have been great to reference the lessons learned from similar experiments, such as the coupon example in the introduction.

Amanda Ren - 10/11/2011 20:26:19

The Kohavi paper talks about the benefits and limitations of doing controlled experiments on the web.

The paper is important because conducting controlled experiments is beneficial in determining what the end user/customer wants in a product. As they point out, it is not always the "highest paid person" that makes the right decisions on product changes. They noted two examples with the FootCare website and the MS Office help articles. Even with such trivial changes, revenue was greatly affected. I found these examples interesting because I actually though the opposite of the results - that the changes were more beneficial. Controlled experiments do have their limitations though - I would not have figured out the reasons for the results had it not been given in the article. This paper is relevant to today's technology with the competitiveness and popularity of web applications - such as social networking sites - and as the paper said, even a small decrease in speed will result in less revenue.   

The Olsen paper describes a new criteria for evaluating new UI systems.

This paper is important because it states that we now need a new way of evaluating new UI systems. Much of this is because this part of research has  slowed down due to the stability of the three platforms. Olsen argues for the need of new criteria because people are moving from the desktop to their mobile phones. However, before a new UI evaluation can be made, we have to avoid three traps: usability, finding flaws, and legacy code. This is relevant to today's technologies because we do need to consider new UI systems as people are becoming increasingly dependent on their tablets and smartphones. 

Hong Wu - 10/11/2011 23:00:30

Main Idea:

The two papers are talking about the methods to test user interfaces.


“Evaluating User Interface Systems Research” focused on the user interface. The paper includes the value to evaluate user interface, wrong methods to evaluate user interface and different aspects evaluation should focus on.

“Practical Guide to Controlled Experiments on the Web” first claimed customers are the subjects developer should approach to rather than some outside consultant. The paper emphasized the importance of the controlled experiment. Actually, control experiments are also used for medical field over a hundred year ago.

I appreciate that “Practical Guide to Controlled Experiments on the Web” also talked about the shortcomings of the controlled experiment and some guild lines, such as “one change each time”. Websites have more flexibility than medical field because of the low cost and no fatality.

Viraj Kulkarni - 10/11/2011 23:26:55

The first paper, 'Practical Guide to Controlled Experiments on the Web', is exactly about what the title says. The paper sets guidelines that can be followed in conducting online experiments where end users can provide feedback that can be used to direct development of features. The authors also make the point that data matters more than opinions! He cites two examples - the scurvy example and the amazon shopping cart recommendation example. There is a slight difference in these two examples however. In the scurvy case, the captain observed that sailors in the mediterranean region who consume citrus fruits don't get scurvy which lead him to believe that eating citrus fruits prevent scurvy. He carried out the experiment which resulted in the data that confirmed his belief. In this case, the order was observation->hypothesis->experiment->confirmation. In the other case however, the order was hypothesis->experiment->confirmation. There was no observation here. The belief or the hypothesis came from nowhere. Both are forms of creativity and innovation. The above point may not be very relevant to the remainder of the paper, but it kinda caught my attention. The paper is a compilation of techniques that can be employed to conduct user studies. It also contains some recommendations and personal opinions of the authors.

The second paper, 'Evaluating User Interface Systems Research', talks about how the rapid progress in computer technology is affecting changes in user interfaces and how researchers should adapt to these changes by coming up with new techniques to evaluate these newer user interfaces. The authors claim that today's user interface systems and the platforms they build upon are rooted in a different technological era where the assumptions they made then are simply not applicable now. User expectations, system performance, hardware characteristics have all changed but the way we think about user interfaces has not kept up with the change. One of reasons for the lack of progress in UI research is due to the fact that we don't have effective metrics to evaluate new ideas. The authors talk about a few pitfalls in evaluating UIs and provides alternative parameters which can be used to compare and contrast systems.

Yin-Chia Yeh - 10/12/2011 0:34:03

The Controlled Experiment paper provides a practical guide of performing controlled experiment on internet. It advocates that it should be end users who decide whether a feature is valuable instead of managers. The evaluating UI system paper reviews ways of evaluating UI system besides usability test. It also proposes people should evaluate a system according to the system’s STU – situation, task, and user. The Controlled Experiment paper is really a useful paper for novice who wants to learn how to run controlled experiment on web. I like the concept that features don’t have to be fully designed before implementation because you can run some experiments to help you improve iteratively. Another interesting stuff is the day-of-the-week effect. I wonder if there are any other similar effects such as holiday effect or breaking news effect. Another interesting stuff is how to design good OEC. It seems to be a very general problem not only applied in web software. Another course lecturer of mine mentioned something very similar -- “People in the field of optimization are highly capable of solving any objective function, but the performance of our algorithm really depends on if the objective function is meaningful.” I have one question about the evaluating UI system paper. It seems to me that the author believes that the declination of UI system researches has something to do with the fatal flaw fallacy. I am a bit skeptical on that because I think people should be able to focus on the aimed problem as long as the scope of research problem is clearly stated. If it is not the case, I guess it could be a similar phenomenon to the case that high fidelity prototypes distract people from real question. On the other hand, I strongly agree that researchers should not be limited to researches that are easily measurable and the selection of research problem should be based on its importance. I have another question similar to my discussion on Controlled Experiment paper. This paper mentioned that generally people will not adopt a new technique that is not two times better than current one. The question is how we define two times better when there is no easy qualitative measurement.

Derrick Coetzee - 10/12/2011 0:48:46

Today's readings were both focused on evaluating user interfaces, one very specifically relating to web applications and remote experimental techniques, and one relating broadly to systems user interfaces.

Kohavi et al's "Practical Guide to Controlled Experiments on the Web" is effective at its primary task, which is instructing companies on how to perform simple web experiments to increase sales, and how to avoid common errors such as not assigning enough users to the test group or using a metric with high standard error. It is accessible to readers with limited statistical background. The technique is exciting in its ability to be deployed to a large test group cheaply and automatically.

At the same time, the extremely limited scope of the work is its primary fault. It does not consider - or even reference - more advanced statistical techniques, nor does it consider examples outside the narrow area of "how to make more money." For example, a news website may consider it part of their mission to inform the public, and so might want to design an OEC that measures how well readers are learning the facts presented (how this would be measured is unclear). I'm also highly suspicious at their claim of finding a correlation (however complex) between outputs in SHA256, and think an element of chance may be involved - I would like to see details of this experiment.

The use of "A/A tests" to test the system is an intriguingly useful idea. One interesting extension would be to consider "simulated A/A tests", where a large user transaction log is replayed over and over with different random assignments of users to groups. This allows a much larger number of A/A tests to be performed with the same data, and allows them to be conducted on past data recorded before the system was available (this would be valuable for example for testing the minor variants of the system produced after bugfixes).

Dan Olsen's "Evaluating User Interface Systems Research," also from 2007, laments the stagnation of systems user interfaces like windowing managers, and provides a framework/toolkit for designing and evaluating new ones.

My biggest frustration with this work is that although it aims to be progressive and fight UI stagnation, its criteria for an effective toolkit design are themselves steeped in conservative assumptions, such as the assumption that large-scale long-term user testing is prohibitively expensive (which overlooks techniques like crowdsourcing or the A/B testing in Kohavi's work), and the assumption that only "important" groups or tasks are worthy of building systems for (virtually any motivated group can achieve effective contributions to society, and increasingly specialization of software to small groups is an evident trend). Moreover, it overlooks numerous extant novel system UI designs, such as those used on gaming consoles, on touchscreen phones, and on embedded devices such as microwaves that are fundamentally different from the WIMP model of desktop PCs.

Yun Jin - 10/12/2011 1:26:48

Reading response of Practical Guide to Controlled Experiments on the Web: This paper provides a practical guide to conducting online experiments, where end-users can help guide the development of features. And it provides several examples of controlled experiments with surprising results. What’s more, this paper reviews the important ingredients of running controlled experiments, and discusses their limitations (both technical and organizational). And this paper focuses on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. And it describes common architectures for experimentation systems and analyzes their advantages and disadvantages. Finally it evaluates randomization and hashing techniques, which shows not as simple in practice as is often assumed. Despite significant advantages that controlled experiments provide in terms of causality, they do have limitations that need to be understood, such as quantitative Metrics, but No Explanations, Short term vs. Long Term Effects, Primacy and Newness Effects, features Must be Implemented, consistency, parallel Experiments and launch Events and Media Announcements.

Reading response of Evaluating User Interface Systems Research: This paper presents some problems with evaluating systems work and a set of criteria for evaluating new UI systems work. This paper shows a variety of alternative standards by which complex systems can be compared and evaluated. These criteria are not novel but recently have been out of favor. So they must avoid the trap of only creating what a usability test can measure. And they must also avoid the trap of requiring new systems to meet all of the evaluations required above. This would recreate the fatal flaw fallacy.

Apoorva Sacdev - 10/12/2011 1:30:11

This week's reading includes two papers, one written by Dan. R Olsen that deals with Evaluating User Interface Systems Research while the other written by Ron Kohavi et al. describing how to conduct controlled experiments on the Web.

The first paper was interesting as it proved that sometimes theoretical predications about certain UI aspects can prove to be very different from the results obtained through user-studies. The examples they gave of Doctor FootCare and amazon substantiated this point. More than just playing with the aesthetics, the fact that the placement of certain UI elements in specific places can influence people's decision making in more subtle ways than imaginable in theory is a good motivation to perform controlled experiments. It would have been better if the authors had mentioned how to deal with the limitations of controlled experiments on the web. For instance, to deal with the primacy and newness effects, how does one make sure that the OEC test is only performed on new users? The paper also seems to imply that the experiments can only provide quantitative results, however one could always club the experiment with a short survey at the end to get the reasons behind user choices. Further, if the experiment is run for a long time to overcome the days of the week effect and uses ramp-up, how does one make sure that cookies in the browser are not deleted and that the study results are still consistent?

I felt some of the things mentioned in the Olsen paper were pretty obvious. The idea of building UI toolkits which allow for faster creation of multiple UI designs seems effective. I wish Olsen had presented more quantitative methods of evaluation in his paper besides stating that for an interface to be effective and important it should be useful to a large population and lead to a substantial improvement compared to the existing interface. He just mentions the problem of measuring “improvement” for interfaces that are completely new but doesn't provide a convincing alternative. Overall, I thought the first paper was better than the second paper in the presentation of its material and backing up its claims.

Suryaveer Singh Lodha - 10/12/2011 3:55:01

Guided Controlled Experiments: The paper guides readers about how to conduct online experiments.Because the data-driven model of th web allows retrieval of rich user data, experimentation on new features is easy. The author gives a very good overview of hypothesis testing and statistical analysis. The author talks about important methods of lowering variance, ex - choosing binary variable to answer "yes/no". This approach helps in reducing variance I found the "day of weeks" effect very interesting and intriguing. A product's use may not be only associated with day of week, but also to the time of day (ex. just after/ before lunch timings peoplle may not be able to focus well.)

Evaluating user interface systems research: This paper describes the importance of evaluating new user interface systems and some methods of doing so. As UI technologies change frequently, it is must to find apt metrics to evaluate & ensure progress. The idea that problems such as steep learning curve can be overcome if we have a high enough reward was interesting. The author states that the lack of true innovation in UI frameworks is reaching a point where it may keep us from thinking about the full design space of interfaces with new (non-desktop) devices. He proposes a set of new criteria for evaluating user interface toolkits as means of producing good designs. Olsen's points suggest a somewhat dramatic shift in the mindsets of user interfaces designers as they prepare to accommodate an entirely new class of devices. It's unclear what the impact on the field of interface design would be if these concepts were to be adopted, but they do provide an interesting alternative view to the process that's in place today. I doubt that the current system will (or should) be replaced, but Olsen's developer-centric view could provide an interesting complementary perspective that may open up some doors beyond the walls of today's UI toolkits. The portion regarding the analysis of the importance a situation, task, and user seemed interesting!

Galen Panger - 10/12/2011 3:55:44

I really appreciated the "Practical Guide to Controlled Experiments on the Web.” I was especially exposed to the importance (and excitement for the results) of split A/B testing in my former job at Google (though, as a public policy team member, I never did any of these myself); and the benefits of running experiments on live users are obvious. Lessons learned from experience are not obvious, however. A few of the lessons struck me as particularly helpful: (1) that factorial designs are indeed complicated, but luckily interactions are less frequent than people assume; (2) that you should run A/A tests to estimate variance, make sure users are split appropriately, and that non-significant results occur with frequency according to the confidence level; (3) the recommendation to mine the data further than the OEC; and (4) various practical recommendations, such as running the experiment for a while (1-4 weeks) but not too long (because of “cookie churn”).

I felt more ambivalent about the “Evaluating user interface systems research” article. The most useful aspect, in my opinion, was the basic framework laid out for anyone attempting to “pioneer” a new interface toolkit. It’s important to focus on the fundamentals - i.e. what’s going to make your new interactions and tools important. Generality, solving problems previously not solved, expressiveness (flexibility/leverage/match), etc. I thought were all important ideas for someone who feels challenged by the dominance of a current toolkit or interface. But otherwise, I thought the article was a bit flat—lots of claims, not a lot of support (and virtually no references).

In particular, I wasn’t convinced by Olsen’s arguments about the “usability trap” and the high costs of user testing. Seems to me that you would want to bear the price of extensive user testing for a new toolkit—after all, the point is to save others a bunch of work down the line, is it not? Set standards that lead to good usability outcomes—that’s the point, no? Elsewhere, Olsen says abandoning legacy code is “the price of progress,” but so too, in my view, is convincing evidence of usability. By the way, he also says that it’s not likely 10 programmers over 6 months could achieve statistical significance—which may be true for small and relatively minor effects—but elsewhere his standard for success is a 100% improvement. That’s probably enough for statistical significance.

Allie - 10/12/2011 8:15:08

In "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO", researchers Kohavi, et al find that in controlled experiments, the opinion of HiPPO, or Highest Paid Person Opinion should be foregone in place of the customers' opinion. OEC, or the overall evaluation criterion; factor; variant; experiment unit; null hypothesis; confidence level; power; A/A Test; standard deviation; standard error are introduced as terms for controlled experimentation and result analysis. Randomization, pseudorandom with caching, and hash and partition comprise of the implementation architecture; and traffic splitting, server-side selection, and client-side determine the assignment method.

The researchers found factors such as the day of the week and speed affected experiment results. Nonetheless controlled experiments are an effective, systematic way of data mining. I think the paper is really solid, and its methods a quantitative heuristic as introduced by Mcguire in the Evaluation and Methods and Techniques paper set.

In "Evaluating User Interface Systems Research" by Olsen, he asks the question: "If not usability how do we evaluate systems"? He then introduces "reduce development viscuosity", "least resistance to good solutions", "low skill barriers", "power in common infrastructure", and "enabling scale" as values added of UI system architectures.

STUs, or situations, tasks, and users, form a framework for evaluating the quality of system innovation. The larger and diverse the STU context is, the stronger the claim to the importance of a solution. Expressive leverage is a designer can express more by less; and expressive match is where tools for UI can be improved by improving expressive match. Inductive combination, simplifying interconnection, and ease of combination are methods that combine pieces of design to create a more powerful whole.

The Olsen paper fell more on the qualitative heuristic side of the Mcguire paper. It approaches UI evaluation from a different perspective than that of the Kohavi et al paper. The criterias are more fluid and less quantifiable, but more interesting and subjective.

Manas Mittal - 10/12/2011 8:16:12

Kohavi et al.'s paper is interesting in that it comes not from a research lab but from the trenches (of Microsoft). This is the real form of user testing (a/b) tests, and this is what we should be doing.

In the past, I worked at 2 companies - a well known product company and a well-known web company. The product company had very little A/B testing, but very very very extensive user testing. The website had more A/B testing, but still not that much.

The primary reason is that it is so expensive to do A/B testing for anything other than the most trivial design decisions (not that these design decisions are unimportant). Its so difficult and expensive to build an alternative version of the website, different in functionality. We should think of tools to make it easier to make A/B testing simpler.

Vinson Chuong - 10/12/2011 8:46:23

Kohavi, Henne, and Sommerfield's "Practical Guide to Controlled Experiments on the Web" offers an analysis of the advantages and drawbacks of conducting large scale controlled experiments on web-enabled applications. Olsen's "Evaluating User Interface Systems Research" surveys alternative methods for evaluating user interfaces (in the context of UI systems and toolkits) which cannot be evaluated easily via standard means.

Kohavi, Henne, and Sommerfield generalize the concept of controlled experiments in laboratories to take advantage of the nature of web-enabled applications. Applications that are deployed completely or partially on the web can be updated and re-deployed quickly and easily, making it easy to perform controlled tests using the entire user base as subjects. One of the most common methods is A/B testing on web apps, where small changes are evaluated against the original for statistically significant changes in metrics like conversion rate or some other measure of success. When I worked with Zazzle, a design-your-own-product-and-sell-it platform, almost every modification to the website was A/B tested against the original for change in conversion rate. Ideas weren't selected for final deployment on the basis of intuition or opinion; there was a clear build-and-test mentality. One of the main limitations of this type of testing that Kohavi, Henne, and Sommerfield discuss and that I've seen is that the results are a change (or no change) in some metrics; there is no explanation about what causes the change and no feedback to inform the design process. Despite their limitations, performing such tests is easy and cheap. Perhaps they can be used to filter potential variants for further, more detailed user testing.

Olsen discusses the difficulty in evaluating UI systems and toolkits by standard means like usability testing, and surveys alternative evaluation methods. He contends that current common evaluation methods present a barrier to acceptance of new UI systems, that UI systems are often too complex and the expertise of their target users too high for usability testing and that new UI systems are often dismissed due to missing functionality. He goes on to discuss how researchers can support claims of "importance", of solving "problem[s] not previously solved", of "generality", of "reduce [ing] solution viscosity, of "empowering new design participants", and of "power in combination". I feel that the arguments presented in the first part of his paper have little to do with the discussion in the second part, as most of his suggestions involve performing what are essentially usability tests with focused sets of "situations, tasks, and users". Regardless, the suggestions he raises seem to be useful for evaluating complex systems in general and can be independent of the motivation that he presents.

Alex Chung - 10/12/2011 8:49:31

Summary: Usability testing can quickly answer any design question without spending a fortune. Also, experimentation can drive innovation because customer feedback gives designer insights on what customers want.

Positive: The clarification of various statistical methods. This article has done an excellent job of providing a 360 view of usability testing from designing to implementing the experiments. It gives the readers just enough detail and reference without making it sound boring.

Positive: Web testing is a great idea to test your prototype. It is cheap and fast. For example, we did the in-person usability testing with 8 users on the Kinect assignment and it took 3-4 hours to collect data. Testing takes a tremendous among of resources. Thus I would be considering using the web for future testing.

Positive: The sample size section is great. I’ve always been a little confused about determining the sample size when weighed against the desired error rate.

Negative: On the other hand, web testing does not give you the contextual feedback and insight from interacting with the testers during or after the experiment. Some people might not be eloquent with words but could be a great conservationist.

Negative: The randomness explanation was a little confusing from this paper. It described how correlation could occur if the random algorithm is not perfect. Yet should the tester designs the randomness into the experiment rather than relying on the computer algorithm to do so. After all, I’ve learned from cryptography that computer cannot achieve randomness yet.

I had a chance to interview leaders in UI design and development at Yahoo and they strongly believe in test-driven assessment on new products. They let the audiences decide if a certain product is worthwhile for future investment through click-analysis.

Yet I thought the uncontrolled experiments were very unscientific and there is a possibility of misinterpreting the data. With an understanding of user’s intend, it is very difficult to make judgment call on error or success rate.

Peggy Chi - 10/12/2011 8:55:40

These two papers discussed very interesting issues including those we had touched in Monday's class. Kohavi et al. shared their practical experiences and observations on controlled experiments running over the web environment, contrary to HiPPO (Highest Paid Person's Opinion). Olsen suggested different evaluation standards for "complex" systems beyond traditional personal computers to meet the current technology needs.

I especially liked the first paper because it gave a clear overview supported by many cases and examples. The A/B test answered my concerns about evaluation: in the past when conducting experiments with invited participants, it was usually relative difficult to separate into controlled and treatment groups. Instead, we gave the same group of users the treatment after testing the original system. However, this method might help participants "learn" from the experience of the existing systems and perform better with the new features. For example, last year when I evaluated my project on a chat system showing real-time related photos based on the chat content, I should have separated into two groups with 1) system random suggestions and 2) tailored ones by our design. If asking users to use the one with random chat first, they might gain confidence to form a better discussion later on. Eventually I only asked users to use solely the new system, and see if they have considered the new feature, which seems not scientific enough. Collecting ground truth also helps define what the difference is between the new system and those existing ones.

However, I'm still not exactly clear about experimental design: For studies involving human especially in a complex environment such as outside (say a mobile phone test), can we eliminate all the factors that might affect the results? When we can't observe users in a controlled room, how do we collect the information of additional unexpected events? Or can we successfully ask users to report?

Rohan Nagesh - 10/12/2011 8:59:44

The first paper, "Practical Guide to Controlled Experiments on the Web," discusses the need for iterating quickly and leveraging the Web as a platform for organizations to run controlled experiments and determine which feature set or product is more preferred by its users. The second paper "Evaluating User Interface Systems Research" calls for shifting our evaluation criteria away from traditional usability metrics such as number of errors and time to complete a task and move towards more complex, nebulous criteria to pair with the more complex UI systems of today.

I absolutely agree with the first paper's emphasis on using rapid feedback testing from these controlled Web experiments. In today's competitive landscape, it is imperative for companies to stay ahead and push out new features or iterations of their product. I enjoyed reading about the HIPPO acronym (highest paid person's opinion) which from my limited experience at work is quite true. The biggest learning I got from this paper was that one must instill a culture of data trumping intuition for an organization to have success and iterate quickly. Additionally, it's important to set an OEC (overall evaluation criteria) upfront to ensure everyone's on the same page throughout the experiment and that the experiment is well designed and has the power to provide conclusive results one way or the other.

Regarding the second paper, I found the author's proposed new set of metrics to be a bit abstract and nebulous. This may have been on purpose given that he wants to tailor them to more complex UI systems and move away from the simplistic usability metrics we had been using. However, many of the metrics he proposed aren't really all that measurable, in the sense that it's difficult to put a number for instance to "Expressive Match." All in all, I agree with the author's main sentiment but would have liked some measurable metrics and evaluation criteria.

Shiry Ginosar - 10/12/2011 9:03:13

There is an interesting tension between these two papers. While the Kohavi paper hails the use of A/B controlled testing for websites, Olsen goes against the tendencies to use simplistic metrics in controlled experiments designed to measure complex systems.

While I generally think that A/B testing is a great tool for evaluating the business success of online sites which have clear monetary objectives it is indeed not clear that improvements on one of the defined OEC's always improves the actual user's experience. Additionally, the paradigm introduces issues with testing long term effects of the changes made, and is cannot easily be used to test systems which are not web based.

While Olsen may have a good point in his paper, the presentation of his ideas is less than ideal. He does not provide a concise summary of his main point, nor does he use his own theoretical ideas in practice to demonstrate how they can be used to evaluate a system that he has built from start to finish. These points take away from the readability and the applicability of his paper.

Jason Toy - 10/12/2011 9:03:46

Evaluating User Interface Systems Research

"Evaluating User Interface Systems Research" is about the lack of usability testing for complex systems. It gives various reasons or questions to judge UI toolkits: what it determines to be the building block for new usability testing.

This paper presents a new framework for judging whether we are building improved user interface systems. It provides questions such as "can your solution scale up?" to critique new systems. At the same time, it acknowledges that a new system cannot answer every critique perfectly, but rather progress is made if you can answer more questions than your predecessors. This is similar to the ideas in "The Structure of Scientific Revolutions", where the author describes a scientific paradigm only needing to answer questions better than the previous paradigm. In both cases, the remaining questions which are not answered satisfactory are filled in later on. This framework could influence future research with its focus on building toolkits, whether for one of the three stable platforms, or for something independent.

I like the paper's push for incremental improvements and progress as a measure of success because giving users options is an end goal that would push for improvements from the three stable platforms, if nothing else. I also think the paper does a good job reminding us of the importance of UIs in different scenarios. There is a balance between UIs as a general form and the specific needs of different tasks. At the same time, the paper's complaint about the outdated assumptions that are built into the functionality of existing systems scares me. While it is true in certain cases, the blatant disregard for resources also sounds like an excuse for writing poor or inefficient code, something that already can be seen in code written by a generation who never had to deal with limited amounts of memory, etc.

Practical Guide to Controlled Experiments on the Web

"Practical Guide to Controlled Experiments on the Web" is about the procedures and pitfalls of how to do online experiments.

This paper provides a new framework or set of guidelines of how to do experiments, in the hope that it would standardize and reduce costs and eventually push for changes which are influenced by data rather than opinions of executives. One of these ideas in particular is relevant to industry today: that of the automated ramp-up and abort. In the past, to avoid massive failures on the website LinkedIn, the company actually rolled out new releases to a small subset of users. Should there be no problems with the roll out, the number of users that the new release was available to increased until everyone was updated to the new version. Something interesting about this paper that might influence the way experimentation might be done on the internet is the fact that many of these users do not know that they are being experimented on. In many of the papers we have read, this is not the case as users are part of a study. But the internet affords for this new dynamic which allows for a large number of people in a setting more natural to them to react.

One thing the paper does well is to come up with examples that really gets the reader to think, and question why any of these decisions are made by executives rather than data (as they themselves might have been stunned by the results). In addition, the paper does a good job of reminding us, that it is important not to just determine whether one result is better than another in terms of clicks, etc, but to have a good set of metrics defined for success, before the experiment starts. At the same time I question the rigidness of the experiment framework the authors have devised. Is it really necessary to enforce seemingly arbitrary rules such as 50% of the entire user base being an experimental group? It is true that you would have more results, but how would this scale to a website like Google? Could they afford to allow millions of people using an experimental version? In addition the paper does not give enough emphasis to "paper prototyping" or smaller experiments before rolling this out to the world, which I find necessary if an experiment on such a scale was to be done.