Evaluation I: Methods and Techniques
- 1 Lecture Slides
- 2 Extra Materials
- 3 Discussant's Slides and Materials
- 4 Reading Responses
- 4.1 Airi Lampinen - 10/3/2010 21:55:09
- 4.2 Charlie Hsu - 10/5/2010 13:39:07
- 4.3 Dan Lynch - 10/5/2010 16:08:38
- 4.4 Luke Segars - 10/5/2010 17:22:09
- 4.5 Luke Segars - 10/5/2010 17:57:54
- 4.6 Drew Fisher - 10/5/2010 18:06:13
- 4.7 Anand Kulkarni - 10/5/2010 18:07:33
- 4.8 David Wong - 10/5/2010 18:10:43
- 4.9 Bryan Trinh - 10/5/2010 18:14:28
- 4.10 Pablo Paredes - 10/5/2010 18:19:56
- 4.11 Krishna - 10/5/2010 18:22:50
- 4.12 Aaron Hong - 10/5/2010 18:34:52
- 4.13 Brandon Liu - 10/5/2010 18:36:38
- 4.14 Thejo Kote - 10/5/2010 18:39:04
- 4.15 Shaon Barman - 10/5/2010 18:46:27
- 4.16 Linsey Hansen - 10/5/2010 18:49:08
- 4.17 Siamak Faridani - 10/5/2010 18:50:05
- 4.18 Thomas Schluchter - 10/5/2010 18:53:14
- 4.19 Arpad Kovacs - 10/5/2010 18:56:25
- 4.20 Matthew Can - 10/5/2010 18:56:53
- 4.21 Kenzan boo - 10/5/2010 19:04:58
- 4.22 Richard Shin - 10/5/2010 20:45:08
Discussant's Slides and Materials
Airi Lampinen - 10/3/2010 21:55:09
In their "Practical Guide to Controlled Experiments on the Web", Kohavi, Henne & Sommerfield argue the benefits of controlled experiments over guesswork. They state that the web is providing great opportunities for quick evaluation of ideas and discuss in more detail different types of controlled experiments. They take a look at randomized experiments, A/B tests, split tests, Control/Treatment tests, and parallel flights. I find the authors' observations of HiPPOs (Highest Paid Person's Opinions) fairly obvious. The usefulness of listening to end-users instead of gatekeepers is not a rocket-scientific remark. It would have been more interesting to hear how this could indeed be effectively done in an organization.
The authors make a stronger case in showing how hard it is to predict the success of new designs and, thus, justifying that experiments are needed for reaching good outcomes. However, the discussion remains on a fairly basic level. The best thing I got out of this were a couple of usable quotations, such as David Kelly's remark that "enlightened trial and error outperforms the planning of flawless execution".
The second text, McGrath's "Methodology Matters: Doing Research in the behavioral and social sciences" gives an introduction to what it means to "do research". According to the author, research in behavioral and social sciences always involves combining three sets of things: some content of interest, some ideas that give meaning to that content, and some techniques or procedures by means of which those ideas and content can be studied.
McGrath's text was an interesting reading in light of the discussion we had in class on what it is we do when we do research, as the author is trying to answer the very same question. For him, doing research is simply a process in which some set of theoretical and empirical tools are used in a systematic way to increase understanding of some set of phenomena or events. While this definition holds true of many types of research, social and behavioral sciences are special in that they focus on human systems of all sizes and the by-products of their actions.
I found it interesting how McGrath argued for the need to combine different types of methodologies and conceptual choices in order to reach more encompassing understandings. I would like to agree with him wholly on this point, since the benefits of multidisciplinary work can be manifold. Yet, I think McGrath was being optimistic (or maybe rather positivist) in stating that combining different approaches can account for the weaknesses included in each individual approach. Fitting different approaches together is in reality difficult and a puzzle doesn't necessarily become whole just by adding the number of pieces from which to build. Yet, McGrath is making a good point in stating that "it is only by accumulating evidence, over studies done so that they involve different -complementary- methodological strengths and weaknesses, that we can begin to consider the evidence as credible, as probably true, as a body of empirically-based knowledge".
Charlie Hsu - 10/5/2010 13:39:07
This reading focuses on the nature of research, with an emphasis on the methodological domain: how the research is actually conducted. McGrath cites that the substance of research in the behavioral and social sciences is "actors behaving towards objects in context". We can study the concepts behind those actors as well, in the conceptual domain. But the emphasis of the reading is the methodology: how do we perform research? Ultimately, the reading goes through a large set of research methods: field, experiemental, respondent strategies, comparison techniques, randomization, sampling, and more. However, the final conclusion is that each method has a limitation: it is impossible to maximize the three goals of research (generalizable, precise, and realistic results). Thus, research must utilize many methods that reinforce each other's findings.
I found the emphasis on the inherent weaknesses of each research strategy (field, experimental, and respondent) insightful and logical. The subsequent need to combine strategies to verify meaningful results was also a logical extension, and looking back on user studies and design process in CS160, we used a broad set of these strategies to ensure the validity of our hypotheses. We used field strategies in Contextual Inquiry to attempt to observe our target group in its "natural" setting. We reinforced our findings there with respondent strategies, asking direct questions of our target group and test users. We performed experimental strategies, bringing in users to execute a set of tasks on mock prototypes in controlled environments. We even did some theorizing on general variables that represented the needs and desires of our test population. Certainly, the combination of all of these methods made our results much more believable‚Ä¶ but only since they all provided the same conclusions when results overlapped.
Another point that resonated with me was the analysis of different types of records that actors leave during study. It seems intuitive that many of these have strengths and weaknesses, and many of them exist. During the user studies in my CS160 group, we took the route of simply attempting to record as much as possible: we had multiple notetaking logging observations for each user, we had self-reports from post-experiment debriefs, we had video observation. However, we did not think of any way to collect trace measure or archival records, and I struggle to come up with a concrete example of collecting data in that category in the context of an HCI experiment. The ethical concerns also brought back memories of having to require participants in the survey to agree to being filmed; does the pressure of being filmed while performing a task skew results in any particular way?
Controlled Experiments on the Web
This paper served as a guide to conducting online experiments. Using the data-driven model of the web, which allows developers to quickly push changes and receive rich user data in return, experimentation on new features is easy both on the implementation and observation side. The paper describes some basic hypothesis testing and statistical analysis, offers some lessons and limitations taken from the method, and briefly describes some implementation concerns with A/B testing.
On a personal note, I found this paper to be extremely interesting and insightful, and I am very grateful to have been introduced to it!
I found that the overview of hypothesis testing and statistical analysis was very well contained and distilled to the bare minimum that would be needed for these sorts of A/B tests on web interfaces. All the important statistical factors for hypothesis testing were clearly and concisely laid out. Furthermore, we also were exposed to important methods of lowering variance: by choosing a binary variable such as "user-converted: yes/no", we can reduce the variance, thus reducing the number of samples we need to take to achieve the desired power. Consequently, small integers and large real-valued numbers do worse and worse in terms of variance.
I also felt that many of the cautions and limitations the paper pointed out were insightful and important. Day of Week effects are something I have experienced firsthand; at one of my previous jobs, the product's use was heavily dependent on not only day of week, but also time of day! The product was used by working professionals, so weekend use was always relatively stagnant compared to Monday morning use. Using A/A tests for software/hardware migration and speed tests for A/B testing were also important tools for pushing new updates to the system. I also found the ramp-up strategy extremely logical, and realized that the entire software development world essential uses this to some degree (alpha, beta releases to increasing number of people).
Dan Lynch - 10/5/2010 16:08:38
This paper discusses the notion of designing human subject studies, and in addition, executing and evaluating them. Validation and correlation are also discussed.
This is an important topic because much of HCI research is related to the behavioral and social sciences---sciences that are innately hard to quantify. Many systems have been created without necessary doing this type of research, but they probably also lack usability. A user-centered approach is important in the MTV generation that we live in today.
Additionally, making sure that our research data is valid is of the utmost importance. Its too often we hear statistics and we take them for granted---how can we validate? This reading provided a very structured methodology for user studies and validating the data aggregated from them.
Guide to Controlled Experiments
This article discusses how to control experiments on the web! What a task. I found that when the website had the coupon code entry they lost 90% of their revenue, although the company thought they had upgraded their website. This falls into results and return on investment. This is a very important topic of interest! At any major corporation, if you didn't take these types of interactions into account, would you still have a job? The developer at Doctor FootCare probably doesn't. The idea of this paper is you can do usability studies and take this information, interpret it, and then make a valid choice on how to develop an interface.
Luke Segars - 10/5/2010 17:22:09
Practical Guide to Controlled Experiments
This paper explored a more quantitative topic than our other recent papers have and explained some powerful concepts that could be used in almost any sort of online research. The simple A/B test, something I'd only casually heard about in the past, is an outstandingly powerful technique that is both simple and descriptive. The paper discusses reasonably complex procedures and experiment design without bogging down readers in formulas or specialist lingo, and I came out of reading this paper feeling like I really learned something useful. One well-known strength of online systems is that it is very easy to distribute updates for; in fact, distributing updates is automatic for all users due to the online (browser-based) nature of the product. This can be seen as both a good and bad thing, since releasing new "improvements" can either bolster a site's usefulness or wreck its usability. In an age of abundant web applications, a single usability failure going wrong can often mean the loss of a chunk of users. These well-known facts of the digital age are what gives the A/B test described in this paper so much strength. Being able to "partially" roll out a change not only provides some damage control but also allows for designers to collect meaningful samples of real user's opinions about a particular change. If the reviews tend to gravitate in a negative direction, it is easy to then reverse the rollout and adjust the new release before attempting another test. The A/B test approach works for a large variety of elements on a site (UI, backend algorithms), but I suspect that there are some situations that may be difficult to test with this approach. Mass communication has the chance of ruining your experiment if users start posting about new features to forums and distorting users' knowledge in regards to "new" and "old" versions of the site. Perhaps most significantly, A/B tests (and probably most other statistical tests) aren't particularly telling for smaller communities or online shops. Requiring hundreds of thousands of samples for a particular test, as the example in the paper did, may not be feasible on a short timeline for many environments. Ultimately I found this paper to be great. Some sort of scientific approach is very important for extracting "rules" from design and equally important to the business world for practically releasing new and exciting features into the constantly evolving online ecosystem.
Luke Segars - 10/5/2010 17:57:54
Methodology Matters This paper outlines the concept of scientific research and how it can be undertaken in a nondeterministic field such as the social sciences. The author starts by describing the shared goals of all scientific research before moving into study design techniques. Although this paper was generally helpful, I found that it took longer than it had to get the message across and had reasonably unclear descriptions on several of the topics. The authors succeeded in providing an overview of the purpose and various components of designing a social science experiment. In particular, I really found the "Techniques for Manipulating Variables" section to be both interesting and useful. A number of principles on this topic were unclear to me (coming from a quantitative background) before reading but I felt significantly better informed after reading it. The "Research Methods as Opportunities and Limitations" section also nicely summarizes some really interesting perspectives from the rest of the paper on the nature of research experiments -- in particular, the relatively few high-level differences between social science research and quantitative stuff. It's interesting to consider the internet as a mechainsm for social experiments in the framework provided by the paper (quadrant two in particular). In many ways the internet provides a hybrid approach between the two strategies, lab testing and experimental simulation, that can acquire a totally different type of data (along with totally different types of problems).
Drew Fisher - 10/5/2010 18:06:13
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO - Kohavei, Henne, and Sommerfield
This paper, published in the ACM conference on Knowledge Discovery and Data Mining, describes and defends guidelines for performing statistically valid user studies on the internet and how to run such studies most effectively. They make a strong case for ignoring expert opinion and instead using actual data from real users.
This is a feasible approach, provided that the implementation of the study follows sound statistical principles and does not overlook some real-world effects. Issues they described include failing to adequately plan experimental tests and sample size, day-of-week phenomena, interacting variables, and so forth, along with recommendations for how to overcome these issues.
The paper was somewhat lengthy in its discussion of statistics, but I recognize that computer science tends not to use as much in the way of statistical methods for benchmarks, since many of our tasks we can measure precisely instead. On the other hand, since people are much more random than systems, the proper application of statistics to perform real experiments on real people produces very convincing results - more so than an expert opinion.
I can't help but feel that this paper covers a lot of material that is already well-understood by the design community and the social science community. This makes me wonder how much novel material this contribution offers - is this sort of cross-topic paper typical in academia?
Methodology Matters: Doing Research in the Behavioral and Social Sciences - McGrath
The interesting contribution of this paper are its discussion of the mutually exclusive goals of Generalizability, Precision, and Realism. As a result, it makes sense to design experiments with a knowledge of what the experiment is likely to be able to contribute, and what areas the experiment may not be applicable to. The key point here is that since no single experiment can cover all three goals, multiple different experiments should be undertaken. By analyzing if they corroborate on all accounts or if limitations in experimental methods attain different results, we can reevaluate experimental methods as well as obtain more convincing results.
This paper, too, spent a lot of time on basic statistics, a field of mathematics that has been around since the 17th century. Seriously, McGrath: way too much discussion on what probability values and correlations mean - refer the reader to any of a numerous collection of maths texts and get to your novel key points.
Anand Kulkarni - 10/5/2010 18:07:33
Controlled Experiments on the Web
This KDD paper explains principles and best practices for running user experiments on the web, with a focus on usability and feature testing for online applications and e-commerce.
The paper is a comprehensive survey of techniques and best practices rather than containing significantly new ideas. The primary contributions of this paper include a review of statistical techniques with application to online settings, discussion of the randomization and assignment components of an experiment, and broader advice for implementation. It goes without saying that controlled user testing is tremendously valuable in HCI research; in some ways it is the principal way to demonstrate that a new approach is better than another, and the paper serves an important role in bringing new researchers up to speed on how to carry out these tests. The discussion in section 5.1.3 (should experients test single or multiple factors?) is particularly helpful because it challenges the conventional wisdom that feature testing should be done on an individual-feature basis. The authors don't identify whether they consider their treatment of the topic to be comprehensive or if outside resources should be consulted before running an experiment; this is a modest criticism.
I like that the authors open almost immediately some discussion of cases where controlled user testing proved extremely valuable; the fact that Amazon's recommendation system was costing the company money not to implement was a great way to motivate the paper's recommendations. The two motivating examples in section 2 also do a good job of motivating the rest of the paper. The results of section 3 are presented as a set of recommendations rather than lengthy analysis or extended arguments, but this is actually an advantage of an instructional paper rather than other presentation styles. One criticism could be that the sumary section at the end is not strong; it falls into the common trap of concluding by restating the main points of a paper, rather than giving any new information or pointers for additional resources.
The author provides a comprehensive review of how social science research is conducted.
Much of the content has value in HCI research, since it necessarily overlaps with social science research in its techniques and limitations. The discussion of the strengths and weaknesses of different measures is useful, as are the discussion early on the different kinds of experiments that can be carried out. The discussion of how to desgin experiments is also pragmatic, though presented somewhat haphazardly. I also appreciate the emphasis placed on checking validity; this is often a lacking or problematic aspect of many studies and important to emphasize. Certainly, it seems likely this would be a useful resource when designing social science research experiments; the main problem is that the topics chosen for discussion appear to have been selected somewhat erratically; though I'm sure this wasn't the case, a guiding thread throughout the paper seems to be missing.
I wish the paper had been shorter; these techniques in the social sciences could be summarized more succinctly since the author's goals were pedagogical. The block presentation of unbroken text in the paper is not an effective way to convey information. I do appreciate that the author was comprehensive, and provided references to more substantial texts from time to time. The arguments are supported largely by references to other resources. This is suitable for the author's purposes in summarizing the techniques of others. At times, I wish the author had identified more plainly if there were disputes on specific methodologies.
David Wong - 10/5/2010 18:10:43
1) The "Methodology Matters" paper thoroughly discusses experiment design and the pros and cons of different types of experiment setup, methodology of measurement, choosing and allocating participants for an experiment, and validating results. The "Practical Guide" paper described important aspects of running controlled experiments on the web, which include methods for calculating sample size, randomization and assignment algorithms, and potential pitfalls and lessons learned.
2) The "Methodology Matters" paper offers a good, comprehensive perspective that is important to take when designing experiments. As it is catered towards social and behavioral sciences, this is quite important for the field of HCI. While the article states that "the study of human-computer interaciton has become a viable science with a cumulative body of credibly interpretable evidence", the paper does not go more in depth to qualify that statement. Regardless, since HCI is definitely a behavioral science, it is important to keep the points from the paper in mind while designing experiments. One idea that was particularly obvious, but one that I had never consciously considered, was that every experiment design and methodology has it's inherent weaknesses and strengths and a good experiment needs to combine different styles of design and methodology to avoid confounding results. While this is quite common-sense, as well as a lot of the material in the paper, I still think it is very important to consider the points made in the paper.
The "Practical Guide" paper offers an interesting perspective to running controlled experiments on the web. While its approach is rather specific, i.e. partitioning 50% of users to control and 50% to treatment, it offers some good advice in designing online controlled experiments. For the HCI community, the paper is certainly useful, but takes a focus on businesses and assumes that the websites employing these controlled experiments already have high traffic. The advice prescribed in this paper coincide for the most part with the points made in the "Methodology Matters" paper, except it quite leniently associates a causal relationship to an A/B testing experiment without qualifying it entirely.
3) The "Methodology Matters" paper offers a relatively sound argument. While some of the points made are quite general, as a high-level description of experiment design, it is quite good. The problem is well-motivated and offers a sound perspective on how to design an experiment. The paper was solely qualitative and offered no experiments or quantitative data that could be confounded.
The "Practical Guide" paper offers a sound argument for the evidence it presented, although to what extent it can generalize to all online controlled experiments is questionable. Insights from the "Methodology Matters" paper can offer some perspective on this issue. The "Practical Guide" paper describes a new type of experiment not described in the "Methodology Matters" paper. More specifically, it is the cross between a field experiment and a controlled experiment. As the medium is the Internet, this is possible and it offers some interesting implications on how to design online experiments. As such, the research paper is well motivated as the field of online controlled studies is still relatively new.
Bryan Trinh - 10/5/2010 18:14:28
Methodology Matters: Doing Research in the Behavioral and Social Sciences
Joseph Mcgrath, gives an overview of the landscape of different research methods in the behavioral sciences. His main idea is that there are a number of different methods for collecting data on human beings doing a particular task in a particular context and being aware of the trade offs of these methods is paramount in desiging the correct experiments for the job.
The study of a particular human interaction is not yet complete with just one particular method. Instead many studies of various kinds need to be conducted in order to address the limitations of the methods used. By arriving at the same result with a different method, the result is further validated, however a conflicting result questions the original conclusions. In either case, the new information is useful and should not be viewed as a wasted effort.
After reading the paper it becomes clear that there exists trade offs to each of the research methods and that more is always better, but when do you get to the point of diminishing returns? This is especially true if the research study is done in a fast paced work environment. Research needs to be done in a timely manner to meet the bottom line of generating revenue. In this case it doesn't matter as much that the data is completely valid, it just matters that the data is practically useful.
Market research is a common industry function that is often done in haste, both from the collection side through to the analysis side. Conclusions can never be inconclusive, and instead the most likely answer is forced.
I think part of the reason why Fac
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPo
This paper gives a practical approach to running controlled experiements for different computer interfaces. Along with the pragmatics of interface testing experiments, they briefly cover the relevant statistics.
More interestingly, the authors gave a list of insights that they have arrived at through their own work. For instance they found that sometimes features will be observed to be losing an A/B test simply because of the time it takes for the feature to load. This introduces another variable into the test that wasn't initially a cause for concern in my mind. They also recommended the continual use of A/A test to verify that there are no other external factors aside from the controlled ones in the A/B test. Initially, it seemed redundant to me, but if the costs of running these tests are low, why not? The focus on practical application was insightful.
The ease of testing different features does lend itself to very poor quality assurance though. Facebook is the best example of this. They are continually testing various features of there website, which frankly provides for a very inconsistent experience. The costs of disappointing the user base just does not matter to them at this point. No matter how many mistakes they make, users will continue to use that service. The author claims that this type of testing creates rapid innovation, but speed does not mean innovation. Invariably developers can get stuck in the idea that they can throw any feature through the A/B test without thinking much about the design. The costs of testing a feature in a user group in many cases is not high enough for the developers to waste time designing carefully. The A/B test then simply becomes this black box that tells you which feature performs better.
Pablo Paredes - 10/5/2010 18:19:56
Summary for McGrath, Joseph - Methodology Matters: Doing Research in the Behavioral and Social Sciencies
The main point of the lecture is around the importance of methodology in behavioral research. The author describes how the vision of research in layers of domains (substantive, conceptual and methodological) should be dissected in different elements, relations and embedding systems that compose a research. Alternations and combinations among these domains is useful and many times necessary.
Methods can be seen as the tools the researcher has to perform a research experiment, however, all of the methods have a specific task where they excel, and therefore limitations. So the use of several methods is necessary, and it should be done in a patterned diversity way (i.e. complementing each other). At the end however, it is necessary that no matter which methods are used, consistency and convergence of evidence should occur across studies.
Although there is a lot of important definitions and concepts related to the research process itself, such as measurement, designs, manipulation, and other elements, I wanted to make major emphasis in two which I consider of great importance: the research strategy and the validation. Research experiments should try to maximize three features: generalizability, precision and realism. However there is no way to maximize all three elements. Here is where strategy takes an important role to define the type of strategies that must be followed (field, experimental, respondent, theoretical) However, as in any other aspect of social research, there is a need for balance and combination of these strategies to find a successful path to appropriate results. I personally believe strategies have to be always formulated with a clear end goal in mind, but they must be flexible and adaptable. I think developing the experience to understand how to better evolve research through different paths requires not only trained individuals, but groups of individuals that can incorporate expertise elements to a study.
The second aspect, validity, is clearly defined as validity to be found in the data, in the questions and in the generalizability of the experiment. Validity to me is a concept that requires a careful analysis as we are dealing with families of probable solutions, and limits of tolerance, as well as interpretations and inference processes, which all and all make social studies always prone for critics. I believe that beyond the hard core validation schemes presented other forms of validation such as public acceptance expressed either via mass media, policies or other classes do have a role to determine the angle of validation and consensus.
Summary for Kohavi, R. Henne, R., Sommerfield, D. - Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO
I like the intelligent phrase "The difference between practice and theory in practice is larger than the difference between theory and practice in theory"... I honestly think there are good insights, but my commentaries are more related to the amusement I feel to see how much re-inventing the wheel occurs across fields. Many other disciplines, especially marketing and business have been dealing with customers for centuries!... If there is a need to better understand how to focus better solutions in a ROI setting, many of the lessons (re)learned in this paper, such as seasonality, peak demand time, could have been quickly inferred from papers from the mentioned sciences.
People that live the art of sales and marketing do understand that the voice of the customer is above all other factors, predominant to make decisions on any (commercial or not) massively adopted system. Therefore I believe maybe the paper could be called something else and focused mostly on methodology to improve already understood evaluations, rather than opposing the view of the customer with the HiPPO (Highest Paid Person's Opinion).
With that disclaimer, I believe it is important to observe that basically the whole process of doing analysis of online consumption boils down to three components: analysis, trust and execution and culture and business. Pretty much the same elements that make any market research study, but done at the "speed of light" and in large volumes (the dream of a market researcher). Probably the trust element in this case takes a different dimension due to the speed and size of the experiment, but all and all there is great correlation with marketing, the art/science of understanding customers.
Krishna - 10/5/2010 18:22:50
The author says that doing research typically involves a content - which I understand as the problem domain, a set of ideas (or) hypothesis and a set of techniques to study the ideas or verify the hypothesis. He abstracts these sets into three domains: Substantive, Conceptual and Methodological; he says research deals with (questioning, explaining, understanding, ..) the relations between elements within a context; and these elements and relations morph as different forms in each of these domains. In the methodological domain, or the set from which a researcher draws techniques or strategies to study or verify, these elements are methods or the various modes of treating the phenomena or its properties under study - examples include measuring, manipulating, controlling and distributing impact (by randomization). Relations, in this domain, become comparison techniques - for example, techniques to assess the association, correlation between measures.
These heavy abstractions and definitions allow the author to make many arguments which are intuitive. He says all methods are limited in one way or the another and can be used to study only a finite set of phenomena, and are thus "bounded opportunities to gain knowledge". More importantly, the confidence levels we can have on results obtained by applying the methods are contingent on the methods themselves - determined by their applicability for example. He argues that this leads to an inherent problem in doing empirical studies; researchers cannot use a single method and draw conclusions - as that single method may not be applicable at all for that study, he suggests using multiple, carefully selected, orthogonal methods and making judgements based on the collection of outcomes. He extends this argument to the other domains defined in his earlier abstractions; he argues that there is no single 'universal' method or strategy or comparison technique applicable to all problems or for the study of any possible concept.
Researchers, the author says, tend to maximize generalizability, realism and precision all at the same time. He argues that this is not possible and there are trade-offs - for example, a carefully conducted study while improving precision will reduce realism, it will reduce the range of participants and thus generalizability. He outlines eight dominant and different strategies used by researchers in social and behavioral sciences and shows how these strategies optimize either generalizability or realism or precision, or combinations but 'not all' - the key understanding is that there is a continuum along these three dimensions. The author says that any study can measure, compare and validate from only a finite, limited number of observations which may only be a subset of a larger set and thus there is this possibility of missing potentially important observations. He says random sampling might be useful but adds that there is no guarantee that the random sample would fully represent the population - in next few sections he outlines the importance of sampling methods, statistical inference and confidence intervals.
It is difficult to summarize this paper; the author has outlined possible pitfalls in doing empirical studies and suggests techniques and solutions. The section on manipulating independent variables under various other constraints was important and instructive, at least to me who has no prior such background. In short, a didactic, long but reasonable read and a definite future reference.
Practical Guide To Controlled Experiments
The paper is about control experiments, their usefulness and the opportunity the web provides in conducting them. The authors describe control experiments as follows: users are randomly assigned to two variant sets: Control or Treatment; for example, the sets may differ only in the availability of a product feature whose efficacy is under question; then, any difference in the data and measurements collected on the behavior of the users from these distinct sets will directly reflect the effect of the feature. The authors give various use cases where such experiments have been hugely successful in determining the effects of new, experimental features.
The key questions are 1) how to select a random population and assign users and 2) how to analyze the data (or) what metrics should be used to compare the two orthogonal sets and 3) where such experiments will fail. Answers to 1 and 2 seem to come from statistical machinery and tools - hypothesis testing using confidence level, standard error, significance testing, etc. Interestingly, they also suggest using Machine Learning techniques - cluster users within the sets and measure mean distances, for example.
I like their idea of gradually increasing the number of users exposed to the new, experimental feature - or the treatment as this would enable to quickly identify problems and more importantly expose less number of users to possible bad ideas or implementations.
The limitations can be summarized as: the inherent quantitative nature of these experiments - lots of numbers without intuition on the user behavior or the context; newness bias - users might take time to get adjusted to the new feature thus affecting the results; coming up with good statistical measures - as read in the previous paper, all measurement and comparison methods have their strengths and weaknesses, choosing the right ones is hard.
Aaron Hong - 10/5/2010 18:34:52
In "Practical Guide to Controlled Experiments on the Web" Kohavi et al. go over user testing over the web with much practical advise. They go over A/B tests which exposes the test subjects to two variants (one control and one treatment group). Then they give advice on how to apply it to the real world.
What's most interesting is that I didn't realize they were talking about testing it on their real websites with users who aren't expecting to be test subjects. Ultimately, although users may notice consistency issues, the chances are low that experimentation will adversely effect user experimentation and the website in general. I thought this was a pretty smart way of gathering data and evaluating new features--no need to appeal to intuition and reasoning. However, once the results are in, hopefully our reasoning can help explain why the changes make a statistical difference. The only issue is that this can't really be applied to research, because we are working in the realm where the users aren't expecting to be tested upon--they are just going to a useful/favorite website.
In "Methodology Matters" by McGrath talks about how to do research in the behavioral and social sciences. They outline 3 domains: Substantive, conceptual, and methodological. Substantive is "which we draw contents that seem worthy of our study and attention." Conceptual is "which we draw ideas that seem likely to give meaning to our results." Methodological is mainly means the techniques which we can conduct research. It's a interesting way to think about it. The conclusion is that every method has its limitations, that research is a challenge to get the right balance in order to gain real knowledge.
Brandon Liu - 10/5/2010 18:36:38
"Methodology Matters: Doing Research In the Behavioral and Social Sciences‚Äù
This is an overview of basic experimental methodology in social sciences. It breaks domains into the substantive, the conceptual, and the methodological. Each of these relate respectively to what is being studied; the possible explanations; and the techniques to study them. The paper presents a diagram showing the space of possible behavioral experiments.
The paper then describes the pros and cons of different recording techniques, such as self-reporting, trace measures, and observations. Finally, the paper describes techniques for influencing an experiment, including careful selection of participants, direct intervention, and induction of some state upon participants.
Although the paper didn‚Äôt directly address HCI, i felt that the discussion of internal vs. external vs. construct validity was relevant especially in the context of the other reading for this class. A/B testing is an approach made possible in HCI. This kind of technique is available in very few other fields of research. In A/B testing, there is no strong theoretical model to guide choices - instead, finding an optimum is a brute-force search. The distinction between each kind of validity is important when considering the results of A/B tests, since it is hard to draw conclusions from a process without a theoretical model.
"Practical Guide to Controlled Experiments On the Web‚Äù
This was one of my favorite papers so far. Much of the paper was spent explaining and justifying A/B testing, which I was familiar with already. The paper has a brief description of how to implement A/B testing. Firstly, practicers of A/B testing rarely run into issues of randomization. The way I‚Äôve usually seen it done is by hashing a unique ID and storing it in a database. Modern web frameworks have easy support for traffic splitting and tracking state within user sessions.
One research direction to build on is in automated A/B testing frameworks. These practices are in use in web development, both in determining layouts and interfaces, and also determining the 'copy‚Äô of websites, or determining which is the most persuasive sentence to use to get a user to sign up. This could be integrated with a service like Soylent to find the best performing copy on a website.
In the context of HCI, A/B testing is controversial. I‚Äôve had experience with companies which are completely data-driven in their product decisions, and the end result is a product which can drive a lot of traffic, but is not intuitively pleasant to deal with. A/B testing has a bad reputation among web designers. The exact reasons for this are non-obvious. The most famous case is that of Google‚Äôs lead visual designer leaving because of excessive use of split testing. While I agree with his sentiment that "minuscule design decisions‚Äù such as choosing between 41 shades of blue are frivolous, I see testing as a way to reduce the need for small decision making, and free up time for larger ideas and directions.
Thejo Kote - 10/5/2010 18:39:04
Practical guide to controlled experiments on the Web:
Kohavi et. al. describe their process of conducting experiments to measure the effectiveness of changes made to web services. They provide a guide to building an experimentation framework and share their experience conducting A/B tests at some of the largest properties on the web. They also cover the basic statistics involved in conducting these experiments.
After providing a primer on controlled experiments in general, the authors address the unique aspects of an online environment. They suggest an implementation architecture with attention to the randomization algorithm used to select the treatment group and the assignment of changes in the application to those users. The most interesting section to me was the one in which they share the lessons learned from their experience. Specific advice like the influence of day of week effects and the role of a data driven cilture on innovation were helpful.
With respect to A/B testing, I've always wondered about the cold start problem. When a new web service launches, I don't know how useful these techniques are because of the small sample sizes. So, they are probably useful in enabling incremental improvements once a service has shown its value.
In this chapter McGrath stresses on the importance of methodology in conducting research. He focuses on the ways in which a science gathers and analyzes information - the method - and describes the challenges involved in introducing rigor to the task. He points out that every method, no matter which one, is flawed, but in different ways. It is upto the researcher to pick and choose the right set of methods so that their advantages and disadvantages complement each other and provide consistent results across studies or methods.
An interesting observation by McGrath is the choice of research setting. He says that a researcher can never achieve generalizability, precision and realism at the same time. There is a trade off involved based on the study at hand. He provides an overview of the startegies involved - field studies, experimental studies, respondent based studies and theoretical studies, and how a choice of strategy decides the scope of the outcome.
McGrath provides an overview of the basic comparison and randomization and sampling techniques that a researcher needs to be aware of. He also discusses the concept of validity of results and the different types of measures that are possible and their pros and cons. In short, this is a high level overview which addresses the major topics in the conduct and evaluation of research with an emphasis on social science research.
Shaon Barman - 10/5/2010 18:46:27
In Methodology Matters, the author outlines the steps needed to conduct a meaningful study in the social sciences.
Although the paper is aimed at the social sciences, many of the topics brought apply to general science as well. The author does a good job of presenting experimental studies in the social studies and how many of the goals do not work well together. The three goals of any evidence are: generalizability, precision and realism. The author then discusses the statistical foundations of the experiment: whether or not the experimental results support the claim with some degree of confidence. It seems important to this in mind when designing experiments. But it seems in research, especially in CS, finding statistically relavent results is not necessary to "prove a result". Running enough trials to get a statistically relavent result time-consuming and uses a lot of resources. Instead, smaller user studies are conducted and analyzed in greater depth. Finally, the author discusses different measures and how to manipulate variables.
Overall, the paper does a good job in discussing how to construct experiments and validate the results. Humans are complex and it is difficult to isolate one variable in an experiment. I think one way of making the concepts in the paper more concrete would be an outline of a real experiment, detailing each of the steps and the thoughts that went behind each design decision.
In Listen to Your Customers not to the HiPPO, the authors make a case for A-B testing and provide practical information for those wanting to implement it.
I enjoyed the mix of statistical background and real-world considerations in this paper. Not having done this type of work before, I did not much about the statistics behind A-B testing. The biggest take-away for me was that when conducting one of these tests, its important to decide before hand what goal the test is trying to detect. This then affects how many trails of the experiment need to be run, ramp-up time, etc. Also, it was surprised to see how isolating the target audience could reduce the number of users needed in a test from millions to thousands. The practical info was also useful, such as the randomization techniques and running tests for a whole week instead of 5 days. One way this method might be improved would be to embed the webpage with code which tracked some of the activities of the use (such as time on the webpage). This data could then be correlated with the A-B test results to provide more of an explanation why a feature did or did not perform well.
Linsey Hansen - 10/5/2010 18:49:08
In their article, Kohavi, Henne, and Sommerfield cover the importance of using experimental data over the preference of HiPPOs, since the mind of the customer generally works far more differently than any analysis might predict.
As opposed to just doing a homogenous user testing (or where every user is subject to the same conditions), having both a controlled and experimental group makes sense. In some cases, when asked for feedback on how a user ‚Äúliked‚Äù an interface and how ‚Äúusable‚Äù they thought it was (at least if the user is asked for feedback), they may very well have found it nice, but would not really want to use some of the features themselves. In the case of just analyzing raw data, with no user feedback, results might look good, but if there are no results from the same week/time in a control condition, there is no guarantee that they would imply any improvement. Then, there are also cases where there is an improvement, but the reasons are either not entirely known to begin with, or they remain unknown, such as the two examples with Microsoft and Doctor Footcare, because predicting a user's semantics regarding a button or a couple lines of text is probably not an easy task.
While the author's did mention it, I feel like they did not place enough emphasis on how familiarity with an interface can skew results. If you are someone is is like a master of an interface, I feel like a week would not be enough to master a new one, thus any results that are gained will not be that good- while using people who are not experienced with a website could remedy this, with most popular websites now-a-days, that could be hard. In the Doctor Footcare, while coupons is something I definitely think about, I feel like part of the problem might have been familiarity, and the fact that the ‚Äúnewer‚Äù interface seemed to have a lot more information on it that could have been annoying.
Following a similar theme, Mcgrath covers the importance of research methods used in an experiment, and how using the wrong method can lead to improperly skewed data.
Compared to ‚Äúaverage‚Äù tests, Mcgrath suggests ways for the experimenter to hold as much control as possible which includes controlling the subject's feelings, work environment, and social environment. I feel like some of these can be a bit unnecessary in most usability tests, because you generally want to test the user in their normal work environment, as opposed to simulating your own. I suppose that one could try to simulate the user's work environment, but it would probably just be easier to bring the test to the user in that case (at least if the interface being tested is mostly software). In a way, I guess that usability researchers who go to the user are employing these methods, since they technically do use some sort of control by restricting the user to their work environment.
Siamak Faridani - 10/5/2010 18:50:05
Methodology Matters: Doing Research in the Behavioral and Social Sciences
To me this article is about outlining the steps for using scientific method in the domain of social sciences. An interesting analogy to the general scientific method can be made. Scientific method is based on observation, hypothesis, experimentation, theory and improvements. Similarly the author talks about content, ideas,techniques. The author provides high level guidline for designing and conducting research and experiments in the social sciences domain.
Importance and shortcomings of methods: Different methods drawn from the methodological domain each has shortcomings and flaws but methods can be combined in a study to widen the information that is being collected and enhance our understanding of the evidences. In these hybrid studies strength of one method can offset the weaknesses of other methods. Consistent and converging evidences across studies with different methods will result in credible empirical knowledge and expand our understanding.
Relation to the Normal Science and Pasteur's Quadrant: In the Research Strategy section author points out that research should be generalizable, precise and realistic. Similar to stoke‚Äôs models Figure 2 places each scientific task on a 2D space. Axis on this space are Abstract/Concrete and Obtrusive/Unobtrusive. I find this diagram in line with Kuhn‚Äôs and Stoke‚Äôs articles, for example Formal Theory can be considered normal science. As go away from the center we may see paradigm shifts happening.
Study Design: study design might be the most important part of the article. The author talks about importance of proper design for a study, what does it mean to have ‚Äòvalid results‚Äô and how we can work towards getting valid results. Why should be conducting the experiments in random order. Unfortunately the author does not dive into the statistical details of experiment design or sampling.
The article is interesting but the author does not provide a practical approach to experiment design. Specially examples are missing that might be the reason that this article is accompanied with Ron Kohavi‚Äôs article.
The second article was more practical and very interesting. Authors provide a number of examples on how we can conduct user studies on live websites. I specially think it can be a very helpful article for startups that iterate quickly. And it seems that this concepts have been put to use by many other companies including google (Google has been testing their Google instante the same way).
One thing that I believe authors really miss is that they look at the all users as one. The average approach may hurt the conclusion that might be drawn with this approach. For example the audience of the website might be 75% male and 25% female and the new feature (treatment) may totally eliminate the female users while increasing the overall ROI (the treatment might be a feature that males like but female would hate). Authors do not comment about how we can correctly infer the outcomes for different demographic groups. Some aspects of HCI research is also related to human factors and I am not sure if we can trust the Internet to provide us with correct information. This might be a great method for companies to conduct research but it should be used with caution for controlled experiments.
Thomas Schluchter - 10/5/2010 18:53:14
The chapter reviews approaches to the design of social science research with emphasis on methodological questions. It does a good job of explaining the epistemological limits of research, and outlines strategies to contain the problems that arise from these limits.
Overall, I find the part most relevant that deals with the weaknesses of methods. Probably one of the most crucial lessons for researchers is to be very clear about the degree to which their findings have anything to do with the reality they are trying to probe. Without a clear understanding of methodological limitations, sweeping generalizations (especially on the basis of quantitative data) will lead to a false sense of security of knowledge.
In physics, we have long accepted that the use of instruments to observe natural phenomena introduces all kinds of ambiguity, loss of precision and may even change the observed phenomenon itself. In the social sciences, the instruments we use are even less reliable: they are based on human thought and language, and sometimes, as in ethnographic research, the instrument itself is human. The existence of social science goes to show that human thought, language and humans themselves are extremely difficult to understand -- what does that say about our instruments?
For research in HCI, all of the limitations that are outlined in McGrath's chapter apply. But the fact that this type of research frequently involves computer (systems) confers some methodological advantages. It is, for example, much easier to log trace measures from an interaction with a computer system than rummaging through people's trash for whisky bottles. Because one part of the equation is usually deterministic, the interactions of humans and computers should be easier to measure reliably than interactions between humans.
Arpad Kovacs - 10/5/2010 18:56:25
The Guide to Controlled Experiments paper provides an introduction to performing A/B testing in a web environment, evaluates the efficacy of various architectures and approaches for performing the experiment, and discusses how readers can apply these findings to construct more effective experiments and improve ROI while avoiding pitfalls. Performing controlled experiments consists of randomly splitting live users into a Control group (which is presented with an unmodified version of an interface) as well as a Treatment group (who are exposed to a new version that differs by a Factor), then collecting metrics of interest (quantified using the Overall Evaluation Criterion), and performing statistical analysis on the accumulated data. The paper proceeds to identify how a controlled experiment in a web environment differs from traditional laboratory-based A/B testing. Electronic testing appears to be more flexible and convenient to run multiple tests on (eg, it is easy to varying the Control:Treatment groupsize ratio, automate testing, and check that software migrations retain the null hypothesis and do not degrade user experience), however it is harder to draw conclusions (due to lack of user comments, difficulty of accounting for long-term effects and learning curves) and requires some effort to set up initially.
The main contribution of the paper is analyzing the randomization and assignment steps of testing in detail, and discussing the lessons learned from a wide variety of previous experiments. The most flexible and accurate solution appears to be using the hash and partition scheme that stores a MD5 hash of the user in a database or cookie, and then performing server-side selection to apply factors of interest to the user interface. The main challenges of deploying these automated A/B tests will be maintaining high performance and cross-browser/cross-platform usability, and predetermining evaluation criteria to ensure that the benefit of running the experiments outweighs the costs. Also, it seems that a very important concern in these experiments is reducing variance by controlling minimum sample size, Control:Treatment ratio, and accounting for confounding factors.
I think that controlled experiments are generally a good idea, since they allow one to quantifiably measure the impact of modifications at little or no cost. However, when attempting to make sense of multiple independent experiments, there is also the risk that each studied factor reaches a local maximum, while still culminating in a suboptimal user experience when the results are combined. For example, Google is famous for A/B testing minutae details of its search interface, however sometimes this results in a hodgepodge of optimizations that do not show any holistic coherence or consistency (see http://xhtml.net/documents/images/new-google-page.png). Ultimately, I think that letting end-users guide the development of features via the outcomes of these tests will promote significant and cost-effective interface improvements, as long as there is a human evaluator in the loop who checks that the proposed changes make sense from a holistic viewpoint.
BTW, I imagine that the reason why the .NET string hashing function performed so poorly (failing to distribute the outputs of similar inputs uniformly, resulting in correlations between experiments) is that perhaps Microsoft used a universal one-way hashing function that only guarantees universality, rather than the stronger pairwise-independence property, and is thus vulnerable to related-key attack. A modern cryptographic hash function such as IBM Fugue, or other SHA-3 contenders should do much better.
The McGrath Methodology Matters chapter discusses how research consists of bringing together interesting content, meaningful ideas, and techniques/procedures through which the ideas and content can be studied; these 3 sets form the Substantive, Conceptual, and Methodological domains respectively. He constructs a strategy circumplex, in which he classifies the 8 possible research strategies into 4 quadrants, then proceeds to survey the various approaches to conducting research, and highlighting their defining characteristics and weaknesses. The first quadrant consists of Field strategies, such as the ethnographic approaches and case studies we studied last time, which involve precise observation of naturally occurring behaviors and emphasize realism above all else. The second quadrant is concerned with controlled laboratory experiments, and aims to gain precision while retaining some realism, even though they occur in artificial environments/situations. The respondent strategies of quadrant 3 require the researcher to sample the population for responses to specific questions or stimuli; since this approach nullifies the context, it sacrifices realism for either generalizability (sample survey) or precision (judgement study). Finally, quadrant 4 formulates general theories or computer simulations, which lack any empirical observations, but attempt to predict future behavior. McGrath continues by discussing statistical measures, as well as validity of findings, and finally wraps up with an analysis of the various classes of measures and manipulation techniques.
The main point of the paper appears to be that there exist many methods for performing research in the behavioral and social sciences, but each of them have their particular flaws and limitations. The solution to this problem is to use multiple methods, which offset each others' drawbacks. Researchers should attempt to maximize generalizability, precision, and realism; however these criteria are somewhat mutually exclusive, so it is impossible to achieve all three goals simultaneously. I found the division of Substantive/Conceptual/Methodological domains to be a bit cumbersome and not very useful, but otherwise I think that this paper provides a solid framework for understanding the available selection of research methodologies, their respective strengths and weaknesses, and how they complement each other.
Matthew Can - 10/5/2010 18:56:53
In this reading, McGrath presents a comprehensive analysis of the issues involved in research methodology in social and behavioral sciences. He spends a lot of the text categorizing several research strategies, providing a detailed characterization of each one‚Äôs benefits and drawbacks. As for other main highlights, McGrath touches on the subjects of data comparison and research validity.
McGrath views research methods as tools with bounded opportunities. Each method provides opportunities for gaining knowledge, but that knowledge comes with the limits of the method by which it was obtained. The solution to this dilemma is to employ multiple methods, carefully chosen so that the advantages of one offset the weaknesses of another. Even better, if the outcomes of multiple methods are consistent, this adds credibility to the evidence obtained by each individual method. McGrath stresses that this is not the obligation of any single researcher but rather something that the field as a whole should strive for.
Something I liked is that McGrath breaks down the goals of collecting research evidence into 3 desirable criteria: Generalizability, Precision, and Realism. However, these three criteria cannot be examined separately because of the effect they have on each other. In particular, as the researcher attempts to maximize one, he will decrease one or both of the others; all three cannot be maximized. With this basis, McGrath maps various research strategies onto two dimensions, abstract-concrete and obtrusive-unobtrusive. I thought this quadrant model was useful because it helped me visualize the space of research strategies. The model facilitates the analysis of the benefits and drawbacks to the strategies.
As part of a comprehensive coverage of research methodology, I would expect McGrath to thoroughly discuss issues of data comparison and experiment design. Personally though, I thought some of the material on correlation, randomization, and sampling was too elementary (for example, explaining what strong, weak, positive, and negative correlations are). I don‚Äôt think it was necessary for him to explain that randomization is a strategy that is not guaranteed to provide an equal distribution, or that as the sample size increases the random procedure is more likely to be closer to the (ideal) limiting distribution.
Practical Guide to Controlled Experiments on the Web
In this paper, the authors present a guide for how to conduct successful controlled experiments on the Web. The authors demonstrate the benefit, in terms of information learned and ROI, of controlled experiments. They describe some details and issues that are important in running controlled experiments. In addition, the paper presents general architectures for controlled experimentation.
This paper contributes to HCI by providing guidelines on how to run controlled experiments on the web. This is of significant interest to HCI researchers because it is an efficient and cheap way to iterate on UI design. The paper‚Äôs description of controlled experiments and significance testing was straightforward and probably well known. What I found more interesting were some of the limitations the authors discussed. For example, the authors raised the concern that the experiments only capture short terms effects as opposed to long term effects. This highlights the difficulty of developing a good OEC. I also liked that they discussed potential biases introduced by the primacy and newness effects, because those are not immediately obvious limitations.
What I found to be the greatest benefit of reading this paper was the analysis of the lessons the authors learned. Though it was by no means exhaustive, it certainly was a good list of rules of thumb regarding controlled experiments. One example I liked was the suggestion to continuously run A/A tests as a way to verify that the experiment framework operates correctly. From the business perspective, I think the best take-away is the importance of a data-driven culture for web-based companies.
Kenzan boo - 10/5/2010 19:04:58
Practical Guide to Controlled Experiments on the Web
The article details approach to doing controlled experiments on the web. It describes which measures are used, hypothesis, extension into the online space along with its limitations and tips for actual deployment. The part on the challenges to deployment of such a system was particularly interesting. The authors suggest Hash algorithms, although most do not generate enough randomness except for MD5. The challenge for most of these systems is that they are spread over hundred of server farms in different places in the world, making deploying quick changes to the website that is being served to people in NY and CA very difficult. The challenge with keeping state can be maintained with a cookie, however doing this quickly is difficult.
A lot of these problems have already been solved by companies like Omniture or Coremetrics. Having worked with both of these companies directly in professional ventures, using their solutions is much preferable to actually implementing any of these. Most of these sites offer a web interface to make small adjustments and have different front ends for each unique user. They make it so that the customers do not have to deal with the issues of architecture or deployment.
One issue that I feel still may be a challenge to address is that usage of multiple devices. Many people have several computers and a smart phone to interface with a site. Have different UI for each will be extremely inconsistent and confusing to the user.
Methodology Matters: Doing Research in the behavioral and social sciences, The author discussed the different research methods and their strengths and weaknesses. He also noted that we have to pay strong attention to the environment and small factors that can make a huge impact on the study. These small factors may not be part of the study, but things like time of day, etc that are extrinsic to the study. This was also discussed in the other article in the day of week problem.
Richard Shin - 10/5/2010 20:45:08
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO
This paper describes methods for conducting controlled experiments in the context of the web, for objectively making design decisions that maximize some desired metric. When making changes to a web site, even simple changes can have unintuitive consequences in important metrics, such as the 'conversion rate' (the percentage of visitors that complete a purchase from an online store). By changing the web site in small ways for a randomly-chosen subset of visitors, and measuring the differences in various metrics for the experimental group compared to the control, a causal relationship between the change on the web site, and change in metrics, can be established.
Mainly, this paper didn't seem to describe any particularly novel ideas, frameworks, or implementations, but rather simply explained and summarized how web site builders can conduct controlled experiments regarding their web sites so that they can make informed, data-backed decisions on how to change the web site. The most interesting part, I thought, was using statistics to analyze the results and determine whether they are significant. Also, I hadn't ever heard of 'A/A testing' where the experimental group receives no changes, but still has metrics measured separately; by analyzing the results, one can see whether the testing framework works correctly and randomly distributes the selection of experimenters.
I do wish, however, that the paper went beyond split testing, where only a single change is tested at once. Since testing the results of each configuration takes time, only testing one change at a time would take too long if many need to be tried. I believe there are methods to make sense of results when multiple variables are tested simultaneously, but none were mentioned. Overall, the paper seemed a bit simplistic and rather like a tutorial compared to other papers that we have read, discussing mostly how to carry out these well-known experiments, rather than describing a new, e.g., experiment framework.
Methodology Matters: Doing Research in the Behavioral and Social Sciences
This paper explains how to design and conduct experiments in order to produce research results in the behavioral and social science fields.The author categorizes the inputs of research into the 'substantive' (from which things to study are chosen), 'conceptual' (abstractions for interpreting the results), and 'methodological' (techniques for conducting the research) domains, as well as discuss the concerns of each domain. The paper also discusses some general concerns of research, such as methods, strategies, measurements, and data analysis in research.
Like the other paper, this paper also seemed mostly a 'how-to' guide rather than the discussion of a novel finding. Nevertheless, the lessons contained in it are helpful; as the paper notes, having a firm methodology is crucial to properly interpret the data that has been obtained, by knowing how it was recorded. The various methods and techniques that can be used in a research study also each have their own strengths and weaknesses, so it is helpful to know about them in order to choose which ones to use that fit the study's purposes.
The topics discussed in the paper, however, felt somewhat chosen in an ad-hoc manner. While each one was explained in detail, there wasn't much explanation tying all the different parts of the paper together. I think the advice and techniques described in the paper would have been more helpful for those doing behavioral/social science research if, for example, the influences on variable selection by the three research domains had been explained.