Evaluation II: Perspectives

From CS260Wiki
Jump to: navigation, search

Lecture Slides

File:Cs260-slides-12-eval2.pdf

Extra Materials

Discussant's Slides and Materials

File:Cs260-EvaluationII.pdf

Reading Responses

Dan Lynch - 10/9/2010 14:55:40

Evaluating User Interface Systems Research

This paper was intriguing---the thought that our progress in UI is impeded by the mouse and keyboard analogy is one to seriously consider. The author also claims that due to stable platforms that we develop on these days, programmers are less skilled at specific UI systems development such as window managers.

What can we do? This paper serves an important call to programmers like ourselves. It also justifies in my case doing many things from scratch, such as GUI programming and window managers. It seems that we should be trying out new UI systems and architectures by exploring ideas and not using prebuilt libraries that come with processing for example. By using them we are limiting ourselves to the set of ideas associated with those tools and technologies of the libraries time and place in the technology timeline. If they were developed during the mouse and click era, then they may be irrelevant for the future.

Crowdsourcing User Studies With Mechanical Turk

This paper makes the claim that user studies are the pillar and indicator of success in design. The problem is, how can you get user studies that provide a large enough sample size? Answer: Amazon's mechanical turk.

To test the validity of the turkers, the research group decided to compare the judgement of the aggregate against the admins of Wikipedia when analyzing the quality of Wikipedia articles. This proved to be not very good for the turk industry, as they found a correlation coefficient of 0.5, which is very low. However, the group ran a second experiment changing the methodology a bit. The idea was to make it so that making a invalid response was as hard as making a legitimate response. This raised the coefficient to 0.66.

This is important because it shows that even if you fail at one attempt, try and try again! There may just be particularities in the turk industry when trying to collect useful data. This paper demonstrates useful techniques for gathering this aggregate data.

I also wonder how this applies to more general user studies. This was, after all, only one task to evaluate a Wikipedia article. Generalizing this to other tasks may prove to be more difficult, but can most definitely be derived from using ideas of this paper as general principles.


Kurtis Heimerl - 10/9/2010 18:26:29

Evaluating User Interface Systems Research This paper details the problems with evaluating user interface systems research. It also has an average of .375 references per page.

I'm unsure of how I feel about this work. Their point is clear, and well argued. It's hard to evaluate these systems. What they didn't do was validate why these systems should be built despite this fact. This is primarily the lack of citations. They do not list interface systems that came from academia. They do not discuss what the evaluations of these systems were. Instead, they just give us a laundry list of possible ways to evaluate interface systems.

However,one bit intrigued me. My colleagues and I have been trying to push a paper about network APIs to SIGCOMM for a couple of years now. The networking community has pushed back at times, signaling that API design is not exactly what networking researchers are generally interested in. Should I have been pushing to UIST? API design is a common theme throughout this paper, but not something I traditionally understood to be part of HCI. Toolkits, sure. But HTTP? TCP? I'll probably bring this up in class.

Crowdsourcing User Studies on Mechanical Turk This paper gives a description of a user study run on mechanical turk, and then attempts to improve that user study, providing a set of heuristics for crowdsourcing user studies.

Given the quick turnaround and low cost, why didn't they do a more formulaic, deep user study? I have no idea what I'm supposed to take out of this paper. Their task in no way indicates that this is a usable platform for user studies (not that it isn't). I feel like the result is equivalent to asking PhDs in math to do basic addition, asking turkers to do the same task, and then asserting that turkers are just as good at math. There was no real discussion of the task itself.

To resolve their cheating issue, they proceeded to make a wide variety of changes, without demonstrating which changes were valuable and which were not. This should be trivial for them. It's literally one small code change and 24hrs.

It's frustrating because I want answers on how to reduce cheating on Turk, and they didn't do it. They met the lowest bar for publication. This is not what notes are for, they are for small contributions, not incomplete research. I've seen further work, and I'm still unsatisfied. Oh well.


Charlie Hsu - 10/9/2010 21:40:19

Evaluating User Interface Systems Research

This paper describes the importance of evaluating new user interface systems and some methods of doing so. Olsen cites that new UI systems architectures are forces for change, possibly lowering skill barriers to entry and reducing development time, as well as offering new spaces to explore good solutions in. Olsen then describes some common evaluation errors with UI systems architectures, and then some evaluation techniques and metrics: importance, solving new problems, generality, reducing "solution viscosity", empowering new design participants, power in combination, and scalability.

One characteristic that immediately struck me about Olsen's metrics was that many of them could generalize to concepts I've learnt before in CS160 and earlier this semester. In reducing solution viscosity, the idea of flexibility was the same as decreasing the gulf of evaluation for UI systems designers. Expressive leverage was the same as decreasing the gulf of execution, and expressive match was the same as decreasing semantic distance for developers. It may be important to realize that UI systems architectures are themselves user interfaces for UI system design, and that many of the general usability concepts apply for them as well.

However, there are some unique concepts to UI system design toolkits that Olsen offered metrics for evaluating. Empowering new design participants is a relatively unique and extremely important goal to UI system design: enabling less technically proficient artists and designers to create technically complex user interfaces certainly seems like a primary goal for a well-designed UI toolkit. Scalability and power in combination are also general systems design principles that apply to UI systems design toolkits as well.


Crowdsourcing User Studies with Mechanical Turk

This article described the use of Mechnical Turk in performing user studies. Mechanical Turk, like other micro-task markets, has excellent potential to quickly collect large amounts of user data at once, whereas traditional user testing is normally expensive and time-consuming. However, the quality of data returned by Mechanical Turk can be suspect if the experiment is not designed well. The paper offers some insights on how to formulate tasks that maximize the usefulness of returned data.

By using explicitly verifiable questions as part of the task, task authors can verify that substantial effort was put into the task, and gamed answers can be eliminated through the verifiability of the question answer. This requires the ability to pull unique elements out of user study material and formulate them into verifiable checks: quantitative checks may be the most straightforward, but I found their idea of using qualitative checks, such as a quick and easy crosscheck of common tags for Wikipedia articles, to be insightful and effective. For user interface design, it would be easy to provide injected "nuggets" to verify test-user effort in the interface: simply add checkpoints in the execution of the interface where information is provided to the test-user to verify effort.

Gaming the system is an important consideration to keep in mind, since gamed data is wasted time and money for the developers. Using task duration and repeated substrings of comments were proposed ways to address the problem from the authors. There are many more possible ways to check for system gaming: pretest qualification on Mechanical Turk can limit the sort of user that actually gets to test, and perhaps an expanded interface for tester profiling and experimenter feedback on user data can filter out the good Turkers from the rest.


Shaon Barman - 10/9/2010 23:14:39

Evaluating User Interface Systems Research

Since user interface technologies are always changing, its important to find suitable metrics to ensure progress. The authors discuss such metrics, along with common pitfalls that can occur during evaluation.

It is difficult to predict how user interface technologies will be used by people, and even harder to predict the effects of future hardware, software, and changes in lifestyles. I liked the section on traps. When evaluating a system, its easy to generalize that the system won't work because of a particular failure, or that the learning curve is too high. Both of these problems can be overcome if the reward is high enough. Using the STU seems like a good way to evaluate, and provides a way of comparing competing user interfaces for a certain task. All of the dimensions to judge a system provide ways to illustrate the usefulness of a system.

One aspect this paper fails to discuss is evaluating a user interface based on how it affects social interactions outside the user interface. Many of the examples provided deal with single user systems, such as widget languages, but they do not provide any examples dealing with multi-user systems or web systems. These types of systems do not seem to fit well with any of evaluations discussed in the paper. When dealing with multi-user environments, the complexity of experiments increases and it becomes even more important to use an evaluation which does not focus on the failures of the system.

Crowdsourcing User Studies with Mechanical Turk

The authors compare a crowdsourced version of rating Wikipedia articles verses experts who rated the articles. They found that without the appropriate checks, users would game the system which provided poor correlation. With additional checks to ensure that the responses had some credibility, they found a much higher correlation.

The key takeaway seems to be that in any mechanical turk task, there must be some question which can be automatically verified. This verification can be by comparing the answer to other user's answers or having already known test cases. By using this verification (along with randomization), there is a certain high probability that the user is not gaming the system. In Soylent, this pattern is taken a step further by verifying the answers as a mechanical turk task. In any system where the ultimate reward is monetary, people will try to take advantage of the system. In prestige based system, this problem does not seem to occur. It would be interesting to see if a combination of prestige and monetary rewards would affect the quality of answers.


Thejo Kote - 10/10/2010 10:08:32

Evaluating User Interface Systems Research:

In this paper Olsen provides a methodology for evaluating UI systems research. He first provides a motivation for why this is necessary now. He argues that UI systems have now more or less been standardized across the Windows, Mac and Linux platforms to use the mouse and keyboard paradigm and that sufficient research is not conducted in that space. But, since many of the assumptions made regarding available memory, processing power and other limitations are no longer valid, he argues that it is time for serious UI systems research to make a comeback.

His core approach to evaluation is to focus on situations, tasks and users since any interactive technology is framed in that context. He suggests that new UI systems be evaluated in terms of generality, whether it is new or important and how expressive and empowering it is.

Crowdsourcing user studies with Mechanical Turk:

In this paper Kittur et. al. share their experience conducting user tests through Amazon's Mechanical Turk service. They provide an overview of the advantages and limitations of this approach and the results of their experiment. While Mechanical Turk has the advantage of providing almost on demand access to people willing to undertake a user testing task for payment, there are also the limitations of gaming of the system and difficulty in determining ecological validity of the experiment.

For their experiment, the authors asked participants in Mechanical Turk for quantitative and qualitative feedback on a set of Wikipedia articles. In the first experiment they found widespread gaming. In a modified experiment, they were able to get better results, which was comparable to the quality of the output of Wikipedia admins. Their main finding was that it is very important to ensure that it is important to bake in elements into the task that will ensure a needed response. This prevents gaming.

My take away was that using Mechanical Turk may be a viable option for some kinds of user tests, but not others. A researcher should carefully design the tests to ensure that it is difficult to game it.


Krishna - 10/10/2010 10:41:26

Evaluating User Interface Systems

The primary argument made by the author is that progress in user interface research has been affected by lack of appropriate criteria to evaluate new ideas. The author gives many relevant reasons for why he thinks there has been no progress in user interface research: assumptions on limitations and capabilities of system hardware and software are no more valid; users are more computer literate and their exposure to graphical user interfaces and hence comfort levels have increased by many levels; current architectures do not seem to support the diverse, mobile and collaborative demands of today's users.

He argues that usability testing requires certain assumptions that are difficult to meet while evaluating new UI toolkits and architectures: comparing two architectures is hard and biased as in most cases users are familiar with the 'tested against' architecture; usability testing assumes standard tasks, however most tasks are complex and not atomic enough to make meaningful statistical comparisons; it is economically unfeasible to test two different architectures. In the next few sections, the author provides us with a set of criteria to evaluate new UI toolkits.

He argues that new toolkits must demonstrate importance of the problem they are trying to solve and this importance must be measured from an users, tasks and situations perspective - this seems to lean towards a more subjective view of importance(?), a stronger criteria would be whether users think the new features are important given them, their situations and contexts. His other criteria include that the toolkit should be flexible enough to adapt and facilitate rapid design changes, the toolkit should be able to accomplish more by expressing less and that one way to accomplish this is by eliminating choices the users should make to achieve their goals. Another criteria would be how the toolkits enable designers to express their design choices more 'directly' - the color picker vs expressing the color as a hexadecimal number.

These above arguments and criteria may make sense when it comes to designing toolkits for graphical design, I am not sure how this could be extended to the general case; how about toolkits for manipulating music ?, there the problem of making users 'express less' becomes harder and difficult to evaluate: what someone might think as an appropriate music recommendation might not be the general case, features such as inductive combination are even harder to achieve - there are no general acceptance on what the primal music component is leave alone combining them, how we should consider this subjectivity into account when evaluating these toolkits and frameworks for general design should be an interesting question.

Crowdsourcing User Studies using Mechanical Turk

User studies are important, illuminating but expensive. This paper is about using Amazon's mechanical turk for conducting quality user studies in an economically feasible way. The authors mention many issues in using Mechanical turk for user studies: it is suited for simple, short tasks with verifiable answers, but this is contrary to user evaluation experiments where participants are typically asked to judge something through ratings and opinions where there are no correct answers. There is also this issue of generalizability due to the unavailability of demographic and expertise information about the participants.

Their first experiment shows that asking open ended questions is a bad idea - this facilitates users to provide random and non-sensical answers, worse it is difficult to automatically weed out such answers and users. In their second experiment, in addition to these open ended questions the turkers were asked to answer verifiable questions (eg: number of images in the article, etc.); this ensured that they have to spend time reading the article, resulting in better quality responses.

The key take away is to have verifiable tasks and strategies that ensure effort from the turkers, experiments should make it equally time consuming for turkers to enter random, non-sensical answers. The authors haven't addressed the issue of generalizability - I am just wondering if it is possible and 'legal' to craft questions and infer demographic and expertise information in an implicit way. As the authors mention, lack of control over participant assignment and environment can be serious holdups for certain class of user studies and technologies.



Luke Segars - 10/10/2010 13:42:44

Crowdsourcing User Studies with Mechanical Turk


This paper discusses the possibility of using distributed crowds to run user studies. This approach offers a number of direct advantages, such as the reduced financial cost per participant, a wider demographic and geographic reach of participants, and far larger sample sizes. Nevertheless, there are some very significant problems with running these studies that put their scientific validity on the line. Kittur et al examine some of these downsides and attempt to mitigate some of them to make the use of Mechanical Turk more feasible for user studies.


Completing a complex distributed task is still somewhat of a stab in the dark for Turk users. The sudden availability of micro-task markets reminds me of the recent emergence of mainstream multi-core processors: they promise fantastic speedups *if* -- and this is a big if -- we learn how to reorganize our tasks so that they can be handled in a set of smaller, quicker jobs that sum to the same result. This isn't always just a matter of saying the same thing in a different way. It can very often mean reevaluating what the actual purpose behind the task is and determining how that goal can be reached through a different, more “parallelizable” means . That's a tough problem, but probably moreso because it requires us to change the way we've been thinking about task generation than because it cannot be done.


The promise of distributed user studies is a dream for designers. User study processes are plagued by being both expensive and time consuming but are nevertheless an important part of the product development and research life cycle. The idea of reaching an audience of over one hundred thousand potential testers is a dream come true for designers in any field.

There are several additional factors that come up as uncertainties in the transition between traditional user testing and a Turk testing model. The first of these is how much cost impacts the quality of responses in a Turk task. The researchers in this particular paper paid $0.05 for someone to read an article and answer 15 questions about it. They managed to get over 100 responses, but I wonder if the quality could have been improved further by increasing the price (even to a still-modest $0.25). This would still provide a dramatic cost savings compared to standard user testing but might captivate a larger (and perhaps more demographically diverse) audience.

Secondly, it would be interesting to see whether the phrasing of the task changed the number of responses and overall quality of responses. For example, providing a more concise description of the task may captivate more users, and phrasing a question more politely may encourage additional Turkers to complete your task (with no additional financial cost).

Ultimately the idea of running user studies strikes me as a terrific idea that needs to be better understood before it can be put into practice. Like developing new software for multi-core processors, the benefits are obvious but the path to the solution is somewhat unclear. Given time and enough interest, a set of “best practices” for micro-task based user studies could totally transform the design industry and research world.


Airi Lampinen - 10/10/2010 13:57:04

Olsen's UIST article "Evaluating User Interface Systems Research" focuses on how to "evaluate new improvements to ensure that progress is being made" in the field of user interface systems research. The article explores new ways for evaluating interactive software architectures but also looks into ways in which misapplied evaluation methods can cause damage. Olsen discusses in further detail three examples of such damage done by misapplied methods: the usability trap, the fatal flaw fallacy and legacy code.

Without being really able to judge Olsen's claim of the development of UI systems being languished due to the stability of desktop computing, I agree with the statement that future systems might be something remarkably different (off-the-desktop, nomadic etc) and hence their development may require different evaluation methods, too. Olsen also makes a strong point of showing the necessity of considering issues on the system level which remains beyond the research of traditional usability studies. Simple usability testing can be great to some purposes but it is most certainly not a fix for everything.

The second paper, Kittur, Chi and Suh's CHI paper "Crowdsourcing User Studies With Mechanical Turk" looks into micro-task markets such as Amazon's Mechanical Turk and considers their potential for conducting user study tasks. The authors considers a wide variety of tasks ranging from surveys to rapid prototyping to quantitative performance measures. Also, the paper presents two experiments that try to pin down the conditions under which crowdsourcing user studies with Mechanical Turk can work well.

While some fairly simple parts of these studies surely can be outsourced to turkers, I have difficulty in believing such a procedure could truly replace more comprehensive user studies that allow for a discussion between the researchers and participants of a study. Luckily, there is no need to pick one way or the another, so looking into ways of applying crowdsourcing with Mechanical Turk is certainly worthwhile. The idea of having "hundreds of users recruited for highly interactive tasks for marginal costs within a timeframe of days or even minutes" is something few researchers doing user studies would not love. Yet, it is clear that even this will not be a straightforward thing to do, and hence, I found the paper a valuable initial effort towards understanding under what conditions such crowdsourcing can work, how tasks need to be outlined to ensure fruitful outcomes and disable malicious gaming etc. However, as the authors themselves state, further work is needed to understand what kinds of experiments can be successfully conducted on Mechanical Turk or other micro-task markets and to determine effective techniques for promoting useful user participation.


Thomas Schluchter - 10/10/2010 14:10:36

Evaluating User Interface Systems Research

The paper addresses the question how novel UI system paradigms can be validated through testing, and posits that current approaches fall short. Specifically, the authors argue that usability testing covers only what is currently known to work and will tend to suppress new developments.

I find the basic argument of the article very convincing: As the development of hardware progresses, we will not be able to realize breakthroughs in UI development without abandoning (for an exploratory phase) the evaluation criteria that apply to the current paradigms. This echoes Kuhn's distinction between normal science and the revolutions that cause paradigms to shift.

The technologies that have emerged from research in the past decades increasingly pervade the social space (beginning with personal via mobile to ubiquitous computing). As systems are deployed into these complex environments that resemble less and less the controlled setting of a 'user' in front of a workstation, the demands on UI design will continue to increase. To reduce the viscosity of the development process, frameworks that increase the number of iterations within a given development time frame is key.

The way Olsen systematically constructs an analytical frame for new toolkits and systems reminded me of Shneiderman's piece on Direct Manipulation. In the sections on Expressive Match and Expressive Leverage, it seems like an updated version of that paper for the age of post-mouse-and-keyboard interactions from the perspective of the developer of these systems.

Crowdsourcing User Studies With Mechanical Turk

The authors report on a study to compare the quality of ratings sourced remotely on Mechanical Turk and through a traditional lab setup with experts.

It appears reasonable to broaden the base of test subjects at a very low cost as long as the task is very carefully tailored to the specific properties of a micro-market platform. It would have been interesting to see work on a hardcore usability study that involves less qualitative judgment and emphasizes measurable completion of a task. In theory, a post-test feedback form with added monetary incentives might help to gain some insight beyond the numbers if one were to undertake such a study.

As the work on Soylent has shown, it is possible to ensure the quality of responses through a design of process (Find-Fix-Verify) rather than loading the isolated task itself with quality assurance measures. From that vantage point, I would extend the authors' argument that crowdsourcing user studies can be adapted to a variety of purposes by design. Where Mechanical Turk will fall short is with tasks that require prolonged attention and the completion of which yields observational insights that can only be realized in a co-located setting.

The argument in the conclusion about the limitations on ecological validity don't really worry me. Designers and researchers who want to test web-based products through Mechanical Turk will have people respond in an environment that is much more true to their actual work context than is possible in a lab environment. And after all, the deployment of systems with distributed access outside of organizations always faces the issue of not being able to control for context.


Luke Segars - 10/10/2010 15:14:33

Evaluating User Interfaces Systems Research

This paper makes some observations about the state of user interface research and how it's narrowed in scope since the emergence of the three stable platforms of today (Windows, Mac, Linux). The author states that the lack of true innovation in UI frameworks is reaching a point where it may keep us from thinking about the full design space of interfaces with new (non-desktop) devices. He proposes a set of new criteria for evaluating user interface toolkits as means of producing good designs.

The high level claims that Olsen makes seem to be both legitimate and important to involve in discussion in the community. Almost all user interfaces today are based on the same (relatively small) set of widgets presented on a “desktop.” It is likely going to be difficult to break this paradigm with users who have never seen anything else, but the emergence of new classes of devices like mobile phones and collaborative CSCW devices provide opportunities for serious innovation beyond the restrictions of a particular toolkit. The abstraction and simplification provided by these toolkits requires that fundamental assumptions be made about the work being done (such as Olsen's example of a point-based input device) in order to create effective frameworks. Some of these assumptions are now being exposed as new devices emerge that extend beyond the normal limits of a “computer interfaces” standard restrictions.

At the same time, it is unclear whether Olsen's metrics will provide a framework that can be more effective than the preexisting user-centric system. Perhaps the most alarming (or innovative) thing about Olsen's proposed system is the focus on developers instead of users. He argues that this allows for rapid prototyping and accommodation of user requests, but I'm wary of moving away from user-centric metrics for the obvious reason of abandoning the group that the object is being designed for. He draws a number of parallels between programming languages and UI toolkits and his metrics seem to mirror those of the programming language community. Making rapid prototyping easier could lead to a surge of innovation for user interfaces and could likely shift the field away from a pipelined iterative process to more of a shotgun spread of possibilities to determine what users found to be best.

Olsen's points suggest a somewhat dramatic shift in the mindsets of user interfaces designers as they prepare to accommodate an entirely new class of devices. It's unclear what the impact on the field of interface design would be if these concepts were to be adopted, but they do provide an interesting alternative view to the process that's in place today. I doubt that the current system will (or should) be replaced, but Olsen's developer-centric view could provide an interesting complementary perspective that may open up some doors beyond the walls of today's UI toolkits.


Bryan Trinh - 10/10/2010 15:55:21

Evaluating User Interfaces

In this paper Olsen provides alternative ways to evaluate new UI systems that are better suited for the complexities of modern systems. The motivation for this work comes from the growing awareness of computing paradigms among the populous. The author argues that it would be silly to measure existing UI devices on the basis of a person still in the 70's. A framework for directing evaluation methods is then presented.

It seems like many of his ideas are not really his at all, but simply ideas that preceded him, repackaged in different ways. For example, the "expressive match" concept looks an awful lot like the match in the gulf of evaluation and gulf of execution. That is just one example though, the paper is full of concepts that we have read about in previous papers that are just repackaged.

Perhaps researchers have not evaluated their UI's in the same hollistic way that the author prescribes, but that does not decrease its usefullness. I also don't believe that the other evaluations he describes was something that wasn't practiced prior. All of these considerations appear quite obvious from my point of view, and I wouldn't call myself an expert in the matter. This paper seems more like a plea to the larger HCI community to include these other metrics in their papers rather than describing a fundamental fallacy in the way people evaluate computing systems.

Crowdsourcing User Studies With Mechanical Turk

In this paper the authors present a way to perform discount usability studies using a Mechanical Turk system. They run some tests using this concept and provide some insights into issues they ran into.

This paper was a good blend of the two papers that we have read recently on these topics. Was it new and novel? Maybe not, but it was interesting to read about what they did anyways. It does bring one thing to light though, the construction of the test significantly effects the accuracy of the answers. The creators of the mechanical turk questions need to be very careful to word things correctly to achieve the best solutions.


Pablo Paredes - 10/10/2010 16:05:27

Summary for Kittur, A., Chi E., Bongwon, S. – Crowdsourcing User Studies With Mechanical Turk

The paper explores the possibility of using Mechanical Turk (MT) as a way to run user studies. The paper states that some current disadvantages user studies face are high costs and low participation. It proposes that MT could be used to reduce these disadvantages, but also mentions some issues inherent to this mechanism, which are the impossibility to identify malicious answers from actual answers, as well the lack of demographic information about the population being tested and user experience.

The conclusions of the paper are drawn from a single user study to test quality content assessment of Wikipedia postings, against expert administrators criteria. The overall result shows that if no control is made to avoid malicious answers the study renders useless. On the other hand, if control can be embedded to avoid malicious answers, the results were encouraging showing that there is a good correlation with expert admin criteria.

Overall I find the paper to do an excessively shallow analysis of this interesting topic. The paper claims issues with ROI in the execution of user studies, but it does not quantify cost and/or low participation or a composite ROI metric. Furthermore, the study infers that other types of activities, such as prototype testing or user measurements could be performed using this methodology. I find this statement also shallow and based on a flawed induction process, that on top of the difficulty to account for consistency in the study (which is mentioned as a subject that requires further exploration).

Another conclusion mentioned is the verification that users indeed do cheat. This statement is so intuitive that makes me wonder if this conclusion was introduced to rather generate text to fill out this paper... People cheat... We all know that!

One interesting proposition is to do further pre-processing of the experiment on the experimenter’s side. Therefore adding more programming time. I believe this tradeoff is worth analyzing... If ROI is a key metric, therefore, planning and pre-processing the experiment (which is itself a cost associated to the overall experimentation process) takes a higher relevance.

Another interesting proposition is to define tasks in such a way that even malicious responders would take the same amount of time to fulfill the task, therefore stopping them from cheating. Although this seems a good approach, I believe is not always attainable... Let's imagine testing a tool to draw... How can you make enough flags to actually ask a person to perform a complex task without hindering the creativity process, needed in the tool? So choosing the task should be first step before trying to design the experiment, and preliminarily seems like one initial consideration should be that the task should be somewhat decomposable into quantitative components.

Additionally, the study suggests a way to screen participants based on their responses, such as by counting repeated verbatim; and proposes that these pre-tests are used to exclude users from future tests. Again, here there is the risk to have false negatives, especially if the questions are highly correlated. So a more important decision in the experiment should be the way the questions are to be formulated, before spending resources discarding potential flawed answers. Additionally, we do not known the users, so screening them could actually tacitly select a specific population. The issue is not on how to qualify individuals, but how to make individuals collaborate accurately. Again this delves into task selection and question formulation, rather than data analysis.

In summary, I do not see how this study added anything new to the notion that MT can be used for many things. Without this study we could have anyways argued that some design tasks could have been accomplished with MT. The study should have defined better the types if tasks by understanding their nature, and have defined ROI metrics as well as parameters of evaluation...


Summary for Olsen Jr. D. – Evaluating User Interface Systems Research

In this paper, the author describes a series of parameters to promote a new paradigm of UI research such as: reduce development viscosity, least resistance to good solutions, lower skill barriers, power in common infrastructure, enabling scale. He describes the “usability trap” as a way of actually hindering effective and novel research, by basing new research in terms of quick usability (by users already familiar with some way of doing UI) and by promoting standardized tasks, which focuses on the technique rather than new approaches to tasks or user expertise. Finally the scalability issue, fundamental on the need to perform statistics that can be consistent and replicable for every test also hinders approaches that cannot be easily measured. As the author describes, the tendency from researchers to sustain that “If it can’t be measured it is not research” reduces research to the study of the trivially measurable. Furthermore, this approach leads to an egocentric focus on the measurable because it is easier to publish.

Another couple of issues dragging research to merely incremental (not disruptive) research are flaw analysis and the value added to code reusability. Focusing research on refining current issues (incrementally) and discarding novelty through exhaustive flaw analysis.

His response to this “comfortable” incremental research approach is simple and goes back to basics, he suggests that we focus on Situations, Tasks and Users, rather than methods to evaluate the quality of a system. He proposes a parameter of “importance” to be defined as the way to evaluate the type of research, measured as a way to measure breath and depth of impact. Another metric is viscosity of a solution, defined as the resistance a task has to grow in importance (as defined earlier). He proposes three alternatives to reduce viscosity: flexibility, expressive leverage and expressive match."

He warns the potential risk of discarding the flexibility of a novel UI by not observing the simple yet powerful insight that originated the design, and rather focus on the obvious implications (which are easy to follow only in light of the insight).   Additionally he describes expressive leverage and match as the way to bring UI to other communities with different skills and in ways that match better their design/development skills.  One clear case is to allow artists to easily develop UIs, without the need to know complex programming skills or computer architecture concepts.

Finally the author defines the need to define the design process in a broader integration context, i.e. not only defining one solution to an specific task, but also the possibility to integrate several solutions to a common platform, one that has a high degree of importance.

Overall, I find this an honest paper that confronts the need to publish versus the actual needs to make progress in UI Design. It touches a very relevant subject matter, which is the evaluation of UI, and it clearly states ground to an end of a desktop-oriented methodology and the beginning of a more distributed/mobile systems approach. However, there are some qualitative metrics, such as importance that should be further explored and defined in ways that can be assessed clearly and that will not lead to unnecessary replicas from advocates inclined to a more traditional quantitative approach. I believe that further evolution of the views of the author to reach a well structured qualitative and quantitative approach could lead to very interesting results from the perspective of new ways of embracing disruptive UI design specially in the increasingly ubiquitous computing paradigm that we are living.


Anand Kulkarni - 10/10/2010 16:47:25

Crowdsourcing User Studies on Mechanical Turk


The authors present results from two experiments carrying out user studies research with users recruited from Mechanical Turk and draw conclusions about how these studies can best be carried out.

The core contribution here is identifying the fact that user studies can be crowdsourced on Mechanical Turk. As far as I can tell, this seems to be the first paper discussing this practice, which is now commonplace, which would make it a tremendously useful contribution. Companies are springing up with the goal of just helping researchers carry out user studies and similar research on crowdsourcing platforms, so this is a useful contibution. I like that the authors discussed the importance of careful task design to prevent malicious users from manipulating the outcomes, a phenomenon that is still the focus of much research. I also like that the authors proposed multiple verification mechanisms, which reduces the need for redundancy in carrying out experiments. It would have been useful to discuss some specifics related to how much users should be paid for completing work.

The evaluation used by the authors looks good to me. For the first experiment, authors compared the ratings provided by users against the ratings provided by experts and used some standard significance-results statistics to determine the correlation. Because they were exploring how task design affected the quality of results in user studies, this was an appropriate analysis. For the second experiment, the authors put a simple verification mechanism in place and carried out the same statistical analysis. Because this was at the time an original practice, its inclusion was useful and compelling. The only limitation is the small sample size; given the low cost on Mechanical Turk the authors could easily have gotten a hundred or more responses at low cost.


Evaluating User Interface Systems Research


The author discusses how to evaluate modern and future user interfaces.

The core contribution is a set of standards to consider when evaluating new user interfaces. I like the suggestion to move beyond superficial examinations of usability (what he calls the "usability trap"). Generality, originality, and importance are all obvious metrics, but it's good that the author lists them because they're central to UI evaluation (in my view). What he calls "viscosity" is an interesting and original way to evaluate interfaces, in my view -- this seems like a useful technique to apply. Last, empowering new participants is an often-overlooked component of UI research, so it's good that he mentioned it. The contribution could be stronger if he mentioned more clearly what quantitative metrics can be applied to gauge, say, generality, rather than just mentioning some examples of things that are not generalizable.

The presentation is presented as a semi-structured list of possible metrics. The author does a good job of mentioning the positive and negative aspects of many of the metrics, and he also provides several different standards, which make for a fairly strong paper. I also appreciate that the author took the time to motivate the difficulty with simply reusing previous standards for UI evaluation. The use of arbitrary technical symbols (U) and (S) at times is a poor practice the author could have avoided. I wish the author had made a stronger case as to whether the list of criteria he gave was comprehensive; this is a risk of papers that purport to provide new qualitative metrics.


Matthew CHan - 10/10/2010 17:55:19

===Evaluating User Interface Systems Research===

This paper is about exploring novel ways of evaluating new User Interface systems since we are moving beyond the desktop. Methods from the past were understandable since it was all based on 2 or 3 desktops, but the advent of cellphones and PDAs are requiring something else

This paper is pretty important since it highlights something many of us never considered too much: the criteria we use to evaluate UI. In fact, the criteria is also based on many assumptions (ie. desktop) and the time period when Apple, Windows, and Unix were the 3 major players. Some of the evaluation errors are the usability trap, the fatal flaw fallacy, and legacy code. No results, techniques, or methodologies were used, but it does provoke serious thought into current UI evaluation criteria and possible alternatives

This paper definitely relates to today's technologies since we've moved far beyond the desktop to multi-touch devices, cell phones, and who knows what might come up next. Whatever they may be, using the old criteria to judge its UI is not enough and this paper provides a wonderful set of directions to consider. Moreover, this paper does relate to my field of work because i'm interested in UI very much.

Crowdsourcing User Studies With Mechanical Turk

When it comes to running user studies and gather data, researchers face the challenges of time, costs, sample size, etc. By considering the use of Amazon's Mechanical Turk, it's possible researchers may shave off time and cost for quality data.

This paper is very important because, as mentioned, no system can survive or move forward without user studies to explore the efficacy of the system. The meat of the paper lies in the two experiments the author conducted by testing Turkers to evaluate the quality of Wiki articles on a Likert scale. The 2nd experiment was similar except turkers had to answer a few questions before resorting to the likert scale. The experiments also explored how susceptible or questionable answers would be compared to Wiki Experts who monitor the site for quality. In the end, he concludes that using Turkers works well, but the design of the experiment must be given alot of weight.

This paper is quite important when it comes to using turkers for user studies, but i suspect there are times that turkers can't just replace a user walking into lab to use a system. This work does not relate to mine, and for some reason, i just find Amazon's Mechanical Turk not too interesting...


Matthew Can - 10/10/2010 18:20:58

Evaluating User Interface Systems Research

In this paper, Olsen argues that usability testing is not the right approach to evaluating complex and often novel UI systems. Instead, he lays out a set of criteria that are based on claims to the value of UI systems and toolkits. This is the paper’s contribution to HCI: a different and better approach to evaluating UI systems.

Olsen presents three common errors resulting from traditional evaluation methods: the usability trap, which is the issue that usability metrics don’t make sense for evaluating UI systems architectures because the problem does not meet the assumptions of usability testing; the fatal flaw fallacy, which states that it does not make sense to evaluate new systems by searching for fatal flaws because, by definition, a new system is going to omit important features; the concern over legacy code, which impedes UI systems progress.

What I like about Olsen’s approach to evaluating UI systems is that his criteria are rooted in the goals of those systems. For example, a good UI toolkit will have a high level of general applicability so as to be useful for solving many kinds of tasks for many kinds of user groups. So, as one way to evaluate a system, we can examine its level of generality. As another goal, UI systems should allow designers to iterate quickly. This can be accomplished by increasing flexibility, expressive leverage, and expressive match. The latter two can be thought of as ways of reducing the semantic and articulatory distances of the gulf of execution.

This paper makes a valid point that traditional evaluation criteria are inadequate or inappropriate for evaluating new UI systems, but it seems to me that the proposed criteria, however sensible they may be, can be difficult to validate. This is the biggest drawback to using these criteria to evaluate research. As an example, Olsen claims that a system should demonstrate importance. Moreover, the importance should not have to be established through data analysis. It should be significant enough to be evident. The lack of rigor in this approach is unsatisfactory to me and it seems to reduce UI systems research to something less than research. It sidesteps the validation problem by claiming that the improvement afforded by a new system is so clearly important that validation is unnecessary.


Crowdsourcing User Studies with Mechanical Turk

This paper examines the use of micro-task markets for collecting measurements for user studies. The authors conducted two experiments on Mechanical Turk, with different outcomes leading to recommendations on how to design effective micro-tasks.

The importance of this paper to HCI is laid out in the introduction. User studies are important to successful design. Unfortunately, they are costly in terms of time and money. Micro-task markets are promising because they offer the potential to conduct user studies with many participants, rapidly, and at little monetary cost. This paper’s contribution is that it gives some insight into how to conduct successful user studies in a micro-task market.

Although this paper did not come close to answering the question of how to conduct a full blown user study of a prototype using Mechanical Turk (which I would have loved to see), it did provide suggestions for how to design Mechanical Turk tasks to gather useful user measurements. Incorporating explicitly verifiable questions into the task is a good way to make users process the content of the task and to discourage invalid responses. There isn’t much substance to the paper beyond those lessons learned. As stated near the conclusion, further research is necessary to understand what kinds of tasks are well suited to micro-task markets.


Linsey Hansen - 10/10/2010 18:46:41

Evaluating User Interface Systems Research

In this article, Olsen discusses the current methods of User Interface of Evaluation, and how they need to be altered to work with user interfaces today. He also briefly covers how current methods are being misused do to some still holding onto old assumptions and beliefs.

One thing that bothered me is how Olsen implied the field of HCI to be more stagnant than it once was, while I am not positive as to how true this is, I feel like the field has definitely not died off as much as Olsen implies. While it is true that pre-80s researches may have had plenty of great ideas and tried many new things (most of which either didn't work or they lacked the technology for), modern researchers, who are now less limited by technology, still come up with plenty of fresh ideas for HCI. While discussing the errors with current methods, Olson goes on to mention that current system models limit the creation of new UIs, and while I definitely agree with this, it is not like people are not trying to modify the system models, and/or create toolkits to support a wider range of possible activities.

One thing I found interesting in the article is the portion regarding the analysis of the importance a situation, task, and user. While I know that it is good practice to evaluate the user, the tasks a user might be doing, the situations that a user will be in, and the frequency in which the user might be doing a task, I never considered to evaluate these things based on importance.

Crowdsourcing User Studies with Mechanical Turk

In their paper, Kittur, Chi, and Suh evaluate the types of situations where mechanical turk can be used to aid with user studies. They run two experiments to investigate the correlation between mechanical turk effectiveness vs. hiring professions and find that while mechanical turk can generate similar results for less money, turking requires experimenters to take great caution in creating their tasks.

One of their primary findings was that while not all turkers try to “game” the system, a few do, and those few can have a large, negative impact on test results. To prevent this, experiments need to be careful to create questions with verifiable, quantitative answers that require the user to actually look at the content specific to the article so generic responses could not be inputted. Personally, I never thought that most of these would be necessary, since users who post invalid answers just do not get paid, but since the experimenter is actually required to go over these answers, having a plethora of incorrect ones will definitely waste the experimenter's time. Another solution to this would be to have a second round of turkers go through and weed out nondescript solutions, but that would not only be expensive, but would be susceptible to the same issues of “gaming.”

At the end of the article, it is also suggested that turkers be posed with several pre-question questions. Where the pre-questions are facts with a definite answer that will require some level of competence to be answered. Or, there could even be “quiz” questions that require the users to look up specific parts of the article before answering the more research relative ones. I think both of these are good ideas for weeding out “gamers” since a potential gamer will either be forced to read parts of the article, and thus be encouraged to answer the more significant questions with more accuracy, or they can just be weeded out at a glance.


David Wong - 10/10/2010 18:47:19

1) The "Evaluating User Interface Systems Research" paper discussed the importance of creating new UI interfaces, common follies in evaluating those systems, and several approaches to evaluating a new UI system. The "Crowdsourcing User Studies with Mechanical Turk" paper discussed whether user studies could be conducted on Mechanical Turk and the validity of those experiments.

2) The "Evaluating User Interface Systems Research" paper offered insight on why new UI interfaces are important. Personally, I believe that the time is near for new, intuitive, UI interfaces that transcend the mouse and the keyboard. Furthermore, I liked how they stated that usability, the fatal flaw, and legacy code were bad ways to evaluate a new UI system. While those claims may be a bit obvious, I think it is important to keep them in mind when developing a new UI system. Lastly, the paper discusses several ways to evaluate new UI systems. I felt that the description of these methods was a bit convoluted, but the ideas were sound. The expressive match criteria sounds very much like direct manipulation.

The "Crowdsourcing User Studies with Mechanical Turk" offered some value in running user studies on Mechanical Turk, but nothing outside the obvious. The paper is a better proof-of-concept paper with a small lesson learned at the end, not anything spectacular. While it is valuable to know someone else has done this and to read about their experiment, the paper only offers the idea that you should design the experiement to force the user to behave honestly. This is important not only for user studies ran on Mechanical Turk, but for user studies done in real life.

3) The "Evaluating User Interface Systems Research" paper had a sound argument in describing the need for new UI designs. Also, the problem is well-motivated as new systems are being developed, such as the iPad or increasingly sophisticated mobile devices, that require new UI design. If the same innovation could be applied to the desktop, there is a lot of value to be gained. As for the evaluation techniques proposed in the paper, they all sound quite general and do not have any concrete evidence to prove their added-value.

The "Crowdsourcing User Studies with Mechanical Turk" paper had a decent methodology in its experiments. The idea that these studies were run on Mechanical Turk already attests to the invalidity of the user sample. However, if we only consider the outcome of the experiements, the paper does soundly illustrate that the design of an experiment is important, especially if you're running an experiment on Mechanical Turk.


Kenzan boo - 10/10/2010 18:47:39

Crowdsourcing User Studies on Mechanical Turk

The article provides an example of a study that was run on Amazon’s mechanical turk using two differing experiments for the same goal. The result was that using a more quantifiable and exact measure, like how many references in the article, made results much more accurate and deterred gaming. The idea was to make providing malicious answers harder than providing valid answers. These ideas of making the mechanical turk tasks very easily quantifiable will be very useful for user studies in this class. There needs to be motivation for the user to actually provide a valid answer. Having an open ended text field saying what do you think of how to improve this will easily lead to someone just reading the first line and coming up with an irrelevant answer. The true value of crowd sourcing, especially in HCI, can come from the fact that hundred of people can evaluate a UI at a faster rate, and cheaper than hiring 5 professionals. Depending on the desired results, using crowd sourcing can be a much better alternative because it gives us a huge range of evaluators. The wikipedia article evaluations showed that even with a difficult task like reading through an article, the collective knowledge of many novices were significantly correlated to the opinions of experts.

Evaluating User Interface Systems Research The article describes UI systems research and how it has changed over the years and why it is still very much needed. Some ideals are, reducing UI viscocity, ie, allow more changes/solutions, lower skill barrier required, offer least resistance to good solutions, and power in common infrastructure. Power in common infrastucture is very very important as the user is ultimately limited by hardware, eg whether they have a smart phone or not, or a pen tablet or not, or an ipad like device. The ui must try to conform to as much common hardware as possible to get a good user base.


Aditi Muralidharan - 10/10/2010 18:55:37

Evaluation II

In "Evaluating user interface systems research" Dan Olsen points out the difficulty of testing complex user interface systems, such as new interface design toolkits, because they are often not amenable to controlled user testing with walk-up-and-use users who are equally unfamiliar with other benchmarks. As an alternative he suggest that the STU context (situations, tasks, and users) could help frame the evaluation of complex systems.

The second paper, "Crowdsourcing user studies with Mechanical Turk" moves In a completely different direction, away from the subjective consideration of a new system's importance, novelty, and generality that Olsen suggests. The authors suggest using Mechanical Turk to crowdsource the measurement of user performance (speed and accuracy) on micro-tasks. They find, unsuprisingly, that this platform is not suited for gathering subjective opinions.

I found the Olsen paper insightful, especially in the context of my interest in search user interfaces for linked data sets. Linked data is unfamiliar to the public, so new interfaces to it are difficult to compare fairly with familiar search user interfaces.




Siamak Faridani - 10/10/2010 18:58:44

The author of the first paper “Evaluating User Interface System Research” builds a framework for critically analyzing research in interface systems. He starts by this fact that the field of HCI research has passed the first wave of chaos and experimentation and now everyone is using one the three major platforms (mac, Linux or windows) Each one of these operating systems has gone through UI and interaction design so any researcher that is using one of these platforms is bound by the standards in that specific platform. This change requires re evaluating out performance metrics and the author seeks to build his framework to make sure that the research in interface systems will lead to significant improvements to the older systems and not just small incremental improvements.

In the first section he talks about why it is important to continue systems research. Computation nowadays has been done everywhere and our computing platforms change continuously. This change requires new interactions and even new theories and approaches. The author also looks at the benefits of UI systems research. He highlights that new architectures will lead to more and more effective UIs and will make it easier for others to build creative tools on top of the new architecture. Systems also make it easier for developers to understand the standards and best practices (in contrast to using only written manuals). These toolkits make it easier for new comers to come into the field and make an impact or extend it. and finally it makes it easier for application developers and designers to think about scale.

One aspect of the paper that I truly enjoyed is where he talks about usability trap. HCI researchers have complained that current user studies may not be sufficient to determine the effectiveness of UI and in this paper he points out 3 faulty assumptions in UI design that may lead to good results from the user study but may fail in the final product.

In the last pages of the paper he highlights the axis on which the performance should be measured. For example it should be simply important. The scuba diving example here was brilliant. It should be novel problem something that is not already addressed or solved or a solution that outperforms a former solution by a large margin. It should be generalizable and flexible should be used in many different situation. Solutions that are tightly constrained to one specific problem will are valuable. It should benefit UI designers and be able to fit into a larger tool set and work with other toolboxes. It should also be able to integrate well and talk to other tools through standard protocols. and finally it should be scalable.

The other paper is about using mechanical turk to perform user studies. My short experience with the CHI community shows that chi community is obsessed with a number of things among them the quality of controlled experience, my rank the highest (after obviously the p-value in hypothesis testing). I have seen reviewers who have rejected a paper simply because the study was done online and no in-lab control group was used.

This paper provides an interesting view into that problem. Authors design two set of experiments and while these two experiments are equivalent in terms of scientific outcome. The way that they are structured is different. In the second experiment authors have embedded elements to make sure that the user is not clicking through the questions and is providing valuable information.

They show that the second experiment was closer to the controlled group (in this case Wikipedia admins who might not be experts either). They have used interesting variables for example they have used the median of completion time and show that it is significantly larger in the second experiment.

To me the second paper is a very interesting research but there is a paradox here. Lets assume you have designed an experiment to be run on Mturk and you have included many questions to make sure that users are not submitting useless data. In this case how can we make sure that results are consistent with a case when we run an in lab experiment? I believe an answer to this question might be a good contribution to CHI.


Richard Shin - 10/10/2010 19:00:21

Evaluating User Interface Systems Research

This paper introduces methodologies for evaluating work in user interface systems. The author notes that recently, the rate of research into new user interface architectures has declined, due to stabilization into a few platforms, lack of expertise among researchers, and lack of a way to evaluate new work. However, the invalidity of old assumptions and the development of new form factors, input devices, and interactive platforms are creating a fertile environment for more research in this field. The paper then presents pitfalls in, and methods for, UI system evaluation.

This paper doesn't seem like quite anything we've read before, in that it directly discusses specific evaluation techniques, independent of interaction techniques. Other related papers have mostly discussed the topic of evaluation in HCI in much more general terms, or evaluated specific systems. I agree with the paper's premise that we are seeing a resurgence in need for new user interface systems, given the growing popularity of smartphones over the last few years, for example, which require entirely new ways of thinking about how to build user interfaces, and correspondingly new user interface systems to support that work. I feel that works like this paper could guide the evolution of such systems and ensure that their creators and miscellaneous critics properly evaluate them.

I'm not as convinced, however, about the specific applicability of all the methods presented in the paper. In essence, they don't seem to have received the kind of evaluation that they are promoting, in terms of how valid or useful they may be. It would have been helpful to have a more systematic set of examples, I think, rather than a few disconnected anecdotes.

Crowdsourcing User Studies With Mechanical Turk

This paper discusses how to design 'micro-tasks', such as those traded on Amazon's Mechanical Turk, for the purpose of conducting user studies cheaply with a large and potentially diverse target group. The authors conducted two experiments in which they offered two slightly-different tasks to Mechanical Turk users, asking them to rate Wikipedia articles. They discuss the shortcomings of their first experiment in collecting useful data, and note how the modifications in their task design led to much better results in the second trial, generalizing them to lessons for future researchers using Mechanical Turk.

I felt that this paper was very similar to the Soylent paper in overarching goals, and was reminded of it continuously while reading. While this paper predates the Soylent one by two years, however, the methods that each one employed to gather useful, high-quality data seemed rather different. The authors of this paper asked Mechanical Turk users in the revised experiment to collect objective, quantitative metrics about the articles in addition to the subjective article rating, since then the users would be forced to actually read the articles in question, in which case they might as well as answer the desired question sincerely. The Soylent authors instead employed Mechanical Turk users against each other, paying users to verify other users' work. In any case, these techniques seem largely orthogonal to each other, and future research could combine them as needed.

I was unsure about the paper's premise of evaluating the usefulness of conducting user studies through Mechanical Turk, though, given what it discussed. I didn't really see how evaluating the quality of Wikipedia articles is like taking part in a user study as part of a design process, or how the techniques used to improve the quality of the answers here would be applicable to improving the quality of Mechanical Turk-collected user study data. I would have appreciated more discussion into these aspects by the authors.


Arpad Kovacs - 10/10/2010 19:02:16

The Crowdsourcing User Studies With Mechanical Turk paper is concerned with performing large numbers of randomly-sampled user studies at low cost using micro-task markets such as Amazon Mechanical Turk. The article focuses on two similar experiments, which asked turkers to rate the quality of Wikipedia articles, and offer suggestions for improvement. The researchers modified the questions between the two experiments, in order to discover how various factors affect the degree of user effort/involvement and answer quality provided by turkers.

The main contribution of the paper are the design recommendations, specifically: 1) Ask quantifiable, verifiable questions to keep users honest. 2) Formulate tasks that take the same amount of effort to perform correctly as it takes to provide a wrong answer. 3) Use multiple criteria for detecting suspicious answers. I think that much of this is common sense, since most people are lazy and will only put in the minimum amount of effort required to obtain the reward. Therefore, if there is an opportunity to cut corners without any adverse consequence (for themselves), then many unethical people will do so. Unfortunately for usability studies, many questions are subjective in nature (eg: how hard was it to use the interface?), and would thus be hard to quantify and verify automatically answers as stated in point #1, so in this case recommendation #2 would be more applicable. For example, on a short response question, the program could require users to spend at least 30 seconds writing, require minimum 50 word response, and filter duplicate copy-paste answers; since the user cannot take any shortcuts, maybe they will actually to respond to the question.

I found it interesting that only a small percentage of users gamed the system, however these users seemed to be the most active (probably because they could spend less time per question than honest users). I think that the ultimate solution to the issue of ensuring consistently high-quality answers is to reward turkers based on the quality of answers, rather than the quantity. For example, the submitted answers could be reviewed by a second level of turkers, who reward the top 10 answers with 1 point, and the worst 3 answers with -1 points, and points are convertible into a dollar amounts using some prespecified exchange ratio. Since the payoff is now directly dependent on the quality of answers, turkers will now compete to score within the top 10 bracket, while lazy turkers will be discouraged from writing bogus answers because of the associated penalty. Such a system will reward the best performers, while filtering out the riff-raff, providing a higher signal-to-noise ratio.

=

The Evaluating User Interface Systems Research paper serves as a rallying cry for renewed innovation that moves beyond the entrenched display, mouse, keyboard interface paradigm. Olsen claims that a new system architecture/toolkit is the solution, since it would simplify and increase the efficacy of the development process, lower barriers to entry, create a common, standardized infrastructure, and thus result in economies of scale. The author begins his discussion of the analysis criteria by identifying pitfalls that should be avoided, eg assuming that users can walk up to a system and just start using it, or comparing the new system to an existing one.

The most valuable part of the paper is the provided list desirable characteristics which the new candidate system architecture should possess. The interface system should provide solutions to as many unsolved problems as possible, and provide significant (>100%) performance gains in accomplishing existing important problems. The system should also be flexible and expressive, in effect narrowing the gulf of execution between the designer's intentions and the actual implementation steps as much as possible by providing an intuitive set of solution options for a given design problem. Finally, the architecture should consist of modular building blocks that can be combined, scaled, and linked for easy communication and integration, analogous to the Unix pipe system.

I disagree with the paper's opening premise that user interface development is languishing. First of all, in the past 3 or so years, there has been an explosion of innovation in the mobile device ux space, ranging from new multitouch smartphone interfaces (iOS, Android, WebOS) to windowing systems that have been optimized for small-screen tablets and netbooks (Meego, OLPC Sugar, custom Linux interfaces for eeePC). Instead, I would argue that due to user demands for consistency (eg: people complaining about the Office 2007 ribbon interface) and security concerns (good luck getting users to install your unsigned shell extension), innovation has shifted from the big 2 desktop systems (Windows and Mac OS X) to various embedded Linux distributions, which are more easily customizable (open source ftw) and can be tailored to the affordances of a particular device without carrying the burdens of backwards compatibility ("legacy code") or catering to prior expectations.

Another criticism: as the paper notes, it would be ideal to test a new interface on someone who has no experience with existing interfaces; but good luck finding such a person. Computers have been in common use for some 20 years now, and are an essential part of modern business and life; you would probably have to go to a developing country to find a person who has never used a computer.

Otherwise, I think that the goals of the ideal user interface architecture advocated by the paper are quite very rational; however it is unlikely that a revolutionary break with the past like going from commandline to graphical desktop will happen again. Instead, I think that the more likely course of action is that these ideas will be gradually integrated into the existing systems.


Drew Fisher - 10/10/2010 19:09:47

Crowdsourcing User Studies with Mechanical Turk

The key contribution of this paper was to show that user studies can be performed in crowdsourced environments, under certain conditions. As in "Designing games with a purpose", it is crucial to make the least-effort approach for the user be the one that involves the user actually making a good-faith effort to perform your task. This can be done by requiring that the participants give answers to factual questions that are easily verified, as well as requiring summaries of what was intended to be read.

This holds promise, as running studies on MTurk could prove cheaper and easier as time goes on, enabling new research methods.


Evaluating User Interface Systems Research

This paper discusses difficulties in fairly and fully evaluating UI systems and toolkits. A key point raised is that many usability tests are on very short-term issues, and to properly evaluate a system may be prohibitively expensive.

To measure the value of a UI system, the author suggests that we view systems in terms of Situations, Tasks, and Users, and weigh the product of the levels of importance of each when determining the value of a system.

After that, the author proceeds into a shotgun approach discussing things to think about when evaluating UI systems, their value, and the progress they enable. In particular, they should also be aware of how the tools may integrate into existing solutions, benefitting from their entrenched value. I found this to be in stark contrast with their previous statement trying to avoid the need to support legacy code. Curious.


Aaron Hong - 10/10/2010 19:17:01

In "Crowdsourcing User Studies with Mechanical Turk" by Kittur et al.they discuss using Mechanical Turk (aka the micro-task market) in collecting user input for experiments. In this case they ran two studies on "rating Wikipedia articles," which were essentially identical except the second one was improved crafted questions to get higher quality data. In the first case they had a major issue with people trying to "game" the system. They came up with 3 ways to improve the quality: (1) verifiable questions, (2) making the "gaming" take as much effort as coming up with the right answer, (3) detecting suspect answers.

I think what's interesting is that most people know that small, easy, and fairly objective question are to some degree reliable to run on MTurk, but in this paper they were able to show that they are able to collect fairly accurate responses on a subjective task. The only thing is that I would like know what the variance in the quality of the article that were chosen are and also how useful were some of the subjective answers even though they were "valid." Much more can be done on the subject especially this experiment was fairly small and narrow.

In "Evaluating User Interface Systems Research" by Dan Olsen talks about the current state of UI research and how it depends too much on simple metrics, which would only produce simplistic progress. He points out the usability trap in which we produce things in which only usability tests can measure (which he claims is too narrow due to issues such as scalability) and other issues with current UI research. He advocates a variety of "alternative standards by which complex systems can be compared and evaluated." There are many barriers to making progress (such as if I did "this" then I wouldn't be published because it doesn't follow the UI norm) and I agree that measure of whether something is valuable or good research should not be based on narrow metrics. However, when we evaluate these more complex systems we need to be clear (he uses an STU evaluation), demonstrable (to some degree), and verifiable--we don't want to fall into subjectivity and pure rhetoric.