- 1 Reading Responses
- 1.1 Chulki Lee
- 1.2 Kristal Curtis
- 1.3 Dave Rolnitzky
- 1.4 Philipp Gutheim
- 1.5 Sally Ahn
- 1.6 Beth Trushkowsky
- 1.7 Siamak Faridani
- 1.8 Kurtis Heimerl
The Online Laboratory: Conducting Experiments in a Real Labor Market discussed the validity of using crowdsourcing market for experiments, especially economics experiments. The paper covered various issues in internal and external validity. In several places, for example in the chapter 5.1 ‘Representativeness’, the paper seemed to defend the use of “real labor market” for experiments because physical laboratory experiments have also the same problems. However, I think it will miss the significant difference between them. For example, it would be more difficult to limit significance of the result in online experiments than in physical experiments, because online experiments are less controllable in choosing and filtering participants, as the paper also discussed.
The second paper, Crowdsourcing graphical perception: using mechanical turk to assess visualization design investigated the viability of using MTurk in perception experiments. The authors replicated previous studies to MTurk and showed that they provided matched results. This comparison method is very different from the first paper. I think the former is more theoretical and thus more appropriate for testing the experimental validity.
Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design: I've been wondering about the adoption of crowdsourcing for user studies, so it was interesting to read this work. I wonder if this is making its way into the social sciences as well. I also wonder if the authors had to fill out any paperwork for human subject experimentation. I was actually somewhat surprised that they claimed to replicate the results of the studies they mentioned, as their results seemed pretty off (Figures 4 and 8). I was very interested in the section about samplers and streakers, and I wonder how much impact the streakers have on the bias of the sample obtained? It seems that this could be quite problematic. I'd like to see an experiment comparing HIT results where in one case you control the contribution of each Turker and in another case you let each Turker contribute as much as they'd like. They also raised the question of experimental repeatability. It'd be interesting to explore that as well.
The Online Laboratory: Conducting Experiments in a Real Labor Market: This paper was a very interesting foray into experimentation on MTurk by economists. I found section 4, where the authors explored factors that make online experiments more difficult, particularly interesting. The authors first touch on the importance of independent observations; however, they only mention the possibility of users having multiple accounts as a way that independence might be violated. Independence can also be violated if there is a shared bias among the Turkers, which there is likely to be among people from similar geographical locations, socioeconomic levels, etc. They touch on this in section 4.2, when they say that stratified sampling could be used to divide the experimental population into random and control groups. I think it'd be very interesting to explore this further.
The Online Laboratory: Conducting Experiments in a Real Labor Market
The paper argues that the emergence of online labor markets have addressed the problems of recruitment/payment and assurance of internal validity. The paper identifies the major differences and challenges of online experiments compared with laboratory experiments conducted online, and seeks to answer why these online experiments may or may not work. The authors do this through the reproduction of a series of classical experiments. These replications suggest that "…online experiments can be an appropriate tool for exploring human behavior, and merit a place in the experimentalist’s toolkit alongside traditional offline methods, at least for certain research questions."
I found their recruitment methods of stratifying based on arrival time (arrival time has a strong relationship to demographic characteristics) pretty good, though I'm not totally sold that this creates a diverse pool of participants. Like a lab experiment, the authors could have chosen a more explicit way to ensure that they had a more diverse pool of participants. Also, the authors do adders self-selection bias, but state this is no different than a laboratory situation: "That is not a concern, as those people are exactly analogous to those who view an announcement for an offline, traditional experiment but don’t participate." This might be true, but unlike a lab experiment, those on mechanical turk could see the entire experiment and then easily opt-out. I believe this would be much less likely in a lab setting, where experimenters have to invest time and effort to physically show-up to find out the details of an experiment. The authors concede later in the paper that representativeness is a challenge with online experiments done on M-Turk: "However, even if subjects 'look like' some population of interest in terms of observable characteristics, some degree of self-selection of participation is unavoidable."
Overall, I found this article to be of good analysis of the value of conducting online experiments, and a good guide for which experiments are best conducted online vs. in the laboratory. Conducting traditional "laboratory" experiments on M-Turk was a good way to conduct an almost apples-apples comparison (even though there were some slight tweaks done on M-Turk, like wage differences compared with the lab).
Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design
The paper assesses the feasibility of using M-Turk to evaluate visualizations. The authors replicated previous lab sutdies, demonstrate the use of crowdsourcing to generate new perception results, and analyze the performance and cost of M-Turk and provide recommendations. The authors find that using crowdsourcing experiments can contribute to new insights for visualization design.
The most valuable part of the article, similar to the Horton et. al article, is that crowdsourcing can be a viable platform for experimentation and may have advantages over experimentation in the lab. And the authors had some good recommendation for how to improve the quality of responses. However, I found that the authors' recommendations about visualization design to be really narrowly focused and specific, and I'm not sure how much value some of these pixel-level findings will have as changes in the tools to view and create these visualizations continues to change rapidly. For example, the finding that "Our chart height and gridline spacing experiment suggests optimized parameters for displaying charts on the web: gridlines should be spaced at least 8 pixels apart and increasing chart heights beyond 80 pixels provides little accuracy benefit on a 0-100 scale."
The Online Laboratory
The Online Laboratory is a piece which investigates whether online experiments on MTurk can be as valid as offline/physical laboratory experiments while providing the benefits of being less expensive and less (logistically) complex to conduct. The results from the paper are quite interesting. In fact, I was conducting a survey on MTurk with around 550 participants and I was wondering to which extend the findings are valid. I noticed that the authors assign/recruit workers to the survey: Arrival time. Both intuitively and based on my own survey, workers from different geographies are more likely to complete a HIT during particular time-frame. Indian workers are typically active when it is night in the US, since the Indian Standard Time IST = GMT + 5:30 while PDT = GMT - 8.
Crowdsourcing Graphical Perception
The Crowdsourcing Graphical Perception paper analyzes if MTurk can be a valid method to evaluate visualizations. Generally, I believe it is not so much of a surprise that turkers can be leveraged to evaluate visualizations. The visualizations are suppose to be understandable for everyone and hence regardless if the evaluator is a turker or an "offline participant" they should both be able to give valuable feedback. However, something potentially important is the international/inter-cultural background of the turkers. It is unclear whether international groups of participants are usually used for "non-MTurk" visualization evaluation. If not, this should also be addressed when publishing HITs on MTurk, since turkers can have a diverse set of cultural backgrounds (assuming that impact the evaluation).
The Crowdsourcing Graphical Perception paper evaluates the reliability of perception user tests conducted through a crowdsourcing platform like Mechanical Turk rather than in a laboratory setting by replicating previous experiments and examining a few new ones. Their replications yield similar results to the older laboratory experiments and the authors to conclude that crowdsourcing is a viable method for gathering experiment results for perception-related tasks. However, the authors also note some factors that are difficult or impossible to control in crowdsourcing, such as the variety of display devices and their settings. They mention that they considered gathering each worker's display configuration information, but decided such information was unreliable because other factors like the viewing angle of LCD monitors can drastically alter the display's contrast. However, they estimate the monitor gamma through operating system information instead. This was confusing to me because it doesn't help the viewing-angle problem and also gets less accurate display configuration information. The authors point out that the loss of control over some of these factors may actually be a more accurate representation of the variance in the user population, but subsuming this variance in the experiments themselves would make it difficult to develop new theories since we would lose information on the factors that could have led to the results.
The Online Laboratory evaluates the validity of more general experiments conducted with online users. They also replicate previous experiments to show that results from MT workers provide similar data as laboratory experiments. It seems that most challenges for crowdsourced experiments involve internal validity. For example, guaranteeing unique and independent observations from online experiments is harder than when the subjects are physically present, and experimenters must check for multiple accounts. The authors mention alternative labor markets where users are non-anonymous, and I would be interested to see how replicated experiments in those platforms compare with MT. Although the authors state that multiple accounts are rare, I am curious whether the demographics of alternative platforms reflect a significant difference. The authors address randomizing assignments as well, which I found interesting. I'm not sure how effective their time-based blocking design would be, since the demographics of a single timezone can vary greatly. The SUTVA problem sounds like a harder problem to control. The authors' advice regarding this is to run experiments quickly (before workers can start discussions about them), "keeping them unremarkable," and periodically checking the message boards. The first two advice puts a limitation on the experiment design that may not be available for all cases. As for the last advice, how can the experimenters be sure that they haven't missed a damaging discussion? They claim that the "natural mode" of such conversations are public discussion forums, but I am not convinced that such discussions do not occur outside these forums.
The goal of "Crowdsourcing grapical perception" was to demonstrate that mechanical turk (or similar services) is as effective for performing visualization design experiments as in-person in a laboratory. In general I thought the paper was a little anti-climatic, particularly with the set of experiments that simply replicated the laboratory results... it wasn't surprising that human beings act like human beings. Perhaps I would have appreciated it more if doubts about why mturk would be so different had been more emphasized. I also thought the setup for each of the experiments wasn't very detailed, particularly for people not familiar with the original experiments. Regardless, a few interesting points came out in the paper. One issue they faced was with data quality, noting that 10% of responses for an experiment were garbage when they weren't using qualification questions. I had always leaned away from qualification questions because I thought it might be a deterrent for turkers to work on the HITs, but maybe that's not true or worth it? I also thought they had an "interesting" way of determining outliers as responses that deviated from the true answer by more than 40% (see results for experiment 1B). I hopefully want to see more techniques that are a less arbitrary and/or more generalizable across question types. I also liked how the authors highlighted the variability in the subject pool. While they didn't mention if that variability affected their results, I think it'd be good for us to reason about when or if this variability matters.
"The Online Laboratory" did a better job of discussing why online markets like mturk might be more challenging for experimentalists, e.g. workers being able to see the specifics of a task before deciding to do it makes coping with attrition more difficult, and ensuring that workers aren't colluding when they shouldn't be. This paper is a great resource for understanding the implications of moving from offline to online experiments, from task design to interpreting results. I thought the notion of trust---workers trusting experimenters to pay them---was interesting, and I'd like to see more experiments there to see correlation with worker effort or determining question ambiguity.
The authors claim that running experiments on Mechanical Turk is not only cheap and affordable but also provides reliable results. To test their claim I designed a very simple experiment on Mtruk. Turkers are asked to provide a four digit random number. And it seems that Turkers have failed in providing a proper 4 digit random number. I haven't tested it throughly but since the numbers are independently provided I was expecting a more uniform distribution while the distribution of the numbers that I have are heavily biased towards the two ends of the spectrum.
I think the authors miss to point out that people are terrible in doing some tasks (like replicating an RNG algorithm) but in addition to that I strongly believe people behave differently when they are in a lab compared to when they are in the convenience of their homes.
I personally found the second paper much more interesting (Heer et al.) they show that they were able not to replicate older experiments successfully but also perform new experiments with the fraction of what they might pay for in lab experiments (with only 15% of the budget of an in-lab experiment). They don't extend their results from the field of HCI to every experiment and that's what I enjoyed about their work. They have found a problem in which crowdsourcing provides good results and they show that their results are consistent with in-lab experiments. Perhaps we can add visualization validation to the list of jobs that are shown to be working on mturk in addition to database cleanup, OCR and business location validation.
The Online Laboratory: Conducting Experiments in a Real Labor Market
Econ papers are great. Also, they backronymed CAPTCHA? That's amazing. The paper itself is nice, thorough in that way that economists can be thorough. While recreating classic experiments validates mechanical turk experiments in some way, it certainly doesn't pin the issue down.
I liked that they looked at a lot of the ethical issues surrounding this new experimental methodology. There was good "turf wars" about deception, which I love. God damn sociologists ruining it for everyone. Econ in one of the fields that has a great amount of self-awareness about their methods, centuries of debate on how to appropriately measure various metrics.
The Online Laboratory: Conducting Experiments in a Real Labor Market
As with the previous study, recreating a few classic experiments does not prove that a new experimental paradigm is valid. This is the induction problem, and N=2 is a particular bad example of it. However, I don't have any particular argument for why turkers would be incapable of doing this task; they are just people. I think the fact that there's theoretical and experimental backing indicates that this is probably a fine way to go.
I wish people would take a more aggressive role when doing studies like this. Try to demonstrate that it is possible (using likely mistakes) to fail at recreating classic studies. These things that should generalize across all humans should work in both methods. The trick is to recognize the differences, common failures, and prepare for those. Maybe that doesn't get into CHI, I dunno.