Surveys and Taxonomies
- 1 Readings
- 2 Discussant's Slides and Materials
- 3 Reading Responses
- 3.1 Kristal Curtis
- 3.2 Chulki Lee
- 3.3 Kuang
- 3.4 Sally Ahn
- 3.5 Wes W.
- 3.6 Dave Rolnitzky
- 3.7 Beth Trushkowsky
- 3.8 Nicholas Kong
- 3.9 Anand Kulkarni
- 3.10 Prayag Narula
- 3.11 James Cook
- 3.12 Manas Mittal
- 3.13 James O'Shea
- 3.14 Philipp Gutheim
- 3.15 Kurtis Heimerl
- 4 Discussion Summary and Conclusion
- Paper 1: Human Computation: A Survey and Taxonomy of a Growing Field. Alexander J. Quinn, Benjamin B. Bederson. CHI 2011.
- Paper 2: Analyzing the Mechanical Turk Marketplace. Panagiotis G. Ipeirotis, ACM XRDS, December 2010.
Discussant's Slides and Materials
Ipeirotis's paper was meant to give a feel for the size and makeup of the Mechanical Turk marketplace, in terms of the number of requesters/workers, the number of tasks present and the rates of posting/completion, and the types of tasks available.
The Quinn et al. paper was a more general look at the entire field of human computation, defining it, explaining what it does/does not include, and giving a taxonomy of the area's many projects.
The Ipeirotis paper is a great introduction to Mechanical Turk. As he points out early in the paper, people often have a lot of the same questions, and this paper is a great summary of the current state of the platform. He made good use of approximations and did a great job explaining the limits of his estimates. I liked his use of statistical and queueing models to provide a rough feel for the marketplace dynamics. He also pointed out some research questions of interest, such as how to design and price tasks to reach certain completion time/cost goals. He also pointed out that the sorting criteria by which workers find HITs have a detrimental effect on the predictability of the marketplace, suggesting that other sorting criteria would be in order. I'm sure that the AWS folks are taking this into consideration. I have run some experiments on MTurk, and I agree that more input during HIT design would be helpful. When I posted my own HITs, I went through a round of trial HITs and feedback from Turkers on the Turker Nation blog. It would be great to streamline this process a bit, possibly involving some of the best Turkers.
The Quinn et al. paper zoomed out, focusing not just on MTurk but on a wide range of systems/projects that involve people and arguing for a taxonomy of all such efforts. The authors first argue for the boundaries of human computation, stating that the key feature is that a person initiates a task, and some combination of computers and people respond to the task. These boundaries were probably the most interesting/thought-provoking part of the paper for me. I agree that having a sense for the scope of the field is important, but I didn't immediately agree with all their boundary decisions. For example, Wikipedia seems to be a tricky case, since while sometimes people do indeed decide which articles to create (as the authors state), other times, Wikipedia users can request an article. In that case, the system does seem to fall within the umbrella of human computation. I guess this brings up a question regarding whether the sets of requesters and workers must be disjoint. Otherwise, I thought that the work was helpful for understanding what's going on in the field. I was also interested in some of the citations under the "Quality Control" heading, as quality seems to be one of the most important issues related to this sort of problem.
What kind of marketplace is crowdsourcing making? Is the marketplace a really different from others? The latter article provided several questions across areas, such as how cost can (or should) be determined, what kind of jobs would be fit in it and how we can estimate completion time of tasks.
It is obvious that both social and technological aspects of the marketplace should be considered to answer these questions. For example, why users participate in the market? It seems that most people are working there to get a little money during some free time, but some people might consider the market as their workplace seriously - as discussed at the class. If it would happen, what kind of problem would we have? Different incentive system would be required, and "learning system" might be proposed to guarantee quality.
Currently, AMT marketplace is quite narrow - maybe because it is mainly for tasks which computers cannot do well, like image or character recognition. The first paper showed that there are many dimensions among "crowdsourcing genres", like goals, incentives, aggregation methods. Is it possible to build general crowdsourcing platform, technically and theoretically?
Two overview papers.
I felt that Pano's paper was nice but would have been much better had it broken down tasks by type, and done the analysis that way. Also I dont trust the data about volume if they are only crawling every hour. It misses short term tasks that have lifespans under 1 hour. As well, it it's subject to SEO-tricks that VizWiz writes about.
I really appreciated this taxonomy and it's citations. One nit is that the Openings for Growth section is clunky and one probably shouldn't do research this way.
The Quinn-Bederson paper provides a nice overview of existing systems that turns to humans for computing what computers struggle with today. It provides classification for such systems and argues that this can promote research by encouraging novel combinations of the components such taxonomy identifies. Their paper provides a helpful context in which to analyze existing and new ideas for using human computation, but I am inclined to agree with Kuang in that its focus should not be on identifying new research; "problems" identified in this manner may be contrived and not as meaningful.
Ipeirotis's paper analyzes AMT as a marketplace rather than as a collective intelligence system. By recording statistical data from AMT activities, It reveals important factors that arise in designing a crowdsourcing system. One such factor is the predictability of completion time for tasks, since increasing this reduces risk for requesters, thus bringing in more requesters and providing more tasks for more participants. Ipeirotis's suggestion for a framework that automates setting task design parameters seems like a good idea to me. For example, using each worker's past activity to suggest more tasks rather than simply randomizing it may be a better solution to the problem of "forgotten" tasks that contribute to the "heavy tailed" distribution of task completion times.
Both pieces provide some useful context for thinking about crowdsourcing and human computation.
The Quinn-Berderson paper outlines some useful dimensions across which we can compare existing human computation systems. I also think their characterization of the relationships and overlap between Collective Intelligence, Crowdsourcing, and Human Computation are reasonable, if still a little ambiguous (their hand-waving when discussing things like Wikipedia doesn't help). However, I find their assertion that Data Mining does not overlap any of their other categories a little problematic. My sense is that data mining is orthogonal to this taxonomy and that one could easily imagine employing human computation or crowdsourcing techniques as part of a larger data mining operation - for example to handle classification, de-duplication, or quality control subtasks. Similarly, their assertion that "data mining…does not encompass the collection of data, whereas [human computation] necessarily does" rings hollow.
The Ipeirotis article provides a lot of useful data for characterizing the kinds of tasks posted on Turk and their throughput within the system. I do question their decision to only poll at hourly intervals (although I suspect this was probably due to technical limitations). This probably underrepresents whole categories of small-batch or once-off tasks that can easily come and go in the span of an hour. Also, the observation that "forgotten" tasks can take infinitely long to complete adds an interesting air of urgency to the process of designing Turk tasks. I'd been wondering why the same task can sometimes take a few hours and sometimes take many times that, and this provides some more intuition as to why.
The Ipeirotis paper was a good overview of Mechanical Turk, and it answers many questions that a number of people that have an interest in Turk and other similar crowdsourced marketplaces would find interesting. THe value of this article isn't so much that it explains mechanical turk specifically, but it suggests possible future directions for crowdsourcing research, and provides evidence about a lot of presumed assumtions (e.g., workers are largely U.S. and in India, the long-tail of the market, etc.). I found it interesting personally because of some related research last semester.
The Quinn-Berderson paper was helpful in thinking about the greater context of human computation and how crowdsourcing fits (or doesn't fit) into this area of research. I could see this paper as important, primarily to help researchers discover new areas of interest or even an entrepreneur to think of interesting new product or services. Personally I felt that using the chart of new dimension paris could be an interesting exercise in generating some of these ideas. None of this is particularly ground-breaking, and much of it is left up to the discretion of the authors. And I felt that the classification system was, by itself, pretty rigid. For all of the paper's talk about the importance of defining human computation as directed by a computational system or process, I found it interesting that the authors used their own judgement (and the judgement of their peers as to what's important or most relevant, a fact that manifests itself into the form of citations) in figuring out their taxonomy.
"Analyzing the AMT Marketplace" describes analysis done on over a year's worth of HITs on Amazon Mechanical Turk (AMT), including types of tasks, price per HIT, etc. The paper highlights that amongst the top requestors are several mediator services, which suggests that many users prefer an abstraction layer on top of AMT that hides worker quality issues and programming. The paper mentions an open problem of how to [automatically] price tasks, particularly how to price a task such that it complete in a target time. It seems that AMT provides no guarantees against HIT starvation, and I wonder how that may impact applications that try to use AMT as one step of a processing pipeline.
"Human Computation: A Survey and Taxonomy..." defines and compares related systems that involve humans: human computation, crowdsourcing, social computing, collective intelligence, and data mining. It provides a history for each term, and describes how each does or does not overlap with human computation. I wasn't quite convinced that data mining does not overlap with human computation, as it seems that humans could be tasked with finding patterns in data. The paper does a good job of classifying existing human computation applications on various dimensions (motivation, aggregation, etc.), although the process order category seemed a bit forced.
Ipeirotis used hourly scrapes of available HITs on MTurk to characterize the requester and worker trends on MTurk. He found requesters posted transcription and classification tasks most commonly, with the exception of Dolores Labs and Smartsheet.com, who post HITs on behalf of clients. A particularly interesting insight is that most workers find tasks via the most recently posted list, which has implications for task completion times.
Quinn and Bederson present a framework in which to think about human computation systems, and the relationship between human computation, crowdsourcing, and social computing. Like others commenting here, I was not convinced in their treatment of data mining as separate from human computation, as human computation seems to me to be a means to an end. I also think that placing crowdsourcing inside the bubble of collective intelligence doesn't encapsulate other, current uses of crowdsourcing. For example, in experimental studies, crowdsourcing is used as a fast and cheap (although biased) population sampling method, whose purpose may not be related at all to collective intelligence. Psychophysical experiments fall under this category.
I appreciate that Alex and Ben took the time to categorize the various types of problem-solving and collective intelligence. It's often difficult to distinguish between crowdsourcing vs. human computation vs. other kinds of collective intelligence when discussing these technologies, so a consistent taxonomy is an excellent contribution. I'd like to see these classifications used more frequently by researchers when describing their own work.
There are a couple of omissions in the discussion, however. Where do wisdom-of-crowd applications like collaborative filtering fit in? These are partly algorithmic and partly , but not human-driven, not quite data mining, and not obviously human computation. They also seem to omit the important _application_ dimension of distinguishing between kinds of human computation. This makes a substantial difference across, and is in some ways more important than the execution order of the system. Real-time interfaced applications ("Wizard of Turk") like Soylent are substantially different from data annotation tasks and content creation tasks; this may even be useful to think about as a proxy for task cognitive complexity. Possibly the number of applications within human computation were not high at the time of the original paper.
Panos's predictive model is not the one I would have used, though this was a choice to use a simpler model given that these models are imperfect. Also, despite having played with modeling completion times on Turk myself, and advocated for people working on solving this problem, I question the research value of modeling these times.
Microtask markets aren't suffiiciently commonplace for these models generalize much beyond Turk, so the only remaining goal is to be able to predict / optimize completion times directly. But this can be accomplished through better means than theoretical modeling/optimization.
For example, Chilton, Miller, Horton et al found that Turkers complete tasks that come up at the top of searches first, suggesting that worker job choice is predominated by the interface presenting them with work. If you wanted to optimize work times you could fairly quickly build a better interface to search for tasks. Systems like QuikTurKit also address this question in another way by building a better way to recruit workers. Finally, you can solve the problem by experimental methods -- especially if you have a dataset like Panos's.
Quinn & Bederson’s paper on taxonomy of Human Computation defines the framework for classification of different Human Computation applications. It does a very through review of the literature available on the Crowdsourcing/Human Computation & Social Computation. Despite the authors’ insistence on the rigid difference between these three, the survey is largely biased towards different crowdsourcing platforms available today which is understandable given that it is the least studied topic among all three. The paper has the potential of becoming a ‘classic paper’ because of its thoroughness and rich set of references. Though the authors do a good job of coming up with a rigid classification system some of the claims of the authors seem misplaced. E.g. authors’ insistence on Wikipedia not being a Human Computation System seems hollow given that Wikipedia can be thought of as an IR system.
Ipeirotis’s paper is an analysis of the current state-of-affairs of the Amazon’s Mechanical Turk system. It uses statistical analysis to provide an overview of the platforms, especially in terms of Job Requesters. The size of the market appear to be really small according to the paper and one wonders if a market worth half a million dollars in total is worth researching. But given the long tail of the requester the general interest seems high. The insights into developing a prediction model for job completion time is especially interesting. The difficulty in developing such a model begs the question if it would be better to use a (less flexible) assignment based model to assign jobs to workers rather than the currently used selection model.
Human Computation: A Survey and Taxonomy of a Growing Field
This is a survey of the growing field of human computation; it lists, analyses and classifies the existing systems, and suggests future human computation projects.
This paper is important as a catalogue of the current state of human computation, and to unite a field by telling its different subfields about each other. It also suggests a method for inventing new human computation projects.
The paper classifies human computation systems along six dimensions: how the humans are motivated; how quality is controlled; how the human work is aggregated; what human skills are used; in which chronological order the worker, requester and computer are involved; and the number of tasks issued per request. It suggests exploring this nine-dimensional to find new ideas.
Analyzing the Amazon Mechanical Turk Marketplace
This article seeks rigorous answers to several practical questions about using Amazon's Mechanical Turk which only had anecdotal answers before.
The article lists four questions which are relevant to people who submit HITs to Amazon's Mechanical Turk, and describes an experiment in which data were gathered in a systematic way to answer those questions. The import of this article is that it gives the first careful answers to these questions.
The article presents the main purposes for which people use MTurk and the distribution of payments given for these tasks. It suggests further work in analysing the payments, to advise future HIT submitters. It looks at completion times to estimate that the hourly wage of a worker is $4.80.
The lpeirotis paper is great - it talks about real metrics ($4.80 wage), identifies some core problem areas, and suggests some models that might be useful. One interesting thing to think about was - how much of this is specific to Amazon, and how much might be related to the broader area of crowd sourcing (actually, are they the same right now?). I liked the 'call for action' with regards to building models to characterize question asking. It seems that it is the "Wild West" in terms of asking and answering questions and there are no models (and few heuristics).
The Quinn et al. paper is a meta-research paper which is a very synthetic, deliberate attempt to come up with a weak taxonomy/vocabulary. I disagree with this translation of commonplace verbiage into scientific taxonomy. Data Mining seems like a pretty random addition to the paper. The authors define a set of dimensions and find example on each. It would be useful to have instances (like say ESP Game) and map their dimensions (instead of single examples). I was put off by the introduction - the link between Turing's and Licklider's work is tenuous at best.
The Quinn & Bederson paper surveys past human computation literature and presents a taxonomy of of human computation. I found the most useful contribution to be the lists (and citations) of incentives and quality control strategies. I have found these two facets to be some of the more challenging aspects of crowdsourcing studies. I was hoping the authors would have delved into the psychology literature more. I imagine there is a rich literature on expeiments involving human subjects and how to reliably collect and characterize data.
The Ipeirotis article provides an overview of Amazon's Mechanical Turk marketplace using descriptive statistics of some of the basic aspects of the system (pricing, duration of HITs, etc). This is useful information when designing studies for mechanical turk, although it falls short of providing general rules to predict how tasks will be performed. In the end, I think most requesters will still rely on trial-and-error to figure out how to design a study. Also, it seems like the marketplace may be changing rapidly, so I'm not sure how long these stats will be valid. I agree with previous posters that it would have been better to collect data more frequently than once an hour.
The Quinn et. al. paper provides a framework, a consistent vocabulary of terms and distinctions, for synthesizing and differentiating the body of work around Human Computation. In the first part, the paper provides definitions for related and overlapping fields Human Computation, Crowdsourcing, Social Computing and Collective Intelligence. This seems useful for future research. However, the second part in which the authors present a set of 6 dimensions to analyze a system are less solid. The dimensions Motivation, Human Skill, Quality Control and Aggregation occur to the reader straightforward and seem reasonable. The other two dimensions: Process Order and Task-Request Cardinality are rather weak and seemed to be derived in order to sufficiently distinguish existing systems.
Ipeirotis provides in his paper an interesting survey on Amazon's Mechanical Turk answering common questions that pop up in discussion groups. Based on these information, one can argue that Amazon has failed to establish a sustainable Crowdsourcing platform. The daily volume of HITs is about 1000$. Since Amazon takes about 1% of the revenue, it is a 3,500$ p.a. Also, the distribution of requester is strongly Zipf-distributed. 1% of the top requester account for 50% of the dollar-weighted tasks. This leaves workers, mostly low-income US citizen or Indian workers, with an approximate hourly wage of 5$.
Human Computation: A Survey and Taxonomy of a Growing Field I was extremely skeptical of this paper before I started reading it. It's a very CHI work, a munging of the existing work on crowdsourcing. However, at the end, I felt as though I had learned a lot. Working on Umati, it's important for us to understand the space and where we fit. I think they missed our dimension, one of targeting. For some systems, it's better to get a wide range of possible participants. Simple tasks fit this model. Complicated tasks do not. I like the broader view of out work, really adding new dimensions to the area.
Analyzing the Mechanical Turk Marketplace This work is somewhat aged, almost two years old! This is particularly problematic in such a dynamic, immature marketplace; both the dynamics of requesters and workers is in constant flux. However, going to mturk-tracker.com, not much seems to have changed. This makes me even more skeptical of the market, in all honesty. I know they didn't do a deep analysis of the workers, but mturk-tracker has no analysis at all. I guess I'm meandering now, but it's extremely strange that the requesting market hasn't changed much in the last year and a half, given that the worker market has changed dramatically. What does that mean?