Crowd Programming Tools

From CrowdsourcingSeminar
Jump to: navigation, search


Paper 1: TurKit: Human Computation Algorithms on Mechanical Turk. Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. UIST 2010.
Paper 2: Generalized task markets for human and machine computation. Dafna Shahaf and Eric Horvitz. 2010.

Discussant's Slides and Materials

TurKit discussion slides (Kristal Curtis)

Reading Responses

Kristal Curtis


The authors present in this work a toolkit intended to enable users to incorporate human computation into their programs in sophisticated ways. Rather than the usual approach where Turk users create large batches of independent tasks, TurKit enables users to create workflows that contain tasks that build off other tasks. The authors present the crash-and-rerun programming model as one of their key contributions. I was surprised that they didn't compare their approach to caching or checkpointing, since that is basically what they are doing. Also, I initially thought the main point of crash-and-rerun was that you could keep incremental results even if you have an error downstream. However, they didn't emphasize this; instead, they emphasized that even if the program was correct, they would crash while waiting for HITs to return. This angle seems less important; it seems that you could get significant gains in efficiency by using asynchronously-managed futures instead. I think the argument could be improved if they drew an analogy between their work and dealing with the memory hierarchy (local accesses cheap, remote accesses expensive) and supporting long-running computations (several Turk jobs, want to keep results obtained prior to error). The online version of their toolkit looks really easy to use, and I'm definitely interested in checking it out. I also really like the idea of getting users to build off the work of others. The fuzzy text transcription was amazing -- it's very unlikely that any single person, no matter how well-intentioned, could perform the task, but just a few Turkers together were able to do a great job. Finally, I thought it was great that they got users from outside their lab.

Generalized Task Markets for Human and Machine Computation

In this paper, the authors explore how to combine machines and people to efficiently complete machine translation tasks. They claim that their approach is generalizable to many other domains as well. I liked their ideas about choosing coalitions from a set of available agents based on the requirements of a task and the competencies of the individual agents. One thing I found confusing about this, however, is that they talked about allowing coalitions to collaborate on low-level tasks, while all their examples/figures indicated that low-level tasks were assigned to individual agents. I assume that they are just stating that this could be possible in the future, but their explanation of how to choose a coalition for a task (ie, based on the sum of the abilities of agents belonging to that coalition) seems very unintuitive. I wish they had explained this a bit more. I was also confused about how they obtained the scenarios for a given high-level task. In the beginning of the work (Figure 1), they claimed that task decomposition was part of their system; however, later on (Figure 5), they indicated that task writers had to provide the scenario (though they did use their algorithms to tweak it a bit). It seemed that most of the algorithmic work they presented in section 3 was overkill for the experiments that they actually did. I wish they would have made it clearer that a lot of their work was vision, with a very simplified practical component. They also didn't really explore the pricing dimension, given that they used Microsoft employees as their crowd. For example, intuitively, experts should be costlier to employ than novices; however, they didn't incorporate this into their experimental section. I also thought the game element was somewhat tangential. Another idea from the work that I found interesting is the assertion that monolingual people combined with machine translation might be able to approach the translation quality of bilingual people. The question of when expert crowds are necessary seems to be a recurring one, and this work seemed to hypothesize that you could approximate an expert crowd with hybrid machine/novice human systems.

Sally Ahn

The TurKit paper introduces the iterative work paradigm for crowdsourcing tasks and a toolkit for implementing this in Mechanical Turk. The authors make the observation that prior to their contribution, most tasks on Mechanical Turk involve independent tasks, despite the significant improvements in quality that can be attained through iteration. The authors discuss the reasonings behind their carefully designed iterative tasks, such as requiring matching votes for approving a improvement task to minimize cost. We've seen other applications of this approach in previous papers like Soylent, and I think all of these examples show that iterating is a great way to achieve higher quality work on Mechanical Turk. As the authors note, the first task often has a great impact on the subsequent improvement tasks, and moreover, much of the hard work is often completed by the first worker. I think this suggests that the initial task should be considered separately from the following improvement tasks. I would be interested to see if fine tuning this separation (e.g. increase pay for the first task, since it creates strong influence and/or involves more work) yields better or quicker results. I've actually worked with the TurKit toolkit for the last assignment, and I found the crash and repeat run functionalities as well as other Mechanical Turk utility functions very practical and simple to use.

The Generalized Task Markets paper addresses the problem of breaking down general tasks into a combination of human and machine tasks. They state the problem as a "coalition problem" in which groups of agents (human worker or computation machine, each with their own ability/skill) are assigned low-level that enables the completion of a high-level task while maximizing quality. I liked that the authors provide a framework for theoretical analysis of this problem, but I think more evidence for the robustness of this framework to general tasks is needed. Despite the paper's title, the authors provide just one example task, translation, and how an arbitrary task would be broken down into appropriate subtasks is not clear. Although the authors touch upon this in the last section, I think there should be a deeper discussion of this challenge in the context of this problem.

Wesley Willett

TurKit's crash-and-rerun programming model is an interesting one, and I can see the appeal of having a system that is robust to crashes over the course of a Turk run. It seems to me that using this paradigm is fine when authoring applications that collect data perform computations on Turk, but that this strategy inherently limits TurkIt's potential in any sort of interactive application.

I also imagine that debugging could get really ugly, however, especially if changes are made to an application that has already been executed. The authors suggest doing this to retroactively debug programs (although not necessarily to alter their function). When looking at a value of a variable stored in the database, it seems difficult to tell when or under what program conditions it was saved. Multiple results from the same Turk tasks could have had dramatically different prompts, interface conditions, times of day, etc. and it doesn't seem like there's any way to see what particular version of the program was running when a bad result was generated. Then again, maybe editing the script between reruns is just bad programming practice.

I liked their use of case studies to demonstrate the utility of the system. There's nothing like four or five (likely) publishable examples to help make the case that your tool is useful.

The Generalized Task Markets paper frames the process of dividing work amongst workers as a resource allocation task and describes planning strategies for carrying out this division of labor. I think this would have been more convincing had they presented multiple examples scenarios in which this kind of abstraction would work - the case they implemented seems like a fairly straightforward one and it's not clear how this will map to other problem domains. In general, their prototype 'Lingua Mechanica' application seemed somewhat divorced from the rest of the contributions of the paper. The Popfly-style authoring environment and the set of five different translation games they describe don't seem to have much relationship to the concepts at the core of this paper.

Nicholas Kong

TurKit The authors present a toolkit for incorporating human computation (in this case, MTurk tasks) into algorithms via JavaScript. They do this via a crash-and-rerun model, where the program crashes while waiting for MTurk tasks to complete, and re-runs itself periodically to check for task completion. The results of previous HITs are cached so as to not be re-run. This is a nice example of how to encapsulate human computation as a set of function calls, so as to maintain the programmer's mental model. The crash-and-rerun model seems well suited for small, well-defined tasks focused on Mechanical Turk, but as the authors point out, there is an issue with scalability. This toolkit is a great way to quickly post complicated MTurk tasks, but to place it in a larger program where human computation isn't the focus might be problematic. That being said, the toolkit design, GUI, and case studies are convincing and illustrate the benefit of this work well.

Generalized task markets The authors of this paper define "generalized task markets" and analyze the efficiency of such markets. The problem they analyze is the following: given a task and a set of computation units, either machine or human, split the task into subtasks and optimally assign them to the resources given the resources' abilities and costs. They then analyze their models using a prototype language translation tool, Lingua Mechanica. While the problem they posed and analyzed is interesting from a theoretical standpoint, it would seem to me to be very difficult to estimate the costs in practice. Getting accurate metrics of "ability" and "cost" (time or money) would seem near impossible in certain task markets, such as MTurk, although they do say these weights could be learned over time. Also, their prototype seemed rather divorced from their analysis. The experiments they reported mostly dealt with ways to assess participant ability, rather than task assignment.

Beth Trushkowsky

TurkIt is a toolkit to more easily program iterative tasks using mechanical turk with HITs automatically posted as part of the process; the paper describes examples of various iterative tasks. One of the big takeaways from this paper (besides the amount of effort people will put forth for very little money), is the effectiveness of iterative improvement and worker-evaluated improvement, particularly for subjective tasks like composing image descriptions. As a proof-of-concept, they show how they used TurkIt to implement quicksort for various items. This raises the question again of the most appropriate uses of human workers and identifying when they are better than machines; I certainly wouldn't use TurkIt to sort numbers. I think it's just a question of understanding when a particular sort algorithm's comparator would benefit from human computation. I also wondered if subjective sorting (like t-shirt ranking) will yield different results depending on the "algorithm" used or when it was run. The TurkIt trace API was interesting as well; it highlighted some of the issues and workarounds for treating people like compute cycles.

"Generalized Task Markets" describes a hybrid system for employing both machine and human effort: a given problem is divided into subproblems which are then assigned to appropriate workers (human or otherwise). The authors specifically apply their technique to machine translation tasks, although assert that it is generalizable (it would have been nice to see another example application). A few points of the paper were confusing, for example how to determine a coalition's composite abilities when identifying a coalition with maximum utility for a particular task, or how automatic potential translation scenarios (DAGs) are enumerated. In general, I appreciated the paper's attempt to algorithmically determine an optimal scenario for a particular high-level task, but I wonder how meaningful that is when the decision is based on loosely defined constraints like "ability" vectors. The prototype section describes how the authors mapped the abilities vectors to equivalence classes, and I wasn't clear if that affects the aforementioned algorithms (as well as their generality assertion). I expected the experiments section to have more examples of the cost tradeoffs, rather than just Microsoft employees. The section on games seemed a bit of a tangent, but reminded me of our latest class discussion on how to define what a game is (the "moon climb" game challenges a player to match a sentence to its best translation. I wrote a game like this once--it was a multiple choice test). To summarize, this paper is a good discussion starter for how to mix human and machine effort, particularly for choosing between different human-machine pipelines for accomplishing the same task.

Dave Rolnitzky

Generalized task markets for Human and Machine Computation

This article discusses the challenges and opportunities for developing "generalized task markets", with a focus on human-computer task markets that mesh human expertise and machine intelligence to solve problems. Specifically, they address the combination of humans and machine to complete translation tasks.

The most interesting aspect of this article for me was the brief exploration of the game mechanics applied to human-computer translation. They briefly discussed the most engaging games, but didn't spend time defining engagement, or why they enjoyed these challenges, or explore why participants were motivated to join these games to begin with. And they really didn't tie it into the rest of the paper very well. This is an area of interest for me, and would have liked them to expand a bit here.

I found it interesting that they used Microsoft employees as their testing sample, and I have a feeling this introduced some bias into their results (since this is an educated population with much different motivations than the general population). Overall the article touched on a number of areas, but I felt that they left quite a bit out. Perhaps that's the focus of their future research. Nothing stood out as particularly novel, though I think that the future research they outline (especially using motor skills to solve new classes of problems) is quite interesting and novel.


This paper introduces a programming model, called the "crash-and-rerun" model, which allows a program to be executed many times. This model makes TurKit--a toolkit for writing human computation algorithms--possible.

The paper mentions a number of interesting applications of TurKit for real-world use, using iterations of human computation to achieve certain tasks. This iterative process seems like a great way to improve the accuracy and efficiency of MT tasks I believe some people used this already in class and had a decent experience for their assignments. There is a clear hole in the MT service and the paper illustrates, through these examples and description of the crash-and-rerun model, that TurKit provides some useful functionality to fill that hole.

Siamak Faridani

Both papers provide interesting research questions. The Turkit paper explain the theoretical framework that was developed for Turkit. Their main contribution is abstracting the complexity of mechanical turk and providing a programmer-friendly interface to mturk that can be used by developers to quickly send and validate jobs to the mturk marketplace. Authors explain the crash-and-rerun model that is a fault tolerant model by storing the history of states in a database. While very interesting I am not sure if it can be considered a major contribution of the paper. I personally found the whole Turkit script API more appealing than the crash-and-rerun model. Although from other comments in the wiki it seems that the crash-and-rerun model has served as a good model for people. The other thing that was not very clear to me was how can one go back in the history of HITs and correct their HIT results. For example let's say I have received contaminated results (that actually passed the mturk QA) but I manually examined my HIT results and realized that some are useless I am wondering if Turkit is capable of correcting HIT results.

The Generalized Task Markets paper describes non-symmetric market of workforce with different levels of abilities and defines a model to breakdown the work to smalled pieces and send the pieces to the individual work units in the market (either human or a machine). The task breakdown is done by a planner and after the results are received they are evaluated and this evaluation is used to update the planner's learning model. The project Lingua Mechanica is used as an example of how this is used in practice.

Manas Mittal

Generalized Task Markets for Human and Machine Computation The GTM paper's main contribution is a polynomial time algorithm to solving (non-optimally, but better than greedy) the task allocation problem in mechanical turk like market. A task is divided and shipped off to different coalitions of workers/computers. The task division is done by the task designer. The paper assumes that the capabilities of the task performer are modular and sum up - a clause that I think restricts the application space. I liked the paper in that it demonstrates how Mechanical Turk like markets are such a rich area for potential research contributions. This paper, for example, combines aspects of theory, AI, and HCI. In terms of the evaluation, I would have liked to see some more discussion of the utility optimization, in particular if there was any dominant factor (like time/speed/latency) that the users specified. The paper was quite interesting - it make me think of the tradeoffs between human and computer computation.


The key contribution in this paper, is the "Crash and Rerun" programming model, i.e., selective caching of Expensive tasks, and re-execution of inexpensive tasks. I think this is a very useful model, and its use goes beyond crowdsourced systems to generic expensive/inexpensive tasks.In real life, I have used the once() method and have found it to be useful, but occasionally buggy. For example, the authors suggest clearing the cache when the input params change. As a programmer, it is easy to forget that and hard to debug. It would be good to have some form of static checks when keying off the once method. As a researcher, I also liked that the authors shared the implementation details - I can use/build upon those.

Yaron Singer

Generalized Task Markets for Human and Machine Computation: This paper provides a framework for studying a broad class of crowdsourcing problems. In their work the authors model crowdsourcing tasks in a manner that allows theoretical analysis of their computational constraints. First, the authors show that several classes of tasks are computationally hard, thought provide reasonable conditions under which approximations can be obtained. To complement their theoretical analysis, the authors explore their approximation methods using various crowdsourcing techniques to evaluate the different approximation approaches.

This work seems to provide a rigorous model that captures the main tradeoffs that are inherent in planning crowdsourcing tasks, and provides a common language to think and talk about crowdsourcing in a more general sense. The next step would be to explore the incentive issues under this framework and understand their limitations.

TurKit: This paper presents an API and programming model for programming tasks on MTurk. The authors show how their crash-and-rerun framework can be applied to various crowdsourcing tasks with low overhead. It seems like the crash-and-rerun model is designed for tasks that have interdependencies. It would be interesting to explore how this model compares to simpler solutions.

Philipp Gutheim


TurKit presents a toolkit for prototyping and exploring human computation in MTurk in an algorithmic manner. Its contribution is a "crash and rerun" model that enables developers to integrate HIT postings in their code without running the risk of reposting HITs after a reboot or unintentionally posting HITs when testing or fixing code. In addition, they provide an API and a GUI for the model. I think the model is a nice idea to take care of expensive "MTurk calls". However, I'd be particularly interested in the question of how this model can be applied during development time. I believe the greatest share of unintentional or multiple MTurk API calls happen during development time. Maybe the MTurk sandbox model could be integrated into it and random or predefined responses could be returned.

Generalized Task Markets

This paper proposes an abstraction method that generalizes tasks into markets consisting of humans and computer algorithms capable of soliving problems. The paper focuses on challenges in learning, reasoning and optimization challenges that occur when abstracting between a set of tasks/problems and their solution (solved by humans or computers). - Single scenario task that has a single subtask. To make a decision is NP-hard. polynomial approximiation. Isn't that a huge hammer for a problem that does not really exist to the extend they illustrate it? It is not really the case that there are thousands of tasks, constantly changing and developers in unison search for an automated solution. I believe developers know pretty well at which point of time humans need to be involved in the process. In that sense, bigger challenges such as the uncertainty of task completion etc. is left unanswered.

Chulki Lee

In the first paper, TurKit are based on the crash-and-rerun programming method. It is very interesting that it can be applied to other similar situations having many subtasks of long-latency, low costs and ambiguity. For example, some sensor networks will be in the case.

The second paper suggests generalized task markets and problems within it. I'm curious about how the interaction between the planning and(de)composition stages can be formulated. In the paper, they are quite distinct from each other, like that the planning component do not adjust scenarios themselves.

Prayag Narula


TurKit presents an engineering framework that can be used to prototype applications that use MTurk. It was a very enjoyable paper to read. The Crash and Re-run model is an interesting take on modeling MTurk tasks. Though it seems more like a hack but the from the text it seem that it works fairly well for at least prototyping. The authors agree with the fact that it sacrifice scalability for programmer usability which is important especially when it comes to Mechanical Turk considering that the MTurk APIs are just horrendous to use. I haven't used the API but I hope its good.

Generalized Task Markets Generalized Task Markets paper provide a theoretical framework to main trade-offs of crowdsourcing. Their polynomial algorithm to task distribution is a very interesting approach to solving a hard problem.

Kurtis Heimerl

TurKit TurKit has a number of things going for it. Firstly, that's a fucking fantastic name. Really top notch. Secondly, I love how practical this problem is. I'm unsure of the research value of this toolkit, they certainly seemed to be reaching by calling check-pointing "crash-and-rerun", but it's clearly useful. There's a lot that can be done using the system that is quite impractical without it. The questions they were trying to resolve were top-rate HCI problems, even if the toolkit isn't. It's hard to hate on something so damned useful though. In my head, it almost read like a story, you can see the foibles they discovered and surmounted, perhaps suboptimally.

Generalized Task Markets Oh man, this is super cute. I love it. I did some work as an undergrad in markets, and I think there's a lot of potential value there. There's a key shift that happens in some of the work, and happened here. It's a fine assumption to believe that everyone has a utility function, they likely do. You should NEVER assume you can know it. It's complex, multi-variate, and depends on shit like the bus driver's watch. Any model that isn't a user expressing their functions, but instead guessing and using that guess to make predictions, is inherently too simplistic.

The goal is awesome though, and this is fine theory work. Modeling and mixing humans and users in this dynamic, rational manner is a great ideal. If only reality was as easily modeled as maths.