Web-Scale Interaction II: Crowdsourcing

From CS260Wiki
Jump to: navigation, search

Bjoern's Slides


Extra Materials

To learn more about collaborative filtering and recommender system, I suggest:
[http://oreilly.com/catalog/9780596529321 Toby Segaran. Programming Collective Intelligence. O'Reilly, 2007.

A survey of Mechanical Turk demographics:
J. Ross, L. Irani, M.S. Silberman, A. Zaldivar, B. Tomlinson. 2010. Who are the Crowdworkers? Shifting Demographics in Mechanical Turk. in Extended Abstracts of CHI 2010 (alt.chi), Apr. 10-15, 2010. Atlanta, GA.

Discussant's Slides and Materials



Survey on Distributed Human Computation

A Taxonomy of Distributed Human Computation

Who uses Mechanical Turk?

The New Demographics of Mechanical Turk

How big is the MTurk market? Where can I get datasets to analyze?

(The short answer: $10K of tasks per day.)

Panos Ipeirotis's MTurk Dataset

Other Resources

Luis von Anh's comprehensive talk on Human Computation at Google

A package for iterative work with Mechanical Turk: TurKit by Greg Little et al at MIT

NSF's Jeanette Wing's Five Deep Questions in Computing, including formalizing a theory of human computation

An excellent list of recent papers on Mechanical Turk and crowdsourcing from Deepak Ganesan at UMass Amherst

Proceedings of HCOMP, the workshop on human computation

Reading Responses

Aditi Muralidharan - 9/26/2010 14:15:28

At first, I was skeptical of the idea of games that actually serve a useful purpose behind the scenes, having played the ESP game and Peek-a-boom. Then I spent one hours playing Verbosity like an addict (when I was supposed to be reading the second paper) and changed my mind. It was so much fun, and as a researcher in natural language processing, I realized that the data could put to interesting use constructing those ontologies and folksonomies that are all the range these days.

The "design principles" they describe seem sound, but vague and hard to execute for any particular instance: they seem to have been able to get it right for verbosity, but Peek-a-boon is a disaster because it feels too difficult. They seem to have ignored at least two factors in their describer-matcher setup: percieved level of guessability and ambiguity, and the skill level of the players themselves. Nevertheless, I agree with the entire premise, and their other design principles (probabilistic answer correctness guarantee) espeically now that semi-supervised/adaptive machine learning techniques are starting to become faster.

I've previously enjoyed learning about David Karger/Rob Miller's work on sloppy programming interfaces - so it was nice to find the second paper on Soylent a fun read.The most interesting part of the article for me was the tiny paragraph in which they sidestepped the issue of ownership and authorship, which I feel would be a larger concern to any potential users of Soylent than they make it out to be. I would have liked to see some user's opinions on those issues.

Another point I have is that the visualizations for shortn are very nice - so nice in fact that if MS Word had a "shortening" mode, focused solely on the task of indicating "filler words" (easily detectied using standard authorship statistics) as a starting point, and letting the author start from there, whether this particular MTurk task would really be necessary. It might be an interesting benchmark to compare results/user satisfaction against.

Although it was beyond the scope of this proof-of-concept-like paper, I would have liked to see some more discussion on particular UI challenges they faced for these word proces

Matthew Chan - 9/26/2010 14:39:43

===Soylent: A Word Processor with a Crowd Inside===

In Soylent, we learn about a new architectural and interaction pattern for crowdsourcing and integrating them into user interfaces. Especially for something like Word Processing, Soylent aims to use Amazon's Mecahnical Turk platform to improve the accuracy of written documents to enable users to shorten the length of a paragraph or two or the correct grammatical errors. However, unlike Microsoft Word's "Fragmentation Error: Consider Revising," Soylent uses a Find-Fix-Verify architecture that includes suggestions/alternatives for improvement. The Find-Fix-Verify pattern was the most interesting aspect of the paper. By identifying the Lazy Turk and the Eager Turk, FFV splits the tasks such that Turks must first find any potential patches of errors, the next group must Fix the errors, and the third group (for sake of quality control) must vote on the quality of suggestions. This paper is fairly important because it ventures into the realm of using crowdsourcing to pick up the slack where AI algorithms fail. The results were pretty impressive, especially considering that we know Turks/humans were the ones who contributed suggestions/fixes. My only concern is about the two populations who are Turks; from a linguistic point of view, English from South Africa, England, Hong Kong, and USA have lots of subtle differences. Since it's India and the USA, having the USA Turks proofread a paper written by an American might be favored, but this is all circumspect. The techniques were straight-up technical contributions and running user studies on students and clerical secretaries who had to use the Human Macro. This paper relates to today's technologies in many ways because society as a whole writes so much when creating articles, entries, books, etc. To simplify the task for correcting errors would be tremendous and time would not be lost trying to fix our own work over a long period of time. However, this paper doesn't not relate to my area of work. For blind spots, i don't see anything wrong. Perhaps Soylent can venture into Keynote and PowerPoint as well.

Designing Games With a Purpose

In this article, we're presented with a spectacular and novel idea of fusing games/entertainment with crowdsourcing to make AI algorithms more accurate. What i liked most about this feat is the focus on games, fun, and enjoyment, and encapsulating that to train machine learning algorithms such as identifying images. Because of these novel ideas, this paper seems very important since they've found a way to tap human resources for free via entertainment. Games such as Verbosity and ESP Game interfaces were pretty neat too and the way how the games are played are appealing. The results/techniques/methodologies weren't covered too much. The article was already talking about the success of three general game templates which it describes., so we don't know what kinds of evolution the games underwent. Furthermore, the authors have given the games lots of thought such as ensuring that users don't taint the ID's of images with the letter "a," by randomization. The paper relates very much to today's technologies bc of the vast amount of hours we spend on games, ie. Facebook's Farmville or Mafia Wars. It's very true that humans seek to be entertained and not necessarily motivated for finances, ie. Amazon Turks. This paper relates to my field of work especially in game design. We don't do anything involving AI, but the game design aspect is very interesting and gives ideas on multiplayers and single players (ie. pre-recording a game and having the odd-one-out play against that). Blind spots in the paper could involve social network. Sure, it might be a great and entertaining game, but sometimes it's also fun bc players/friends can collude and play together. However, the security and risk is understandable, but if this was not a problem, social network with social games could increase the amount of games played aka increase the training for AI algorithms!

Airi Lampinen - 9/26/2010 16:03:07

Von Ahn and Darbish outline in their article "Designing games with a purpose" a new class of games, games with a purpose (GWAPs). As people play GWAPs, they perform as a side effect tasks that computers are currently unable to perform, such as tagging pictures. The authors present GWAPs as a way to profit from the fact that people enjoy playing games - thanks to games of this type, tedious tasks get done while the "workers" have fun.

Putting it simply, the idea is to turn computational problems in GWAPS. Games have to be designed in a way that encourages computation and is likely to lead to a correct output. The authors explore examples of successful GWAP game-structures. These include out-put agreement games, inversion-problem games and input-agreement games.

The article includes also a discussion on ways to increase player enjoyment. According to the authors, simple features such as different types of high score lists can be very effective in motivating participation. The continuous provision of sufficient challenges does indeed make sense from a psychological point of view, as suitable challenges and direct feedback have been identified as major factors that facilitate the achievement of flow.

Finally, the authors discuss ways to prevent players from gaming the system. This is a valid concern, even if I believe it is less crucial when it comes to entertaining games (that are likely to be played due to an internal motivation) than if the authors would be discussing instances where money or other tangible benefits would be on the line (and hence, players would be more likely to be in it for external reasons).

The second article, Bernstein et al.'s "Soylent: A Word Processor with a Crowd Inside" discusses the possibility of combining crowd-sourcing human contributions directly into user interfaces, in this case to the word processing software Word. Here, too, the question is to get human contributions to problems that are not easily solved by AI. While automatic text editing tools point out typos and grammatical mistakes, they are not very effective in identifying higher-level problems in texts.

The authors present a word processing interface called Soylent that enables calls directly from the word processor to Mechanical Turk workers. This way, tasks such as shortening, proofreading, and otherwise editing parts of a document can be crowdsourced on demand. Against a minimal cost, authors can save their own time for higher level issues while letting Mechanical Turk workers take care of tedious tasks such as correcting a list of references. However, for the system to be effective, it is extremely important that the input from crowdsourcing fixes what was requested without breaking something else in the process. While the authors address this issue, it seems that ensuring the adequate quality of input for something as delicate as text can be very difficult.

The paper contributes a crowd programming pattern called Find-Fix-Verify. The pattern is designed to control costs and ensure correctness by splitting a task to smaller phases and hence diminishing the power an individual Mechanical Turk worker has over a piece of text that is being edited. This is deemed necessary to avoid problems caused both by too lazy and too eager workers. When discussing collective action, this is an often neglected point of view - many studies are overly worried about freeriding and lazy gaming without looking into the other end of the spectrum: the problems that "too" motivated participants can cause on the level of outcomes and systemic sustainability.

Thejo Kote - 9/26/2010 16:22:23

Designing games with a purpose:

In this paper Von Ahn and Dabbish describe their work in building what they call "games with a purpose". Since many tasks are easy for humans to perform but hard for AI systems, they propose the use of human effort in the form of playing games. By designing games that are enjoyable, they show that it is possible to solve hard AI and machine learning problems.

They focus on games which make use of large networked systems like the internet. They provide three templates for game design - output agreement games, input inversion games and inout agreement games, and examples of games that they have designed in each case. Player enjoyment is an important factor in any game, and Von Ahn and Dabbish suggest a number of techniques based on traditional game mechanics like timed response, skill levels and score keeping. They also address challenges in mainitaining output accuracy and techniques to avoid collusion and suspicious behaviour by players.

This has turned out to be a succesful technique for the authors and has resulted in some very impressive projects like reCaptcha. Of course, the technique is limited to a certain class of problems and I wish they had discussed the limitations in more depth.

Krishna - 9/26/2010 16:30:20

Games with Purpose

Humans are significantly better than computers in solving certain tasks: finding spatial, temporal patterns for example. Also, the success of many important AI applications like natural language understanding, common sense reasoning, etc depend on the availability of tagged (or) labeled data, in large amounts. An interesting research question is how to design tools and processes that incentivize and motivate human involvement in these tasks. The paper is about Games with Purpose(GWP), computer games whose outputs, or in general any interaction information, can be used towards solving computational problems. It is well known that humans spend considerable amount of time playing computer games. The motivation thus is to come up with gaming solutions that can implicitly make use of this time towards useful purposes while still providing the needed entertainment.

Though the idea seems to be simple enough, designing such games is not trivial and needs considerable thought. The authors provide a general framework that helps in designing and evaluating such games. The framework aims to transform a computational problem, tagging objects in an image for example, into a game with rules that not only encourage user involvement but also promotes correctness of the output. The authors suggest three game templates: output-agreement games, inversion-problem games, and input-agreement games. These templates have been inferred from successful GWPs and provide us with a general, simple set of rules and strategies. It could be argued that these strategies tend to develop games that require users to use their intellectual and problem solving skills more explicitly than say for example in racing games (or) first person combat games. The latter types of games tend to be more entertaining.

The authors suggest standard game design strategies like scores, leader boards, randomness and timed response that may help in increasing player enjoyment. Towards ensuring output correctness, the authors suggest strategies like randomly matching players - this strategy reduces the probability of two players sabotaging the game's purpose. They also suggest using taboo lists - such lists can be used to control the vocabulary used by the user in a labeling process.The authors suggest three simple metrics for evaluating such games. The throughput measures the average number of problem instances solved per human hour. To measure the enjoyment level offered by the game, they suggest measuring the amount of time(ALP) spent by an user in the game and then computing the average. They also suggest combining these two measures. Not sure how these measures should be interpreted when there are only a few number of players.

My criticism of the paper is that users playing GWPs are not completely unaware that they are participating in a larger purposeful cause. I believe this may affect the overall entertainment value and motivation. Once humans are aware of an implicit social cause. their inclination to participate is more a direct function of their willingness to contribute than be entertained. There is also this issue of creating a critical mass of players for the game to be successful.

I was unable to access the other paper, I couldn't find it online as well(obviously). I did read a short report on it on MIT UROP website. An interesting idea, an immediate thought that occurred to me is that most jobs submitted to MTurk require limited cognitive, intellectual resources, not sure how tasks like proof reading and editing can be outsourced this way.

Shaon Barman - 9/26/2010 16:39:52

Designing Games with a Purpose

The article discusses how to use games to get a large crowd of people to perform a desired task. The key insight is that instead of asking a person to directly perform the task, the task is hidden in a goal of the game.

I liked the breakdown of tasks into the three different types of games. By formalizing the notions of the games, they study the merits and pitfalls of each type. In the increasing player enjoyment section, the suggested techniques seem like they would work for most games and are not particularly aimed for GWAPs. Most of these techniques are aimed at bringing out the competitive side of people. The section on output accuracy was informative and provided good ways to ensure quality results. I assumed this paper was written for a general audience and was therefore lacking numerical results.

I felt the author's missed some important points in GWAP evaluation. The evaluate a specific game by a combination of throughput and average lifetime play. But this does not take into account how popular the game is (and how the social interactions affect throughput/lifetime play) or the accuracy of the information. They do discuss verifying the results of the GWAP vs paid subjects, but there seems to be many variables which affect accuracy (such as player testing and repetition). Overall, GWAPs seem like a good way to get monotonous tasks done cheaply. By leveraging the competitive nature of people and the internet, GWAPs are able to do tasks that could not have been done before. It would be interesting to see if GWAPs could be generalized to other types of game play, such as FPS or strategy games. This would require more clever embeddings, but lead to a greatly increase in use.

Soylent: A Word Processor with a Crowd Inside

The paper introduces the Find-Fix-Verify process which allows better results for tasks using Mechanical Turk.

The three programs introduced in this paper, at first, do not seem like good matches for mechanical turk. They require users to understand a large amount of data and provide insightful feedback. What the paper proposes is a process which decomposes the task. This decomposition creates better results.

The results showed a proof of concept (since the sample sizes were quite small) and showed that the technique could be used. At first, it would seem like decomposing the task of shortening a paragraph would be inefficient, since a user identifying a location to edit probably already has an idea of what the edit should be. It would be interesting to see how the Find-Fix-Verify technique compared to a Fix-Verify in terms of costs and quality. That would show that the decomposition of tasks is necessary for quality results.

My main criticism of this work is that the amounts paid for each job seems too high to be widely deployed. Writing in a clear, concise manner is a task which requires a certain level education. The authors say that paying less only slows down the rates of the process, but does not affect quality. But would the education background needed affect the quality? (Assuming higher educated people would not perform for such little) The process also exposes whatever is being written to a large audience, and therefore would not work with confidential information.

Bryan Trinh - 9/26/2010 17:13:10

Soylent: A Word Processor With a Crowd Inside

In this paper the authors implement and discuss a word processor that integrates crowd sourcing and artificial intelligence. By creating a word processing system that is able to interface with other human beings to act as the proof readers, they are able to smartly create corrections that an AI system would not be able to produce.

This paper addresses one of the biggest problem in crowd sourced data, reliability and quality of produced content by implementing a design pattern dubbed Find-Fix-Verify. The simulation of the real world deployment verified the performance of the design pattern and gave them the ability to postulate on the completion time of the mechanical turk tasks once the service grows larger. They also included the monetary costs of these interactions on mechanical turk, which could lead to the analysis of the economics of crowd sourced systems.

This is the first I have seen of a program that integrates crowd sourced data directly into the user interface of creation. It is a very interesting paradigm of peer creation that I can see being used in a number of other creative endeavors. If these types of programs evolve, it'd be very interesting to look at the economics of this type of ecosystem of creation. Is it sustainable for the proof readers? Can proof readers create a career out of something like this, or is it just for personal amusement?

Domain knowledge writing is a problem that is not fully explored in this paper. Is it possible to properly connect the author to proof readers in the correct domain of expertise? A possible approach to this problem is to allow the author to tag his document akin to tagging a blog post for SEO purposes. These tags are then used to connect the author to the relevant people on the mechanical turk. Relevancy can be determined by, as was mentioned in the paper, past performance in a particular domain, or by a symmetric tagging of the proof readers bio. It might be interesting to see whether or not such a system would increase the accuracy of corrections for domain specific writing styles.

Just as a side note, I loved how the conclusion elegantly placed a lasting real life example of the system at work.

Designing Games with a Purpose

In this paper Luis Von Ahn and Laura Dabbish explore the concept of games to source information from human beings for the purposes of training AI. They establish design principles for further exploration in this concept of crowd sourcing. They approach the problem of validity by creating a checking system, where the a response between two independent players corresponds to a valid piece of data.

Although this template provides a valuable method of leveraging human intellectual capabilities, the set of problems it can solve is very limited. This type of game works really well for questions that have an objectively correct answer, but for anything that has a number of different opinions, this model would produce a very small amount of data simply because there will be too many conflicts in the answers provided by players. This however is precisely what is needed for training data for an AI algorithm, objective data.

It would be a challenging, if not impossible, task to move this type of model out of the realm of menial tasks that don't take much creativity. But, for the purposes of training data for an AI algorithm, this is perfect.

Charlie Hsu - 9/26/2010 17:50:19

Designing Games With A Purpose

This paper described games with a purpose (GWAP), which are videogames in which user input doubles as a computation, classification, or some other sort of work that a computer is unable to perform. People will theoretically play these because they are games, and meant to be fun, but they will also inadvertently contribute by generating data for machine learning algorithms. The paper discusses three models of GWAP: output-agreement games, inversion problem games, and input-agreement games. The paper also discusses a few strategies to maximize player enjoyment and verify output accuracy.

I found that my personal experience with horrifyingly addictive online Flash games has given me a few suggestions that I would add to the section on increasing player enjoyment (or addiction, thereby increasing ALP and expected contribution!). One of the most addicting metagame elements that has taken off recently in videogaming is the concept of "achievements," publicly viewable badges that are earned when completing some task in the game, challenging, arbitrary, or unique. One example of this implemented in the ESP Game might be an achievement awarded to both players if both successfully match 50 images within the time limit, or streak 10 images matched on their first word guess. These relatively innocuous achievements spawn surprising devotion to completionist gamers; on a videogame I play, some servers have been set up for players to "farm" achievements by deliberating colluding with other like-minded players to set up achievement scenarios. Another important hook to a game is a low barrier to entry: the ESP Game on www.gwap.com was right to offer a guest account option to enable immediate play, rather than forcing a curious user to register and log-in first. Another addicting hook is to offer "skills" that make the task easier (the more you play the easier it gets!) and offer the player decisions on skillsets to acquire using points earned. "Skills" in the ESP game may be to force a higher difficulty (more taboo words) in exchange for a higher score, or to extend the amount of time given.

One particularly striking thought that came to mind while reading this paper was the possibility of integrating these sorts of GWAP into already existing MMORPGs (massively multiplayer online role playing games). These MMORPGs often have vibrant economies and various ways of generating in-game currency; why not have one of these GWAP be one of those methods of earning in-game cash? The incentive of exploring a world explicitly designed solely for player enjoyment is a powerful tool to encourage user participation in the GWAP (One virtual world, for example, uses games designed by advertisers to award currency to players; the P (purpose) here is commercial rather than data collection).

Introducing the cooperative/competitive paradigm by pairing people together to play the game is a great way to mask the fact that these GWAP are really just brute classification work. The social dynamic created fuels the entertainment factor, and unlike Mechanical Turk where the incentive is economic and has a cost, the social dynamic is freely generated. Other ways of generating this sort of crowdsourced classification information include the methods Gmail use with their spam filter and Facebook use with their recommendation engine. Users are provided an useful service, and receive a better experience by sharing some information with the service (ex: this email is spam, i don't actually know this person).

Soylent: A Word Processor with a Crowd Inside

This paper described Soylent, an enhancement built onto Microsoft Word that allows the user to call on crowdsourced Mechanical Turk workers to shorten, proofread, and perform repetitive "human-scriptable" tasks on documents. The three tasks are described, challenges in programming with crowd workers are explored, and the "Fix-Find-Verify" pattern of utilizing crowd workers is introduced as a solution. Soylent is then evaluated on its three tasks for monetary cost, time cost, and accuracy/quality of work.

I feel that one of the paper's strongest contributions to HCI is the way they address quality of crowd-sourced contributions. Although "Find-Fix-Verify" is designed for document composition work and fixing existing problems, the model of identifying a problem, fixing it, and verifying it, all in separate stages, modularizes the work and makes each task simpler and more efficient. Fixing identified problems is far easier to do by itself than searching for all problems to fix and fixing them. Fixing atomized problems also combats the "Lazy Turker" described in the paper. Users focused on identifying problems need not be distracted by attempting to fix them either, making their work more efficient as well. I feel that for any sort of crowdsourced "fixing" work, like fixing a document, fixing tags on files and data, fixing code for optimization, etc., the "Find-Fix-Verify" model is a great way to optimize crowd work for accuracy.

The discussion on cost efficiency is interesting, but I felt I had some unaddressed questions of my own afterwards. Mechanical Turk offers an interface for screening potential workers before they are deemed "qualified" to perform a task: what if a Soylent user could specify a higher price for a premium, highly qualified user? This might be a way to address the problem of certain users not having necessary domain knowledge, or rewarding those with general writing skills such as knowledge of style. Different cost levels also might help in prioritizing material for time efficiency: users could offer a higher rate for content feedback needed immediately, and a lower rate on less important items.

I also came up with some quick suggested solution for problems mentioned with Soylent. Users could mark high-importance sections or sections with specific style for use in Shortn, indicating to crowd workers to avoid changing those marked sections. This also applies for proofreading (see the example on "GUIs").

David Wong - 9/26/2010 18:02:47

1) The "Designing Games With a Purpose" paper discusses general formats for creating games with a purpose (GWAP). The paper also discusses other design elements that can enhance a GWAP as well as ways to evaluate the success of a GWAP.

2) The "Designing Games With a Purpose" analyzes a new trend of games that use humans to help solve tasks that are difficult for computers to solve. While creating GWAPs is beneficial for those trying to create a large dataset to train computer algorithms, the paper's only significant contribution to the HCI literature is the brief discussion of game design enhancements, such as using time intervals, a scoring system, and leaderboards. Those, however, are not novel ideas and have been studied in much more depth elsewhere. The paper merely acknowledges that humans spend a lot of time playing games and that researchers can leverage that time to generate datasets for training machine learning algorithms. At best, the discussion of making a game entertaining and obtaining a valuable result may be the most useful to HCI, but it is not discussed in that much detail.

3) The "Designing Games With a Purpose" addresses a specific instance of a well motivated problem: how do we effectively create and utilize peer production systems? The framework they provide is well supported empirically through various real-life systems. However, the framework they provide is quite elementary and has a lot of room for future research. It is, though, a good point to start from and a nice guideline for anyone who may want to design a GWAP.

Drew Fisher - 9/26/2010 18:03:45

Designing Games With A Purpose - Luis Von Ahn, Laura Dabbish

This paper discusses three models for creating games that aid in computer classification of information, some design suggestions for increasing user play time, and metrics for evaluating these "games with a purpose."

Von Ahn makes a case for newer metrics measuring the value of a game, coalescing them into "Expected contribution." This metric (by design) ignores the social ramnifications of gaming, which I found shortsighted, particularly in this day of massively multiplayer online games and games with social connections. I'd suggest a metric that also included some measure of how many people would be expected to play the game (although I recognize that such a model is likely impossible to produce). It seems to me that even if Expected contribution is high, it can easily be outweighed in value by reaching an order of magnitude more users. I only wish all the Farmville players would put their efforts toward a game with a purpose, like one of these classifiers or fold.it.

Soylent: A Word Processor with a Crowd Inside - Bernstein et al.

This paper's main idea is making it easy, integrated, and fast for people preparing documents to get automated paid human review, corrections, and improvements, since AI currently is inadequate for the task.

Soylent provides three main tools, each powered by Mechanical Turk: Shortn, a tool helping authors get their point across in fewer words (particularly to comply with maximum length rules); Crowdproof, a tool providing proofreading suggestions; and The Human Macro, a tool enabling quick dispatch of common user-customized tasks.

Soylent employs techniques to produce correct results; this paper discusses in particular the Find-Fix-Verify pattern. This pattern helps compensate for particular personalities in the Mechanical Turk community, as well as promote better coverage and evaluation of result quality.

The primary drawbacks to this approach to document editing are that the service is not free, which means that it has to be explicitly requested by the user, and that the user has to wait some time between when he/she requests feedback on the paragraph and when edits arrive, possibly losing context in the process.

I find it interesting that when using Soylent, the author somewhat steps back and becomes a manager for the document, dispatching work and collecting contributions from the crowd. I haven't thought about document editing in this manner before, but I wonder how many other tasks can be handled well by this dispatch-response technique for what is in essence hierarchical collaborative work.

Anand Kulkarni - 9/26/2010 18:11:32

Designing Games with a Purpose

This paper presents a mechanism for recruiting human participants on the web to solve problems that are too difficult for autonomous algorithms to solve.

The core contribution of this paper is the idea that games can be used to recruit humans -- for free -- to solve problems via the web that are too difficult for autonomous algorithms. These problems include, notably, the problem of collecting tags for the purposes of image recognition and search. The paper introduces some methods for quality control to ensure that malicious or incorrect contributors are filtered out, by designing games carefully so that participants can only earn points for correct answers. This method should be applicable to a wide range of problems, which is a great approach. The only limitation is that the authors are concerned primarily with offline data sets and training sets for online algorithms; they don't consider the possibility that these games can be used to solve algorithms in real-time, which could replace autonomous algorithms entirely in some settings.

The validation here is excellent; the authors have been able to tag a very large body of data on the web, and they have been careful to argue their strategies result in data that is free of errors. The authors suggest that there is no generic strategy to convert AI problems into games; however, minigames, for example, would be one such strategy.

Soylent This paper presents a new word processing plugin using Mechanical Turk as a backend to carry out difficult tasks in a cost-effective manner.

The two core contributions here are first, the idea that you can use Mechanical Turk in a "wizard of Turk" style interface to build interfaces with impressively strong AI capabilities, and second, a find-fix-verify model for dividing work among Turkers to ensure that no one person has excessive power to modify the document and to ensure that work is carried out correctly. This technique is likely to have a major impact on a wide variety of software tools, including things like Photoshop and other offline applications. The application in Word is an interesting test case, but the methods described are much more general. The Macro and shortn tools in particular seem like they could be plausibly incorporated into a future commercial version of Word.

The authors carried out experiments with each of the three tools they built and provide numerical experiments on the costs and times required for each. These experiments are convincing, but ultimately less important than whether the system actually works as described. The authors discuss several different categories of Turkers; some are excessively eager and some are inadequate and provide strategies to control for each. The third set of validation offered was self-explanatory and outstanding - the conclusion is passed into the authors' Shortn tool and we can see and verify its effects ourselves. The only suggestion I'd have is that the authors could have discussed more about how to reduce the time needed, as this is important for using such interfaces.

Richard Shin - 9/26/2010 18:18:40

Designing games with a purpose

This paper describes a class of games, called 'games with a purpose', designed to produce useful data as they are played. Many tasks that need to be performed en masse (e.g., labeling images with words) are (yet) difficult for computers, but easy for humans. By structuring these tasks in such a way that people want to perform them, with the introduction of game-like elements such as competition and challenge, the collective brainpower of the masses can be harnessed effectively.

I thought that the approach described in this paper contributes to the field by introducing the notion of human psychology in order to get users to do things that they wouldn't have done otherwise. What we have studied previously in this class focused mostly on how to allow users of a computer-based system to most effectively convey their intentions to it, assuming that they want to in the first place. This approach removes that assumption and instead creates psychological rewards and other incentives to make users want to interact with the system to provide useful data. The games also have been carefully structured to incentivize users to provide accurate data; if users were rewarded simply for providing words to match images, they might be compelled to enter random words to achieve the rewards, rather than ensuring that their input is accurate or thoughtful.

It seems hard to tell, however, how applicable this method can be to other similar tasks. Perhaps it would be better to generalize the notion to collecting data from activities that have meaningful, rather than artificial (e.g., scores, a top-10 listing, etc.), rewards, such as getting better movie recommendations by rating the movies that you have already watched (which also provides general data about the movie's quality). In general, I was a bit skeptical about how much people would want to play these games. Certainly, they would be more interesting than performing the non-game versions of the same tasks, but the games still seem mostly interesting as a novelty rather than as a truly enjoyable activity (at least in my experience having tried these types of games in the past). It seems relevant to compare how these games would perform compared to more traditional ones, and further evaluate how any attractive characteristics of traditional games can be transferred to GWAPs.

There might also be unconsidered ethical implications of structuring what are essentially menial tasks as games by exploiting human psychology. Games such as World of Warcraft have been criticized for being designed explicitly to be addictive, using similar principles as GWAPs. Perhaps, if GWAPs become more popular, they would become subject to similar debates.

Soylent: A Word Processor with a Crowd Inside

This paper describes an extension to Microsoft Word that allows for convenient paid crowdsourcing of common, yet repetitive or boring, text composition tasks, through Amazon's Mechanical Turk service. By using Mechanical Turk workers who each perform very small tasks on the text, the system enables human-powered summarization, proofreading, and arbitrary text manipulation tasks (the paper provided citation formatting or finding relevant pictures as examples) with relatively low latency (a few minutes, depending on how much the user is willing to pay) and high quality, as if a computer had been performing the same tasks.

The idea that these tasks can be offloaded to other people is not new. Instead, the fundamental value of this paper seemed to be a system for extracting meaningful, high-quality data from a large number of inexpensive contributions of varying quality. Hiring a trained proofreader would be much more expensive than this system, while producing results that are probably not too much better, while using Mechanical Turk naively would require a lot of overhead on the part of the user in evaluating the quality of the results returned. The 'find-fix-verify' model presented in the paper seems to have effectively addressed these problems, in essence by averaging contributions, and having the workers evaluate the other workers' work.

However, it seems implausible to me that this system, as described, could garner wide adoption. Since payment is required, however small it may be, the system introduces psychological roadblocks for the user in deciding whether paying for each use is worthwhile, which might be avoidable with some kind of flat-fee subscription system. Also, the user would be paying for results of unknown quality. By using a specialized crowdsourcing system rather than Mechanical Turk, which can better fit this system's characteristics, I think this service could become much more viable, and I would have appreciated greater discussion of how the crowdsourcing system could be changed in order to meet the application's needs.

Brandon Liu - 9/26/2010 18:45:04

“Designing Games With A Purpose”

This was a very well thought out paper. It avoids any discussion of what exactly makes games enjoyable and instead focuses on how they can be used efficiently for researchers. It has a clear description of three models of Games with a Purpose: output agreement games, inversion problem games, and input-agreement games.

A point that I liked was how the authors incorporated the player’s desires into the game. They mentioned how in an inversion problem game, the describer wanted to see the guesser’s output, so they introduced social interaction in ‘hot and cold’.

One aspect of GWAP that I would have liked to see discussed more is the nature of the participants. What distinguishes long term GWAP players from players who come across the game for a short period due to a link from a popular blog? Another interesting question is how a user’s individual performance changes with regards to how long their session is. Also, ‘Griefer’ players may have different objectives than anyone else.

The article also compares game players to paid people. It wasn’t clear whether these were people paid to tag the data, or paid to play the game. A possibility is that the output from volunteer game players is actually higher than that of paid participants. A question that thus should be explored is whether people paid to play the game produce better results than both volunteer players and paid, but non-gaming participants.

“Soylent: A Word Processor with a Crowd Inside”

I was really excited to read this paper, hoping to see the line “Soylent is people”. I was satisfied after the first page. The rest of the paper is good too, though.

The authors describe Shortn, Crowdproof, and Human Macros, three features plugged into Word. Shortn uses paid workers to shorten sentences. Crowdproof fulfills the task of current spellcheckers/proofreading plugins, but uses humans. Human Macros are more ambitious in they can be loosely structured tasks. In a sense, both Shortn and Crowdproof are a subset of the functionality that Human Macros delivers.

One point I liked in the paper was the analysis of Lazy Turkers and Easy Beavers. The point that was made that both groups do what they do to signal the completion of the work was surprising. The paper spent some time discussing the average time it took to get a response back, but there could have been more discussion of the quality of the responses as a function of delay time. If Lazy Turkers and Easy Beavers produce poor results, it may be possible that the best results from the system come in the middle of the time delay.

Something I would have liked to see is a deeper explanation of the motivation model for both the writer and the Turkers. For example, a user of the human macro wanted to convert all his verbs to past tense. He then encountered a worker who missed converting one verb. It seems like errors like this are common, thus the writer would need to himself verify that all the verbs were changed. The amount of time that is saved becomes much smaller when the writer has to verify every single piece - Especially since the writer of the paper is the one who is risking rejection or embarrassment from mistakes in his paper, while the worker is only trying to make a few cents.

The main value of the paper was describing Find-Fix-Verify. This could apply to other domains such as photo manipulation. For example, removing an object from a picture using the Clone Brush in Photoshop could be crowdsourced, but the results would run through this process. This would be a better application since photo editing is time intensive (for example, images for a PR or a news firm). It would be doubly applicable since one would only have to send to Turk a piece of the image, and not risk any theft of intellectual property.

Linsey Hansen - 9/26/2010 18:54:40


In Soylent: A Word Processor with a Crowd Inside, the authors talk about the Soylent word processing interface, which is unique in that it allows direct human contribution to the user's taks via crowd sourcing. With crowd sourcing, the user has the option to ask other people for help in editing and proofreading their article.

I see several blindspots with this process. The first one is that this would probably cost way too much to be feasible. I remember that amazon tried something similar with the Kindle, where it allowed users to ask questions, then a team of real people would answer the questions, however, Amazon eventually dropped this feature since even while the answerers were swamped with work, the cost proved to be to great. While Soylent's method does charge for its services, the turkers will still need to be paid during downtime at first, so getting enough funding to move over the initial bump could prove to be difficult. Another problem is that the user will still need to wait, often for a significant ammount of time. Though the article did say that the service will gradually become faster, the fact is that there will always be lag, and the user will not even know for sure how correct the results will be upon receiving their altered paper.

The concept of being able to ask real people to proofread or edit your article is really great, since machine learning is in no way close enough to understanding human language to the point where it can get the concept behind a paper and edit it accordingly. There are already peer editing websites, where users can post papers and others can go through and edit them (though this is normally in plain text without any fancy tools or visualizations), but this is definitely not professional, and can often involve a lot of spam. Plus there are few people who want to read over someone's 20+ page essay and grade it well. Perhaps if Soylent were to offer some sort of peer-to-peer web service, where volunteers can complete and vote for the tasks described, that would be better (assuming that the volunteers had some sort of reward/reputation system).

Designing Games with a purpose

In their article Ahn and Dabbish introduce the idea of creating “games with a purpose” (gwaps), which are games that allow humans to undergo tasks computers cannot currently the complete. The data from these games can then be gathered and used in machine learning.

Similar to a previous article, the authors believe that normal people can be used to complete tasks that are normally simple yet tedious for professionals, but then Ahn and Dabbish take it a bit farther by turning these tasks into engaging games. While similar research has been done, in areas such as making work more fun, raising money for charity, and using people to directly train computers, this method in unique in that it benefits both the research and the participants almost equally. The fact that it does use games and encourages correctness in the games is also great, because it ensures that people will not try to create erroneous data on purpose.

After giving all of the games on gwap.com a few tries, I did notice some blindspots in this technique, or at least a potential problem that was not strongly addressed. During my experience, I noticed that a lot of the words/images are reused rather frequently. If the same words were reused every couple hours or so, it wouldn't be a problem, but I only did each game about three times and encountered at least 4 duplicates (some in the same session) per a game. The problem this created is that the people I tended to play with would just spam keywords as opposed to using my actual clues or the video, which not only impairs the accuracy of the data, but it also makes the game less fun. Considering that there is probably TONS of data available to google and whoever else creates gwaps, I would imagine that having more varied data would be easy to do, and should definitely be done. One other problem I see is the reward system- while points are fine for some people, I feel like giving people something to spend those points on would be even better (and these things do not even need to have any sort of monetary value).

Kenzan boo - 9/26/2010 18:56:24

Designing Games with a Purpose

The article provides examples of how we can harness people’s inherent desire to be entertained into something useful like training computer artificial intelligence. The concept is an amazing one, if only we could build this into the common games that many people already play. This reminds me of puzzles games that I’ve seen played as a child like neo pets where players, my little sister, would solve small puzzles in a limited time in reward for in game points. Many other games can be formatted in this category, having people solve small puzzles. People try their hardest to solve them because the game mechanics make the game fun. Instead of having to pay users to do something, they are self motivated to play the game for fun. This is a big win for the user and the developer who needs artificial AI. One of the main challenges to this still seems to be how to integrate these ai problems into a fun and attractive game. One possible thing is, with the help of game developers, to integrate these puzzles into already made fun games. Games like zynga’s Farmville or mafia can integrate these puzzles into the game in reward for game points while keeping the game fun. E.g the user can figure out which image is a tomatoe or which is a piggy and compete with their friends to try to match piggy images.

Soylent: A word processor with a crowd inside.

The idea of crowd sourcing proof reading is great, but it has many problems when compared to an intelligent paid proof reader. The benefit is that it is much cheaper, but crowd information can not quite compare to a good writer or proof reader. As shown the paper, the crowd source is fairly efficient at catching blatant mistakes but often confounds the meaning of the whole paragraph or misses out on certain critical elements. However, to the extent it works, it is insurmountably better than any AI programs employed to do something like proof read or shorten while keeping meaning to the paragraph. For the price a user pays, its much better and much faster than a paid reader to proof read. This allows typical writers who do not have access to employed proof readers to have more eyes look over their work.

The requirement of these two articles is that there is a surplus of human labor and time out there at a very very cheap cost. These are two examples of how computer science can harness this vast resource and organize it into usable work. Its an amazing idea, and as more and more people are putting information onto a digital medium, we need to find ways into organizing that into something meaningful.

Thomas Schluchter - 9/26/2010 18:58:53


Arpad Kovacs - 9/26/2010 18:59:22

The Soylent describes a word processor that relies on crowdsourcing via Amazon Mechanical Turk to make texts more concise, proofread and correct spelling and grammatical mistakes, and automate mundane tasks using human rather than artificial intelligence. Soylent is intended to be an improvement over the built-in grammar and spell-checking functionality of Microsoft Word, whose spell checker often yields false positives, and whose grammar checker fails to catch many obvious (to humans) mistakes.

The main contribution of the paper is the find-fix-verify paradigm of crowd programming, which distributes the tasks among sets of workers who individually propose contributions, but use a group voting mechanism to decide which contributions are integrated into the final paper. The advantage of this approach is that it does not require any training of the editors; rather workers can be hired on demand to solve specific tasks. Unlike traditional outsourcing which may rely only on one person, the distributed nature of this technique also means that one single editor cannot exert excessive influence on the outcome of the paper. The paper's find-fix-verify approach separates and parallelizes tasks into 3 stages, in order to rein in excessively enthusiastic "eager beaver" editors, while at the same time motivating "lazy turkers" to contribute more.

While this approach may be useful in technical writing or nonfiction, where the clarity and readability of the text is of the highest importance, I do not think that it would be applicable to fiction or poetry, which requires a much more individualistic personal style. Strong, forceful passages may end up being edited into more conventional and "safer" phrases, since some turkers may disagree with controversial claims, excessively clever language, or purposely ambiguous/sly metaphors. In effect, the risk is "edit by committee", where the group settles on the lowest common denominator, and in the process strips away much of the original authors' or individual contributors' brilliance. Nevertheless, I find the approach to be intriguing, and probably much more cost-effective than a traditional dedicated editor; perhaps the best approach is running the original text through Soylent, and then having the editor choose which changes to keep.

PS: I like the name of the program; nice sci-fi allusion.

The Ahn and Dabbish article describes GWAPs (Games With a Purpose), in which people playing video games for fun can help train AI algorithms and solve computational problems that computers cannot as a side-benefit of their main entertainment purpose. The beginning of the article chronicles several examples of such games, and the catalysts and trends that are making GWAP possible, namely: wider access to the Internet and other networks, the proliferation of problems which are computationally hard, but easy for humans, and increasing amounts of time spent on computers for entertainment purposes. However, I think that the main contribution of the article is the categorization of GWAP into the following categories:

1) Output agreement games 2) Inversion problem games 3) Input agreement games

I found the part of the article describing design challenges that are specific to integrating games with computation to be the most useful portion. In particular, I thought that using timekeeping, randomness, scorekeeping, leveling-up, and high-score lists to increase player motivation/enjoyment and keep people of all skill levels and experience coming back to the game were good ideas. I also found the section on maintaining output accuracy to ensure that players do not decide to maliciously game the system itself, but it seems that this portion was lacking prescriptive advice other than random matching and sanity checks.

I think that GWAP is an idea with great potential. Not only is it a clever way of making people who play games contribute something useful to society, but I also think that it could spawn new business models that rely on gamer productivity, rather than (or more likely in conjunction with) advertising for revenue. I think that the game-play mechanisms of points/stars/etc is a very powerful motivator for many people; just think of how much progress we could make if we could figure out a way to harness all of the World of Warcraft addicts out there to do something useful.

Pablo Paredes - 9/26/2010 19:03:07

Summary for Bernstein. M. et al. – Soylent: A Word Processor with a Crowd Inside

The paper describes an example of integrating crowdsourcing with front end interfaces to support complex AI tasks where supposedly humans can deliver better results.

The method describe is called Find-Fix-Verify, which breaks the tasks in smaller review stages that refine the task. Three groups of Turk (from Mechanical Turk – Amazon) workers are used, the first finds the problems, the second one to provides solutions and the third one does quality assurance. Three specific tools were implemented using this algorithm, a shortening solution to long paragraphs, an alternative content, and an open-ended task (such as finding pictures, quotations, or other open tasks).

It is interesting to observe that the expected accuracy is or around 70% due to a series of issues related to the worker’s background, abilities and other issues. A couple of well-defined groups that introduce errors are the so-called Lazy Turks and Eager Beaver, which each provide either too little or too much content (correspondingly) to solve a problem. It is also interested to observe that although some AI editing tools provide lower results, these actually provide some results that are not apparent to humans, such as catching minor problems embedded in larger problems, so improving the overall result, suggesting that a mix of crowd-sourcing with traditional AI could be a better solution to the algorithm.

Additionally, it is interesting to observe that the technique can be applied iteratively to improve results from previous stages. However given the current size of the Mechanical Turk the expected solution time for each stage is of 18.5 minutes, and costs around $5 USD, which makes costly and time consuming.

The authors claim that the network size growth expected from the Mechanical Turk will reduce the times in about an order of magnitude and will have also some cost savings. I do not see clearly how the second argument will be attained, as the minimum cost is not dictated by the network size, but by the size of the task.

The options described to implement mini Mechanical Turks for corporations are counter-intuitive to the sense of having a large body of people with very small tasks. I do not see that as a solution, but I would rather would have preferred that the discussion contemplates other solutions to the 30% error rate, for example: filters to better match expertise with the core subject; differentiated pricing based in quality; and some pre-processing from the author to avoid style editions.

Summary for Von Ahn, L. and Dabbish, L. – Designing Games With a Purpose

The paper describes the use of Games With A Purpose (GWAPs) to solve paradigms of computation that are extremely hard for AI techniques to solve.

The interesting approach is that the task is performed by humans with a high probabilistic degree of success, but neither relying on payment nor altruism, but rather in a more intrinsic human desire, the desire to be entertained.

One important description from the authors is that it is not enough to put a (game-like) interface to an activity, but that the activity itself must be integrated in the game itself. There must be interplay between activity and the game.

Several techniques (used in traditional games) help maintain a high level of accuracy and engagement in the activity, for example: points to link the winning condition with the effort; well-specified and challenging goals that reflect in higher levels of motivation; skill-level differentiation to maintain interest along the time, among others.

Three types of games are specified: output-agreement games, inversion-problem games and input-agreement games. Each with different structures, but all of them defined for 2 people, making them of collaborative nature. The option to increase them to a larger audience changes the dynamics to a competitive game, which in turn changes the expected outcomes – to be noted that both competitive and collaborative approaches have both pros and cons.

Finally the authors conclude that the real measure of utility of a GWAP is a combination of throughput (measured as the ability to solve tasks over time) and it enjoyability (further accounted as the average of the time that players remain interested). I believe the way output is measured, by comparing it to paid subjects is efficient, as long as the task can be accurately delimited in a timeframe.

I found this paper extremely interesting and of great potential to explore other options to where human nature could accomplish (serious) tasks but mixing them with enjoyability and challenge (i.e. games). I think the authors could have provided a better description of other options to incorporate GWAPs beyond AI-related tasks, such as complex social issues to be discussed or evaluated or other qualitative studies.

Matthew Can - 9/26/2010 19:09:22

Designing Games with a Purpose

In this paper, the authors present a set of principles for how to design and evaluate games with a purpose (GWAP), a class of games in which people perform tasks that computers cannot, all as a side effect of playing the game.

The main contribution of this paper is to define the class of games dubbed GWAP, to describe the characteristics of such games, to understand the factors that lead to a successful GWAP, and to provide metrics for measuring success. It opens the door for further research into GWAPs.

I liked the paper’s focus on why GWAPs are a better than other crowdsourcing approaches in terms of accuracy and user commitment. Crowdsourced contributions are susceptible to errors (from both incompetent users and malicious ones). The game templates presented in the paper, along with additional countermeasures, substantially mitigate the accuracy problem by filtering out errors and inhibiting collusion. As for user commitment, whereas other approaches rely on altruism (may be hard to come by) or financial incentives (may be hard to sustain), GWAPs are based on the notion of entertaining people. The authors provide several suggestions for how to increase user enjoyment by incorporating specific game mechanics.

My biggest concern, something the authors also acknowledge, is that the application of GWAPs may be limit to a small class of problems. For example, the GWAPs presented in the paper all solved problems by creating input-output mappings. Some problems might not fit into that framework, and even those that do may still not be suitable for GWAPs (what about problems that require domain knowledge). It seems to me like this is an interesting area for further research.


This paper describes Soylent, a word processing interface that utilizes the crowdsourcing power of Mechanical Turk to perform text shortening, proofreading, and arbitrary word processing tasks.

The biggest contribution of this paper appears to be the concept of integrating crowdsourced contributions into user interfaces. Furthermore, as one instance of that, the authors implement a system, Soylent, that applies the concept to word processing. They provide a meaningful analysis of the problems encountered in building such a system, as well as the user benefits and the costs involved (time and money). Additionally, they describe a crowd programming pattern, “Find-Fix-Verify”, that is meant to improve the quality of crowdsourced contributions and that has applications beyond just the kinds of systems considered in this paper.

What I liked most about this paper was the analysis of the challenges in programming with crowd workers. I have been quite skeptical of the quality of work produced by crowd workers, so I was interested in the kinds of problems the authors faced when implementing this system. It seems to me that their “Find-Fix-Verify” pattern is a novel contribution toward solving some of these problems. In particular, I would have liked more discussion on the separation of the find and fix stages of the process, two tasks that are closely related but that the authors chose to decouple.

Aaron Hong - 9/26/2010 19:23:02

In the article "Designing Games with a Purpose," Luis Von Ahn and Laura Dabbish chronicle their efforts to develop some templates and actual implementations for Games with a Purpose (GWAP). The say that people spend so much time playing games already, that it is valuable to create games that harness that effort and time. In their paper they discuss 3 general designs: output-agreement games, inversion-problem games, and input-agreement games. They conclude that much work can be done, especially with games that require creativity and diversity of answers.

I remember reading this article for CS 160 and we actually got to create games with a purpose. It is true, we should make that are useful and practical be fun and entertaining. One of the most important aspects of this, and the article notes, is: What you are trying to accomplish really needs to be integrated into the game itself, it can not just be tacked on after the fact. This is really true since I worked at LeapFrog, an educational toy company, over the summer. There I saw how some games were really designed with the learning mechanism tightly integrated in and some that were not--it was hard to see from my small slice of time spent there whether one was more successful then the other, but I know in the long run it does.

Finally, to some degree I think it is a sad fact that "by age 21, the average American has spent more than 10,000 hours playing such games—equivalent to five years of working a full-time job 40 hours per week." Instead of creating games that mask the results of the computation they are putting in (i.e. only rewarding them with entertainment). We should great GWAP that are fun and entertaining, but benefit not just those doing computation or machine learning, but directly the participants themselves.

In the paper "Soylent: A Word Processor with a Crowd Inside" by Bernstein et al (including our own Bjorn) talks about a word process (MS Word) that uses Amazon's Mechanical Turk to crowd source complex tasks that artificial intelligence and regular people are just not enough for. What I thought was particularly fun was the name. "Soylent is people" plays off Soylent Greens--I just hope more people got a chuckle out of that. The paper mentions high variance that comes from people like the Lazy Turk or the Eager Beaver and different ways to counter that. I do think it would be useful for general word processing pre-work world, i.e. school and non-critical documents. A college student does not care so much of the ownership over his essay, but a business man sure would. There is much to be explored for on demand editing in the working world.

Kurtis Heimerl - 9/26/2010 19:52:58

Sorry for not turning this in earlier, I thought I had hit submit but forgot to!

Designing games with a purpose. This paper details how to use games to crowdsource for training AI and generally solving difficult problems.

This is almost a seminal work now, referenced widely beyond computer science. I remain bearish on its value. Primarily, the metrics for success are wildly vague: "defining a game as successful if enough human hours are spent playing it". This is not sufficient, for a variety of reasons. Firstly, wikipedia falls pretty strongly into this idiom. People have spent a long time building that database for social rewards, much like the "rankings" they discuss later. Another counterpoint is the proliferation of "zynga" games; rote repetition for the same social rewards. I don't doubt that mechanical turk + zynga-like social networking would create an absolutely massive amount of work, but it's neither a game nor enjoyable. It's just playing on aspects of human personality. There's a reason all of their games mimic existing social games and require (sorta) multiple users.

I suppose my broader point is that we're just giving social rewards rather than monetary rewards and this is not a large enough distinction to separate this methodology from something like Turk. If you could build a system that was more like a real game, not "bite-sized" but something people would play without social rewards, then I'd be more convinced.

However, I don't want to be too negative. It's an interesting avenue and not something I want to dismiss out of hand. This is a totally reasonable first step, and the results are strong. As an avid gamer though, I do not feel like this is a scalable, long term way to resolve these tasks.

Soylent: A Word Processor with a Crowd Inside This paper details Soylent; a system integrating turk tasks into a word processor.

I didn't actually read this, as I didn't have the username/password needed. However, I've heard a great deal about this work I can't say too much without reading it, but I can say that my primary concern is one of analysis. I don't think a "proof-of-concept" paper is meaningful for this idea. This is not a complicated idea, and I get nothing out of knowing it could be implemented. I want to know the tradeoffs involved; latency vs price for example, and what users would think of such a system.

I suppose my point is that building this system is not the hard part.

Dan Lynch - 9/26/2010 21:35:10


This article presents an authoring tool by the name of Soylent. This tool can be used to help authors edit their work in real-time, whether the author needs to change the tense of a paragraph or simply cut some work out to shorten their writing. In addition to these types of edits, the software also supports citation searches.

Note that Soylent is not just software! Its also people! Just as in the 1700s a Turk was a person who pretended to be a machine, Amazon has Mechanical Turks who are just people helping collaboratively in the background while an author types into their Microsoft Word document.

Three main services are offered, the first is Shortn, which is used to cut text down to 85% of its original length without losing meaning. The second is Crowdproof, which is used to detect spelling and grammar that Microsoft Word misses. The Human Macro is sort of what it sounds like, using a human to do a task, such as finding citations.

A major problem is, how to verify that these humans won’t also make errors? This is ameliorated with the Find-Fix-Verify pattern, where the tasks are delegated to groups who use a voting system to complete tasks.

This is in my opinion a very important topic of discussion because it cuts across boundaries that computers cannot, and neither can a single person. Its the fusion of the two that realizes the power in this system. The fact that a human has on demand a team of other humans through an anonymous computer network to collaborate with is a very ground-breaking technology indeed. This has many implications socially, legally, and so on. Other issues are privacy, legal ownership of text, and domain knowledge. What if a writing is confidential? Who’s words are these anyways? And what if they don’t understand the context?

On another note, I am not sure how I really feel about the implementation. If you go to the actual website, you will find HITs, human intelligence tasks, which you can yourself become a Turk and work on. These jobs range from $0.02 to a maximum of $10.00 (during my search). One example was to write a 840 word essay on a particular writing for $3.40. Is this legit? Seems like a modern sweat shop. However, aside from the social implications, this could be a profound technology if leveraged correctly. I think it also needs a little more publicity and use.

Something that I think they have missed is, why not focus on other areas for Turking? For example, there is physics, math, and other areas besides writing that could use a collaborative system! This calls for possibly a new research question and maybe even implementation...


What a great idea! Let me pull out the quote that made me really pay attention to this idea: “What if this time and energy were also channeled toward solving computational problems and training AI algorithms?” Imagine.

Parts of this article discuss projects that already have taken place to utilize man-hours to do work for labelling and categorization of images. For example, “200,000 players had contributed more than 50 million labels”, which is very useful for google image searches. This is definitely interesting to see that they can trick people into doing work!

This is however, a VERY important issue. Collaboration in large numbers like this gets work done! If they can really optimize these games, imagine how much more relevant our data searches will be in the future, from text to images to videos. A large portion of the Internet is stratified using computer vision techniques and text-scrapping algorithms, and often these are not enough for the job. Its definitely something that humans can do easily, and what better way to use the network?

Luke Segars - 9/26/2010 11:05:05

Soylent: A Word Processor with a Crowd Inside

This paper describes Soylent, an extension to Microsoft Word that provides distributed human input towards the task of proofreading documents. The paper discusses three integrated tools, all based around simple but computationally challenging word processing tasks, that have been built using Amazon's Mechanical Turk, a service that pays individuals for doing series of small tasks that are provided to them electronically.

Soylent reports reasonably high, although well under professional, proofreading accuracy rates at a reasonably minor financial cost (more on this later). Given the difficulty of natural language processing and context detection, it is clear that human peer review is often still superior to its computational equivalent.

There are a number of distinct disadvantages to this approach to proofreading that may make it less desirable to potential users. The first and most obvious one is that it costs money to do. While it seems that the cost may be somewhat trivial to university professors or graduate students, it is unclear whether a middle school or high school student would want to use it over the course of their education due to accumulating costs (not to mention the possible need for multiple iterations per paper). It is also often psychologically difficult for people to begin paying for something they've assumed to be free in the past (peer proofreading). As the level of complexity of the piece being edited increases, it may also become more difficult for the users to understand the purpose of the paper if the topic is somewhat advanced or abstract. Reading a draft of a research publication may, for example, be difficult to correct if users are unfamiliar with the terminology of the field. Similarly, it appears that much of the data provided to Turkers is provided out of context, with only a single paragraph transmitted out of the bulk of the entire paper. This seems like it may make proofreading more difficult – personally, it seems like proofreading without knowing the context of the rest of the paper would be more difficult for me to accomplish successfully.

Nevertheless, it is very possible that there is a niche of users who would benefit hugely from the availability of a crowd in their word processor. There are a number of situations where I could personally benefit from such a tool if I were willing to spend some money. I'm wondering now whether it would be possible to involve GWAPs in the proofreading process in order to remove the cost and increase potential accuracy (through repetition) with the tradeoff of a slower feedback loop.

Games with a Purpose

The Games with a Purpose initiative is an excellent example of the larger modern movement to direct human cognitive power towards solving important problems and otherwise benefiting society. Von Ahn is a significant contributor in this field and several of his other ideas have become omnipresent in the online world. Games with a Purpose describes a set of principles that can be used to create fun games that can be used to train machine learning algorithms towards various goals. Nevertheless, the techniques described in this paper apply to a very small subset of games, specifically the casual game industry, that may limit it's ability to capture the thousands of hours spent playing games as representable knowledge.

The idea of Games with a Purpose is, I believe, ingenious. It is a well-known fact that humans enjoy being entertained, and in the digital age a very common form of entertainment is video gaming. Since the internet became widespread, a number of these games have taken advantage of the social interactions provided by a networked game to make their games more fun for players. Very often players participate with people they have never met (outside of the game). The games that von Ahn describes uses the power of randomly distributed social interaction to generate data that has a very high probability of being correct. A previous invention of his, reCAPTCHA, uses this same concept to digitize book text.

Games with a Purpose presents a more general question that is important for us to confront in an age where the times are truly different: is it still necessary for there to still be a clear distinction between “work” and “play?” As von Ahn mentions, a number of educational curricula and online communities are beginning to use game mechanics to increase user participation in doing their work for them. Online environments such as World of Warcraft have had tremendous success in using game mechanics to get users to spend countless hours attempting incredibly difficult or complex tasks by simply offering them a virtual “badge” for their character's profile. If people are willing to work for virtual rewards instead of money, is it possible that GWAPs are heralding in a much larger paradigm shift than we are aware of?

I don't think that the answer to this question is immediately obvious. Von Ahn's games still apply to a very small subset of the gaming spectrum and widespread participation may be limited without a broader marketing campaign or generalization into other types of games. Perhaps one of the biggest elements currently left out of von Ahn's formula is the ability to gain social reputation with other players. Each of the games described in the GWAP context requires that players be anonymous to each other, unable to exchange any sort of information other than the desired content of the game designers. Many of the most successful games today take an opposite approach: give the players a specific online personality that they are responsible for that other players will also know them as. This gives the player a sense of responsibility for developing their own reputation but can only be achieved at a very elementary level if players are required to be kept anonymous.