Applications in HCI
- 1 Readings
- 2 Discussant's Slides and Materials
- 3 Main Discussion Points during Lecture
- 4 Reading Responses
- 4.1 Nicholas Kong
- 4.2 Kristal Curtis
- 4.3 Wesley Willett
- 4.4 James O'Shea
- 4.5 Manas Mittal
- 4.6 Kurtis Heimerl
- 4.7 Beth Trushkowsky
- 4.8 Travis Yoo
- 4.9 Anand Kulkarni
- 4.10 David Rolnitzky
- 4.11 Ariel Chait
- 4.12 James Cook
- 4.13 Siamak Faridani 01:40, 7 February 2011 (PST)
- 4.14 Sally Ahn
- 4.15 Prayag Narula
- 4.16 Philipp Gutheim
- 4.17 Chulki Lee
- 5 Authors' Replies to Student Comments
- Paper 1: VizWiz: nearly real-time answers to visual questions. Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, and Tom Yeh. UIST 2010.
- Paper 2: Soylent: a word processor with a crowd inside. Michael S. Bernstein , Greg Little , Robert C. Miller , Björn Hartmann , Mark S. Ackerman , David R. Karger , David Crowell , Katrina Panovich. UIST 2010.
Discussant's Slides and Materials
- Mechanical Serfdom Is Just That Rachael King, BusinessWeek, Feb 1, 2011.
- Breaking Monotony with Meaning Lukas Biewald, CrowdFlower blog.
Main Discussion Points during Lecture
- There are a lot of incentives to code something together (Class)
- Results from a survey n=136 (Philipp Gutheim):
- Indian workers earn around $1.3 per hour and work on average 4.7 hours per day. They source HIT by type (50%), newest (39%) and highest payment (11%).
- US workers earn around $1.99 per hour and work on average 2.2 hours per day. They source HIT by type (72%), newest (11%) and highest payment (17%).
- “Payment becomes less important when the system becomes iterative. You get a few people pretty quickly regardless what you pay. Often you just need a few people for a task so maybe payment is not as important.” (Michael Bernstein)
- The number of HITs in a group affects the response time more than anything else (Reynold Xin)
- Motivation is a concern in Crowdsourcing. Feedback is a simple solution to do so (e.g. simple UX enhancements when selecting a HIT).
- There is a trade-off between Quality of response and privacy concerns about the amount of information you provide. (Pablo Paredes)
- Although, this is entirely up to the mechanism you leverage to work around this tradeoff (Tap & Chuang).
- Also, people may tend to give up privacy concerns against significant value gains.
These papers both explore issues with incorporating crowdsourcing into end-user applications by implementing two prototypes.
Soylent is a Microsoft Word plugin that allows users to crowdsource certain editing tasks to Mechanical Turk. It describes shortening, proofreading, and custom macro services. The main issue the prototype tackles is that of assuring quality. The paper states a useful error metric (about 30% of the output is unsatisfactory) and introduces the Find-Fix-Verify programming pattern to address this problem. I'd like to see further exploration of some of the crowdsourcing parameters, for example investigating how the number of Turkers at each stage affects result quality.
VizWiz is a mobile application for blind people that allows them to ask questions about their surroundings (e.g., identifying types of cereal). In this application, latency is the major concern. The authors address this by keeping workers answering old questions, so they are ready to receive a new request when it comes in.
One issue that any crowdsourced system will have is scalability. If these types of systems gain wider use, I wonder if the population of Turkers will grow fast enough to be able to support them.
Soylent: This paper is about an MS Word add-on for editing/proofreading text powered by crowdsourcing. This is a great paper, since it is a nice application of crowdsourcing that also touches on some interesting research issues, such as (1) hybrid computer-human systems, (2) time to answer vs. cost/quality, (3) the need for guidance on how to set costs, # of workers per stage, etc. I'm interested in exploring some of these issues. I was very impressed by how the authors were able to produce useful results from unreliable Turker output. The theme of producing reliable results from noisy input/unreliable components is a strong theme in distributed systems, and I'm happy to see it here as well. I like the Find-Fix-Verify pattern, but I wish the authors would have done more to explain which existing patterns are similar and also why they are lacking. This seems an important omission in the related work since the authors claim that the pattern is one of their main contributions. I also wish the authors would have provided more discussion of why they chose various constants mentioned throughout the paper (eg, price paid to workers at each of the FFV stages, number of workers per stage).
VizWiz: This paper is about a question answering service for blind people that uses Mechanical Turk along with some AI techniques like speech recognition and computer vision. To me, this work mostly made a case for the fact that there's a great need for cameras to assist blind people in taking photographs. The usefulness of the app was severely limited by the poor photos that the blind people often took. I was disappointed that the authors didn't try a more iterative model where the Turkers could look at the photo and provide cues to the blind person about how they could take a more useful one. I was also surprised that the authors didn't try to verify the responses. I understand that latency was a key concern, but surely providing incorrect/spammy answers back to the blind person is less desirable than providing an answer in which you have more confidence after some small delay? In the AMP Lab, one of our goals is to develop "error bars on everything." One way to balance the need for reliability with the need for real-time performance is to provide the answers as quickly as they come back but to also tag each answer with an error bar. As more answers are received, you can adapt the answer and reduce the error bar, indicating that your confidence has increased. I was a bit confused by the authors' explanation of how they maintained a worker pool -- it sounded like they were just keeping the workers busy answering questions to which they no longer needed an answer just to keep them around. Surely there must be a better way to keep the workers engaged? Maybe they could have a separate "low priority" service which costs less and whose tasks will only be given to workers when they are idle. I was intrigued by their comments about how a pool of workers could be shared among many users of VizWiz. As I believe that the crowd and the cloud have a lot of parallels, this reminds me of Heroku, which hosts Ruby on Rails apps on EC2. Heroku minimizes their costs by multiplexing many users on the same pool of EC2 instances. You could also imagine a scaling service for the VizWiz crowd that would adapt the size of the worker pool to the VizWiz demand.
Each of these papers describes a nicely-designed application that demonstrates the utility of the crowd for a particular domain, and, in the process, answers an interesting technical or research question (Soylent addresses concerns over data quality via their Find-Fix-Verify pattern, while VizWiz deals with latency).
Soylent is a particularly nice illustration of how the crowd might be embedded into existing applications, and I the Find-Fix-Verify design pattern provides a nice, systematic way of breaking down particular kinds of tasks to address quality concerns. I suspect that we're going to be seeing a lot more work on design patterns for the crowd in the near future - and that's a good thing - systematic approaches for dealing with these sorts of common problems are a must.
That said, I do worry that the efficiency and cost-effectiveness of Soylent for a task like proofreading is debatable. Shortening or editing individual paragraphs in their examples involves a incredibly large number of workers, takes a lot of time, and the costs for this add up very quickly. The authors note that the allocation of tasks and the amount paid for them could be tweaked in the future to improve these results, but it's not clear to me that requiring inputs from 30-50 workers per paragraph is every going to be reasonable. That said, I think crowdsourced proofing is probably viable - maybe even using this model - but that the granularity of the tasks and the experience level of the workers probably need to change. Using a more specialized cloud with trained editors who identify, fix, and proof many edits at the paragraph or page level level might provide higher quality results at a reasonable price. This seems like a case where they're segmenting the tasks too much.
Also, the eat-your-own-dogfood conclusion was cute…I mean, they pretty much had to.
VizWiz's most useful contribution seems to be their "quickTurkit" model, in which they maintain a pool of workers who are waiting to respond to new prompts with minimal delay. This represents an interesting shift from a pay-by-task model to a pay-by-time one. My sense is that this shift entails a change in how worker contributions are reviewed and how the quality of work is judged. Rather than simply consider response quality, more general measures of worker productivity (throughput, downtime, etc.) seem necessary, and more discussion of this would have been interesting to see.
Also, I have some issues with the cost estimates given in the paper. While I understand the author's assertion that they maintained a pool of workers capable of answering many more questions than were actually answered, and that the cost/answer would theoretically decrease as the number of users increased, their cost estimates from their pilot studies seem artificially low. By my understanding, maintaining a pool of 4+ workers at all times, means that every hour of uptime requires a minimum of 4 worker-hours. This means that their stated hourly cost of $4.67/hour averages out to a maximum hourly wage of $1.27/hour for the workers - not a living wage in many locales.
VizWiz presents a mobile-device application to help blind users answer visual questions in nearly real-time. I think the main contribution is their quikTurkit approach to maintain a source of available workers to be used when questions arise. It is a clever way to solve the latency problem using Amazon's Mechanical Turk, but it seems somewhat of a waste to have workers answering other questions until real queries come up. Given that latency is a general problem with MTurk, I wouldn't be surprised if new crowdsourcing platforms were developed to address this issue directly, maybe using social networks. I imagine there are more efficient ways to maintain a pool of workers and simply alerting them to questions as they are asked. Maybe paying workers a low rate for time available, and then a higher rate for each question answered. This isn't possible with MTurk, but it could easily be employed as a new crowdsourcing platform.
Soylent describes a word-processing application which uses crowdsourcing to proofread and edit sections of text. As previously noted, the Find-Fixit-Verify approach seems like a nice way to deal with quality-control issues. Having said that, I'm still not sure if proofreading and editing text is the best choice for a crowdsourcing application. It seems like quality could vary significantly depending on which 5 or 6 workers you get to suggest rewrites. Having more workers helps with this, but then you start to question whether is an efficient or cost-effective approach. I wonder if we'll start to crowdsourcing options using pools of experts specializing in a one type of task rather than using jack-of-all-trades. For example, the quality might be much higher if the pool of workers all had writing degrees.
VizWig: Nearly Real-Time Answers to Visual Questions Authors demonstrate mechanisms for handling crowd workforce so as to minimize time to answer (quikTurkit). They instantiate this mechanisms in VizWig, an iPhone app for blind. VizWig illustrates some tasks are still best done by humans, and how mechanical turk can be used intelligently to optimize a desired characteristic (minimize time to answer). It also illustrates "hacks" akin to busy polling to keep the crowd workforce engaged. Its a cool, useful tool that I might need (and use).
One question was, how elastic is the workforce availablity, and how much is available anyway? Will it be possible to model the crowd behavior so as to provide predictable QoS assurances? Another question was, how much of this is a function of Amazon Mechanical Turk, and how much is a function of the general idea of crowdsourcing. Are we studying the semantics of mechanical turk as opposed to the general idea of crowdsourcing. This paper illustrates that there might be different ways to optimize different question-related metrics, and I am thinking about research ideas on these lines.
Soylent is a Microsoft Word plugin that instantiates the model of Find-Fix-Verify (FFY) paradigm. The authors argue that this paradigm enables more complex tasks to be accomplished on mechanical turk while using Turk's lazy and unreliable workforce. The authors also demonstrate an example of deep integration of behind the scene crowdsourcing in an existing software (Microsoft Word).
Soylent demonstrates the FFY paradigm in context of word processing. In order for FFY to be a generic model, the paper would benefit from a discussion about the generality and applicability of such a paradigm, i.e., what is the class of problems that this can be applied to? I question the argument that separating fix and find stages. Is it fundamentally different from "Decreasing Work"? Lazy workers will continue to randomly mark sections and randomly change/rephrase work. It would be good see some form of "Systems Thinking" discussion around this. For evaluation, I was wondering why the authors never considered using a human professional proofreader (or even a quick self-proofreading). Will Soylent's results be significantly different in price and quality from using a cheap professional proofreader in India? Also, the authors statement that wait times will drop since "it is important to remember that the service will continue to grow" is an unqualified forward looking statement not backed by any data. Finally, in the "Soylent-ed" conclusion discard the phrase "That use independent agreement and voting", a phrase that is a key point. Will crowdsourced systems generate lowest common denominator results?
VizWiz: nearly real-time answers to visual questions I'm really disappointed in this paper. It seems like such a mishmash of random work, each incredibly poorly evaluated (3 users for vizwiz 2.0!). Is it a design study? You really had to write about the fact that blind users have a problem taking pictures? On what planet is a "correct answer" when a turker accurately points out that there's nothing in the picture? This seemed to avoid all of the hard problems (blind people taking pictures accurately) while tackling the easy ones (lowering the message latency). That's a bit unfair, maybe those were both hard questions. I just know that throwing money at it or selecting a specialist population will lower latency. I didn't need that evaluated. Clever enough idea, I suppose, but it completely seems like one of those "one-offs" HCI is known for.
Soylent: a word processor with a crowd inside Ah, I missed this in 260. More interesting work than the latter, with some structural work included. I'm happy with it. The discussion hits the important problems with the work, which I think should have been resolved before publication. Right now, this work primarily details that yes, you can insert MTurk into random apps. Also, it provided an unevaluated pattern for crowdsourcing work. The real issue, to me, is: "is this a viable model for crowdsourcing to actually be useful". That's the latency and cost issues that were ducked in the discussion. That's what we're trying to cover in Umati, and I'm happy to be able to do that.
"VizWiz" describes how to use turkers to answer visual questions for the blind in real time. The best takeway from this paper is their use of lead time to recruit turkers in order to reduce overall latency for the user. It would be nice in the future if latency estimates and guarantees were directly provided by human computation frameworks. I liked the techniques for "better turker use" (i.e. not wasting their time on bad tasks), namely, automatically recognizing dark or blurry images before they are sent. In general, we should remember we don't need turkers for *everything* and should continue to explore the best ways to combine human work and machine work. I also liked another technique for reducing answer latency: sending redundant HITs. Lastly, LocateIt demonstrates a specific instance of what would be cool to explore further: question rewriting!
"Soylent" uses turkers to help with complex writing tasks like shortening or proofreading a selection of text. A key takeaway is their approach to quality control, Find-Fix-Verify, which separates locating corrections and performing those corrections. One small note: it wasn't clear if the Shortn interface shows multiple same-length versions of an except, in case there are quality differences that the user might want to choose between. In general, Soylent suffers from long wait time, and would benefit from a similar approach used in VizWiz to reduce waiting. The authors choose to more-or-less write-off wait time, saying that in the future there will be more workers. However, in the future there might also be more tasks to choose from if the number of requestors increases as well.
Just lost all the changes I made because someone saved before me. No way to lock the page while editing?
Both applications are very interesting and practical, and I would love to see how they will be further developed in the future.
Soylent provides a valuable programming pattern to improve worker quality, which is the Find-Fix-Verify crowd approach. This could be applied in many other HCI crowdsourcing applications, however, even though it increases the quality subsequently, it extends the latency and cost. It would be an interesting experiment to find a critical point of trade-offs between those elements - quality, latency (# of workers?), and cost.
The Find-Fix-Verify pattern can be applied in VizWiz, however, the latency is a big issue here. Some photos that shouldn't be incorrectly identified would use the approach even though it increase some latency of identifying those photos.
A question that I'd like to ask is "What would be the best crowd programming pattern for applications where the latency is the major concern?"
VizWiz and Soylent are the first applications to demonstrate use of crowdsourcing as a synchronous online tool rather than an offline data annotation / data processing service. This plays strongly into the earliest visions of what Mechanical Turk would be -- a "collaborative human interpreter" that could provide information on demand to algorithms and software tools; unfortunately, this vision was seemingly forgotten in the early years of Turk in favor of data annotation / offline processing tasks.
How to analyze system costs on Turk
I'm impressed that VizWiz managed to get a pool of 10 workers for $5 / hr. This is a result we shouldn't be afraid to cite widely when proposing and designing systems on Turk. There's a strong case here that systems powered by human workers on Turk can consistently be made inexpensive, regardless of how much a specific implementation might cost. It'd be nice if we could evaluate systems that use Turk without worrying as much about the actual cost of a specific implementation, but perhaps the number of queries made to workers, or the amount of worker-minutes requested. After all, it's been demonstrated by now that Turk workers can be obtained inexpensively, so whether it's actually been done during a project is just an implementation detail, not a property necessary for evaluation.
Both papers evaluate the cost of their systems by comparing them to non-crowdsourced solutions for the problems. This is a reasonable practice, and especially effective in domains like VizWiz where there are expensive existing tools for the problem. This is harder, however, when there are no existing software tools that solve the problem, as in the case of Soylent. One other idea for how to talk about this cost is by drawing analogies between the cost of running human computation systems and other kinds of foundational advances in computing architectures, rather than comparing it to standard software costs -- for example, today quantum computers offer a qualitatively more powerful computing architecture, but cost quite a lot to operate -- certainly cost more than $5/hour.
The real contribution of VizWiz is QuikTurKit (though the specific application itself is also cool). This is a much better solution to get real-time answers from Turk than developing optimization models, and I look forward to using this in my own research and seeing it used much more widely. I do wish they hadn't built it on top of TurKit, though! I know of only one other application "in the wild" using this kind of approach (IQ engines, out of Berkeley), though I can easily think of many more great applications I'd love to see built. Anyone know of anything else in the works?
I notice that they implemented some great hacks on Turk to ensure they show up at the top of the search lists, posting 64 times as many HITs as they actually need (and presumably not letting workers submit any after the maximum number of workers are in the system). Tools using Turk should mimic this strategy; it's fairly easy. Of course, it won't work once everyone's doing it, but for the time being it's a great idea.
Soylent is a real engineering and interface feat, and the main results speak for themselves. In my view, the big crowdsourcing contribution here is the find-fix-verify model. It advocates separating out each of the main components of a general task and assigning them to a different worker to improve accuracy. Solid best practice. Can we find a similar model to improve consistency rather than accuracy?
Last question for the authors -- what's the plan to commercialize these tools?
The paper discusses Soylent, a word processing interface that integrates crowdsourcing to improve the quality (grammar, length, and formatting) of written documents. I thought the idea is really innovative. Beyond word processing, one could imagine that the ideas presented here could be integrated into a varitey of software (e.g. spreadsheets, presentation software, etc.). I like how they thought about a couple personas for Turkers -- the Eager Beaver and the Lazy Turker -- as extreme examples of the types of challenges that Soylent might face from Turkers, and then designed some aspects of the interface to address this. For example, splitting up Find and Fix was a really interesting approach to mitigate the issues that both of these personas might present.
However, overall I'm quite skeptical that most Turkers would be knowledgeable and have enough proficiency with grammar and sentence structure to be able to offer an expert level of feedback. The authors cite a previous paper that talks about how Turk has 2 major populations: well educated Americans and well educated workers from India. However, given the limited confidence I would have with even native speaking, educated Americans to be able to edit papers, I'm even less confidence with non-native English speakers. I can see this system working for simple papers, but certainly it would be difficult to expect anything above high school or low level college to benefit from this system. For example, the authors noted the difficulties in fixing Parallelism issues. For the Find-Fix-Verify pattern, I was also surprised that the authors set 20% as the level for a patch to be selected for the Fix part of the algorithm. This seems much too low to me. The authors point out many of the challenges to Soylent, and I can appreciate that they really addressed some of these challenges.
VizWiz, a project aimed at enabling blind people to recruit remote sighted workers to help them with visual problems in nearly real-time. Very interesting paper with clear applications to some very needed and useful future research projects--this type of readily-available, crowdsourced info could be applied to many problems. The two areas that I think prevent this from currently being an overwhelming success are the problems of the users taking good photos, and the response time from the pool of workers. The authors pointed both of these issues out as being problematic. I do think that if there is a way to combine with a user's social network (it was mentioned in the article that users wanted to receive more detail and even a real conversation) that the service could be much more effective. The network of individuals would be much more likely to respond quickly and accurately than random strangers on Mechanical Turk. Sort of an On-Star system for everyday tasks.
Soylent This paper discusses interface outsourcing, integrating the crowd as a function of an interface, specifically a for a word processing application in this context. The features are Shortn, Crowdproof and Human Macro for pruning, proofreading and custom tasks respectively. They try to get around the quality by breaking up tasks into Find-Fix-Verify with multiple agreement at each stage providing extra accuracy. Although this increases latency and cost it seems like a necessary trade-off for any kind of real quality control. The idea that 'wait time' will be eventually phased out by a mass of workers ready to complete any tasks in an instant might be problematic. In order to reach such a critical mass of workers, there should also be a critical mass of decently paying work, and presumably more competition on wages for instantaneous results.
This papers talks about a product to grant a blind person access to visual information by outsourcing recognition to the crowd via a camera equipped mobile phone. To address a need for a quick turnaround, they introduce quickTurkit. By keeping workers busy on other tasks in between recognition tasks they should be available at the moment they are needed. The costs of keeping a pool of workers around aren't exactly clear. Are they paid for the in between tasks that have no use, and if the pool is to be shared by multiple users will $4.67 an hour really be enough to keep a large enough pool?
VizWiz: Nearly Real-time Answers to Visual Questions
This paper describes a tool that uses MTurk to answer visual questions for blind people, and discusses the cost of keeping a pool of workers ready to answer questions quickly.
It sounds like VizWiz has the potential to be very useful and cost-effective, providing functionality which was previously not always conveniently available to blind users. It's very much a work in progress: the UI apparantly forces the user to wait around until a response comes back; and the response times themselves leave room for improvement.
QuikTurKit is important in itself, as it seems to be the first attempt to use MTurk to get near-real-time results. I'd be interested to see how much the response time would improve if many requests were entering the system every minute.
I wonder if the MTurk workers could have provided useful comments about the system.
Soylent: A Word Processor with a Crowd Inside
Soylent lets users create copy-editing tasks for MTurkers with a UI intergrated into their word processor. It is relatively quick and has a system for quality control.
This project has the best name ever. Also, it sounds pretty useful to me. Proofreading, parallized, done in the background, and without needing to corner idle friends? Sign me up!
They should have done more work to evaluate their system: for example, asking a group of users to use their system to accomplish practical tasks, and asking a control group to do the same thing with conventional grammar and spelling checkers, and trying to determine if their software had any positive effect, for example on productivity. Also, I would have liked to see experimental results supporting their claim that separating find and fix really helps in proofreading.
Siamak Faridani 01:40, 7 February 2011 (PST)
Both of the papers talk about how crowd sourcing can be embedded in HCI applications. They provide interesting techniques and best practices. For example the idea of keeping Turkers on call for tasks seems to be interesting. It would be also interesting to verify their accuracy while they are waiting on the system. VizWiz or Soylent can send test jobs to workers and see if they are submitting honest responses. One question that I have is that why they do not use a private cloud of Turkers and why are they insisting on using Mturk. The other question is that reward for each hit seems to be a dominating factor for the completion time. I was hoping that they could have more discussion on the optimal pricing policy. I understand that the quality of work is not correlated with the reward but it can determine how fast you are getting your results back (especially in Soylent). Also the question of optimal reward policy and completion time calculation seem to be very related.
The idea of shortening the conclusion with Soylent was brilliant
The focus of these papers is on minimizing latency and maximizing quality through carefully designed HCI applications.
VizWiz introduces crowdsourcing as an alternative method to computer vision for helping blind people gain visual information. One of the greatest drawbacks in human-powered services is latency, and the authors focus on how they try to solve this with quikTurkit and other heuristics. I wondered whether redundant work was really necessary to reduce latency; since the authors list lower expense as one of the advantages of VizWiz compared to automatic services, this seems like a question worth investigating. I would imagine that a training session or even a game to play during the wait would be a more cost-effective way of keeping workers around than paying them for unneeded work. As mentioned in the paper, it seems that a bottleneck occurs when the worker cannot answer the question immediately from the photo, and this is a source of frustration because restarting is much faster with automatic processes. A possible way to make such communication more efficient might be to add buttons that instantaneously sends common feedback messages on the crowd worker's interface. The other approach also mentioned in the paper would be to add software support that helps the blind users take better photos, but the results from VizWiz:LocateIt reveals that this may be much more challenging. The quikTurkit framework enables "nearly real time" answers from Mechanical Turk. I think the difference between "nearly real time" and "real time" is still significant for most applications, and it would be interesting to explore scenarios for which the pros of "nearly real time" crowdsourced response outweighs the inevitable (albeit minimized) latency.
Soylent tackles the problem of quality assurance in crowdsourcing. The paper describes meticulous design needed to achieve this as well as important experiments and data that analyzes the core challenges of enlisting Mechanical Turkers to produce reliable results. It identifies Lazy Turkers, Eager Beavers (cute), and error introduction as key problems and presents the Find-Fix-Verify pattern to address these issues. As the authors suggest, Soylent's achievement indicates the applicability of crowd work to complex editing tasks. It would indeed be interesting to see further work along these lines. Document and image editing seems like appropriate realms for gathering input from a crowd, but the idea of crowd-produced programming seems a little far-fetched and even risky considering the potential disasters even the tiniest bugs can cause. However, I would probably have had smilier doubts about complex document-editing prior to reading this paper, so it's an interesting idea and my mind is open to it.
VizWiz: VizWiz was a great read and an impressive piece of work. quikTurkit was the most important contribution of the paper. The impact of maintaining the worker pool on the time of answer was a revelation (though did not come as a surprise). The economics of such a design requires a separate study but my gut feeling is that it would scale very well with time. I believe that the design works much better with a larger pool of user. The authors shouldn’t read into the $5/month figure given to them by the users. The users generally don’t know how much value a service has for them till after a longish time interval. I felt that the authors packed a little too much in the paper which made them skim over some of the details. I felt that locateit was another very important contribution and would have liked to reach about it.
Soylent: Soylent’s most important contribution was the Find-Fix-Verify model which provides a design patterns for researchers building crowd-sourced applications. Though it is a fairly well-known technique among crowdsourcers but it is good for someone to codify it as a pattern.
In terms of the system itself, I would have liked to authors to contrast the results with results obtained from professional copy-writers in terms of time spent waiting and the cost. The presentation style of the conclusion was a good touch.
The VizWiz paper presents a talking application for blind mobile phone users offering a new way to get answers to visual questions almost in real-time. Particularly interesting about this research is the quikTurk concept; an idea to shorten latency time and increase user experiences by keeping workers in a constant “loop” of (useless) work to get incoming queries answered faster. This is an essential tradeoff because it increases the costs of the service significantly. In order to avoid that, the service could partner with a requester with similar tasks and push the time sensitive queries when they are needed.
The Soylent paper presents a word processing interface that uses MTurk workers to aid complex writing tasks such as paragraph shortening and error prevention to MS Word when needed. It consists of 3 parts: The Shortn, the Crowdproof and the Human Macro. Interesting here is the potential issue of Information loss, particular terminologies and lack of context. E.g. The first worker could misinterpret a paragraph (e.g. about a movie) and change the title and the focus (e.g. to a book that is titled similar). The following workers cannot identify the mistake because the suggestion of the first worker is in itself consistent. Unless there is enough context provided by the original author, this can turn out to be – although rare – but significant issue.
VizWiz provided assistive use of crowdsourcing markets. It used iterative process for improving image recognition, but what if a user talked with a group of workers? It may produce more issues like session management and allocation, but I think more feeling of direct relationship may result in more satisfaction in such use case, and facilitate more use.
Soylent showed how writing supports from crowds can be integrated in a program. Especially, the Find-Fix-Verify pattern and the Human Macro are very interesting. For example, it can be applied to other forms of interaction between users, rather than in crowdsourcing market. In addition, I think more feedback and incentive/guide system will improve the satisfaction of results.
Authors' Replies to Student Comments
Michael Bernstein (Soylent)
Optimal Pricing. Some people claimed that pricing seems to be the primary driver of completion time. Writ large, this is likely true -- see Mason & Watts HCOMP '09 for the most rigorous study we have so far. But, when we're talking about interactive applications, payment may actually be less important. The reason: for many interactive systems, we care most about getting a small number of answers quickly, not answers to 10,000 questions. I've done some unpublished studies of payment on Soylent completion time, and we see that the first few folks arrive no matter how much you pay. Here's a graphic showing what happens in Crowdproof: http://dl.dropbox.com/u/2398832/crowdproof-wait.png Notice that there are a few people who accept the task in under two minutes no matter how much I offer. Paying more just speeds up the latecomers. For further evidence of this, see Chilton et al., "Task Search in a Human Computation Market", which suggests that lots of Turkers use the Most Recent sorting to find tasks, regardless of price.
Evaluation. There were a lot of different suggestions for how to extend the evaluation section of the paper. All of these make sense, and some of them we are trying: combining Find+Fix into a single step and seeing how much of a difference there is, comparisons to professional proofreaders or editors, comparisons to conventional tools, and comparisons to other patterns (e.g., Iterate and Vote). Another one that nobody mentioned but we worry about is convergence: the paper makes no guarantee that Crowdproof won't just take your money if you pass it an already-perfect paragraph. Perhaps the takeaway here is that we are still evolving a set of metrics for how to evaluate crowd-powered interfaces. Which are the most important questions to answer, first? Which are second-order effects? What does the class think?