Labor Issues and Incentives

From CrowdsourcingSeminar
Jump to: navigation, search

Reading Responses

Sally Ahn

Designing Incentives for Inexpert Human Raters: In this paper, Shaw et. al. conduct a study on how various social and financial motivational frameworks affect the quality of content analysis tasks on Mechanical Turk. Their results show that the two conditions that produce significant and positive effect on quality are Punishment for disagreement and "Bayesian Truth Serum." I found this result somewhat surprising and very interesting, especially because the various reward and punishment conditions appeared to be quite similar. The authors' theory that framing the condition as a punishment rather than a reward creates a greater impact on the workers due to the possibility of rejected work (and subsequent banning) makes a lot of sense. I think this shows that even subtle differences in wording a task can yield significant improvement in quality. Their explanation for BTS (creating confusion in addition to cognitive demand among the workers on their evaluation) also sounds feasible; why confusion should act as a motivator is an interesting psychological question. Overall, I thought these experiments were well designed and their analysis was quite thorough.

Sellers’ problems in human computation markets: This paper changes the perspective on crowdsourcing research by shifting the focus from the requesters' desire for cost-effective and high-quality work to the workers' problems with unscrupulous requesters in Mechanical Turk. The authors investigate this problem by reading turkers' messages on Turker Nation, many of which lament the abundance of spam/scam tasks and the lack of control the workers have over how their work is evaluated (which directly affects whether they get paid). I think some of the suggestions collected for a hypothetical "Turkers' Bill of Rights" are reasonable and may effectively diminish some of these problems. For example, enforcing a low baseline pay for all completed tasks sounds reasonable to me. While this doesn't exactly solve any of the problems (a worker could still be unfairly denied the rest of his pay, and an unscrupulous worker will receive some payment), I think this would discourage the massive growth of spam/scam tasks, and the requester could still maintain control over quality with his power to reject work (even though the poor worker receives baseline wage, such workers will accrue bad ratings, and the previous paper showed that most workers are very concerned about the possibility of rejection).

Philipp Gutheim

The "Sellers' problem..." paper addresses a valid concern: The dependency and increasing exploitation of MTurk workers by scammers & spammers. Although this paper addresses an important issue, I feel not comfortable to write a serious comment on a paper which contribution starts with an anecdotal story telling a la "and then spamkid94 goes "XYZ" and ends with a set of open questions. I am well aware that this research area - as the paper points out - is not tackled at all. However, I believe there are more appropriate ways to approach a new research topic in a more academic way.

The paper "Designing Incentives for Inexpert Human Raters" investigates how quality of results differ in respect of financial or social incentives for the workers. As one of the findings, the paper points out that an international set of workers has significant implications. This is for instance because of inter-cultural implications as well as contextual differences. For example, US workers might have a better context for tagging an image than Indian worker (hypothesis: more US workers use Facebook, read about it in blogs, the media etc).

Travis Yoo

Designing Incentives for Inexpert Human Raters: the paper conducted very interesting experiment that treated workers differently and presented in-depth analysis. The similar example in the real world would be using different terms and conditions in contracts between employees and employers. In the real world, work contracts involving simple tasks usually convey punish conditions (such as "I'll deduct 10% from the total bonus if you break more than 5 dishes!") and it is known that those with such conditions are generally more effective to make workers perform better. The paper's results are consistent with the findings in the real world.

In designing different incentives, I think referring to various social experiments in behavioral economics (especially ones related to incentives) would be interesting. There may be unpredictable differences that may only appear in the AMT, however, considering the characteristics of the AMT marketplace, I think incentives applied in the real world market would generally work in the AMT and human computation markets as well.

Sellers' problems in human computation markets: the paper addresses issues related to the sellers (workers) in the AMT. The paper identifies the problem of uncertainty associated with HIT payment, and it argues that it is mainly because of the prevalence of malicious requesters and intentions. The authors attribute the cause of the problem to the AMT's design decisions, which successfully attracted workers to the marketplace as well as scammers at the same time.

Mainly this problem roots from the socio-techno gap that we haven't closed yet (or never ever)in such systems, and the problem's nature suggests that the problem can be definitely solvable in any close time. Again like other research areas in HCI, understanding workers' behavior and the connection between workers and the system would be critical in solving this problem. As the author says, looking into the studies already conducted over other systems (such as the low quality of online Q&A or email scamming) would be also useful to break through this problem.

Kurtis Heimerl

Designing incentives for inexpert human raters Interesting paper. Basic gist is that they present a variety of incentive mechanisms to the users and measure the relative effectiveness of them. I had some qualms about the methodology, probably because I was recently deeply embroiled in writing a turk paper of my own. Some were minor (failing a null does not suggest anything), some were major (chi-squared is generally not appropriate for discrete distributions without more analysis. Maybe they did.), and even more were probably because I don’t know what I’m doing.

The result is interesting enough that none of those really actually matter. First off, the variety of incentive methods was very broad and pretty awesome. They found two incentive methods worked the best; punishment for bad answers and some complicated betting scheme. I didn’t understand the latter, and I can’t image the turkers did either. This was discussed in the text at points, simply making the turkers confused about payment likely improved the results. Yet, it not only was significant, it was the best. Does this mean the results are questionable? Science should say yes, data without a hypothesis doesn’t say much. The fact that cheap talk-surveillance did the worst is equally surprising, I mean, that’s the canonical method right?

However, given this thought,I’m pretty surprised there weren’t more positive results. I mean, plenty of the instructions were confusing and didn’t any positive impact. My rambling can end now, but this was interesting.

Sellers' problems in human computation markets This whole paper contradicts what I found in my interview with an Indian BPO operator. Basically, he argued that while scams were prevalent, they actually paid. This was the ONLY way to make money on turk, in fact. SEO tasks paid very well and he took them very seriously. Those guys also made no use of these toolkits, and didn’t seem particularly interested in using them either.

Shouldn’t we be asking the users what they want? Not just those that self-selected onto turkernation and turkopticon either. That’s more along the line of Bill’s work, I guess.

Nicholas Kong

Designing Incentives for Inexpert Human Raters In this paper, Shaw et al. explore a number of social and financial incentives for MTurk using a content analysis task. The most interesting part of this paper for me was the wealth of sociological insight and theory the authors brought to bear on their experiment design. The sample size was also impressively large, especially for an HCI paper (but maybe not so much for a sociology paper).

The authors found that residence in India and web skills had the strongest effect on performance, stronger than any of the other treatments. I wonder if re-running this experiment limiting the subject pool to US-only residents would help tease out some of the subtler effects between the treatments. I also think it would be interesting to approach the problem in the other way: find what motivations work well with in-person teams, and try and recreate such environments in MTurk.

Sellers' problems in human computation markets This paper is a short call-to-action to HCI researchers to consider crowdsourcing systems from the workers' perspective. The authors use anecdotes from Turker forums to describe some of the problems Turkers face, such as delay to payment and difficulties in distinguishing fraudulent HITs. The questions they raise point to potential new directions in crowdsourcing research, but they don't really go into the issues in much detail.

Beth Trushkowsky

"Seller's" problems raised a lot of interesting issues about the trials and tribulations of people who sell their labor in human computation markets, in particular on mechanical turk. A lot of problems we've discussed already and I've been working on, e.g., paying on time, not rejecting labor unfairly, having clear task instructions. I was surprised the "HITs you should NOT do" turkernation forum post said to avoid surveys and quizzes, as those tools are used a lot by researchers who want to understand the labor pool and how heterogeneity affects quality and experiment repeatability. More surprisingly, however, is that there are so many HITs from requestors that violate Amazon's terms of use (and the rules of the "do not do" list). As a requestor, I'd expect the crowdsourcing platform (i.e., AMT) to enforce its terms of use and penalize those requestors; this problem shouldn't be as rampant as it is.

"Designing Incentives" describes a series of experiments where the authors used many different incentive methods for a mechanical turk task regarding content analysis. I really appreciate their "kitchen sink" approach (i.e. trying fourteen methods). I was disappointed that many of the methods didn't improve the quality significantly, but I thought it was interesting that the "punishment-disagreement" condition worked while the "reward-agreement" condition didn't. I agree with the authors' analysis that the punishment version implies futures consequences for turkers since they can be banned from the site due to many rejections. I think the difference is also related to the micro-task nature of the labor: a reward on one particular task is never worth never being able to work on many more tasks. All in all, an interesting paper, however I wonder how much of the results were influenced by the type of task (content analysis). A next step would be trying the same kitchen sink approach on other task types.

Wesley Willett

Designing Incentives for Inexpert Human Raters describes a series of experiments in which the authors tested the impact of a number of different incentive strategies on Turk workers. The number of strategies tested and the sample size used in the experiment are both pleasantly large. However, their results, even with such a large population, are not particularly clear. In fact, the results seem pretty noisy, with many treatments giving good results in some tasks and very poor results in others in an unsystematic way. At best, the differences between the treatments were subtle. One potential problem is that the content analysis questions they asked (particularly the "content rank/rate" and "user rank/rate" tasks) were extremely subjective. This means that even well-intentioned workers could easily give a bad result and helps explain why workers performed very poorly on average. I suspect they would have gotten more discriminating results if they had tested using prompts that had more objective responses. These particular prompts may also explain the success of the BTS treatment, which gave better results on the rank/rate tasks in particular. This makes me wonder if it gave better results in the ranking tasks because it set up a sort of Keynesian beauty contest that encourage participants to pick the perceived majority value judgement rather than their own.

Additionally, I was disappointed to see that the authors looked only at the impact of each strategy on a single task/worker. I suspect that the efficacy of some of these strategies will vary dramatically over time, depending on how strictly penalties and rewards are enforced or how well social connections are maintained. For example, the effectiveness of the betting, tournament, and BTS schemes should vary as workers get a feel for how much their performance actually impacts their result or payout. Likewise, social incentives should presumably decay if the humanization and social connections are not maintained.

Sellers problems in human computation markets, meanwhile, gives a high-level survey and discussion of the issues facing workers on Mechanical Turk. They raise a lot of interesting and relevant points, and their high-level open questions are general, but the work is very AMT-specific, and I think it leans too heavily on an analysis of a single (albeit prominent) forum thread. Assuming we are interested in getting into the nitty gritty details of Turk (as opposed to looking at a broader range of crowdsourcing platforms), I'd be more interested in seeing quantitative data that tries to give a sense of what portion of the work on Turk is actually disingenuous. That said, as a short survey of the worker-centric issues facing work and research on Turk, I think it does a pretty good job.

Yaron Singer

Thank's everyone for the great responses. See you in the discussion today,

Prayag Narula

Sellers problems in human computation markets I really liked the sellers case just based on the fact that it has a lot of qualitative data. I don’t agree with some of the points raised about the statistical validity of the findings. It is about the power-turkers who want to help newbies avoid the fraudulent ‘buyers’. The data is very useful for the researchers and can be used by new platform builders on how to be fair towards the turkers.

Designing Incentives for Inexpert Human Raters The paper talks about various incentive methods and their effectiveness in improving the quality of the work on crowdsourcing platforms. I have two major gripes about the paper: 1. The incentives they talked about are very economic in nature. From my own research, tasks that allow for personality development are incentives big enough for people to even accept lower wages. Something, like this is not even touched upon in this paper. I think economic incentives are very narrow to motivate workers especially in case like MTurk where the payment is very less. 2. The incentives that the researchers talked about were fairly complicated. For workers who are short on time to read the instructions would probably skip a lot of it. That seems to be what happened based on your results. It seems a lot of the workers simply avoided your instructions while doing the tasks. A false null here probably doesn’t prove much on the contrary.

Siamak Faridani

I enjoyed both papers specially the one on seller's problems by Lilly Irani. It was an example of an interesting research question that actually makes an impact in real world and makes it a little easier for workers on AMT.

The first problem by Shaw et al had some issues with the hypothesis testing. So they have 14 treatment groups and 2 of the treatments have shown to be significant. This setting really reminds me of XKCD's comic about green jelly beans causing acne. Their treatment classes is large and I think they should increase the number of participants dramatically to get any meaningful results from that experiment

Chulki Lee

Designing Incentives for Inexpert Human Raters conducted a controlled experiment comparing the effects of 14 different incentive systems. I like the broad coverage of incentive systems. I think different effects by incentive systems may be explained better if using some nature of MTurk, such as lack of (good) reputation or trust system.

Sellers’ problems in human computation markets identified various issues in MTurk, in particular in seller's perspective. I totally agree with the argument that the "bad" phenomena not just exists in human computation markets, but spread over the Internet, and at the same time, the markets have distinct characteristics from other space on the Internet. I am just curious about how it can be distinct from other information exchange system. Is it more like online version of labor markets or another online information exchange system having financial components. I think most papers we discussed until now focused on the former perspective and this paper gives an opportunity to think on the latter view.

Kristal Curtis

Designing Incentives for Inexpert Human Raters:

In this work, the authors investigate various incentives and determine how they impact answer quality. It was very interesting that the Bayesian Truth Serum approach worked the best. I'd be interested in applying this to a subjective task to see if it would work just as well. I was concerned that they computed ground truth by finding consensus among the research assistants, who were very likely a homogeneous group and therefore more susceptible to groupthink. I would be interested to see if the results would hold when ground truth was from a more heterogeneous group and presumably more robust.

Sellers' problems in human computation markets:

In the work, the authors explored various factors that contribute to a worsened experience among Turkers. It's interesting that they focused on Turkers' problems, where most work in this area looks at requestors' problems (eg, quality, throughput, spam). I agree that HIT payment can be very delayed, based on my experience doing the homework assignment for this class, where we had to perform work as Turkers. It would be interesting to study whether Turkers' trust in a particular requestor leads to better result quality.