Systems for Crowdsourcing
- 1 Readings
- 2 Discussant's Slides and Materials
- 3 Reading Responses
- Paper 1: CrowdDB: Answering Impossible Queries. Michael Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, Reynold Xin
- Paper 2: CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones. Tingxin Yan, Vikas Kumar, Deepak Ganesan, Mobicom 2010
Discussant's Slides and Materials
The crowdsearch paper suggests interesting techniques for optimizing crowdsourcing services, and how they can be efficiently applied in mobile application. Its main claim to fame, though, is in its ability to solve the notoriously hard problem of image search, which seems a bit much to hope for. The hard problem in image search is being able to answer search queries where the input is an image, and return a proper list of suggestions. Since crowdserach's product uses a SEARCH ENGINE to perform this task, it seems more like an optimization layer rather than an image search engine. It's not really clear how this product is unique to image searches, as opposed to, say, regular web searches that could also be optimized using a human in the loop (to select the most relevant result for example). Furthermore, it's not clear why this product should be specific to mobile applications. The component I found most compelling was the delay prediction model, which I'm sure could be most useful for many crowd applications.
The CrowdDB paper presents the design and implementation of a database system that uses crowdsourcing to process "difficult" queries. I think this paper is an interesting prove of concept and illustrates a nice approach to combine database queries with the crowd. However, I believe the papers leaves some open questions. One of the paper's benefits is the ability to perform cost-based optimization. Although that might be nice, I am wondering - when implemented on a large scale - if the costs aren't exploding. Along that line, I believe that timeliness and the (total or current) number of available workers might not be sufficient to process queries that a service could use it in a real world application. In a nutshell: Two use cases were presented but a real world context was missing.
The CrowdSearch paper presents an accurate image search system for mobile phones as a response to inadequate technological solutions. This paper is a quite well developed and mature prototype. However, I have two concerns and a question: a use case concern, a user interface concern and an open question in the reasoning. Firstly, the researchers present an elaborate solution for a technology problem but since they've developed an actual iPhone application it is unclear to me in what situations actual users would use such an application and if they would be willing to pay for the service. Secondly, there is a lot of work done in the back-end to minimize the time to show results but the user interface does not give any idea about what price would be appropriate (and what impact it would have on the time-to-complete). Thirdly, I can't follow the argumentation of the tradeoffs. The paper points out particularly 3 trade-offs: Accuracy of search results, (monetary) costs and battery power consumption. Although, there are two categories with "sub-tradeoffs", the crowd side (delay, accuracy and costs) and the back-end side (energy, delay and accuracy). In particular, the reasoning for the crowdsourcing is that multivalidation would increase accuracy ("[...] can reduce error [...]") and reduce delays. This is confusing because intuitively, a multiple validation imposes delays / time-to-complete since another worker needs to look over it.
CrowdSearch paper suggested several models to predict delay and algorithms to optimize delay and costs. In the experiment, they provided evaluation of their models, effectiveness of the system in terms of recall, monetary costs and energy consumption. I think the paper covered many issues well, from low-level to high-level problems - except crowd. For example, how can performance of individuals in crowd would be improved? Are workers in AMT optimized enough? It is more than just interface - what is their motivation? We need to listen to them.
CrowdDB is more interesting and more focused on job management. It considered individuals as computing components (it is not new) and interpreted crowdsourcing by analogy with database systems. I think it is meaningful as a modeling of matching crowdsourcing works and traditional computing tasks. In It seems hard to cover every needs and provide an abstract layer.
CrowdDB: In this work, the authors describe how they have extended traditional databases to solicit and include crowd work in answering queries. This paper is a great example of how a hybrid system can be produced by leveraging the somewhat disparate abilities of machines and people -- they use machines to do analysis of lots of data, but they leverage people's abilities to gather data and provide subjective judgments. I really liked their section on crowd microbenchmarks. They provided a bunch of interesting insights beyond what Ipeirotis published in his survey of MTurk. I'm also very interested in hybrid systems. I am most intrigued by the last sentence in their conclusion where they suggest that in some cases, you might prefer to solicit information from the crowd even though it is also available from online. This comment dovetails with a comment I heard recently from a CrowdFlower employee, which stated that some crowd programmers prefer to use the crowd to do things that we know how to do with computers. I think crowd/cloud scheduling is an interesting open question.
CrowdSearch: The authors of this paper produced an image search system which uses a combination of human input via MTurk and computerized image search. This work is important in that, like CrowdDB, it explores what can be done with the combination of humans and machines. I was really impressed by their models of the Turkers' behavior. I also liked that the authors provide some simple, easily understandable knobs to the user (price, deadline) so that the system feels less opaque. I'm very interested in modeling and would like to use something similar in my own work.
The CrowdDB paper presents the design of CrowdDB--a relational database system that is used to answer queries that are impossible to answer without the help of humans on Mechanical Turk. The authors discuss two cases where this is needed: unknown or incomplete data, and subjective comparisons.
Overall, I thought the idea of CrowdDB was quite interesting, and really has the potential to solve many of the issues with retrieving information from standard relational databases. In general, I felt that while their experiments confirmed a lot of assumptions about the Mechanical Turk platform specifically (e.g., higher rewards result in faster response, reward matters in a smaller communities of workers that are competing for resources, etc.), the experiment would have to be repeated a number of times over different day-parts, days of the week, or month. The authors did address these concerns in the paper. I'd also be curious to see if any of these findings could be extrapolated to other crowdsouring platforms beyond MT. In general, though, I felt that CrowdDB shows good promise with helping to solve information retrieval issues from databases.
This paper details Crowdsearch, an image search system for mobile phones that uses automated image search with real-time human validation for the search results. The authors claim 95% precision, money savings in comparison with non-adaptive schemes, only a 5-6% delay, and a savings in energy consumption.
My general thought when reading this paper is that I'm not sure how useful this is for people. How often are people really using their phones for intense image search? For me, it's pretty rare that I do this task on my mobile device. Furthermore, the authors state that "The explosive growth of camera-equipped phones makes it crucial to design precise image search techniques." Not sure how critical this is -- phone resolutions will likely increase substantially over the next few years, making the main benefits of Crowdsearch less useful.
But the biggest issue I have with Crowdsearch is the cost and delay for image search. The authors suggest that perhaps a notification could be sent to the user while they are doing other tasks so that they don't have to just wait, but I find that in general that immediate results (or as immediate as possible given the constraints of the network) are what matter most for mobile-based tasks like search.
CrowdDb is an interesting example of abstractions and primitives that enable extend SQL and databases and incorporate new crowdsourced primitives. The paper also describes experiments where real-world experiments using mTurk are conducted, and the response time, cost in $, result quality, and way jobs are posted are assesed. I like the special db-like constructs introduced - it reminds me of how Computer scientists used to write explicitly parallel code in the 1970s. The authors also describe how automatic web interfaces are generated that solicit data and resources from the crowd.
In terms of open questions, I would challenge the authors to think vertically up in the stack. To me, the main benefit of incorporating crowdsourcing at the DB level is this - it provides an easy way for programmers to use the familiar business logic <--> DB connectivity (ODBC like) and transparently use crowdsourcing without having to learn new things. However, given the latency of a typical crowdsourced answer, how would ODBC like constructs behave? What modifications would the programmer have to make in the way they think about DBs.
I also liked the graphs (the results for the experiments). I am interesting in building simple models, and actually, in questioning if there can be generalized simplistic models for mechanical turk.
CrowdSearch was a very interesting paper in that it takes a stab at making a simplistic mathematical model for delay prediction. I am wondering how generic this model is. Also, the authors say that its 60% better to post 5 questions of a penny each as opposed to 1 question of 5 cents. That is surprising, and I'd like more details (like when they posted the task, did they restrict it to people in the states etc). Also, the authors mention that certain jobs (like tagging) is unpopular. Is the price-latency tradeoff same for unpopular tasks? I want to understand the model better, and use it for my research (an A-B testing framework)
The CrowdDB paper is framed as an way of answering questions outside the scope of the database and as a way of handling dirty data. However, I'm not sure if the toy examples presented in the paper really illustrate a realistic use case. Most scenarios in which someone might typically execute a relational database query over a small set of data are delay-tolerant or error-tolerant for this to seem like a viable option. Instead the concepts in the paper seem more useful to me as a methodology for using the crowd to collect or refractor structured data. It provides a reasonable way for developers or database users to specify a schema using familiar RDBMS syntax and semantics and populate it by executing queries. In general, it's an interesting and different systems metaphor for thinking about the crowd as computation. If TurKit is the crowd as system as processor/state machine, and CrowdDB is crowd as relational database, are there other systems metaphors like these that might be productive?
I do have one big question regarding the system. The example prompts shown in the examples in the paper are devoid of any information that would provide additional context for the task - yet much more context (beyond just the database query) is necessary for a worker to produce a reasonable answer to most questions of this form. For example, if I'm entering a professor or department - does this need to correspond to a particular university? If I'm asked whether "IBM == Big Blue" - is the equality operator asking if the two strings are literal matches? If the two are the same legal entity? If they're used to refer to the same company in everyday speech? Based on the description of the experiments later in the study, I expect that more detailed prompts were given to the Turkers. However, generating more detailed prompts and providing additional context for workers seems difficult to do automatically given only a schema. I'd be interested to hear how this was handled in the prototype system and what the implications are for doing this at a larger scale.
The CrowdSearch paper was quite a bit drier, and didn't raise as many questions for me. It's interesting to see more elaborate modeling and optimization applied to the problem of task allocation. If this sort of microtask-oriented crowdsourcing is going to take off in the business world or if it's going to scale well, an awful lot of people will be fixated on optimizing these tradeoffs in the next few years.
CrowdDB is a relational database that allows users to use the crowd to populate tables and run queries over those tables. I liked the idea of inserting the crowd into a query plan, but did have some questions about the way the tasks were presented to users. In particular, it seems like it would be easy to post impossible tasks if schemas aren't complete. For example, if a query was run over a professors table only included a department but not an institution, the results would likely be unusable.
On another note, the costs (both financial and temporal) for executing an inappropriate query, for example one that returns more records than necessary, are much greater when it comes to the crowd. Unless the user knows the schemas well, it's not clear that a given query will access the crowd or not. I know it's not part of what the system is exploring, but perhaps previews of crowd-query plans, in particular the interfaces, might be useful.
I would be interested to see these experiments run on more difficult queries, such as relevant image finding, or answers that are naturally ambiguous. The study was run on a fairly straightforward task. Error rates might be a lot higher on more esoteric tasks.
Finally, I agree with the approach to be lenient to workers and give out bonuses for good work. Robert Kosara, in the visualization community, has come to the same conclusion with his MTurk experiments.
CrowdSearch is a system for image search that uses MTurk workers to validate the results of automatic image queries. I think it was a good idea to keep the task simple for Turkers. There's more incentive to pick it up and it's just as easy to answer in good faith as in not, which reduces bad-faith results.
However, while there's a good contribution in modeling the delays, I have serious doubts that this model can be transferred to more complicated tasks, such as those of Soylent or even CrowdDB. Their tasks are very simple and require a single button press. It's unlikely they have many Turkers who abandon tasks, for example, and responses take negligible time. I would expect that more complicated models are necessary for more complicated systems.
CrowdSearch is a mobile application for performing image search by combining machine and human effort. I liked how it used machines and humans in a pipeline: machines to crunch a bunch of data, and the crowd (via AMT) performed verification to achieve the final results for each query. Crowdsourcing research should continue to include an understanding of how and when to use the human effort; we've seen several examples already showing that people are good at verification tasks. I also liked the idea of using a query's deadline to reason about how many concurrent tasks to post to AMT, which affects total cost. Also, the authors briefly mentioned that a query could include GPS information or tags--I wonder how the crowd would take advantage of such information for their verification task.
CrowdDB is a database system that integrates human input to answer a range of queries not possible with a traditional database. The authors take an "operator-granularity" approach in which some database operators are crowd-based while others are machine-based (perhaps using cloud computing). This hybrid technique should allow both crowd and cloud to do what they ecah do best... the authors point out that just because humans can do quicksort doesn't mean that they should. I thought the comparison between crowd and cloud interesting as well, in particular how individuals develope affinities for types of tasks or requestors. Finally, the GoodEnough operator is of particular interest to me, especially finding other techniques other than a majority vote to assure quality control.
CrowdDB extends SQL to use microworkers to do things computers can't. It seems to be a library that is waiting for an application. The paper didn't seem to describe any situation where it makes practical sense to use CrowdDB like a database (especially given the time it takes for the crowd to complete the HITs required by a query) but maybe it is better to think of this simply as a new wrapper around MTurk's API. Like the other MTurk-using papers we've seen, this paper includes some advice on using MTurk and its pool of workers efficiently.
CrowdSearch improves computer image search results by asking microworkers to filter them. It seems like a convenient app to have on my phone, if the delay could be reduced significantly. The paper's main contribution seems to be a new way to adaptively decide when to ask more questions. The binary yes-no tree model is interesting, but I wonder if there's any point in paying attention to the order, or if a simpler model would produce better results.
CrowdDB describes a system for integrating human computation in database queries by hiring workers from Mechanical Turk. It provides detailed discussion on both the motivation and method for such a system. This system allows humans to provide answers when a computer might not because humans are not limited by the Closed World assumptions and can also make subjective decisions. I think it's an impressive feat to integrate human computation as seamlessly as CrowdDB does. Automatic user interface plays an important role in ensuring this and its principles provided some insight on designing effective tasks for turkers in general.
CrowdSearch enables human workers to help with image search on mobile phones. Thus, it is smaller in scope than the CrowdDB paper which addresses general database queries. Its focus is on minimizing cost and optimizing for speed. While I agree with some posts that image search isn't a primary task people perform with mobile phones, I found their delay prediction model and transformation of image search into validation tasks pretty interesting.
Both of the papers were extremely interesting although I am not sure about the scalability of these applications. In the CrowdSearch paper and in the Delay Prediction Model, authors assume exponential arrival times without any justification. The only justification that they provide is the CCDF diagram in Figure 3. I am extremely doubtful that the arrival rate is homogeneous. And the proper test to justify their work would have been a kolmogorov-smirnov test instead of just a CCDF diagram. Even a probability density function can be a more accurate that their diagram. Almost every distribution looks the same as soon as you start looking at the commutative distribution function. This model obviously makes everything easier but it is not clear how accurate it is.
Additionally I am not sure if arrival times can be easily considered independent (as assumed in the paper) Mturk arrivals seem to show non-homogeneous behaviors and if thats true then the independency is violated. HIT completions affect other predictors that influence arrivals for example the number of available hits changes after one arrival and it seems to me that it will change the next arrival so independence assumption is not close to reality.
The other issue that I might add to both papers is a measure of confidence in the results. Perhaps in addition to the count of rows in traditional SQL authors could implement a measure that tells the requester how confident they are with the results. That might be an interesting addition to both systems.
CrowdSearch: CrowdSearch uses crowdsourcing for improving the Image based search on cellphones. The biggest contribution of the paper seems to be the Delay prediction model. I believe that this delay prediction model is more accurate than modeling it using the queuing model but it still needs a lot of work. On the other hand I have doubts about the claim that the DelayPredict and ResultPredict are independent. It might be possible that the long delays in a system might be because of the image is not clear, this might affect the quality of results. One criticism of the paper that I would make is that it doesn’t take into consideration the Information Retrieval models of online search. It would be interesting if the paper would take into consideration that the searching process for a user is an iterative process with the user making multiple queries one after another till she finds what she is looking for or gives up.
CrowdDB: CrowdDB uses a hybrid of the human and computer processing to solve queries that are impossible for traditional DB systems. Its a very interesting work and I am personally interested in this direction of research. Especially interesting is the use of StopAfter and GoodEnough operators.
CrowdDB represents workers as relational operators to enable a SQL query to return results with new data and subjective judgement. It's a very cool new approach for RDBMS. For instance, it shows that entity resolution works. Tiny complaint about the experiments: I don't think there's really any doubt this wouldn't work. Related, about the statically chosen task parameters: 10 comparisons per hit, at 3 assignments each for 1 cent: smarter task batching would have been very interesting.
The discussion about operator level of granularity used to engage human workers is made me think more broadly about how DBMS should/can engage crowds. First, I don't think the choice is really between query- vs operator- level. I think the case that must be first made is whether entity resolution (data cleaning) or getting new information (data collection) needs a RDBMS interface. I can't think of why, but would have liked to be educated on that. Second, I'm thinking about the large run time differences between a RDBMS and human operators. In this scenario, the computer can afford to spend an order of magnitude more cycles to optimize not only inter-human-actions, but also intra-human-action. I would have liked to see the latter considered. Maybe that leads to a third alternative approach?
CrowdSearch is about doing image search on a mobile device under money and latency constraints. What's interesting is the latency prediction model. The charts are very pretty. But as a user, I think I'd be pretty happy with the top hits on Google, rather than waiting at all...