Intelligent User Interfaces

From CS260Wiki
Jump to: navigation, search

Lecture Slides


Extra Materials

Shneiderman, B. and Maes, P. 1997. Direct manipulation vs. interface agents. interactions 4, 6 (Nov. 1997), 42-61.

Krzysztof Z. Gajos, Jacob O. Wobbrock, and Daniel S. Weld. Improving the performance of motor-impaired users with automatically-generated, ability-based interfaces. In CHI '08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 1257-1266, New York, NY, USA, 2008. ACM.

Discussant's Slides and Materials

Reading Responses

Bryan Trinh - 10/24/2010 14:04:32

Principles of Mixed-Initiative User-Interfaces

This paper introduces the use of mixed-initiative user-interfaces. It is essentially a mix of direct manipulation with an automated recommendation engine. The recommendation engine is proposed to reduce the amount of time to complete tasks that are relevant to the user. It adapts over time to suit the needs and uses of a particular user.

By enumerating a set of factors necessary for development of mixed user interfaces, the author gives ui designers a well thought out checklist to judge their own designs. The only thing that I would add to the list is that such a system would take a considerably longer time to develop, and developers might be better off creating a UI that can be navigated with less error.

Providing recommendations to the users that evolves over time also prevents the construction of chunking patterns. This is especially true for expert users who use the system with high enough articulatory proficiency that automation is not needed. In my own experiences, using these sorts of systems is very useful when first learning a program. It served as a way to explore the capabilities of the application. After I was aware of the common functionalities, I found faster ways to execute the tasks I intended.

Sikuli: Using GUI Screenshots for Search and Automation

This paper presents an image based approach to searching digital documents and scripting user-interface actions. By building a library that uses openCV image recognition as a document search tool, they were able to build a rich set of programming tools and applications.

This system changes the way that one can think about and articulate code. The semantic distance is reduced to a point where most non-programmers can provide a good guess as to what each function is doing. Their screenshot automation system is also extremely extensible--creating value for a number of different types of users. It would be useful for UI programmers that need to test edge cases for various button combinations.

One thing that they did not sufficiently address is the limitations of using image recognition. After looking at some of the demos online, one notices the sluggishness between the image recognition functions. Because image recognition is so CPU intensive and requires much more time to execute, there is an inherent trade-off in choosing to use this system versus a text based system. Essentially we trade human cognition time for computer processing time.

Kurtis Heimerl - 10/25/2010 13:30:16

Principles of Mixed-Initiative User Interfaces This paper provides a set of principles for enabling mixed-initiative interfaces; interfaces that mix automation and direct user interaction.

I wish I could ack Bill Gates. All of my current papers just thank telemundo, for totally valid reasons. Anyhow, I've been involved in some IUI work, mostly Kuang's Usher project. The real difficulty in these interfaces, as far as I see, is that all analysis has to be longer term. This is because there's a terrible feedback loop to the statistics. Some interfaces seem reasonable until they've been used for a few months. Users optimize their behavior for the system, quickly screwing up the statistics and mitigating the value. That's what I remember anyhow.

This paper was good, there's not a lot to say. I'm not sure about their 12 points, that's a lot of points in an introduction. It probably makes more sense in the context of the research community. The idea that there were people arguing "no, we must not automate anything! Let the user do everything directly!" and "The optimal system just guesses what the user wants, they need only hit the power button!" is silly. Of course these support each other. Guess, and ask when you don't trust your guess. Anyhow...

Sikuli: using GUI screenshots for search and automation This paper describes sikuli, a system for using screenshots to search (instead of keywords) and for automating GUI tasks.

The first half of this paper was rubbish. As far as I can tell, they demonstrated that highlighting an area is faster than typing. I don't think speed is a reasonable metric for this work, and I'm very unclear as to if they allowed copy-paste. Either of those throws that whole search engine away.

The automation engine though, is bad ass. I think they chose the wrong language, classical (e.g., python) programmers generally won't need to do this sort of automation. However, for people with limited computing skills, this could be a godsend. I know of plenty of users in developing regions who could get a lot of utility out of basic UI automation; automating installations, data entry, or so on. I want less of an action-oriented script though, which they discussed a little bit. I want the UI to be reorganized by the script, perhaps placing data entry items in the same location to ease entry. A toolkit to do that would be great.

Linsey Hansen - 10/26/2010 7:56:50


  In the first paper, the authors describe a new method for searching and creating actions scripts by using screenshots of gui objects.  This visual method is called Sikuli. 
  One thing that I might have missed is whether or not people are able to include words with their screenshot search.  While taking a screenshot might be easier when you have no idea what something is, and therefore want as much information as you can on it, I would think that if you want to do a specific thing with something that words would be necessary.  
  I do not believe that I have ever heard of the concept of visual words until I ready this article, and they seem really neat.  However, I am not clear as to whether the “word” is equivalent to the colors of a pixel, or if there is some other sort of encoding.  Regardless, I wonder if it would just be easier to make gui objects have some sort of extension which shares information about that object, then selecting it just copies/pastes that information so it can be searched more easily, because that seems like it might be easier on a computer than image processing.  This would also help out with the problem of customized themes among users, so that they could share scripts easier.  However, at the same time, the image processing approach is easier because it does not require people to add anything extra to an interface.  

Principles of Mixed User Interfaces

  In this paper, the author, Horvitz, discusses how to create “mixed” interfaces, where the user is able to both directly manipulate parts of the interface as well as use automates services.
  In the Lookout example, I felt like what was being described is identical to “Rules” that can now be set in most programs (especially mail ones).  Where the user just needs to select an event in some sort of detail, and then specifies action which can be carried out by the computer, or in Horvitz's model this is equivalent to specifying the user's “goal” for the event.
  One blindspot (or it at least I feel that it was a blindspot) regarding Lookout in the paper was how the author assumed that the email program will need to search out any possible sequence of words that could imply a meeting.  Considering now-a-days that we can just create meeting attachments to send in an email, trying to find arbitrary words seems silly.  It would make more sense to have a set syntax for meetings, where a user could maybe even have a tool called “create meeting” and the tool can add a properly syntaxed item near the end of the email and then Lookout could just search for items with that syntax.

Siamak Faridani - 10/26/2010 9:47:15

Principles of Mixed-Initiative User Interfaces This paper starts with pointing out the idea f direct manipulation in the earlier paper that we read. With his studied principles Horvitz designs and implements a system for scheduling meetings. He seeks to combine the former research on direct manipulation with interface agents and automated reasoning.

Horvitz points out a number of factors in integrating automation and direct manipulation. In brief they are as following: (a) the automated service should provide value over alternatives that only use direct manipulation (b) should be aware of uncertainties in user goals (c) should be aware of the state of user attention (d) should be able to find the ideal action (e) if it is not clear about the intention of the user, it should be able to engage in a dialogue with a user (f) user should be able to bring the automated agent into the loop or take them out (g) should be able to alert users when they use poor judgement (h) should be able to adjust the precision to help the user when the inference is not done right (i) should understand that user may want to extend the task beyond what the agent has suggested (j) should follow social expectations and norms (k) should have a log of recent actions (l) should be able to expand its capabilities as it observes user behavior

The details of the LookOut system reminded me of how Gmail and Google calendar work together. Seems at least Google has put this research to a good use. For example they have implemented a text parsing algorithm that infers from the text, the date and time for a meeting. It is not clear to me how they run their inference algorithm. He mentions that lookout assigns a probability to the goal of the user but it is not clear what kind of probability. It seems that he has explored with a number of alternatives for example a naive bayes classifier and a support vector machine and seems that the final system users an SVM. what he tells us is that the system has three modes for action (1) do not engage (2) make a suggestion (3) go ahead and invoke the action. LookOut selects one of these options based on the probability threshold that it calculates by equating equations 2 and 3. Also the author does not clearly mention who the learning process happens.

The article is a combination of theory and practice. He starts with highlighting factors that should be included in a mixed-initiative system and builds a system around that. There is no analysis showing that this system is efficient and the article looks like a show and tell. I am still wondering what percentage of the system actually made it to Microsoft Outlook. And I with author could provide more details about his machine learning models.

Sikuli Paper It was interesting to see a paper about Sikuli among our readings. Sikuli is an interesting system that I use almost everyday for a number of tasks. And I didn’t know it was presented at CHI. I just use it because it is a very efficient way of automating everyday tasks. I mainly use it to keep my AirBears connection alive. The problem with airbears is that it logs you out after a certain number of hours so a sikuli script can help with loging into AirBears evey couple of hours and keep it connected.

Sikuli is implemented in Jython (Python running on JVM) and their Linux implementation is still buggy. They use toolboxes like OpenCV to perform image analysis. Most of the examples in the paper are also available on their website and I am wondering why they have not updated the examples on their website. Although people have built interesting demos using Sikuli. I specifically like the one that playes Bejeweled using Sikuli.

Sikuli’s main idea is to combine visual and textual directives to provide a rich scripting platform to automate tasks. They even follow the same idea in their paper and their paper combines images and inline texts to communicate with the reader. In terms of technology it is much simpler than the LookOut tool, there is no learning tool, or SVM inference but the reach Python programming platform and anyone can implement their own inference tools. There is nothing to the decision making beyond setting a threshold for pattern matching. and their user study seems to be there to convince CHI reviewers that they have done their homework.

Dan Lynch - 10/26/2010 15:33:46

Principles of Mixed-Initiative User Interfaces

This paper frames and characterizes the issues surrounding the debate of using automated services or interface "agents", versus direct manipulation to access information and explicitly invoke a service. The paper defines a framework for which to weigh the costs and benefits of each, and how it affects the user based on the users intentions using probabilities of whether the right choice was made. In the end, the came up with a well-defined algorithm for which to characterize the optimization of interactivity using these agents.

This is important because these interfaces are ubiquitous today. For example, the iPhone correction drives me nuts! I can't tell you how often they attempt to correct something that I do not need to. However, there are times when its beneficial. For example, in XCode, there are code completions, where types and variables are suggested. However, there are times when it autocompletes when you don't want it to, causing errors.

I think it comes down to the fact that people will adapt and eventually learn how to use these interfaces to benefit them, even if at first they were an impedance. In the long run, they can be beneficial.


Sikuli is a system for using GUI screenshots to automate tasks on a personal computer. Developed at MIT, this system can basically execute code using visual variables supplied by a user. However, it is limited to the visibility of the screen, thus, no hidden items can be manipulated.

This idea is important only in certain context. To limit something to its visual appearance is to put too many eggs in the syntax basket. The semantics are completely left out. Any Sikuli script will not work with different desktop backgrounds, and certainly not after an operating system update. The concept is cool, but perhaps not too useful.

Airi Lampinen - 10/26/2010 16:38:05

Horvitz's paper is an ambitious attempt to bridge between research on developing new metaphors and tools that enhance potential for direct manipulation and research focusing on developing interface agents that provide automation. The paper looks for synergies between these two areas of investigation, illustrating key ideas with a scheduling and meeting management system. Horvitz is in favor of coupling automated services with direct manipulation. This approach makes sense, since the problems of purely automation based solutions seem to run deep - and, on the other hand, as we spend more and more time interacting with computers and other devices, automating tasks that do not necessitate our attention and time makes perfect sense. What I find really interesting in this topic is finding the right balance between the two, especially when the likely mistaken inferences of any automation process are taken into account: what would the optimal (in terms of user satisfaction) solution look like? Even when automation will work effectively, there are moments in IT use where people want to be in control and choose for themselves. Catering for these moments while automating the ones where we really just want the computer to take care of the issue is a tricky challenge.Yeh, Chang & Miller describe in their paper on Sikuli a way of using screenshots both for search and automation. Next to explaining how Sikuli works, they present results from a user study indicating that searching by screenshot can be easy to learn as well as faster to use than specifying keywords. Furthermore, the authors illustrate the automation aspects of their work by a number of examples of tasks that are suitable for scripting, such as map navigation and bus tracking.The paper is an interesting example of how interfaces could be developed towards considering better that people are not purely verbal beings, but recall and understand things also in terms of pictures and visual representations. After all, humans have been interacting with pictures and symbols and using them to interact with one another much longer than we've been able to work with textual information.I was left wondering who is seen to be the target user group for Sikuli. One could expect that offering it to expert computer scientists might be hard - when the relevant expertise is there, text-based commands are often (at least perceived) more effective than the manipulation of graphical user interface. Also, the benefits for experts may be questionable: it seems that one of the biggest advantages of using the screenshot search could be in being able to search information of something that one doesn't no the accurate name of. However, experts are likely to master the correct terminology related to the GUI they are working with whereas this may not be true of end-users in general. Still, based on the description, the system seems somewhat complex for the lay users.

Shaon Barman - 10/26/2010 16:55:58

Principles of Mixed User Interfaces

The author of this paper explores how automation and prediction affect the user's experience and productivity.

This work provides a theoretical framework on how automating some part of the user interface based on predicting a user's goal affects the user. The model uses the utility of the prediction to find a cutoff probability which maximizes the utility of showing/not showing the automated tool. One thing lacking in this paper was user studies and data showing how users liked or disliked the tool. Utility seems like a good theoretical concept but is difficult to measure in real life. It would have been nice if they had bolstered their data showing how well the tool predicted scheduling decisions, or even if they tested at what false-positive rate users were willing to tolerate (through something like wizard of oz testing). I would predict users would tolerate very few false positives, even if the disruption was minimal because it out of the users control.

It seems like such a tool would be more successful today if it was built into the cloud, such as Gmail and Google Calendar. The tool could train on each individual by looking at how they use the two products and correlate events to email. By training on the large data set of Gmail users, it would probably much more accurate. Overall, the research raises important questions on how users are affected by tool automation and how to model these interactions.

Sikuli: Using GUI Screenshots for Search and Automation

The authors create a visual search engine to search documentation about GUI elements and create a scripting language to perform automation scripts.

Overall, I really liked the techniques in this paper. I have recently been doing a lot of GUI programming with the Java Swing package and it is difficult to get started unless one is already familiar with the tool. Using visual search, a programmer can easily learn about GUI programming by seeing existing GUIs and figuring out how they work. But, the evaluation seems like it could have been expanded. It seems like the participants were just asked to find related articles about the dialog box, but not asked a particular question. Since the visual search is using more information than the text search, it seems intuitive that the visual search would provide articles quicker. What is more interesting is whether is information is useful? Also, the experimental setup is prone for bias, since the participants are asked to compare the two type of searches and know that the visual search was made by the authors.

The scripting language seemed also very useful. Creating UI tests is quite difficult, and much of testing is done with little automation. These scripts could be used to test program websites for correctness as well as usability (for example, how many motions does it take to find a particular button). One evaluation criteria lacking in the paper is how fast the visual scripting language is and whether it would be scalable to running such tests. It isnt quite clear how long the image recognition step takes.

Drew Fisher - 10/26/2010 16:58:14

Sikuli: Using GUI Screenshots for Search and Automation

Sikuli is a system that leverages the use of GUI screenshots to empower search and scripting. While this is an interesting concept, I'm concerned by a couple of things:

  1. The methodology for evaluating Sikuli compared to keyword searches was poor:
    • Every test provided was known to have a solution easily found by Sikuli in the searched set. This is unlikely to be true in general for software problems.
    • Now the search provider has to index the image contents to do image comparisons - effectively requiring crawling and producing reverse image search akin to Tineye or Google Goggles. While possible, it's hard to say this is much less work than requiring a11y in text labels, which is a fundamental part of many software toolkits today.
  2. Most people probably don't want to learn a programming language to accomplish tasks.
    • And for most of the examples of possible applications, pure python, javascript, or bash script would have been simpler to write to get the job done, and would have thus appealed more to the people who are willing to program

Thus, my take: it's a neat concept, but I don't see any real improvement from the system.

Principles of Mixed-Initiative User Interfaces

The list of "critical factors" at the beginning feels like a laundry list, rather than a particularly well-thought-out set of guidelines. I think most of them can be summed up by combining the Principle of Least Astonishment with "When in doubt, don't do anything."

Further, users don't want to "engage in rich dialog" with their computer. They want to get things done, and have the computer get in the way as little as possible. Joel Spolsky puts it pretty well in and given the commercial failure of Clippy, anthropomorphising the computer has been shown to be the Wrong Approach‚Ñ¢.

That said, the concepts discussed in this paper are still quite valuable, even if Microsoft got the implementation wrong. Making these automations easy for the user to kick off/approve seems to be an approach that people find quite valuable. A good example of a subset of LookOut is the Gmail "add to my calendar" option that likely performs a very similar bit of natural language processing. In this fashion, it's quite easy for the user to say "yes" with a single click, without having to engage in an inefficient dialog with the computer.

I think that a key point that the authors missed is that by letting the system assume "don't act" but present the user with the option to "act" in a nonmodal, noninterruptive manner, we can achieve the benefits of the "dialog" case without incurring the costs of the "No action" case.

David Wong - 10/26/2010 17:23:49

1) The "Principles of Mixed-Initiative User Interfaces" paper discusses the idea of using both direct manipulation and automation in user interfaces. It proposes several critical factors when designing these types of interfaces and demonstrates the LookOut system as an instance of a mixed-initiative interface. The "Sikuli" paper discusses their new visual search interface and scripting capabilities. It goes over the architecture of the system, a user study, and sample scripting applications.

2) The "Principles of Mixed-Initiative User Interfaces" paper doesn't contribute anything new to the HCI community, but it does highlight an important perspective--that direct manipuation and automation can go hand-in-hand. The paper was written in '99, so back then, it probably had more influence. The Lookout system that they describe sounds like how Google Calendar parses emails to look for possible items that can be placed on a calendar, a feature that I use a lot. As another proof of concept, or illustrate an idea, paper, it was sufficient in getting its point across.

The "Sikuli" paper gives a novel approach to search and I believe that it has a substantial contribution to the HCI literature. While the previous work had made advances in this area, the Sikuli system illustrated the value of visual search. I think in conjunction with textual search, there is a lot of potential improvement in search. It captures the idea of a more direct manipulation in how to defining a user's search query. Also, the visual scripting idea is an interesting and inspiring idea. The vision techniques they implement are well thought out and sophisticated. More work in this field can definitely inspire a new generation of search.

3) The "Principles of Mixed-Initiative User Interfaces" paper addresses a problem that probably at that time was novel. As such, it had a good approach in establishing needed features of a mixed-interface system and qualifying that with their own system as an example. They claimed to have studies in the paper that backed their set of principles, but never gave the actual data. As such, conceptually, their arugment is sound, but whether it stands in practice is another question.

The "Sikuli" paper presents a strong case in the potential of visual search and visual scripting. The user study they conducted was quite small, only 12 participants, but it demonstrated that a system like Sikuli could work. Their visual scripting samples, although limited to the visible portion of the screen, demonstrated potential in scripting power. They addressed their limitations and if they were to address them, the system would be more convincing. Altogether, the development of Sikuli was well-motivated from the previous research done and the limitations in computer vision in the past. Their analysis of their system was thorough and well thought out.

Pablo Paredes - 10/26/2010 17:31:06

Summary for Horvitz, E. - Principles of Mixed-Initiative User Interfaces

The paper describes a series of principles (12) that define what would help guide a designer in HCI to accomplish a fair combination of automated agents and direct manipulation.

I believe many of the principles are just special cases of previous ones, such as minimizing cost of poor guesses, agent-user result refinement, which is sounds more like a corollary of the principle of work under uncertainty, or other principles such as working memory of interactions, status or user attention and learn by observing, which are to me necessary steps to inference in light of costs, benefits, uncertainties.

I believe the notion of trying to unite a couple of themes that were being discussed by the HCI community responds more to a need to publish some sort of compendium that ties two tendencies, rather than focusing on solving real problems. At the core of the HCI issue is to make interactive experiences adequate to complete tasks under certain context. Whether using agents or enabling direct manipulation sounds to me more a problem of tool choice, rather than defining a paradigm of design that captures the real problem and defines the best use of any tool available based on principles that are attached to the problem definition, rather than attached to the solution.

I believe this is one example of trying to find solutions to the problem of selecting solutions, rather than focusing on the problem or defining the right problem which will lead to adequate solutions. Probably this type of analysis was relevant 12 years ago, and even now, but it is just not aligned with my perspective of actually finding the right problems first, rather than nice solutions to weakly defined problems.

Summary for Yeh, T., Tsung-Hsiang, C., Miller, R. - Sikuli: Using GUI Screenshots for Search and Automation

This paper describes a tool to use images as part of search queries, rather than using only word tags. The paper describes both the system architecture, made of three components (screenshot search engine, UO for querying and UI for adding screenshots), as well as a scripting language that benefits on fuzzy search algorithms that provide flexibility to the types of tasks to be performed (find, patterns, region, actions, visual dictionary, editor). The system shows that image matching using image components is more efficient than only using text.

Although I find the idea promising and there could be interesting applications for illiterate people, I personally did not see this as a big serious problem for annotation queries (as the base case) and I think the extended examples are somewhat non-interesting. Additionally it is not evident that the test performed to evaluate performance actually had advantages. They did not explained what type of queries were requested, and I could imagine that if you are asked to find for example icons with color blue and background not well defined could be a hard problem to use text, I would argue if this type of problem in a simple screenshot is ever a big issue for the person interacting.

I really wished the authors focused in some other problems that this tool could solve, such as teaching children, supporting illiterate people, etc. I think the way to show that they can use images to find other images in a still set of pixels and present it as a big change in the paradigm of searching demanded some more creativity beyond just describing an annotation problem.

Charlie Hsu - 10/26/2010 18:03:09

Principles of Mixed-Initiative User Interfaces

This paper challenges the traditional debate between direct manipulation interfaces and interface agents that provide automation, and attempts to couple the two focuses into a "mixed-initiative user interface". The paper provides some principles for designing mixed-initiative UIs, and then demonstrates an implementation of one as an automated scheduling enhancement to Microsoft Outlook.

I felt that some of the principles underlined by the paper were not exclusively "mixed-interface" principles. Some principles were clearly problems that automation interface agents need to deal with: minimizing the cost of poor guesses, considering uncertainty, etc. However, since direct manipulation is now considered as an alternate avenue of user input, the cost/benefit model of taking an automated action has new avenues to explore. Direct manipulation offers new information for the interface to learn from. Direct manipulation offers the user a chance to refine results, creating new actions for the interface to take besides simply performing an automated action (i.e. querying the user, showing a preview of the action to the user, etc.).

I found many of the design choices on LookOut interesting and well chosen. Degrading the interface's goal when not enough information is provided is a great idea. Allowing the user to preview a decision via mouse-hover is another great decision, eliminating the possibility of costly backtracking. The text-to-speech implementation of a metaphorical "butler" was also interesting, but I wished the paper could have explored more into when it would be a good idea to query the user (mail-window is active? sound is turned off? in a private environment?).

Many of the considerations in this paper brought to mind Gmail's integration with Google Calendar. I have found that Google Calendar does not offer as many previews of automated actions as would be helpful, and thus creates many costly backtracking problems when adding new events. Though Gmail offers an easy way to add events in Google Calendar after parsing through email text, changing those events before addition isn't possible, and should be by the analysis of this paper.

Sikuli: Using GUI Screenshots for Search and Automation

This paper describes Sikuli, which allows users to use GUI screenshots for search and visual scripting. The paper introduces the technology, with implementation details and usage examples. The paper also contributes an user study verifying the hypotheses that screenshots are faster than keywords for formulating queries about GUI elements, and that screenshots return equally relevant results.

Sikuli's visual scripting API was particularly impressive to me. I found that using GUI screenshots was a very strong abstraction that could mask away application development and GUI structure from an user simply looking to increase his control over a machine. Using Python's relatively high-level syntax helps make this sort of superuser capability much more accessible to non-programmers. In the context of contributions to human-computer interaction, GUI screenshots as a new input structure to scripting is certainly one that bridges the gulf of execution by removing the need of application development knowledge in application scripting, as well as lowering articulatory distance.

I was a little disappointed with the user study. I felt that it would have been good to see some examples of the dialog boxes shown to the users. Judging relevance of search results is a highly subjective task, and in the case of Sikuli, search of GUI elements seems to be used for looking up icons in relatively expert applications. What sort of search result relevance were Craigslist recruits asked to judge? (hopefully not what the "Dodge/Burn" Photoshop buttons do…). Showing what the dialog boxes prompted and some examples of user input would have been much more insightful.

On a more meta note, I found the introduction to the paper very clear and well organized. I particularly liked how the authors clearly defined the contributions made in the paper: both the script and search technologies, the user studies, and examples for integration. I also felt the introduction provided strong motivating examples, referencing the use of direct visual references in the real world.

Thejo Kote - 10/26/2010 18:04:08

Principles of mixed initiative user interfaces:

In this paper, Horvitz observes that when the paper was written, HCI researchers broadly fell into two camps - those who believed in the development of user "agents" which exhibit intelligence and perform tasks on behalf of a user and those who focused on new conventions in the direct manipulation of user interfaces. He argues that the right approach is a judicious mix of the two and presents his approach through a description of the LookOut system, an add-on to Microsoft Outlook.

Horvitz lays out a number of principles for the effective integration of agents and direct manipulation interfaces. They include - significant value addition, considering uncertainty in the user's actions, a class of principles related to getting the timing of actions by agents right and another class related to uncertainty in user actions. He describes the implementation of the LookOut system which incorporates these principles and describes the probablistic models used to achieve them.

The use of probablistic models has obviously seen many useful applications. Specifically, user agent tools have been very successful in cases like spam detection which is pretty much an automation of human action. But, I don't think any system has seen a lot of success in providing a mixed initiative user interface like the one envisioned by Horvitz in LookOut. That does not take anything away from his framework, which I think provides a good set of principles to follow.


In this paper, Yeh and co-authors present a screenshot based search and automation system. Visual search allows a user to submit a part of the screen being viewed to the system which would then return relevant results. Relevancy is determined by the visual characteristics of the image itself, text in the image extracted via OCR and text around the image. They describe the information retrieval approaches adopted in the system and test their hypothesis that Sikuli search is faster than traditional search, which involves coming up with the right keywords - not always an easy task.

The second part of the system allows automation through the incorporation of images in the script. A user can specify script behaviour based on screenshot elements. It uses computer vision techniques to match relvant areas on the screen. Their examples demonstrate why it is better than traditional macro based systems which depend on the position of objects on the screen.

I thought this was a very innovative application. With a suitable library of scripts and sufficient effort to package and the entire solution so that it is easily installable, I could easily see this becoming popular, given how popular macro applications already are.

Luke Segars - 10/26/2010 18:22:49

Principles of Mixed-Initiative User Interfaces

This paper describes an attempt to merge two major branches of UI research (direct manipulation and automated services) into a single tool called LookOut. The authors description seems to focus heavily on the AI / automated services component, but an argument is made and a "best practices" list is presented for producing UI's that merge components of both paths. This seems to be both intelligent and useful idea to merge these two main paths to make a significant change to today's interfaces. Both fields has shown promise in very particular situations, but they don't seem to be gaining much ground alone. Considering the significant progress that has been made in both artificial intelligence and sensing in the past couple of decades, it is exciting to think about the possibility of tools like LookOut that can streamline your day-to-day routines. Horvitz and his team have clearly put a lot of thought into the automated decision making that the system performs in an attempt to keep another Clippy Incident from resurfacing. While the merging of the two fields may hold promise for the future, I am hesitant to put a lot of support behind the idea of automatic decision making; there have simply been too many failures that fall well on the strong side of the obnoxious-and-rarely-useful spectrum. The author seems to have considered a number of conservative thresholds and heuristics to keep the user from being annoyed and it is interesting to see some of the techniques they suggest (timing the agent's lifespan based on the confidence interval of a decision's probability was really cool), but I still don't have confidence that a digital assistant would be more useful than annoying. Horowitz himself states it as one of his design principles: they must "match social expectations for a benevolent assistant." I'm not sure that we're there yet.

Arpad Kovacs - 10/26/2010 18:23:36

This paper introduces the LookOut scheduling and meeting management system for the Microsoft Outlook email client as an example of how to automate services with direct manipulation and probabilistic models. Unlike fully-automated systems, this combination of direct manipulation and automation will prompt the user in case of uncertainty, and therefore will become more useful by reducing errors at minimal cost (distractions) to the user, and learning from its mistakes. The system scans the user's email, and guesses meeting times/dates based on the content of the message; however LookOut prompts the user before adding the appointment to the calendar, and allows the user to edit its guesses directly. The most innovative part of the paper was the hands-free mode, which uses text-to-speech and voice recognition technology, as well as the probabilistic learning model which prompts the user at different levels depending on the degree of certainty/accuracy of the proposed calendar.

Overall I think that the main insight of the paper is that error-prone automation is worse than no automation at all, and thus graceful degradation in case of uncertainty is the optimal solution. I also think that adaptive automation, which takes context and degree of certainty into account, is an excellent idea; in particular I liked how if Outlook cannot determine a specific appointment time from the message, it will estimate a range of times, and prompt the user instead. However, I think that the animated helper agent is a bad idea; it reminds me of the "Clippy" paperclip from older versions of Microsoft Word, which attempts to make the software more anthropomorphic and "friendly", but as far as I am concerned, it is only a distraction. I think that a better interface would be for LookOut to simply highlight detected appointment dates/times in the message, and on right-clicking the detected date/times bring up the calendar with automatically populated fields, and ask for confirmation. This would allow the user to select only certain appointments to add (for example, an email may contain multiple dates/times, but what if only one of them is relevant to the user?). In addition, user-initiated automation would also enable the system to ignore certain irrelevant messages (for example, a highly automated system would be vulnerable to a spammer who decides to fill up the user's calendar with advertisements). So in summary, I think that this system makes a good attempt at making automation more useful, but I would only use such a system if it is highly configurable and acts more like a macro (the user initializes the task to be accelerated, and can monitor its actions), rather than running on its own with the potential to be exploited.

Sikuli is a visual search and automation system that allows the user to write simple scripts based on screenshots and simple interaction actions.

The main contribution of the paper is an existence proof: it is possible to build a screenshot-based interface for searching and scripting that is quick and intuitive to use. The search system uses the SIFT feature descriptor, as well as 3-grams yielded by optical character recognition to create a summary of the user-specified screenshot and the strings it contains, and compares it to a database of screenshots from computing books.

I really like the idea behind this system, and would like to try it out in my free time (fyi, you can download it at I am surprised that it is so accurate, since I was expecting a much higher rate of false positives. Most dialog boxes have a very consistent layout and standardized ui widgets, and the actual setting values like screen resolution change frequently, while labels are duplicated between dialogue boxes, which I thought would throw off the SIFT/OCR matching. I was also surprised that the 50k images from just 102 books are sufficient to cover 70% of their test dataset. It is likely that the authors designed and tested this system only for widely-used core applications (eg windows explorer, office, photoshop). However, in my experience the built-in documentation for such applications is already quite comprehensive and up-to-date.

Instead, I think that this screenshot-based system would be ideal for looking for help on popular open-source projects (eg Blender, Gimp), which often contain poorly documented built-in help, and generally have a lower emphasis on usability. Admittedly, solving that problem would require a much larger database of images, but in my opinion it would provide a much more useful contribution, since here is a domain where users absolutely need to search online help, and would greatly benefit from faster screenshot-based search.

The automation component also looks very cool, although I think that it makes many assumptions (eg: the button/action I am looking for is visible and not occluded by another window, or minimized). I myself would probably not use it, since text-based programming languages and APIs behave in a more deterministic manner and have higher performance. However, this could be very useful to people who do not know how to program, and just want to perform a mechanical task faster.

I found it interesting that the authors initially hypothesize that screenshot-based search would be faster, but the paper does not contain any quantitative data to back up this claim. Perhaps the unfamiliarity of the new interface slowed users down, although the authors did observe that users' query speeds improved with practice.

Aaron Hong - 10/26/2010 18:28:25

In "Principles of Mixed-Initiative User Interfaces" by Eric Horvitz of Microsoft Research, he talks about the "power of direct manipulation and potentially valuable automated reasoning." He does this through his discussion of "LookOut" and extension to Microsoft Outlook, which automatically tries to schedule events in an unobtrusive and helpful way.

Granted, I think with some good machine learning and some good design, these kind of mixed-initiative user interfaces can be pretty useful. They can perform high-level tasks without engaging the user, essentially being a very helpful assistant. However, whether computers can ever do that reliably is a big question. It's funny how Horvitz even said "An agent should be endowed with tasteful default behaviors and courtesies that math social expectations for a benevolent assistant." I think only so much can be learned and designed into a system like this. Ultimately, what is most useful about computers and technology in general is to provide a algorithmic function that does whatever algorithmic task well and reliably, instead of providing a heuristic one. We can do the reasoning. The challenge is then is how do we make these powerful tools know (when often they are so hidden in the interface)?

In "Sikuli: Using GUI Screenshots for Search and Automation" by Yeh et al. talks about helping users learn and annotate GUIs through this screenshot interface in place of text queries. They develop a visual scripting API as a result also which can be used to drive macro-like scripts.

I think this is a useful idea, and since it works with screenshots (aka bitmap images), the resulting scripting language is general enough to work with any image. The only thing is that it does not exploit the structure underlying the GUI toolkits since it's working with the bitmap level. Theme variations would through it off, and so would trying to occlusion. Due to it's generality (not needing access to the underlying toolkit information), it is limited in achieving the desktop search, annotate, and help it sets out to be. However, it's usefulness is broader now--the API can be adapted for general image search.

Brandon Liu - 10/26/2010 18:34:50

Mixed-Initiative User Interfaces

I was following along pretty well on this paper until it got to the screenshots of the Genie. After that, I couldn’t help but consider the paper in a bad light, because of the utter failure that is Clippy. I found the ‘Butler analogy’ to be kind of flawed, since the value added by Lookout is never more than what the individual would have otherwise. What really needed to happen in this paper was a justification of a ‘tool that enhances user’s abilities to directly manipulate objects’ - this is stated like it is an absolute positive, while there is not really discussion of tradeoffs between more efficient operation and the uncertainty and stress added to the system by popup choices and inference.

Some of the inferences that Lookout makes are problematic. For example, it describes how the wait time for the user can be taken in as a feature for inference. However, this assumes that users are completely engaged in the email task, which my intuition says is a small percentage of the time. Another feature that was suggested was using “Umm and Uhh” to signal to the system that the user was pondering. This is another huge assumption about what the user’s goals are, to the point where I felt like the concept was abusing the knowledge of the world.

It’s funny to consider some of these ideas in the context of two systems, Microsoft’s Clippy and Google’s GMail. Google Calendar suggests you add an event if it detects so in an email message, and the Priority Inbox sorts emails by looking at the content. How these differ from Clippy is that no explicit prompt is presented; instead, the option is presented in an area of the screen that would be blank or unused otherwise.


One issue I had with the paper’s evaluation method is how it evaluated searching. It evalauted performance in terms of coverage, recall and precision. What it didn’t mention is how the task of search is done, and not one individual query. For example, searching for help is an iterative process where one query is made and then refined. For this screenshot system, the cost of doing an iterative search is very high, since the user must go back and drag another screenshot, while for keyword searches they can add on or change a few terms.

Overall, I really liked the idea in the scripting part of the paper of bringing the automation interface directly to what the user perceived; the visual interface. I see this as applying more to novice users and one-time users rather than real end-user scripting, though. I think this approach makes too many assumptions about the quality of GUIs in general; from the application knowledge to what is being presented through the GUI, some domain information is inevitably lost; thus, direct scripting on screenshots works without any knowledge of the underlying model itself. This comes up in web application testing frameworks often, where a testing suite has to be completely rewritten because of a small visual change, even though none of the logic changed.

Anand Kulkarni - 10/26/2010 18:38:34


The authors present a new tool for visual search and scripting.

The paper is chock full of awesome innovations. I like the idea of indexing visual screenshots containing text by OCR; this is a unique and useful method. I love the concept of incorporating images into a visual scripting language, since it makes writing image processing programs much easier and lets us operate at a more abstract level. I wish the authors had explained more clearly pattern-matching the non-OCR visual features they chose to use; this is itself an interesting technical contribution that could have been made clearer.

This paper has the same challenge of giving quantitative means for verifying the effectiveness of what is obviously an interesting and innovative system faced by many papers in HCI. In all honestly I find this practice in HCI to be somewhat weak -- it forces evaluations that are a bit contrived at times in cases where the concept is suitably interesting to stand on its own. The authors' validation of visual search experimentally is good, since they demonstrate quantitatively that this framework is an improvement. The use of a Likert scale is, as usual, a silly choice; the 1-point improvement offered by their software is within the reported margin of error and undervalues their system. I like that the authors demonstrated their effectiveness in contexts other than their own experimental setup by connecting their work to existing systems.

I particularly like the notion of baby-based verification employed by the authors; I hope to see this methodology applied more frequently throughout the HCI literature.

Horvitz: Mixed-Initiative UI

Horvitz presents LookOut, an automated agent that make use of ideas in direct manipulation, and generalizes from this system to examples of broader UI.

I like the foundational idea of LookOut as a mixed-initiative agent; there are several important principles here that can be used to generate new kinds of agents that are more helpful. I appreciate the probabilistic analysis the agent carries out when deciding when to intervene; however, a quantitative examination of its effectiveness is really important here to prove that it's not simply another Microsoft Clippy. I think there's a very natural extension to use of Mechanical Turk to extend the appropriate of the choice to intervene by the agent as well as the apparent intelligence it exhibits. I found the early connection to direct manipulation the author makes somewhat inappropriate here -- improved agent automation is a better description of the paper's contribution.

The real difficulty with this paper is that it claims to be presenting general ideas about mixed-initiative user interfaces, when in fact these are better described as features and innovations in LookOut itself. Where's an example of another UI embodying these principles? The author doesn't provide a quantitative validation of LookOut's effectiveness, but gets away with it because the paper isn't about LookOut - it's purportedly about general design principles. LookOut is certainly a great embodiment of the principles that the author suggests, but in order to be an effective validation of these ideas the paper should discuss many other examples, not just one! I like that the author took pains to discuss the many different ways that LookOut embodies these principles, and I also like the quantitative examination of LookOut's threshhold. However, to be totally honest, if I were a reviewer I would reject this paper as written; LookOut is an outstanding contribution on its own, but it deserves its own paper with proper analysis; it can't serve as the sole example for any discussion of UI principles.

Luke Segars - 10/26/2010 18:45:04

Sikuli: Using GUI Screenshots for Search and Automation The paper presents the idea of a graphical search engine based around graphical user interfaces. The system, called Sikuli, is able to gather help material about particular UI elements and maintain them in an up-to-date index instead of relying on a built-in help feature. This is an intruiging idea that offers a lot of promise for complex systems like Photoshop and Blender. Manufacturer-produced help documents can be incredibly lacking, and the amazing success of the internet means that many solutions are available immediately from others who have hadthe same problem. Getting help on a particular tool or function would lower the learning curve for more complex programs and improve the usability of these systems. Arguments could be made, however, that an 'undo' button gives the users the same ability to learn by trial-by-fire instead of through reading tutorials. I can imagine a couple of circumstances that it might be really hard to technically support. The first of these is when users want to select a particular small element (a button) in an interface. Althought these are primarily marked with icons, the region selection itself may be difficult for a vision algorithm to support. Additionally, a large majority of icons are repeated from application to application, such as the "save" disk and "open file" folder. The image capture engine will therefore have to take into account which program the screenshot is coming from, not only how it looks from a higher level. In addition, many questions about user interfaces come up because people don't know how to get to the right UI element in the first place. "How do I?..." questions could not be addressed easily by this system because there isn't any indication of what the user is trying to do graphically represented on the screen. Also, using this sort of technology for automatically answering dalog boxes seems to be both a roundabout approach to the problem and a potential risk of overautomation. Even if a user did want the computer to decide whether certain files should be downloaded or run, it doesn't seem like it should be necessary to analyze the visual dialog to determine the best action. A more efficient (and possibly more accurate) approach would somehow access the text instead of trying to extract it from the image.

Matthew Can - 10/26/2010 18:55:33

Principles of Mixed-Initiative User Interfaces

In this paper, Horvitz lays out twelve principles for designing mixed-initiative interfaces that integrate automated services with direct manipulation. He demonstrates these principles on LookOut, an interface that adds automated services to Outlook’s calendaring system.

I thought Horvitz’s principles were well thought out. Automation cannot simply be tacked onto an existing interface. It has to be integrated effectively. For example, I liked the suggestion of soliciting additional information from the user to resolve ambiguity. This is important because such automated systems often make decisions under uncertainty. I also liked the author’s emphasis on continued learning by observing. It is certainly challenging, if not impossible, to design automated systems with optimally tuned parameters for everyone. This is why it is important to give the system the ability to learn from the user’s behavior. As an example, Gmail does this well with its priority inbox, where people clearly have different ideas of what constitutes an important email.

I wish the paper would have provided more rationale for the social-agent modality, perhaps even providing results that argue for it. It’s not clear that it does anything to improve the user’s experience. Moreover, interacting with the agent through speech is slower than mouse and keyboard interaction, and it creates more ambiguity for the system.

An interesting concept in this paper is that the value of alerts can be enhanced by building models of user attention. One idea is to defer dialog and actions until the user is most likely ready to receive them. I would have liked to see results on how this benefits users. It seems that users might find this timed interaction unpredictable, making it difficult for them to build a mental model of the system behavior.


This paper presents Sikuli, a system that takes a visual approach to search and automation of GUIs. The system allows users to search for information on GUI elements by taking screenshots of them. It also supports automation through visual scripting, meaning that users provide screenshots of GUI elements in their scripts to execute actions on those elements.

The idea of using screenshots for searching GUI elements online is well motivated. First, the help documentation that comes with an application is usually insufficient. And as the authors state, it can be difficult to formulate the right query to provide to a keyword search engine. The paper’s description of the screenshot search was straightforward. For me, reading the results was more educational. I thought that screenshot queries would return less relevant search results than keyword queries, but the paper’s user study refutes that.

One potential problem with screenshot search is that it always gives the same search results for a given screenshot of a GUI element. In reality though, people might be search for a GUI element for different reasons. Perhaps future research could examine how to augment screenshot queries with keywords queries.

Aditi Muralidharan - 10/26/2010 18:58:22

In the first paper, Horvitz presents the principles of mixed-initiative user interfaces. The bulk of the paper is the example, LookOut, a clippy-like "helpful" add-on to Outlook that automatically detects meeting coordination language in emails, and attempts to automatically add a meeting to the user's calendar.

It is hard to take this paper's principles seriously because there is no mention of a user study, and because the principles described are so high-level - they would all be difficult to apply to any particular task. Especially in the light of the unobtrusive, but useful, sidebars that modern calendar+email applications have, I think there is more to mixed-initialtive UI's than the principles that produced the LookOut genie.

This mixed-initiative research is otherwise relevant to my research on UI's for investigative exploration of text collections (for jounalists, literary scholars, historians, etc ) How far should the system attempt to find related or relevant information? Should it reinterpret your question to better fit the data? There is a degree of automation necessary in these systems because we assume that the user is actually unfamiliar with most of the text.

The second paper on referring to GUI elements by screenshots takes direct manipulation into a realm that was previously governed by automated interface agents: asking for help on a GUI. They also present a visual scripting API that makes a lot of new scripting possibilities available. The idea is to take selecting by dragging over an area to the next level - modern OS's only support dragging to select icons, but this makes it possible to (at least) search and script over any GUI element, just by taking a screenshot.

It wasn't clear to me what the scope of applications for query-by-screenshot are outside searching for help on the web, but this is a really new technology, so it isn't surprising.This isn't directly relevant to my research.

Thomas Schluchter - 10/26/2010 19:00:00

Sikuli The paper presents a visual method of specifying search queries and building potentially complex scripts embedded in GUI systems based on screenshots. Through various image recognition techniques, Sikuli is able to 'make sense' of screenshots and to treat the image data as input to further processes.

The main value of this paper to me is the insight that formulating complex expressions (either search queries or a series of steps in an automated process) benefits from a visual as opposed to a text-based approach. It seems that for all the direct manipulation in the interfaces we are used to, this problem domain still relies on a model that requires the user to translate a lot.

In the case of search, this is aggravated because the corpus against which search operates is entirely unknown. The user is forced to put the search problem in the terms of the system without knowing them. While this is true of any search problem in general, UIs as a search space present an added problem: The designers implement a conceptual model that is not as open to socially mediated reasoning as many other things in the world that can be searched for. Without an understanding of this conceptual model, searching means frustrating guesswork. Accordingly, the paper's contribution to this space is interesting.

I'm less sure about the approach to scripting. Using visual inputs for automated (and potentially destructive) action seems like introducing risk through ambiguity. GUI systems are forced to reduce the variability of their visual representations to avoid information overload. Thus, the meaning of icons, buttons or other elements might change depending on context. If that is so, there is an obvious tradeoff: In order to specify enough context for a scripted command, the scripting language needs to have expressive power. This complicates its use. Reading the examples of the Python-based construction of nested conditions for execution of a script, I had my doubts that an end-user would be able to understand this intuitively. What makes it complicated is the introduction of something that the screenshot approach to selection tried to mitigate: abstraction.

Query formulation and script assembly are distinct activities, and I wonder whether Sikuli is equally well equipped to make both of them a lot easier.

Mixed-Initiative UIs The paper attempts to reconcile two branches of UI research that have long existed in separation: research on direct manipulation interfaces and research on intelligent agent-based systems. The argument is that a mixed approach can lead to increased efficiencies if the parameters for the synthesis are chosen correctly.

I found this paper fascinating; in many discussions about how realistic the vision of a semantic web is, the autonomous agents that were promised to us in the late 90s are the one thing that has drawn the most ridicule from the detractors. And indeed the discussion has suffered from an unwarranted optimism as to the capabilities of systems to make autonomous decisions based on information accessible to them. In my opinion this paper shows a better way to think about system agency: as a collaboration between the user and the system in which the system is calibrated to take a conservative approach to its reasoning and be as transparent as possible about the limitations.

Especially valuable are the design principles for mixed-initiative systems that outline how rigorously they have to be modeled to support user behavior rather than supplant it. One of the key points in this respect seems to be that the machine learning performing these systems can extend over a long period of time, making software truly personal.

The one thing that I found distinctly 90's about this paper was the claim that system agency needs to be literally represented by a virtual-embodied agent. It brings back memories of the notorious paper clip in Microsoft Office that terrorized knowledge workers with its infantile appearance. I wonder whether the visual form of the assistive function needs to rely on the butler metaphor or whether an abstract representation would work just as well without being that whimsy.

Richard Shin - 10/26/2010 19:10:53

Principles of Mixed-Initiative User Interfaces

This paper describes and argues for user interfaces that combine both direct manipulation and intelligent, automated agents. Interface agents attempt to sense the user's intent and desires, rather than waiting for the user to explicitly indicate them to the user, and automatically take actions that they believe the user will want. In this paper, the author seeks to combine this new technique with existing work in direct-manipulation interfaces, to create a UI that takes the advantages of both. The paper presents a plug-in to Microsoft Outlook called LookOut, which uses many of the techniques developed in the paper.

The idea of intelligent agents in user interfaces seems interesting enough, and we have seen some prominent uses of them; such as Microsoft's Office Assistant ('Clippy') which were animated avatars offering suggestions to Office users as they went about their work. By using probabilistic inference over a large amount of evidence gained from sensors (contents of e-mails, speech recognition, etc.), and utility theory to compute whether to take some action based on the probability that the system's belief of the user's intent is true, LookOut strives to automate calendaring and scheduling activities. Compared to direct manipulation, intelligent agents seem to better use a computer's natural competencies, rather than seeking to (sometimes unnecessarily or hinderingly) emulate the real world.

I was a bit confused, though, over the paper's thesis of combining direct-manipulation interfaces with intelligent agents; were agents ever sufficiently powerful or accurate enough to deploy on their own? Also, I didn't feel that this paper was sufficiently convincing of the benefits of intelligent agents, especially given the negative experiences I have had with them in the past. Generally, unless the agents are very accurate, it seemed that they could cause a lot of unintended actions or prompts, or otherwise be useless as they do too little. In any case, I'm not convinced either that dialog is the best metaphor for user interaction; real dialog between people tends to be synchronous and require unwavering attention from both parties, which don't seem desirable properties of a user interface.

Sikuli: Using GUI Screenshots for Search and Automation

This paper presents a system, named Sikuli, which uses screenshots of GUIs for finding documentation about them, or to automate actions based on image recognition. Based on the insight that people prefer tangible visual references in real-world interactions, Sikuli attempts to bring the same paradigm to the computer, by allowing users to get information abou, e.g., a dialog box by taking a screenshot of it and having it be analyzed. Sikuli also includes a scripting system that works on the screenshot level, and allows users to express commands like "click on all copies of this icon and drag the selection to that icon".

These tasks have already been implemented before, of course: text search and macro languages. The distinction, then, is Sikuli's focus on visual images rather than or programmatic interfaces. For screenshot search, the authors integrated a variety of techniques such as visual features and OCR in order to make images searchable and indexable, and they were able to demonstrate in the paper that their method produces more precise and relevant results than simple text search. Their visual scripting work enables similar automation as with macro tools or GUI scripting languages, but using screenshots instead. This enables more intuitive scripting of user interfaces that can carry out instructions almost exactly like how people already graphically use computers, and have greater compatibility and reliability than existing systems that require special support.

However, similar to the other paper, it didn't seem that the technologies available were necessarily up to par in order to provide an optimal user experience. For example, in visual scripting, the user needed to specify the similarity threshold manually that would be used when searching for images, in order to avoid false positives. I had actually tried this system a few months ago, and then, it didn't seem particularly fast or reliable either. Additionally, I'm not sure that screenshots are the best level of abstraction to work with in implementing systems like this; by reverting to bitmap images, the system is throwing away a lot of data that is otherwise available from the operating system and UI toolkits. While screenshots work more universally, I think the authors should have explored in more detail using this kind of extra data in order to further their goals of automation and search.

Kenzan Boo - 10/26/2010 23:48:53

Principles of Mixed-Initiative User-Interfaces

The paper describes a middle ground, a ideal point, between automated program intuition and allowing users to richly describe and manipulate what they want to do. Some of the key issues with developing automated services with direct manipulation are, developing significant value added automation and considering uncertainty and what to do with that uncertainty.

The example they used is lookout, which had a magic genie that seems very obtrusive.

This reminds me of apples implementation of their mail client. It would automatically highlight in blue the key phrases that would allow you to create a calendar event around it or call someones number. this captures intention without being too obtrusive. although sometimes it would make a mistake and parsing, but the system allowed users to edit the entry as needed. It has a lot of features pointed out in this article which was written many many years before their implementation.

Sikuli: Using GUI Screenshots for Search and Automation

Sikuli is a digital image based way of searching for help on a user interface.

This is a great and very intuitive for users to point to a button and say, What does this do? with an image. What would be even better is to completely avoid the image capture and go striaight to the object in the code, along with a visual representation for no programs to have feedback about what they clicked on.

This is much better than submitting a word to describe a ui object and is also much more precise.

Also, given a limited set of possible ui elements, would be fairly simple with modern image processing to figure out what the image is, even with different screen resolutions etc. This is also helpful for developers helping to answer questions. screen shots are always easier than trying to describe the scren.