Temporal Interactions II: Audio and Video

From CS260Wiki
Jump to: navigation, search

Lecture Slides

File:Cs260-slides-18-time2.pdf

Extra Materials

Discussant's Slides and Materials

File:Cs260-discussion-18-time2.pdf

Reading Responses

Airi Lampinen - 10/29/2010 13:59:21

Stifelman et al's "Audio notebook" describes a prototype of a notebook that combines paper and pen note taking with structured speech recording. The authors present a user study with altogether four students and two journalists using the Audio Notebook in a real use context over a longer period of time.

The paper takes further the notion of augmented reality, that is, the paradigm of enhancing physical world with digital properties instead of replacing it by a completely digital/virtual solution. The instance that the authors look at is perhaps especially suitable for this, as taking notes with paper and pen seems to be an interface that people are so accustomed to and so fond of that abandoning it for a fully digital interaction would not be beneficial. The work is also interesting in the sense that it proposes a way for utilizing audio recordings more effectively, as Audio notebook allows for browsing them.

Next to the use cases of students reviewing course material before an exam and journalists writing up an article based on interview, I could imagine that the Audio Notebook could be of interest for many researchers, especially for anthropologists who are writing extensive field notes. This use case would have been an interesting complement to the study because recording interactions in field conditions poses a number of challenges related to the consent of the people observed and so on. On the other hand, in many cases when research interviews are recorded, transcribing them takes up excessive resources and, even more problematically, a lot of tones are lost in the process. Hence, Audio Notebook or a similar solution could be a very valuable contribution for the interview-making parts of the research community.

Goldman et al. discuss in their paper "Video Object Annotation, Navigation, and Composition" the use of tracked 2D object motion to enable novel approaches to interacting with video. The paper covers video annotation, video navigation as well as video-to-still composition, looking into its topics with some technical details.

The presented informal evaluation shows that while the proposed solution has some advantages in comparison to currently used solutions, there are still a number of problems to overcome before the technique would be effective and reliable enough for real use. However, the authors conclude with the conviction that "interactive video object manipulation can become an important tool for augmenting video as an informational and interactive medium".

Overall, I did not get a lot out of reading this paper, even though I agree that finding better ways to use and work on video is a worthy goal to work on. The proposed annotation and navigation techniques, too, could be useful in facilitating the work of ethnographers who nowadays often have large quantities of video recordings available but not necessarily good tools for analyzing them in a rigorous way.


Thomas Schluchter - 11/1/2010 22:46:50

===The Audio Notebook===

The paper describes a system for augmenting audio recordings with simultaneously taken hand-written notes for easier extraction of relevant content. After the system's initial evaluation, the authors added speech processing capabilities that make navigation in the audio through the written notes more effective.

It's fascinating once again to see an early precursor in research to a product that came out to the mass market maybe two years ago. The main difference between the modern devices and the Audio notebook as described in the paper is the reversal of roles. The notebook is no longer the intelligent part of the system but the pen itself. This makes the product easier to use (no unwieldy tablet to carry around), but it effectively preserves the interaction model of the Audio notebook. This is merely speculation, but it's interesting to think about whether this reversal of roles might be due to the researcher thinking from a metaphorical perspective (augmenting paper as the information-storing medium) whereas the product designer thinks primarily from a usability perspective (how to make this as compact as possible?).

One of the things I liked best about the paper is the section where the researchers explain how they came up with the "snap-to-grid" function of the notebook. I think this is not only a very ingenuous feature, but it also demonstrates the power of translational thinking as mentioned above. Identifying the pauses and breaks in speech as well as the common signals for a conversational turn as the structuring 'whitespace' of language is an original idea that leads to a highly useful application.

The evaluation strategy also demands respect. Conducting a longitudinal study is always a tough decision in terms of resources but in this case absolutely necessary. Precisely because the system changes how we approach extremely familiar tasks it is crucial to go beyond the "wow" effect of a prototype demo session and deploy the system in the field. I think the results of the study show very well that the underlying idea fits as many uses as pen and paper through their flexibility and simplicity do.


Charlie Hsu - 11/2/2010 10:28:52

The Audio Notebook

This paper describes the Audio Notebook, a notebook digitally enhanced to record pen strokes and sync them with audio recordings. The notebook can also playback audio, indexed through both a scrollbar and the notes themselves. The authors conduct four user studies, observing students and professionals using the Audio Notebook in their normal activities. Some extra features were explored, such as a listening-to-writing offset, an audio snap-to-grid that detects contiguous phrases in audio, and topic suggestions for chunks of audio delimited by cue words.

This paper is heavily related to the subject of me and my partner's research project for the class; our project essentially attempts to provide much of the same functionality in the video spectrum, targeted at HCI designers using video to perform field studies and ethnography. Many of the issues they discovered are also issues we are trying to deal with in our project. One of the first ideas we had was indeed to implement a "recording-to-writing offset," much like the paper's "listening-to-writing offset", since it takes time to cognitively recognize an action of importance and begin to transcribe it. We run into the same problem of a lack of guarantee that we find a coherent starting point when backing up by a fixed amount, however. Even worse, it seems far more difficult and out of the scope of our knowledge to develop a "video snap-to grid" that could figure out contiguous chunks of video / detect new actions. It is good validation though to see that others have struggled with our problems: we hope that the nature of our interface (one video recorder, one notetaker) will allow the notetaker to devote more attention than normal to events.

We also hope to implement symbols that encode metadata on tagging video as well, further reducing the cognitive load on the users. As seen by Student 2's change in notetaking habits when using the Audio Notebook, we infer that the ability to easily recall relevant portions of video in the future will reduce the need and desire to transcribe copiously during the actual time of capture. Student 2 simply kept track of "higher level concepts", while Reporter 1 took it even further, simply placing star symbols next to important areas. We hope to expand on this, allowing users to place symbols such as stars, squiggles, etc. to indicate especially important or unimportant areas, or to perhaps even tag as an abbreviated symbol for something else (i.e. a triangle could mean "the engineering team might be interested in this", etc.).

Some final design decisions that the Audio Notebook made mirrored our initial design decisions as well, a comforting reassurance. The Audio Notebook decided to act as a digital enhancement to existing interfaces of pen and paper, like the DigitalDesk. We intend to do the same thing, preserving the traditional video camera and notetaker model of an ethnographic study, with the notetaker able to continue using pen and paper. The Audio Notebook also worked in two directions: allowing the user to skip to audio segments based on notes, and allowing the user to view notes associated with audio that might be playing back. We intend to do the same with our video interface.


Video Object Annotation, Navigation, and Composition

This paper explores the use of tracked 2D object motion in videos to enable three tasks: video annotation by associating graphical objects with moving ones on screen, video navigation through direct manipulation of objects on screen, and still picture composition using rearranged portions of video. The paper describes the algorithms for pre-processing the video, choosing video objects, direct manipulation, and more. A brief informal evaluation is done, and limitations are discussed.

I feel that the direct manipulation video navigation interface was gimmicky, and I was unable to come up with an impressive example for when it would be useful. Video navigation through direct manipulation requires advance knowledge of how the objects in the video move! The most convincing example I came up with was for time-lapsed/time-manipulated videos, where objects may not be moving at their normal speeds (i.e. a sun moving across the sky during a sped up video). However, this embodies both an extremely predictable moving object and a modified sense of time that would already interfere with normal human prediction of where and how quickly a predictable object would move.

The video annotation work seemed to be an useful contribution; many of the technologies they implemented open the door to future applications making use of their model. Detecting occlusion and examining different methods of linking text to video objects could have all sorts of extra features built onto them, for example, automatic surveillance via detecting occlusion of key areas on a security camera feed, or a richer viewing experience for translated foreign films, where things like street signs in scenery could be translated and annotated in a more natural fashion, attached to the object itself and moving with it. However, I have to point out something that amused me in their discussion of video hyperlinks: they cite that hyperlinks are useful because they provide the user access to more information on demand without occluding the screen, but the Internet Explorer window that pops up on a click does exactly that, occludes the screen!


Shaon Barman - 11/2/2010 15:20:12

The Audio Notebook

The audio notebook attempts to augment a physical notebook with voice recording and allows the user to index into the voice recording through the text (by linking the time between voice and text).

The main contributions of this paper is a technique to timestep written text so that the corresponding audio can be accessed easily. The authors evaluated this design on two students and two reporters, who all used the tool for different purposes. One feature I especially liked was the auto-snap-grid detection. I have has many instances where I repeatedly backed up because I could not find the right start location. Although, this feature might be overshadowed by a more precise visual interface which showed the user the start of a phrase. Overall, the audio notebook seems like a practical tool with many purposes.

My first reaction when reading this paper is that someone must have/should have created an app just like this for the iPad. I have had many moments in class where I drifted off and a tool like this would be invaluable when trying to decipher the scibbles I wrote. Additionally, it seems like this basic concept can be expanded to many more areas. I took a Chinese class where I had to learn the pronunciation and stokes for many characters. If my teacher has used such a notebook to create a document which linked the characters with their pronunciation, it would have greatly simplified learning each sound. This basic concept of linking text with speech could also be linked with typed input. With the hardware already available, such an interface could easily be implemented and used.

Video Object Annotation, Navigation, and Composition

The authors create a direct manipulation interface to allow augmenting video with annotations and basic editing.

Although the paper itself is a bit confusing, the video shows a tool that is both powerful and easy to use. The tool tracks areas with similar motion, and allows the user to directly annotate these users (with the annotations automatically following the area). I have not done much video editing, but because of the number of dimensions in a video, it seems difficult to specify such interactions. The tool abstracts these dimensions away and allows the user to edit an object, just as if the object was real. I like the use of starburst arms to show the temporal location of similar frames. It is a very intuitive metaphor which encompasses both the action and time.

The main critique seems to be amount of preprocessing this technique takes... 5 minutes per frame. This seems excessively high, but because this is a prototype, it should be able to be optimized.


Anand Kulkarni - 11/2/2010 15:50:58

Audio Notebook


The authors present a new device allowing users to review and annotate audio notes.

I like the domain; this is a problem that has been faced by anyone doing user studies. I particularly liked the snap-to-grid functionality, which is a novel contribution that suggests some interesting technical problems in OCR+speech recognition. I also like the fact that users don't have to explicitly index their data. I think the interface could have used some work. There's not a strong reason to use a interface biased towards the physical (a digitizing tablet and stylus) rather than a purely digital one on a laptop.

The evaluation was a field study following several users of the system over a five-month period. This seems to be a reasonable method of evaluation; there are no obvious quantitative methodologies that would come to mind here. I like that the authors examined workers in their intended use cases (students and reporters) and took extensive feedback from them both during and after these periods. I also like that they analyze specific examples of content produced by the users, although not particularly rigorously. I wish they had attempted to generalize more or incorporate the strategies these users adopted as much as possible.


Video Object Annotation, Navigation, and Composition


The authors present a novel interface for annotating video streams.

This tool has several contributions and applications in several domains; video annotation will become increasingly important as youtube and its relatives become a frequent means of media dissemination online. I like that annotations morph over time as a video progresses and is edited. I also like that the technology hides the complexity of tagging and identifying video objects from the user. Last, I like the ease of video recomposition the authors present. I wish the authors had discussed more complex applications in (for example) object tracking and automatic recomposition, for instance, writing programs or scripts around these activities; these are interesting extensions.

The evaluation is informal and the authors argue their system is largely novel enough not to need one. However, it's good that the authors presented at least an informal evaluation using a few storyboarding attempts. The experiment using After Effects as a control is also good, since it provides a clear demonstration that the system is better. I would have liked to see more discussion of technical details of how the technical features were accomplished; these were only touched on, perhaps due to space, or perhaps b/c their inclusion would have required more technical evaluation.


David Wong - 11/2/2010 17:23:57

1) The "Audio Notebook" paper discusses a system that helps users index an audio recording from a lecture or class by the user's writing. There are additional features they implemented, such as phrase recognition and topic identification, to make the usage of the system more refined. The "Video Object Annotation" paper discusses a new system where a user can annotate a video with several different graphical objects, navigate a video with a direct manipuation of objects in the video, and create new image stills by dragging objects in a video. The paper goes into detail about the implementation and also discusses an informal evaluation and limitations.

2) The "Audio Notebook" paper proposes an interesting system that seems like it has a lot of potential value to add to students and reporters. It relates to the papers we read about the DigitalDesk as it tries to augment both the physical with the computational. The system is cool--I would use it. However, I don't know if it would be economically feasible to implement, especially nowadays with the iPad and other tablets becoming widely used. Nevertheless, for a paper written 9 years ago, I think it was on the right track. To juxtapose audio and the user structured text, the system clearly improves the reviewing and learning process. I think that this concept applied to current day products like the iPad or tablets could be very effective.

I think that the "Video Object Annotation" paper offers a lot to the HCI community. While there has been extensive related work stated in the paper, their approach seems quite novel. In the paper, they state:

"Although they work remarkably well, we do not consider our tracking and grouping algorithms to be a central contribution of this work. However, we find it notable that so much interaction is enabled by such a straightforward preprocess. We have no doubt that future developments in computer vision will improve upon our results, and we hope that researchers will consider user interfaces such as ours to be an important new motivation for such algorithms."

Accordingly, I believe that their work has really illustrated the possibilities that are currently available in video interaction software. Their work also inspires the design of new products that can build upon their work and enhance the state of the art. One significant downfall of their system, however, is the massive amount of time that is needed for preprocessing and its current limitations, such as complete occulusion. Nevertheless, I think that they've created a great first prototype that can be a strong basis for future work.

3) The "Audio Notebook" paper's argument that their system is valid is convincing. Although their field studies were small, I believe that they clearly illustrated the potential benefit that their system has. I thought it was particularly convincing for them to include a user who was unwilling to use the system at first, but then realized its value after converting back their original methods--the reporter and the tape recorder. Although I don't think a Audio Notebook system would be realized today, I believe it's concept can be effectively applied to existing tablet products in the market.

The "Video Object Annotation" paper addresses an area of software that hasn't seen strong innovation since video editing software. I believe that they are engaging research in an area that needs it. I think they clearly, however informally, illustrate this with their informal evaluation. They demonstrated that their system could perform an equivalent task to After Effects with much less effort. Albeit that this was a singular example and there are probably many situations where their system would fail, which they do note in their limitations section, I think this is sufficient motivation to further research in this area.


Matthew Chan - 11/2/2010 17:27:26

The Audio Notebook: Paper and Pen Interaction with Structured Speech

The Audio Notebook is a novel tool for many people who divide their attention to listening to others (ie. a lecture or interview) and writing down notes. This tool allows the user to continue writing notes while recording audio, and allows the user to quickly jump to certain parts of the audio that maps to the notes. More impressively, the tool builds upon the standard pen and paper instead of an LCD screen as mentioned.

I'm sure many other posts are mentioning how this reminds them of a product called LiveScribe, which is a pen that can take notes and record; however, users have to take notes on a special type of paper produced by them. This paper is pretty important because it offers a new tool that simplifies two tasks that many of us doing daily. The techniques/methodologies/results was loaning the devices out to a few students and two reporters over a course of a semester. The authors then analyzed and interviewed their experiences with the devices and whether or not their note taking habits were altered or enhanced; the same was done for reporters. The authors took note that reporters wasted lots of time with regular tape recorders as opposed to their Audio Notebook and that students also had an easier time recollecting lecture notes about key words or diagrams.

This paper relates to today's technologies because of LiveScribe itself and how it is currently marketed and sold to consumers. I've seen many peers use it as well, the only drawback is that the pen looks a bit bulky. A blind spot that i see is how we might be able to merge this with mobile devices. the iPhone and Android already offer apps that act as tape recorder, but what if the student is typing up notes while recording? In this instance, there is no mapping from time to audio, etc.

Video Object Annotation, Navigation, and Composition

This paper explores a new way of building upon video annotation, since there is very little work in this area and the most well-known/seen example is the tool used by Football replays where commentators can draw on the video. This video is important because it has the potential to impact the way users make films or movies. With websites like YouTube, users are making lots of videos. Moreover, the gov'ts around the world have extensive surveillance in airports, cities, subway systems, etc. This would be a valuable tool for making notes, tracking objects, etc. and relates in many ways with today's technologies since cameras are ubiquitous and standard in cell phones.

The applications consist of word bubbles, graffiti, scribbles, path arrows, marking occlusions, and video hyperlinks. This paper was very technical with the explanations of particles conferring together, etc. The results/methodologies used was by comparing it to Adobe AfterEffects with professionals and the results were that one novice user took about 20 min to complete a certain task, while another used this new video annotation tool. This user was given a 1 min introduction/tutorial and tool only 30 seconds to complete his task. One blind spot I see is the same with the Audio Notebook: mobile devices or cell phones. When the iPhone 4 was released, Apple also announced the iMovie app so users can edit clips on their phone. Perhaps once computation costs go down more, video annotation, navigation, and composition could become standard features on these devices.


Linsey Hansen - 11/2/2010 17:53:39

In the Video Object Annotation, Navigation, and Composition article, the authors describe a new method for tracking 2D object motion. Their method tracks particles (points) in the video to create objects, and then allow interactions tied to those objects.

At the end of their paper, the authors mention that the pre-processing time takes 5 minutes for a 720 √ó 480 pixel input-- and while they do acknowledge it as a downside, I would go so far as to say it is a rather large downside for more professional use. Considering that most real-time films are 24fps, that means that 1 second of film would take 2 hours to preprocess, which is better than pixar but I feel like people using this for movies aren't necessarily going for a pixar-level film. I feel like the primary application for something like this would be, as mentioned, for sports teams to do break downs of a game, and while it would be cool for sports announcers to do it while it's happening having something to look over a week later would be nice too. I suppose that there are also some film projects that might want to use something like this for special effects, or people might even want to use it for some quick effect that lasts even a minute, and that seems fine. However, I feel like people who would make the most use of this would be people like YouTubers, do-it-yourself demonstrators, or just people making fun family videos. In this case, I doubt that these non-professional people will have the resources to make this into a quick project, even if they only do like a 5 or 10 minute film, even if they choose to preprocess only 1 minute of it.

Perhaps I missed it, but the main thing that bothered me is how prone the particle-tracking method is to error. For instance, how high quality does a video need to be for it to work well, and even if the video is of decent quality, will factors such as lighting mess with the particle tracking (or will really dark spots often create ambiguity in objects). Also, in the event that this method can be prone to error during more “normal” use (not in the experimental settings), will there be a way for the user to “debug” the problem? Or will this just not work for the user's video? While the idea of this is definitely neat, and it will probably work on most professional things , I feel like someone will definitely need to look into these so that the method can be more reliable- or maybe the authors did address this somewhere and I missed it.

The Audio Notebook

This is neat because it is very similar to livescribe pens. In fact, livescribe pens were probably built off of this considering how similar they are. The thing I prefer about the pens to what is described in the article is how the pens have the audio interface built into the paper, while this device required a separate interface object for that.

The most useful feature that I think this has is its ability to sync with a place in the notes, because that is the main reason why saving audio, or even watching webcasts has been difficult for me in the past. There is the small problem where it might be hard to sync to some place if you didn't take notes on it (I am still somewhat confused as to how Student 2 was able to do their notes), because you would probably need to search through quite a bit of excess information. This also has a great advantage over computer typed (and recorded) notes because it allows you draw diagrams and such with more ease (aside from tablet-like laptops, but those are pricey).


Siamak Faridani - 11/2/2010 17:59:20

These two publications are mainly about the techniques that we can use to manipulate and interact with video and audio. Both groups of authors go on to develop hardware and software packages around their idea and provide them as proof of concepts.

The Audio Notebook Authors start by suggesting that in classical note taking requires users to divide their attention between the speaker and the process of note taking. As a solution they suggest a technique called Audio Notebook that blends the note taking activity with listening to the speaker. It also enriches the learning process by including audio into the notes. This provides playback capability to students and journalists when they review their notes.

In addition to user annotating and syncing the notes with the audio they also use audio processing to select and sync the voice. We can look at this technology as augmenting the actual notes with audio data. By mapping the time into space (audio times to locations in the notebook audio scroll bar) The system uses regular pen and notebook which is familiar to users.

The paper might be the grandfather of livescribe pens. There are a lot of similarities, for example they both use real notebooks as opposed to LCD touch screens; and they both try to sync audio and notes.

One question that I have about this research (and also Livescribe pens) is that I tend to write slowly and I am always behind. So I am wondering how can we modify these devices for people like me. Perhaps a time shift in audio may resolve them problem but is not really a linear transformation (for example sometimes I am behind and the voice and notes are not in sync and sometimes I catch up so I am in sync with the speaker so we need a nonlinear mapping from time to space)

Video Object Annotation: Navigation and Composition Authors start with pointing out that video is always hard to interact with because it is hard to represent it in a non sequential order. It is hard to navigate and brows a video. In this paper they suggest three interfaces for working, manipulating, annotating and navigating the video. They developer and annotation tool in which texts on the annotated object moves as the object itself moves as a result giving this impression that the annotation is attached to the object. They also use the idea of direct manipulation to perform video navigation. So a user can simply grab an object and use it to scroll through the frames. And finally they use a static image to scroll through a video.

Their image processing is twofold, first they analyze each particle in the video, and then they assign particles to objects. They use particle video to do the tracking. And using particles allows them to do object selection even with sloppy mouse strokes. The underlying algorithm allows authors to develop interesting techniques in their tool. For example they have a graffiti and word bobble tool that moves with the object. Path arrows are also the result of using the particle filtering tool. They include some of their ideas in Adobe After Effect as well as their own prototype.

Some of the other techniques that are included in this paper seem to be minor though, for example hyperlinking does not seem to be that much of a significant contribution.


Pablo Paredes - 11/2/2010 18:18:53

Summary for Goldman, D. Gontermna, C., Curless, B., Salesin, D., Seitz, Steven - Video Object Annotation, Navigation and Composition

The paper describes a series of algorithms and techniques which makes easy for users to perform three video interfaces: annotation, navigation and image composition. All these basic operators allow users to easily attach tags or select moving objects.

The main advantage of the system is its ability to hide the complexity of algorithms and allow the user to think in terms of high-level goals, such as placements rather than segmentations or other low-level parameters. Their particle tracking algorithm is well encapsulated under simple interface commands such as zone selection, annotation boxes, word baloons, path arrows, video hyperlinks and occlusions.

As the authors describe, their main contributions are not in the field of video manipulation itself, as they have not defined higher levels of automatic segmentation or performed enhanced computer vision techniques. In particular, a drawback from the system is its time to process, as well as the need to define the number of element a priori. However their main contribution is in the very simple interactive techniques that allows the user to be very creative and to quickly implement desired operations. Their implementation of hyperlinked video is very interesting and presents good options to interactive TV and other traditional video interfaces. I think this paper helps me think again how much I admire simplicity, and how complex it is to reach simplicity, specially when our modern consumer brains always push us to believe that more is better.


Summary for Stiferlman, L., Arons, B., Schmandt, C., The Audio Notebook

This paper describes another example of enhanced reality (as opposed to digitalized reality). This example allows a user to take written notes in paper, while recording the lecturer. This provides a great tool to enhance the level of interaction between the student and its environment while taking notes. The system enables several novel tasks such as rapid skimming with audio, fill in the gaps, reviewing, etc.

The notion of enhanced reality is clearly noted, as the traditional note-taking task is not altered at all, while providing additional information. The user does not need to keep track of any new mental models, beyond only turning on the recorder. The overall system is very unobtrusive to the user and provides again simplicity in its use.

I find this idea powerful and interesting from several angles: 1) The notion of simplicity is of great fascination to me, and it should always be the final goal of a designer, 2) The notion os enhanced reality is a humble approach that recognizes the implicit intelligence from years of usage embedded in low-tech tools, while adding high-tech provides new horizons to this technology, instead of replacing it, 3) The notion of freedom of expression, which provides the user a completely familiar environment helps the adaptation of technology in many ways, rather than generating additional gaps between digital workers and more traditional ones. I like all the attempts to go back to basis and the use of technology as an enhancer of reality, rather than a reality changer (although not opposed at all to the notion of progress and creation of new tools).


Thejo Kote - 11/2/2010 18:29:36

The Audio Noteboook:

In this paper, Stifelman and co-authors present a note taking system which incorporates audio and written text. They observe that taking hand written notes when trying to follow a lecture or conversation, or handling recorded audio without an easy way to extract useful information are both not optimal ways of taking notes. They propose a system which combines note taking activity by a user with audio recordings to improve the process.

One of the main features of the system is the support for taking notes on paper, which is familiar to most users. It allows users to write down their notes while recording audio at the same time. By detecting the position at which the user writes on paper and co-relating it with the audio, the system is able to provide both spatial and temporal context when the user later tries to retrieve information from the notes. The authors describe the results of their field study and improvements they made to the system based on the results. The main additions are "acoustic structuring" to help the user more easily find the relevant portions of audio.

I thought this was a very interesting solution to a common task. With the availability of tablet devices now, I can easily see a system like this that can be created in a feasible way. Addition of other layers of information to the temporal scale like video in a class room setting would make it more compelling. With the improvements in speech and handwriting recognition technologies, it should be possible to create a system that does not depend on custom hardware.


Video object annotation, navigation and composition:

In this paper Goldman and co-authors present a number of approaches to interacting with video. They track 2D object motion to allow easy annotation, navigation by direct manipulation and creation of a composite images from multiple frames of a video. The system they present moves away from the traditional frames and timelines approach to working with video and supports more direct mouse based interaction techniques.

The system depends on an expensive pre-processing step which identifies movement of objects in the video and tracks them using a particle video approach. This pre-processing enables all the applications they present. This paper also presents a "starburst" widget which provides a better visualization of the range of motion that is possible with a mouse.

Computer vision and video manipulation is something that I don't know much (anything) about. But, it was interesting to read about how the interaction paradigm was made more direct and the clear benefits it brought about. The composition of an ideal image from multiple frames was especially cool.


Luke Segars - 11/2/2010 18:31:19

The Audio Notebook

This paper describes an extension to a standard paper-filled notebook that provides voice recording, audio navigation, and phrase detection. The device seems to be helpful and the users who tested it, athough few in number, seemed to give positive reviews for its usefulness.

Perhaps the most critical feature of this device is the ability to speed up audio and listen to it at an accelerated rate. Listening to information for the second time at the same speed would be time-consuming and inefficient, as well as annoying for the user. The Audio Notebook provides both accelerated replay speeds and track navigation, making it significantly easier to find what you're looking for in the audio track. It seems like there would be a significant number of applications for this technology including education, recording staff meetings (or any other kind of meeting) in business, conducting interviews, and creating field notes.

High levels of usability separate this device from other similar versions. The ability to move freely within the audio stream is a significant patch on the solution of slow, uncontrollable feedback. Although the majority of the technology is straightforward, the Audio Notebook, like the bubble cursor, improves upon a challenge that a user group faces without requiring a technological breakthrough. This seems to be a trend in much of the work in HCI and suggests that creativity is sometimes more important than a supercharged technical brain.

The small number of users tests makes the positive results seem unfounded. This study could have easily been conducted with a small group of friends who provided positive feedback for social reasons more than functional ones. Larger user tests would be far more convincing and may reveal additional use cases for this device. It would be interesting to see whether a user's study habits changed after using the Audio Notebook for extended periods of time. For example, students may not pay as much attention in class or review more efficiently once they can hear the lecture multiple times.


Luke Segars - 11/2/2010 18:31:39

Video Object Annotation, Navigation, and Composition

This paper describes a technique for using pseudo-direct manipulation controls to manipulate and pan through video. I found this paper to be very informative for technical reasons (the particle-based object tracking in particular), but I also found the "Applications" section to be a bit weak and unfounded. For certain applications, such as providing replays in sports, it does seem like it'd be a more natural way of traversing the video stream given a reduction in the significant preprocessing time. Although the technique itself is interesting and probably worthwhile, this seems like a technology that may not have many practical applications before its available for people to play with.

You could conceivably think of some interesting (although not necessarily valuable) uses of this tool, such as the creation of dynamic movies that responded to user input or other conditions. This technique wouldn't necessarily involve a human participant but could visually map actions to characters based on their position or behaviors through a scene. It also seems like there would be a decent number of ways to integrate this sort of technology into CSCW applications. On the other hand, the paper mentions video editing as a possible application, but I don't see how this would be superior to the standard linear slider for that purpose.

The contributions to HCI here are probably more in the theoretical realm than the actual product that Goldman et al developed. Video is becoming a more significant part of our daily lives and means of communicating information than ever before. Any person with an internet connection and video camera can share a video with the rest of the world, and many of these users do not have the expertise to perform expert video editing. We do need a new method of editing video with a lower barrier to entry, although I'm not convinced that this method is more intuitive for amateurs than the linear model. It would have been interesting to see a user study with users who were unfamiliar with any sort of video editing and how helpful they found this approach to be versus a standard linear model.


Krishna - 11/2/2010 18:46:53

Audio Notebook

The authors describe the Audio Notebook, a system that provides structured access to Audio through paper and pen. The key idea is to map spatial interactions by a user on a notebook paper to time segments in Audio. While a user is taking notes, the system records the audio and keeps track of correlations between the current page, spatial locations where the user is writing on the current page to the audio timeline. By touching locations on a page, users can directly seek to the relevant portions of the audio associated with the page. A page detection system ensures the audio timeline is appropriately adjusted when user changes pages. The interface also provides an audio scrollbar that can be used to arbitrarily seek to any time in the audio relevant to the page.

The authors also discuss improvements done to this system by using acoustic information to further structure the audio. Using acoustic features, the system structures audio based on phrase beginnings, endings and topic introductions. The former was done so that while starting seeking to arbitrary positions in the audio the system adjusts to align with phrase boundaries. The latter was done so that users can skip to those audio segments that introduced topics - this was done using information from pitch, energy and pauses.

Very creative piece of engineering. I particularly liked the fact that multiple revisions and features were made based on user studies. A nice thing about the system is that errors in detection algorithms can always be corrected and adjustments can be made by the user. Probably this is an important criteria to keep in mind when engineering intelligent user interfaces - allow users to handle false positives and negatives.

Video Object Annotation, Navigation, and Composition

The central idea is to use sophisticated computer vision techniques to directly manipulate video. Contrary to current way of video manipulation which is primarily based on timeline navigation, the system allows users to manipulate videos in a variety of interesting ways.

Users can annotate objects in the video and these annotations move when the objects move, users can seek video by dragging objects in the video and users can create variety of stills by mashing up different frames in the video. Their demo blew me away.

The central idea in their computer vision techniques was this notion of "particle grouping" where particles(which I have understood as groups of pixels) are grouped using some optimization criteria. These groupings are then associated with user defined regions, the motion of the region is tracked by tracking the motion of the particles within the region. As they say the "only" limitation is that it takes 5 minutes to process a single frame.


Bryan Trinh - 11/2/2010 18:58:27

The Audio Notebook

This paper introduces the implementation and usage of a system of note taking that makes use of spatially indexed audio recordings.

Like many of the other user interfaces that we discussed in class, the Audio notebook follows the theme of digitizing our digital world instead of creating objects in a digital world. By making use of existing natural human affordances, users can learn to use a new interface much quicker. The Audio Notebook seemed pretty successful in this regard, but one question that I had after reading this paper is whether or not users are willing to learn a new system. A few of my friends that were given the LiveScribe Pen, a pen that also supported spatial sound navigation found that they would rather take notes traditionally without the audio recording.

An interesting realization from the user study is that the audio notebook fundamentally changed the ways that the users took notes. The audio notebook didn't just add more data to existing note taking, it changed the process altogether. The user restructured note taking to holistically capture all the relevant data in the most efficient way. This is similar to how we now all restructure the way that we communicate to better utilize email and text.

Video Object Annotation, Navigation, and Composition

This paper introduces a suite of video interaction techniques that uses 2d object tracking technology. It enables the user to manipulate or mark moving objects that are in the videos in a direct manipulation sort of interface.

I would have been very impressed if I hadn't already read about similar work when I visited Prof Agrawala's website in the past. It's interesting to get a high level idea of how such systems work though and at the very least this paper serves as a good example of how exploration of new computational technologies can open up a whole new space of human computer interactions.

It would be interesting to see if they could produce a similar application that was scaled down for mobile phone use. Essentially trade speed/ computation time for accuracy. As a prototyping tool, I could imagine how a direct manipulation interface of objects in video would be immensely useful.


Drew Fisher - 11/2/2010 18:58:57

Video Object Annotation, Navigation, and Composition

The primary contribution of this paper is a direct-manipulation approach to video editing and playback. According to their results, this approach is both more intuitive and more efficient for users performing tasks like video annotation than current software offerings.

I am concerned by the performance limitations mentioned in section 5.1 - the authors admit that a single standard-definition video frame requires up to 5 minutes per frame. That's 5*60*24fps = 7200 times as long as needed for realtime. This suggests that while this might be useful for real footage, it would need significant algorithmic improvements (or a sizable compute cluster) to see use by consumers or industry. In addition, there are limits to the technique's efficacy that limit what clips this will work on - expanding the useful range of this techniques will also prove valuable. These seem to be limits on the current implementation, though, not the idea itself.

The spatial inertia/kinetic movement feature felt like a gimmick brought on by the recent explosion in multitouch/kinetic interfaces, and I couldn't see that actually being useful in any video editing or exploration activity.


The Audio Notebook - Paper and Pen Interaction with Structured Speech

The primary contribution of this paper is implicitly connecting written notes with the audio recorded as the notes were being taken. Connecting these allow our minds to more easily navigate notes of the past by giving a more holistic copy of the experience, as well as by letting us use the intuitive connections between sound and movement made by our brains.

I particularly liked the "snap-to-phrase" technique employed - the system's awareness of logical units, rather than strictly physical ones, can be used to help infer what the user really wanted. This helps take into account the lag between when audio is heard and when the notetaker's brain processes it, makes sense of it, and converts it into words to write down on paper.

The broader-impact focus of this paper is to think about making tools to augment reality, rather than to replace it.

Sidenotes: this paper take a more ethnographic approach to the research subject than I have otherwise seen in CHI.


Arpad Kovacs - 11/2/2010 19:00:06

The Audio Notebook combines a digital audio recorder and paper notebook in order to augment and synchronize a user's notes with the original audio from an lecture/interview/encounter, and thus allows for easy navigation and search of the data. The device structures notes according to both the user's notetaking, as well as the tone of voice of the orator, thus allowing a mapping of the time of the audio domain to the space of the writing domain which can be skimmed more easily.

I think that the most interesting part of the system is the topic suggestions capability. This is excellent for strengthening the mapping between notes and voice. Otherwise, I do not think that the system in its current state provides much improvement over just adding the corresponding timestamping in the audio track to each page of notes. I think that a useful addition to the system would be voice recognition of the original audio stream, which would allow skimming of not just the user's notes, but also lookup of direct quotes from the original audio. In order to keep the notes, textual transcription, and original audio track synchronized, the paper notepad would have to be replaced with an electronic one, that could turn the scrollbar of the Audio Notebook into a karaoke-style timeline, like the one used by Google Voice for playing back voicemail messages. In summary, I think that substituting the paper notepad with an electronic wacom tablet or similar device that can provide additional feedback (eg highlighting items by timestamp, to show direct links between the spatial and temporal domains) and the advantages of digital media (eg text search, copy-paste) would have made for a more compelling device by making the mapping between notes, transcription, and audio more direct and easier to access/navigate.


The second paper describes how 2D computer vision can be used to add annotations, navigation, and image composition functionality to video streams via direct manipulation. The advantage of this approach is that it allows users to interact continuously with "objects" in the video, rather than enforcing the traditional constraints of timelines and frames. The system first performs a pre-processing phase where it places and tracks particles over time, and then aggregates these points into object groups which the user can manipulate. The system then attempts to recover the motion of objects, but does not try to put precise constraints on their boundaries, since that is not necessary for the user to directly manipulate the objects.

This is one of the most interesting papers I have read this semester, and has a lot of potential. By allowing the user to interact with objects, this system significantly reduces the gulf of execution as well as the time to perform a given video editing task, since he/she can just drag and drop the object, or directly write on the video, rather than edit a collection of frames. Unfortunately the preprocessing stage is very slow, which may make the system unsuitable for immediate deployment, but this can probably be improved with more work.

I thought that the handling of full occlusion of dragged objects, by using inertia to the virtual sliders was clever. This is a clever psychological hack to a technical problem. I also thought that the perspective deformation of the graffiti based on anchor tracks was a neat feature. Overall, this is a very impressive system; I would really like to try it soon!


Richard Shin - 11/2/2010 19:00:09

The Audio Notebook: Paper and Pen Interaction with Structured Speech

This paper describes a system that allows associating recorded audio with notes written in a paper pad. As written notes often lack the detail of what is said, yet audio recordings are frustrating to use for finding a portion of interest (requiring scrubbing or fast-forwarding backward and forward), combining the two together allows the note-taker to preserve details that written notes alone would have used, while enabling easy navigation of the audio recording with the written text. By placing a suitably-coded notebook over a digitizing tablet, the device can record sounds as well as what is written in the notepad (and on which page). The authors present the results of user studies that they performed, as well as 'audio snap-to-grid' and 'topic suggestions' for improving the match between text and audio.

This paper follows previous attempts that we have read about for augmenting paper; specifically, the DigitalDesk and the Designers' Outpost. While those were large-scale systems using cameras, this instead provides a much smaller system which does not augment the paper visually, but only adds audio annotations. Overall, the idea in the paper seems appealing for anyone that needs to take notes while also recording the audio; I know that this kind of system is now actually available commercially (although I don't recall the name), albeit placing all the intelligence in the pen and requiring specially-patterned paper rather than providing a large device for placing the paper notepad into. Surely, this system preserves many of paper's advantages while adding new features to it.

I'm not sure about how suitable augmenting paper was for a system like this, however, compared to a just a touchscreen-based device. It seemed that, in order to enable augmentation, a bulky device was needed to keep track of pen strokes; the notepad needs to code the page numbers specially for the device, so that it can keep track of which page is visible when tracking the writing. Also, while the authors tout the benefits of paper for note-taking applications, chiefly the ability to easily remove sheets from the notebook, it seemed unclear whether the augmentations that this system provides would still be available.

More fundamentally, I think that audio would often correspond poorly to what the user writes in the notebook. People can rarely transcribe what others say in real-time, or copy from a blackboard immediately; some delay is always present, which may also vary over time. While the paper addresses this partially with phrase detection, if the audio is more than one phrase apart from the text, then the correspondence between text and audio could easily become largely useless.

Video Object Annotation, Navigation, and Composition

This paper explores the use of object and feature tracking in video for matching annotations to objects, or to enable direct-manipulation navigation of the video. The authors note that timelines provide a poor interface for many types of interactions with video, as it remains unaware of the higher-level components of the video such as which objects appear and where. The paper proposes particle tracking and grouping as a new primitive that would provide information about how objects move and deform. Using this information, the authors demonstrate object-aware video annotations, video navigation by dragging objects, and manipulating specific objects in a video to create a composite still.

The method that the authors present for tracking objects within a video seems to me the primary contribution of the paper, more so than the applications built upon it that are also presented. I can imagine many more applications that could be built with this awareness of motion, particularly in video editing, an area that the authors do not explore; consider the ability to independently adjust the speed at which separate objects move, for example. Object tracking could be enabled with visual search, so that users could learn more about things which appear in a video simply by clicking on them. Although I personally haven't seen real-world uses of this technology, it seems widely applicable to many users of video.

Of course, both for applications presented in the paper and for any future applications that might arise, accuracy of the tracking is paramount. Unfortunately, they don't really seem to discuss how well their system works in that regard; while the authors informally evaluate the usability implications of the applications that they present, they don't seem to have evaluated the accuracy or robustness of their system at tracking motion and grouping points together into coherent objects.


Matthew Can - 11/2/2010 19:01:58

The Audio Notebook

This paper presents the Audio Notebook, a system for taking handwritten notes and capturing audio for later review. It is particularly novel because it synchronizes the notes with the audio, making it easy to review the two in concert.

In contrast to tape recorders, the Audio Notebook provides a fluid interface for navigating audio recordings. What makes this possible is that the audio is structured, both by the user’s notetaking and the audio’s acoustic properties. For example, by pointing the pen at a handwritten note, the user can index into the audio at the time when the note was taken. In other words, the system maps time onto space. This interaction technique is one contribution of the paper. More generally, the paper contributes to HCI because it demonstrates a successful system that uses computation to augment rather than replace the status quo, in this case pen and paper for notetaking.

One thing I really liked about this work was the longitudinal field study. Unlike the laboratory user studies in other papers we have read, this study has strong ecological validity, and it is particularly suitable to studying this system. The participants used the system for real tasks, in their chosen manner.

In addition, I liked the concept of topic suggestions, especially because they add structure to the audio recordings in the absence of user-created structure. I thought it was fascinating that the topic suggestions were predicted from acoustic features alone, but this approach seems quite error-prone. The system allows users to adjust the tradeoff between false positives and false negatives, but it should really employ more sophisticated NLP methods. And, it would have been nice if the authors presented precision and recall figures on how well their system extracts topics from a recording.

Overall, I thought the Audio Notebook was well thought out, with carefully polished interaction details like the audio snap-to-grid. But more importantly, I liked the authors’ approach to addressing the problem of effective notetaking: augment, don’t replace. The best thing about this system is that it is perfectly compatible with the typical notetaking method; the second reporter in the field study is a good example. Like the DigitalDesk, it integrates the advantages of the physical and digital worlds.


Video Object Annotation, Navigation, and Composition

This paper explores how 2D object tracking in video can be used to build a better interface for video annotation, navigation, and composition. The interface allows users to create annotations that transform along with the objects in the video. Users can directly manipulate video objects to navigate to a point in the video when the object is near the specified position.

What I liked most about this work was the object-associated annotations. This is a wonderful, practical application that comes out of thinking about video editing in terms of higher level concepts like objects, rather than frames and timelines. I thought the path arrow annotation in particular was impressive.

As for the direct manipulation video navigation, I found the concept very interesting. For some applications like reanimation, it seems better than the standard timeline or slider interaction technique. But for typical video viewing, it does not seem well suited, especially because of discontinuities. It would be interesting to take a cue from the Audio Notebook paper and automatically structure a video stream into suggested scenes that index the video, marking the standard slider widget with these suggestions. Although this is not a direct manipulation approach, it could nonetheless enhance video navigation, especially for movie viewers.