Audio/Video

From CS260 Fall 2011
Jump to: navigation, search

Bjoern's Slides

media:cs260-17-audio.pdf

Extra Materials

Discussant's Materials

Link to presentation: Audio/Video Manipulation

Link to video of Paper:

LiveScribe Demo:

Another interesting application I found while looking at video manipulation was a paper called "Unstructured video-based rendering: interactive exploration of casually captured videos" by Luca Ballan et al.

Reading Responses

Valkyrie Savage - 10/29/2011 12:34:06

Main idea:

The current ways that we have of interacting with auditory and visual inputs (that is, linearly in time) could be improved: having the option to segment these stimuli in a visual way that is not linear can help us increase our understanding of them.

Reactions:

Before I say anything about the two papers individually, I would like to say that there isn’t anything I like better than using technology to improve upon nature. Our senses only allow us to experience real-world things in a unidirectional linear fashion: moving forward at the same pace as time. The options of fast-forward and rewind changed this. Now, adding semantic meaning on top of those same options will again improve us over our natural senses. Hurray, technology!

The Goldman, et al., paper about video object annotation and navigation dealt with how one might wish to use objects to track through a video, e.g. following a football player as he moves through a play, regardless of pauses he may make at certain parts of his destined arc. Parts of this paper seemed like really great ideas--the ability to compose a shot of a group of people from a video of them is good, for instance--, but I’m not sure that this method was the right way to approach the problems. I do like the addition of meaning based on objects flowing as a way of navigation, and that security camera example was a good one. Advertising-motivated annotations make me a little sad, but I guess we all know that that’s what our world is eventually coming to as consumers are willing to face more and more ads in order to pay for fewer and fewer services. I was a bit taken aback when they just “mentioned” late in the paper that their preprocessing took 5 minutes per frame. I probably wouldn’t have been so excited about that if I’d know earlier, but anyway... it was an interesting paper. I think there are more applications of the idea that I might be excited about (for instance, if they modeled the interactions from the second paper, and we were able to jump to points where a new character was introduced in the scene, etc., although I understand that their system is not yet robust to knowing characters from one occurrence to the next).

The other paper, by Stifelman, et al., was equally interesting. Their longitudinal study didn’t seem so longitudinal; the reporters each only used the device twice, once for write and once for read, so I’m uncertain how the authors thought they could call that a “chance to get familiar with the device”, as it sounds mostly like the kind of study that could have been done in a lab in an hour. I was happy that at least one of the people seemed to hate the device, as that I hope helped the authors work on new techniques to make it nicer for everyone. It seemed like a good opportunity for them since he wanted to use the new system alongside the old-school tape-recorder system: as we discussed, having multiple options usually gives one a better chance of providing helpful feedback.


Hanzhong (Ayden) Ye - 10/30/2011 23:34:00

I think audio and video are interesting topics in Human-Computer Interaction because of the rich information they carry and high possibility for varies of manipulation. In the first paper, the authors illustrate a new device called the audio notebook which addresses the problem that a listener usually experiences when attempting to capture information presented during a lecture, meeting or interview. As a combination of a digital audio recorder and a traditional paper notebook, the device enables a range of usage styles, from detailed review to high speed skimming. The design is smart in that it combines the interface of traditional notebook with the recording and audio-navigation process. However, with the appearance of more and more advanced mobile devices such as tablet PCs, I think the device will be replaced by applications developed on these devices.

The second paper by Adobe and researchers at University of Washington is a more interesting one because it talks widely on the topic of video manipulation. The authors explore the use of tracked 2D object motion to enable many novel approaches to interacting with videos. They have developed novel ways to add moving annotations, navigate video directly by manipulation of objects on screen, and even create an image composite from multiple frames. Although I am not quite clear about the computer vision technique used in their implementation, I like a lot their final results and I believe their work can be widely employed in a variety of applications including film and video editing, visual tagging, and authoring rich media. I especially like the application of path arrows generation and I think it has great potential to be largely used in sports live show and many other areas in video editing.


Shiry Ginosar - 10/31/2011 19:53:19

The Audio Notebook presents a system that allows a note taker to capture and review the audio of a lecture or interview in synchrony with written notes. The Video Object Annotation paper describes a system that allows users to perform various manipulations of video all relying on a preprocessing step where objects in the video are motion tracked.

As a student, the problem of taking notes during a lecture is acutely familiar. A delicate balance must be maintained between following the class discussion and capturing highlights and details in writing for later review. The authors of the Audio Notebook paper seem to have designed the dream system to address this problem. By allowing the audio track to be queried based on a specific drawing or note captured during the lecture, one can easily retrieve the larger context of anything that was missed in real time while taking advantage of one's intimate familiarity with the spatial arrangement of one's notes. I especially enjoyed the study described in this paper as this sort of application seems to lend itself well to a longer term study rather than a one-off lab test. All in all I really have no negative remarks. I kind of want one of these to get me through grad school!

The interaction described in the Video Object Annotation paper was harder for me to relate to as I know nothing of video editing and annotation tools nor do I have the need to use them. While the applications described seemed interesting, the interaction descriptions seemed long and complicated despite the fact that they were all accompanied by descriptive images. This is the kind of paper that could really have benefited from a video online which I could not find. Moreover, the study described in this paper was more of a proof of concept than a study. Though the interactions presented are unique to this system, the paper did not give me a sense of how usable they are for the various goals described, and as mentioned, this gab was hard for me to fill since the domain is outside of my familiarity zone.


Yun Jin - 11/1/2011 14:15:51

The first paper introduces the general idea of an epistemic action and discusses its role in Tetris, a real-time, interactive video game. Epistemic actions -physical actions that make mental computation easier, faster, or more reliable-are external actions that an agent performs to change his or her own computational state. More precisely, they use the term epistemic action to designate a physical action whose primary function is to improve cognition by reducing the memory involved in mental computation, that is, space complexity; reducing the number of steps involved in mental computation, that is, time complexity; reducing the probability of error of mental computation, that is, unreliability.

The second paper draws on theories of embodiment—from psychology, sociology, and philosophy —synthesizing five themes they believe are particularly salient for interaction design: thinking through doing, performance, visibility, risk, and thick practice. We intro- duce aspects of human embodied engagement in the world with the goal of inspiring new interaction design approaches and evaluations that better integrate the physical and computational worlds. And this paper presents five themes that they believe are particularly salient for designing and evaluating interactive systems. The first, thinking through doing, describes how thought (mind) and action (body) are deeply integrated and how they co-produce learning and reasoning. The second, performance, describes the rich actions our bodies are capable of, and how physical action can be both faster and more nuanced than symbolic cognition. The first two themes primarily address individual corporeality; the next two are primarily concerned with the social affordances. Visibility describes the role of artifacts in collaboration and cooperation. Risk explores how the uncertainty and risk of physical co-presence shapes interpersonal and human- computer interactions. The final theme, thickness of practice, suggests that because the pursuit of digital verisimilitude is more difficult than it might seem, embodied interaction is a more prudent path.


Steve Rubin - 11/1/2011 16:19:53

The two papers for this class focused on two applications using audio and video. The first, "The Audio Notebook" described a device that augmented note-taking with audio recordings. The second showed novel interaction techniques for working with video, including annotation (speech bubbles, graffiti) and directly manipulated navigation. While the first paper offered a highly adaptable system that allowed the user relative freedom in exploration, the second paper had more specific goals.

Accordingly, I think this is a good time to think about the conceptual differences between these two approaches. I was excited when I read "The Audio Notebook" because it is an open-ended system. The authors did not prescribe a use for it, but showed through their user studies that people generally found it to be an effective tool. The did not stress a quantification of its effectiveness, but rather showed through example how the device could be useful in many ways. I think one hallmark of good research is that the motivation is more or less obvious; I didn't find myself needing to be convinced of anything here. Normally when I read papers that lack a solid evaluation, I start to wonder about the credibility of their assertions and the validity of their motivation.

The second paper presented interesting ideas as well, although from a much narrower focus. Its primary contribution was a suite of tools for working with video at a very high level. They were essentially automating and streamlining things that could be done with traditional video post-processing tools. By automating these features, they hopes to offer these tools to users for communication-related activities in addition to standard video production work. Some of the applications in this paper seemed novel in a silly way. The speech bubble and graffiti techniques did not seem very useful, but served more as a proof of concept. Navigation via direct manipulation, however, seems very useful. Because it represents a significant paradigm shift in how we navigate video, I would like to have seen some discussion on how users can effectively leverage this technique. Otherwise users may find themselves progressing through the video more or less linearly anyway.

While the first paper needed no motivation (for me), the second would have benefitted from a more focused agenda. I don't think they needed to discuss every single thing they implemented, and should have focused on a few of the examples instead.


Laura Devendorf - 11/1/2011 16:23:14

Video Object Annotation, Navigation, Composition describes and defends new novel methods for interacting with and editing videos. The Audio Notebook discusses a method for simultaneously capturing audio and text notes in a notebook and discusses both storage and retrieval techniques.

This paper presents a number of interesting, useful concepts. The system seems highly coupled to the vision algorithm they employ and I can't be entirely confident in it's usefulness without actually using it, having a user study to support it or the computer vision knowledge to validate their algorithmic choices. I also thought the authors where contradicting themselves by saying "we believe the tools we have demonstrated are largely unique to our system" in the first paragraph of their informal evaluation when in the Introduction, they gave several examples of similar projects and how they were going to improve on them. The applications were less interesting and novel than their methods for producing such interactions.

I enjoyed the audio notebook paper, especially the interaction techniques they described for skimming audio. Yes, their prototype is bulky and far from robust by today's standards but I think it presents a great conceptual idea and has effectively been recreated by live-scribe systems. While having notes in addition to audio is an interesting concept I began to wonder how it would augment the note taking process. Would the users notes function more like an annotation of the audio?



Amanda Ren - 11/1/2011 19:48:34

The Stifelman paper introduces the Audio Notebook, which combines digital recording with the all familiar pen and paper note taking technique.

The paper is important because it addresses a current problem many people face - taking structured notes through a lecture/presentation and presents a solution that doesn't necessarily change how people are used to doing things (taking notes with pen and paper) but seeks to improve their experience. I thought the paper was well written and how they had an extensive user study. I found it interesting how one of the students changed the way they took notes with the tool and how one of the reporters could have saved a lot of time had he just used the tool vs transcribing his video recording. This relates to today's technology because we have already seen something implemented very similar to what is described in this paper. Livescribe allows users to use a smarten and digital paper to review audio associated with a notes recorded at a specific time. Although I have never actually used this technology, I'm wondering how good the audio quality is, especially given that in lecture, you could have either background noise (echoes in the lecture hall) or be too far from the speaker.

The Goldman paper describes a series of techniques for interacting with video.

The paper was fairly technical. It describes a tool that uses object motion to simplify the task of video annotation (through speech bubbles, path arrows, or video graffiti), video navigation, and video to still composition. Their system does all tracking and grouping before the user interacts with the system, allowing the user to think about their high level goals.They also performed an informal user study that showed their system resulted in both less time and less mouth clicks compared to the After Effects program. The user study, however, only consisted of a novice and advanced user, perhaps it could have been more extensive. The one downside to their system is length of time they need to preprocess a video. This paper relates to today's technologies because of the popularity of Youtube and better quality video recorders being present on mobile phones, we need a fast and easy way to edit these videos.


Viraj Kulkarni - 11/1/2011 20:26:08

The audio notebook: paper and pen interaction with structured speech' proposes a system which couples notes handwritten on a physical paper along with the audio recorded while the notes were being taken. The idea is attractive and has several use cases. However, two things bother me about this approach. First, the authors mention that every pen stroke is linked to a different part of the audio recording providing for a fine grained mapping between notes and audio. Usually, when I take notes, I don't write down something for every sentence I hear. Instead, I summarize a dozen sentences I hear and I write down about them AFTER I finish hearing those sentences. There is a necessary lag between when I hear something and when I write it down. I don't think this system handles this very well. Secondly, why not embed the recording technology in the pen itself? The paper was published in 2001 and pens which record stroken might not be available then.

'Video Object Annotation, Navigation, and Composition' presents a system for interactively associating annotations to objects in a video, navigating through a video and composing new still frames from a video. With increasing availability of disk storage and internet bandwidth and decreasing costs of video cameras, video is becoming an increasingly important part of our lives. The goal of their system is to allow manipulation of videos at a level higher than frames and timelines. The system relies on heavy preprocessing of the video to identify object motion which is then used to perform the above mentioned tasks. Although the paper and the system developed do not go a very long way in higher level video manipulation, its definitely a welcome step in that direction.


Peggy Chi - 11/2/2011 0:08:25

Audio and video, two kinds of multimedia difficult to manipulate, how do we design new interactive ways to not only improve navigation efficiency but also provide better experiences? The Audio Notebook was an advanced work that applied physical daily objects such as a pen to playback audio recording linked to paper notes. Goldman et al. presented an innovative technique that enabled intuitive video object interactions.

Both of these two papers highly addressed my interests, from which I especially love the video paper because it shows a new perspective of interacting with videos in a different level, the objects. I still remember back to years ago, how impressive the demo video was that showed how users navigated temporal videos by dragging the objects. It breaks the limitation of timeline to more content-based interaction. The Audio Notebook reminds me of many current iPad apps such as SoundNote (http://soundnote.com/), AudioNote (http://luminantsoftware.com/iphone/audionote.html), and Notes Plus(http://notesplusapp.com/). Though the authors emphasized the great features of everyday objects like paper and pen, current technology somehow pushes the touchscreen experience closer and closer to the physical world we are familiar with. All in all, these papers demonstrated great examples how digital data can be interacted by direct manipulation. However, when the technology has enabled us to focus on the content itself, I wonder if this will bring up new challenges of automation, capturing, or other issues. For example, more techniques such as face recognition, emotion detection, and context-aware content, could be considered to enhance higher level interactions. What are other opportunities?


Hong Wu - 11/2/2011 0:17:55

Main idea:

Both papers are to associate the traditional device (pen) with digital devices (audio or video).

Details:

“The Audio Notebook” presented a system which combine paper writing with audio. The flipping of paper is associated with the segmentation of the audio. By doing this, user does not need to separate the audio section manually. It will also be easier for user to track and find the associated content according to the notes. The paper did not mention what would happen if user went back to previous page when recording was going on. Only three people have been interviewed in this experiment, which is too few.

“Video Object Annotation, Navigation, and Composition” showed a system to add graphical annotations to video, to navigate video by applying screen-space motion of objects, and to compose new still frames from video. The preprocessing for video interaction is one of the most important contributions of the paper.

Association of pen with video and audio is a good trial. It helps people to deal with the digital world by a familiar way. If the device to record video and audio can be as thin and light as a paper. The device may replace the paper.


Alex Chung - 11/2/2011 1:12:12

The Audio Notebook combines the advantages of physical writing on paper along with the digital recording of audio that is time-coded with the markings on paper.

While some might argue that typing is faster and more readable, it is much easier to sketch graphs and drawings on paper. Switching between the modes of characters and shapes on computer system often requires the user to make an explicit gesture to do so. Such action usually diverts the note taker’s attention and disrupts the flow of writing. Besides of affording the users to perform the task without obstruction, the digital recording allows them to use the physical writing as markers to review the audio recording. Overall, there is less resistant for adoption because the technology enhances without asking for more efforts from the user.

As a happy customer of Livescribe Echo smartpen, this is another reminder of how technology goes through much iteration of research and development before emerging as a mainstream product. The current version mounts a small optical sensor to “watch” where it has been on the specially printed-paper to digitize the writings. At the same time, the timestamp of audio recording is linked to where the pen has been on the page. It provides a precise control of the audio recording by each pen stroke.

However, the smartpen would only outline the audio into segments if the user makes the effort to take notes while recording. Otherwise, it is no different from a regular voice recorder. Actually, it would be worse because you can place the recorder in the front of lecture hall while you sit at the very back. The smartpen does not work very well if there is a distance between the speaker and the notes taker.

The second paper presents an application that affords the manipulation of a single object in video and the effect is automatically applied to the same object in within the clip. For example, user can paint a string of text on a moving car in a single frame. Then the same string of text remains on the vehicle throughout the processed video.

Interaction with motion graphics is certainly a new paradigm to the field of HCI. The paper contributed many new features to allow simple to use interaction techniques for manipulating individual objects in the video. For example, system user can independently alter the motions of three individuals within the frame and then merge them into a single snapshot. Something similar can be seen in the Window 7 commercial where the mom took a video of her family and selected the perfect headshots for each member of the family photo.

However, I don’t see enough new interaction technique being introduced in this paper. Many interactions are similar to the use of PhotoShop for editing pictures. On the other hand, the smudge feature is an interesting new feature where they can paint a giant arrow to show where the object has been and what direction is it going next. The recognition of user’s motion can be significant to the HCI community when it comes to usability study that requires video encoding. Instead of watching the video, the program can automate the process.


Ali Sinan Koksal - 11/2/2011 1:30:12

The Audio Notebook is a system aiming to help and augment the task of taking notes using paper and pen. Notes taken on a notebook are synchronized with the simultaneous audio recording, and each annotation on paper becomes an index into the corresponding point of time in the audio recording. Moreover, another structuring technique is implemented to ease further navigation: detecting beginnings of sentences and potential changes of topics.

This work has clearly been very influential and successfully commercialized (the LiveScribe smartpen is one instance). The idea of augmenting the familiar pen and paper with the audio navigation capabilities reminds of Weiser's view on not creating a virtual reality but enhancing our everyday life using technology. Such a method for structuring speech may well be useful in structuring continuous video recordings too, if the problems of battery life can be overcome as we discussed during the last lecture. The fact that this time annotation is made implicit through notetaking is a great way of encouraging its adoption, without incurring an additional effort from the user.

The authors emphasize greatly the role of the audio scrollbar on the tablet itself, which doesn't seem to be there in the current commercial systems. In my opinion, the tablet is too bulky to carry around all the time, and moving to a notebook-only approach with the help of a camera on the pen itself was a useful move.

The evaluation and the lessons taken from it were meanwhile interesting to read. They certainly motivated very different modes of use of the audio notebook.

The second paper is an impressive work on raising the level of abstraction in video editing, from timelines and frames to moving objects. There are different uses of the system that leverages 2D object tracking in videos: annotations can be tied to objects that can move, objects can themselves be virtual sliders that let users navigate through time by indicating object positions, and a technique for compositing images using a drag-and-drop metaphor is presented.

This work seems to brilliantly integrate direct manipulation interface concepts into video editing by allowing users to navigate through time by manipulating objects in the video itself. Meanwhile, the performance of the system at the time of publication does not seem to enable such a DMI approach, a preprocess cost of 5 minutes per frame makes it too slow to be responsive enough for use as a direct manipulation interface.

The ambiguity that arises when moving an object that has moved through certain points more than once could be explored a little more. It would be interesting to discuss ways of presenting different possible points of time that correspond to a given location, therefore allowing to explore parts of the video without getting stuck at certain points.


Galen Panger - 11/2/2011 1:42:08

The Audio Notebook is obviously quite similar to the present-day LiveScribe pens, though the technology is stored in the writing surface rather than in the pen. This allows the Audio Notebook to do a few (I believe) unique things to the present-day implementation, including visually indicate important areas (high-level topic breaks) in the recording and allow easy scrubbing through the audio in the (presumably) portable context. I assume all of this functionality could be done through the LiveScribe’s software interface, but I don’t know whether it actually does this and of course a laptop or other device would be required.

The Video Object piece involves tracing the movement of objects which are detected through particle clustering. Though pre-processing is required, identifying objects in the video allows for video search and navigation, animated object annotation (the annotations follow objects), and limited construction of images from multiple video frames.

Both papers pointed toward success in their evaluations; the Video Object piece achieved in a preliminary expert user test an order of magnitude improvement, which is very impressive. The Audio Notebook results were not as clear-cut, but the reporter’s experience locating important quotes was the clearest indication of the utility of these kinds of note-taking and recording devices. For students, the results are more ambiguous because they may spend more time with the material given that recordings behind their notes can be called up. But this may be a good thing; and there’s an indication that the audio wouldn’t be used if it couldn’t be easily skimmed or navigated.

I don’t have experience with the LiveScribe pens (the Audio Notebook analogue), though I can imagine that they would be quite helpful. That said, with respect to the utility of LiveScribe pens, I don’t really take notes anymore given that most courses that I’ve recently taken post slides online. And some courses even post video recordings of lectures where the content can be easily scanned and salient parts of the recording reviewed. So the LiveScribe pens may be made unnecessary by the increased use of slides (so why take notes?) and the increased use of lecture video recording (which can be easily visually scanned for content).


Manas Mittal - 11/2/2011 1:50:03

The Audio Notebook synchronizes user's audio notes with written text. By combining the two disparate interface modalities, the interface enables easy retrieval and skimming of information. The Goldman et al. system enables video viewing and navigation by direct manipulation at an object-level granularity.

These papers are interesting for two reasons: First, they present interesting examples of using multiple modalities for performing a task that can be performed by using individual modality, but is better performed by using multiple modalities. Can we generalize this -- is it commonly true? Second, these papers present a direct manipulation interface. Is direct manipulation interface aways better.

I've had a long-standing idea to build the inverse of the audio notebook as a way to enable users to navigate phone trees (i.e., the automated press 1 for X, and press 2 for Y systems). Currently, automated phone trees are presented as linear audio outputs. What would happen if we compliment these audio outputs with textual options displayed on the cell-phone menu that a user could choose? This is now possible since several users already use a internet enabled smartphone which can do both audio and data in parallel.


Donghyuk Jung - 11/2/2011 2:39:22

The Audio Notebook

In this paper, the authors presented ‘The Audio Notebook’ and it is a combination of a digital audio recorder and paper notebook in a single device. They also conducted field studies in order to find out how they use it.

Pros: I think this project is a good example of combination of analog and digital world by complementing disadvantages of both sides. Especially, it is appropriate to implement additional functionalities (phrase detection and topic suggestions) designed for reducing the gap between two worlds.

Cons: They conducted user studies but they mainly focused on how they use it. They should have showed some usability tests how they improved note-taking experiences in terms of performance or psychological satisfaction. Maybe A/B or ANOVA test might be good ways to get some statistical results when they interacts with other tools (analog vs. digital or audio recorder / notebook vs. Audio Notebook)

Video Object Annotation, Navigation, and Composition

In this paper, the authors presented “a system for interactively associating graphical annotations to independently moving video objects, navigating through video using the screen-space motion of objects in the scene, and composing new still frames from video input using a drag-and-drop metaphor.”

Pros: They developed new approaches for computing the motion of points and objects in a video clip, and interactive systems that utilize this data to visually annotate independently moving objects in the video. This interface what they used to capture the object reduced significant amount of time comparing with indirect manipulation using frames and timeframes.

Cons: As they mentioned in the paper, their implementation requires preprocessing so that it is necessary for this application to replay edited clips instantly. Additionally, if users need to capture fast-moving object or very small object on the screen, it will be very hard task for them.

Video Link for This Paper: http://vimeo.com/2345579


Derrick Coetzee - 11/2/2011 4:48:11

Today's readings dealt with user interfaces for audio and video content.

Stifelman et al's "The Audio Notebook," published in 2001 and since realized in part as a commercial product called Livescribe, is a system that allows users to effectively browse a long audio recording by leveraging notes written during the recording. By pointing at the notes, the user can return to the point in the audio where that note was written. A field study demonstrated a variety of useful applications of the technique.

One obvious limitation of the device is that, being limited to audio, it eschews visual cues of the speaker such as body language, whiteboard diagrams and writing, and slideshows. Provided that the speaker is being concurrently recorded on video, it seems like it would be a trivial matter to match up timestamps and browse clips of videos using the same technology.

A more subtle problem is that users are forced into a linear writing scheme where they can only write things related to what was just said by the speaker. They cannot expand upon previous notes with new insight or lag behind the speaker without mis-indexing. An interface that gives expert users more manual control over the writing/speech association could be valuable.

Acoustic structure, where structure is imposed based on characteristics of the speaker such as intonation, is a more heuristic technique that did not find its way into Livescribe. It can be important in regions where the user took few notes, particularly for cases where the user could not be present at all, but is difficult to perform accurately for such an underspecified problem. Incorporating video for additional cues could be valuable; alternatively, a means could be provided for users to go back and "fill in more notes" for audio regions that are too large.

Because the system is paper-based, it cannot sync in the opposite direction: show notes as the corresponding audio is played. This would be especially valuable for users who write nonlinearly, jumping between pages.

Finally, methodologically, the work focused on the invention and field study but performed no experiments. Without careful controls, it is difficult to eliminate confounding factors such as the novelty of the system on result metrics like time saved by using the system. Follow-up work in this area remains important.

Dan Goldman's et al's 2008 work provides a framework for annotating and navigating using moving objects in a video. Using well-known computer vision techniques for object segmentation, they could track motion of objects over time, then use novel UI mechanisms that are very simple and accessible to novices to attach annotations to moving objects in the video, and navigate through the video spatially by moving an object to a specific location where it appeared during some frame. Even for expert users, the system proved vastly more efficient at the specific tasks it enabled.

The primary limitation is preprocessing time; 5 minutes per frame amounts to days for a show or film. This might be acceptable if not for the fact that many people who wish to annotate or navigate videos are also actively editing the videos, including actions like adding or removing content, reshooting/reanimating bad scenes, and so on; and the scheme does not appear to provide for incremental update of the segmentation data. It is also very limited in functionality compared to complete professional film editing products. In particular, there is little recourse when its heuristics fail: there is no way to "fine tune" object selections, nor to manually track fully occluded objects.

One interesting application that I'd like to see this technology applied to is more usable closed captioning technology, where speech could be attached directly to the characters producing it, avoiding the current awkward association by name.


Suryaveer Singh Lodha - 11/2/2011 7:38:36

The Audio Notebook

The paper describes in detail an Audio Notebook, which record pen strokes digitally and sync them with audio recordings. There are features such as easy audio playback, indexing through a scrollar etc. For user studies, the author observed how students and professionals use Audio Notebook in their daily life, instead of just using it in a lab environment, which I think was sort of nice, and seemed more on the lines of Beta testing of a product, to iterate over an existing product and make it better. The authors explored features such as listening-to-writing offset, audio snap-to-grid to detect contiguous parts in audio etc. I found the audio snap-to-grid feature very interesting because it deals with data search at a higher level, which I think is one of the most important action we as users perform on any data set. I think that the fact they tried to incorporate search in audio data by annotation was pretty nice, however I'm not sure how optimum that is, or if that is the right way to do it. I'll be interested in learning about other approaches researchers have taken in this field to make the search experience better in audio media.

Video Object Annotation, Navigation and Composition

The paper explores novel approaches to interact with videos which include moving annotations, object tracking and developing a composite image by using multiple frames from a video. I find compositing using multiple frames in a video very interesting, and something I can see myself using often. I also like the interface, and how it makes certain tasks which seem easy to formulate in brain, actually easy while doing them on a computer. For example, when someone asks me to add a graffiti on a moving object, I would think - lets just add a layer of graffiti on top and make it a part of the moving object, so that it moves together. This is easy to formulate, but difficult to implement. The paper is able to implement this quite easily. The idea of particle tracking and grouping sounds good when we have well defined regions in the video/ a good high quality video, but I'm not sure how well they will perform in a complex video where boundaries may not be as well defined. I think looking at crowdsourcing for video annotation, where the software in itself cannot figure things out might be an interesting area to explore. We have already seen in earlier papers (example - Soylent, crowdsourced word processing) that we can have pseudo real time results. The reason I think crowdsourcing would be an interesting approach is because it is relatively cheap and its a lot simpler and faster for humans to annotate objects even when the video is not so clear, or there is huge motion blur. Obviously it wont be real time as of now, but may be in future, as we have better/more responsive crowdsourcing platforms, it might be realtime!


Jason Toy - 11/2/2011 8:28:03

The Audio Notebook

"The Audio Notebook" is about a new system that addresses the problem listeners have of divided attention when attempting to capture information during a lecture, meeting, etc.

The goal of the Audio Notebook is to retain the original audio of speakers while allowing a listener to quickly and easily access portions of interest. This results in changes in the listener's note-taking strategies. For example, one student began to take less detailed notes, instead writing about key topics, which allowed her to concentrate on the discussion taking place. The Audio Notebook is related to the Digital Desk paper we read previously, which projected a computer display onto paper documents. In both cases, the creators are trying to integrate the physical and virtual workspace. The Audio Notebook specifically tries to augment the familiar physical objects of paper and pen rather than try to replace them, for example, with a tablet. In terms of real world products, the Audio Notebook is similar to the Livescribe pen.

The authors of the paper did a good job noting both qualitative (feedback and change in note-taking) and quantitative (time to transcribe) factors in real life settings. They also managed to incorporate the feedback, such as snapping to nearest phrase on playback. I however question the use of their time looking into automatically detecting sections of notes. It doesn't seem particularly accurate and when a user still has to guess a number of sections to separate the notes into, why shouldn't the user do this task themselves? They could do this easily by making some kind of mark, a star, or maybe a line between sections, a task which doesn't even have to be done while listening to the speaker.

Video Object Annotation, Navigation, and Composition

This paper presents a framework to enable users to interact with video by creating moving annotations, creating composites, and video navigation. While the paper does discuss use of its framework to do navigation and composition, I found annotation to be the most polarizing and intriguing of the topics, and thus will focus my discussion on this topic.

The use of annotations and hyperlinks could possibly change the way we look at video today. Interest for annotations could be found in the audience of "subbers", people who create subtitles for foreign film and videos. Given video today, subtitles are text lines that show up on the bottom of the screen as long as a person is talking. Sometimes this can be confusing as a viewer may not be able to immediately attribute a line of text to the speaker. Using speech bubbles may alleviate this problem. Another possibility for annotations and hyperlinks is for advertisement purposes. The advertisements you see on the side of a basketball court for example could be hyperlinked to bring you to the website of that company. Any type of ad can be placed as an annotation, allowing advertisers a new realm to explore. This could lead to a new cost structure where free video or shows would have advertisements embedded in them, while if a user paid for his or her favorite television show, it would come advertisement free. Another use of the hyperlinks could be similar to interactive video done today on youtube. Some youtube videos have links in them that allow a user to do a "Choose Your Own Adventure". Similar to the genre of books with the same title (http://en.wikipedia.org/wiki/Choose_Your_Own_Adventure), if you click on a door, your character will walk through the door, etc.

The authors of the paper have a well motivated problem: exploring the use of interactivity when dealing with video. They do a good job discussing limitations of the system, including the long pre-processing time and imperfections with the cost function. However, the paper fails to evaluate the use cases of these methods. Given its current state, where a user of Adobe After Effects requires up to 4 minutes to add one piece of graffiti, it seems illogical for the average user or viewer of video, who would not have similar skills, to ever do the same. Even for professional use, this may have limited uses. For example, in an operating room, there could be a couple dozen notations at any given point or time. I also question "entertainment" users as a possible audience for this application as the paper describes. Even if I was interested in investing the time to annotate a video, it seems that for a large population of viewers, this would not be of interest to them. Who would be my audience? Would I spend 10 hours annotating a video just so my best friend can watch it? In the case of "subbers", where a small group's annotations could reach out to a large group of viewers, and thus be cost efficient, there are still problems. A critical mass of users have to use this application before spending the time and effort to subtitle through means of word balloons or other annotations makes sense. Regarding entertainment users' adoption, the paper fails to evaluate the reaction of users to this system. Would they find the addition of graffiti or word bubbles annoying, intrusive, or out of place?


Vinson Chuong - 11/2/2011 8:48:03

Stifelman, Arons, and Schmandt's "The Audio Notebook" offers an interesting way to index and assign semantics to audio recordings by coupling them with handwritten notes. Goldman, Gonterman, Curless, Salesin, and Seitz's "Video Object Notation, Navigation, and Composition" offers a system which abstracts a video into a set of moving objects and discusses various possible interactions using those objects.

In class, we've been seeing various ways of collecting raw information during the performance of a task, inferring the semantics of that information, and then presenting that information to the user in a meaningful way. In the Zoetrope, Chronicles, and photo manipulation tutorials papers, this inference is done automatically. Depending on how specific or generalized tasks can be, these systems may not have the "inferential power" to return really useful information. "The Audio Notebook" bypasses this problem by piggybacking on the semantics assigned by the user during note taking. The more structured those notes are, the more structure can be assigned to the audio. I find this to be a particularly clever approach that can be applied in other places as well. Take tutorial generation as an example. What if the user (the person generating the tutorial) gave verbal commentary of what he was doing and why? His actions would impose a useful structure on the commentary, perhaps enough structure to apply useful annotations to each step of the resulting tutorial.

The advantage of a useful abstraction layer is, relative to the raw bits, you have more structure to work with. Objects and interactions can be described more concisely and meaningfully. You might be able to compose or act on objects in interesting ways. That is, if the abstraction is a good one. In the video object paper, the authors explore the benefits of abstracting the pixels of the frames of video into large moving objects. They demonstrate novel interactions like annotating objects (creating new objects and attaching objects to one another), navigating the video along an object's trajectory, and selecting a set of frames to build a composite image. One major drawback of this system is that applying the abstraction during a pre-process takes a very long time. To me, this functionality seems most useful when done "in real time". For example, say I'm recording a clip for a screencast and I want to see what object annotations would look like so I can decide whether or not to recapture the clip. Or say I'm skimming through a bunch of just-recorded raw footage, looking for a good frame where the subject is postured in a specific way and trying to find out whether or not I need to record more footage if there isn't such a frame. The system presented essentially says: this one video is important, and I should spend time abstracting out the moving objects and allowing future interactions. This conflicts with the point of view: I want to quickly figure out if I've recorded what I need, and if not, I need to record more footage and go through the process again. I believe that this system is useful only if the piece of footage you want to look at is very important---important enough to invest the time in applying the abstraction.


Allie - 11/2/2011 8:52:28

In "Video Object Annotation, Navigation, and Composition", Goldman et all discuss technologies surrounding 2D object video motion tracking. This is done off-line preprocess that enable users to to interactively manipulaet the video, including annotation of speech and thought, video graffiti, path arrows, video hyperlinks, and schematic storyboards. This is particularly useful in sports broadcasting. The technology allows the user to think in terms of high-level goals such as placements of objects and types / content of their annotations, rather than low-level details of tracking and segmentation.

In "The Audio Notebook: Paper and Pen Interaction with Structured Speech", Stifelman, Arons, and Schmandt introduce the Audio Notebook, a combination of a digital audio recorder and paper notebook. In so doing, the researchers synchronize the user's handwritten notes with a digital audio recording, where time is mapped to space. The Audio Notebook augments, rather than replace the paper and pen. The paper takes us back to earlier in the semester with DigitalDesk, which attempted to integrate the computer and actual desktop. The Audio Notebook takes advantage of the user's natural activity to index an audio recording for later retrieval; where handwritten notes and page turns serve as indices into an audio recording. Listeners are then able to devote more attention to the talker.

I wish both papers talked more about the cognitive science behind whether these technologies are in fact, effective.


Yin-Chia Yeh - 11/2/2011 8:53:09

These two papers are about audio/video interaction techniques. The audio notebook paper augments paper notebook with recording device and takes advantage of user’s note taking behavior as the index of the recorded audio. The video object annotation, navigation, and composition paper presents new ways of interacting with video clip by objects which is extracted by preprocessing the video. The most interesting part of audio notebook paper is that how different people use audio notebook. S1 uses it only as augmented tool for reviewing purpose and does not change her note taking behavior. On the other hand, S2 changes note taking habit to adapt to audio notebook. It would be interesting to also test how audio notebook affects student’s understanding of the lecture before and after using audio notebook. One question I have in mind about audio notebook is that does it work well for people who usually don’t take a lot of notes. If the user has to take more notes to index the audio, it seems a little bit conflict the original intention of this paper, not requiring people to take detailed notes. Another question is on the other side of the spectrum, for users who need highly detailed information, such as R2 in the paper, could speech recognition software be a better solution than audio notebook? I am not familiar with the performance of state of the art speech recognition software. If it does not work well for general speech audio notebook is still valuable. For the video object paper, I like the part of navigation through direct manipulation more than annotation part since it provides a new way of navigating video instead of traditional timeline navigation. The downside of this paper, as mentioned by the authors, is that it is still subject to performance of video object tracking algorithms and it currently requires preprocessing. It would be interesting to see how these interaction techniques works if we substitute the preprocessed video object track algorithm by a real-time algorithm that runs faster but not as precise. On the other hand, though navigation is more interesting to me, it seems annotation has more immediate useful applications. One possible improvement of this paper is that instead of running preprocessing, only runs real-time tracking on user’s region of interest.


Rohan Nagesh - 11/2/2011 9:01:26

The first paper "The Audio Notebook" discusses a novel (for 2001) method of interaction designed to free notetakers from simultaneously having to take notes and understand what a lecturer is saying. The second paper "Video Object Annotation, Navigation, and Composition" discusses a video pre-processing approach to enable annotations and navigation on moving rather than static objects.

In the first paper, I had to put myself in the 2001 frame of reference to avoid passing judgment too hastily. For the time, this was definitely a remarkable idea. I liked the fact that the authors kept the metaphor of paper/pen in their design decision as I do think this the most familiar method of notetaking for people. I didn't like the fact that all their interactions (starting, pausing, etc.) were done by touching the pen to the actual notebook rather than just using the pen itself. All in all, it was an idea ahead of its time and I appreciate that.

In the second paper, I found myself unsure how the experience would feel of moving annotations. Would they follow a particular object through space/time? What would happen when many annotations or scribbles had been made on the same video? These were challenging questions that led me to believe a splicing of static photgraphs that have been annotated and moved into a slideshow might be easier to understand.


Sally Ahn - 11/2/2011 9:03:44

Stifelman et al. introduces the idea of combining digital audio recorder and a physical paper notepad. They introduce various novel interaction techniques for navigating audio through an LED-lit scrollbar. In "video object annotation, navigation, and composition," Goldman et al. The second paper focuses on video annotation, navigation, and composition and provides novel interface for manipulating video in the spatial domain, rather than temporal (frames).

One discussion I found interesting in the autdio notebook paper was their discussion of choosing paper notepads over LCD displays. In other words, they those to o"augment" the paper notebook. This made me wonder whether tablets would ever really replace paper notebooks. I doubt it will, for the very reasons the authors list, and I wonder whether users would react differently to "audio tablets." I also wondered why this product hasn't ever become mainstream despite its claimed usefulness. One possibility is that this form of interaction neglects one key advantage (at least in my view) of audio: it is hands-free. By coupling audio navigation with paper-related interactions, Another thought I had regarding this idea was the high availability of webcasts today; webcasts eliminated the need to record audio, but I think a useful area to explore might be how to link segments of the webcasts to online notes, course syllabi, and collaborative docs.