Context area exam: Bradley Rhodes

Question 2

Question: In part based on the above, consider three versions of Remembrance Agent style applications.

on a desktop or conventional laptop computer, typing into a text editor, with optional windows available to show texts which are correlated with current user input.
a wearable computer with visual head mounted display and twiddler-like user input
(somewhat fanciful) a similar wearable computer which is also listening on a microphone, doing somewhat faulty word recognition on the speech it hears, and matching texts against that input. During ordinary conversation, the user is NOT using the twiddler, but the head-mounted display and twiddler are available for user interaction.

For each of these three scenarios, discuss the role of the auditory channel as a user interaction enhancement. Consider issues such as the task, the user's focus of attention, and social issues.

At issue with all of these are advantages and disadvantages of audio and how these features mesh with each particular task. In general, audio IO has these relevant features:

Long-range: audio output can work at a distance, and doesn't require the receiver to be focusing their gaze (ears?) in any one direction.
Non-visual/manual: audio output is a separate channel from video, and so will cause less crosstalk with visual primary tasks. For input, audio is separate from manual manipulation, and so is useful for when one's hands tied up in the primary task.
Harder to shut out: audio output is harder to shut out audio than video. One can't just close their ears.
Multiple recipients: audio output is good for reaching many people at once.
Social/privacy issues: Audio output and input both can annoy others in earshot who aren't using the interface. It can also give private information to people in the area who perhaps shouldn't hear. This is especially a problem when multiple people are using audio interfaces in a constrained area.
Lack of localization: sound is harder to localize than video, so it is harder to focus an audio "annotation" on a single physical location or with a specific object. (This the the coralary to the "long range" point given above.)
Ephemerality: Audio output is ephemeral -- after it is spoken it cannot be re-read, but must be replayed.
Serial: Audio is serial in nature, it is harder to scan than visual media.

For the desktop version of the RA hooked up to a word processor, the fact that audio is long range is not useful. When using the word processor, you are right there with the computer, so it needn't be long distance. When not using the computer, the text being typed isn't changing so the RA wouldn't have anything to say anyway. (One could imagine an RA hooked up to the network such that new information might come in, but that isn't the application that was asked about here.)

The primary task is visual, so presumably there would be less crosstalk with audio alerts. Audio can be processed more in parallel with reading or writing than can written text. However, writing at a word processor is more interruptible than many other domains (such as driving). Furthermore, we have control of the full computer screen we have more ability to make visual annotations that interfere little with the current task (e.g. annotations can be near the text being annotated, but in a clearly designated area with clear physical properties to distinguish the text). These features of the task domain make the limited crosstalk of audio less compelling a reason to use audio, though it is still present. On the input side, we know the user will have their hands available since that is how the word processor is already used. One might envision a system, however, where a speech interface can be used to bring up suggestions while the user is still typing and working with the word processor. In this case, however, the audio interface should only be one possible method of interaction. Bruce Tognazini's "Starfire" video shows good use of audio in such a setting.

The fact that audio is harder to shut out would be good for critical alarms that are expected to be desired by the user. Eudora's audio hail when you have new email is one such example. However, most of the time remembrance agent suggestions aren't expected to be of vital importance. Indeed, even if only the most relevant suggestions were shown it is hard to envision a back-end that could produce high-quality hits all the time except in very specific domains.

In a word processor application there is presumably only one user, so the fact that audio output can have multiple recipients is not a feature. Indeed, the social and privacy aspects of audio can be a problem, especially if audio alerts frequently occur.

The lack of localization in audio greatly limits the use for it in this setting. For example, if audio were used to speak or announce suggestions, it would best be used for global annotations (annotations about the general thing being typed) rather than annotations about specific paragraphs, names, or sentences. This point was brought out in the experiments done by Chalfonte in "Expressive Richness," though I disagree with her explanation. Chalfonte claimed that people used audio annotation for general comments and text for specific comments because audio is more expressive, easier to use, and generally better, and that therefor audio is used for harder concepts and not more trivial local annotations like correcting spelling and grammar. She does not account for the fact that if audio is generally better then it presumably should also be used for such local changes as well. A more likely explanation for her data is that audio is useful for general comments because it is more expressive and easier to use for complex ideas, but that text is better for local annotations because it can easily be identified (in fact, must be identified) with a physical location. Thus correcting a spelling error is easier with text because you can make the annotation right on the misspelled word itself, while the audio is harder to associate with that word. Similar design constraints are on audio for RAs. Audio is better for annotations of a general nature like those shown in the emacs version, but less good for local annotations like those shown in the Margin Notes system or in Sara Elo's PLUM system.

The ephemerality and serial nature of audio is a problem as well for annotations, because users are forced to listen to the audio right then and there, or not at all. Since the user might be getting to a stopping point in a thought or action in their primary task, the audio needs to be simple enough that it not distract. If the presented audio is a speech message it can't be distracting, i.e. in the first stages of a ramping interface, or it needs to be an ambient type of audio that is very low information content and plays quietly in the background continually. Longer and more complex sequences of audio are harder to scan and harder to choose to attend, and therefor should only be used at later stages of a ramping interface.

After going through the features of audio, it seems clear that audio should only be used sparingly for the desktop version of the RA. To avoid bothering others alerts should only be occasionally played, and only when the annotation is deemed especially useful. This indicates either a special application where all annotations are rare but useful, and the occurrence of a suggestion warrants an audio signal. Another possibility is having two-tiered annotations, where normal annotations have no audio signal but especially good ones get an audio signal, possibly one of many audio signals indicating the kind of information contained.

For a wearable, most of the above analysis still holds, though the importance of the features change. Long-range is still not an issue, since the wearable is with you at all times. The fact that audio won't interfere with visual primary tasks might be good or bad, depending on the expected application for the wearable in question. If the expected domain is visual, then the audio annotations might be good. Backseat Driver and the museum tour-guide applications are both good examples of such domains. If, on the other hand, the wearable is primarily used in audio-intensive environments such as conversations or classrooms then using the heads-up display is better. It should also be noted that heads-up displays have much less real-estate than desktop machines, so any use of a different medium can help conserve that screen-space.

The same issues hold true for audio input. If the wearer is expected to not always be able to type on the twiddler (e.g. an annotation might occur when the twiddler is off the hand, or when the user's hands are busy) then audio input might be good. On the other hand, if the user is with other people or likely in a conversation then audio input is socially unacceptable. Audio commands might be used to bring up fuller text of a suggestion (moving up the ramping interface), or might be one of many possible interfaces to get suggestions that allows users to choose the best fitting method for their current situation. Note that if the wearable only has the twiddler for input (i.e. no passive sensors like GPS, etc) then it is likely that the user has the twiddler and is using it whenever an annotation is available, simply because at other times the RA has no changing input to require it to make a suggestion. Wearables and RAs with passive sensors start to make things much more interesting, because then a suggestion might become available when the user is not even actively engaged with the computer.

Multiple recipients is possibly less of an issue with wearables, because a wearable user can be expected to wear headphones (something a desktop user cannot be expected to do). However, as mentioned in the previous question regarding cell-phones, it might be important to convey enough social cues to indicate that a suggestion is being conveyed to the wearer. Whether this is actually necessary depends on the detail of the suggestion, and whether the wearer might need to act on the suggestion. For example, I don't need other people to know if I get zephyrs, since I don't act on them immediately anyway. However, if I am likely to need to answer a phone call or leave in a hurry, I want people to know that.

Finally, the Eavesdropper system is actually a subset of the wearable application described above. In this system, we add the constraints that the domain is already audio-based (conversation), and that we do not have the twiddler for getting more information in an annotation. From personal experience it is already likely that not much information can be attended to and processed while in a conversation unless more than two people are involved. In one-on-one conversation, it is too clear that the wearer is slightly distracted by other information when they process more than a few words worth. For this reason, I would design the system to use passive audio input (speech recognition on the current conversation), but would not use audio output at all to avoid crosstalk with the conversation at hand. I would also design it so information coming up on the display is likely to be enough information to jog the user's memory, and expect only rare twiddler interactions where the user wants more information. I expect in those cases, the user would excuse themselves and explain what they are looking up. The database of information would probably be personal information rather than more general information, so that there are memories to jog by the small amount of information given in the first stage of the ramping interface.