Audio Interfaces for Online Environments

This is an excerpt from a longer paper I wrote.

Introduction

It was shallow thinking to maintain that numbers and charts were the cold compression of unruly human energies, every sort of yearning and midnight sweat reduced to lucid units in the financial markets. In fact data itself was soulful and glowing, a dynamic aspect of the life process. This was the eloquence of alphabets and numeric systems, now fully realized in electronic form, in the zero-oneness of the world, the digital imperative that defined every breath of the planet’s living billions. Here was the heave of the biosphere. Our bodies and oceans were here, knowable and whole” Don Delillo, Cosmopolis: A Novel (Scribner, 2003).



The World Wide Web is unnaturally quiet ... and disquietly unnatural.

Sound is one of our most sophisticated senses, from the time we are babies our entire world is filled with sounds designed to stimulate our behavior. We grow to expect pleasure or annoyance as were are introduced to surprising new sounds as well as established ones.

Sound has a variety of forms - voice, music, effects, nature, or other communication forms - and these can be incredible rich, complex, and subtle.

It is the primary means in which we by which most of us receive data, information, and knowledge.

Sound is very much a part of our life and yet the devices and “virtual” spaces in which we spend an ever increasing amount of time are silent. Most sound that we experience on a day-to-day basis goes unnoticed.

Sound can be distinguished from noise by the simple fact that sound can provide information.

Sound answers questions; sound supports activities for tasks, so sound is inheritly useful. Consider the information provided by the click when the bolt on a door slides open, the sound of your zipper when you close a pair of pants, the whistle of a kettle when your water is finished boiling, the sound of a river moving in the distance, the sound of liquid boiling, of food frying, and the sounds of people talking in the distance. In the workplace there are the sounds of keys being pressed on a computer keyboard.

Natural sound is as essential as visual information because sound tells us about things we can’t see and it does so while our eyes are occupied elsewhere. Natural sounds reflect the complex interaction of natural objects; the way one part moves against another, the material of which the parts are made. Sounds are generated when materials interact and the sound tells us whether they are hitting, sliding, breaking or bouncing. Sounds differ according to the characteristics of the objects and they differ on how fast things are going and how far they are from us (Bill Gaver).

There has been a great deal of research and practicioner level discussion about the need and rewards of human centered design for web sites. User centered design and its over-used code phrase user friendliness has been called an essential, sustainable advantage (Garrett 2002). Industry leaders like Jacob Neilson have led the charge for greater standardization of the web interface. Successful online interfaces like that of Amazon.com's have been copied and re-proposed for use across a multitude of sites. But with all the work being down to further the cause of user-centered design, there seems to be a fundamental aspect of human interaction missing. Most interface design done for the World Wide Web today utilizes only one of our senses, vision. If we are to elevate the art of interface design instead of simply refining the current state we must take in account the other essential sense, our sense of hearing.

Background

“The conceptual failings of the mid 80’s resulted from an inability or unwillingness to see the power of the desktop metaphor. The failings of the present day come from taking that metaphor too literally.” (Johnson 1997)

“My main complaint (about modern interfaces) is that metaphor is a poor metaphor for what needs to be done. At PARC we coined the phrase ‘user illusion’ to describe what we were about when designing user interfaces. There are clear connotations to the stage, theatrics, and magic, all of which give much stronger hints as to the direction to be followed… should we transfer the paper metaphor so perfectly that the screen is as hard as paper to erase and change, clearly not.” - Alan Kay

Web browsers, the primary tool for accessing the World Wide Web, use the page metaphor, which is appropriate for browsing static text with hyperlinks. This is the task that browsers were designed for.

As the Web expanded into transaction systems and applications, the page metaphor has been mixed with application metaphors. This has created confusing environments for users.” (Fellenz, Parkkinen , Shubin 1998)

How can sound improve online interfaces? Is mapping audio to interfaces useful? Even though we have two primary senses, hearing and visual, most current interfaces today are primarily visual. As vision and hearing are fundamentally different, there are some distinct advantages to incorporating audio to the user interface of online environments.

There’s magic when you use computers, there’s magic when you open a web browser and interact with the user interface. This magic derives from the fact that computers aren’t tied to the old analog world of objects. Computers can mimic that world, of course, but they are capable of performing and adopting new identities and performing new tasks that have no real world equivalent whatsoever. People who love computers and love the web get hooked on this magic; get hooked on this range of possibilities. They don’t get hooked on computers because they remind them of real world tools, or remind them of a toaster. They get hooked on using computers because their machines do things that they never thought were possible. Interface design should reflect this newness in this range of possibility. Audio interfaces can help raise interfaces to these new levels. 

Doug Enklebart is considered ‘the father’ of the modern interface. After being haunted by the image of Vanoveer Bushes Memex machine as described in his seminal essay called ‘As You May Think’, Enklebart went on to create the first, what we could call, bit mapped visual interface which contained a number of break through innovations including the principal of direct manipulation; where one thing was to represent another and allow users to have some control over that particular thing as seen on the screen. An important part of that for him was that you were given an illusion, an actual illusion that things were happening. Creating illusion is a key part of interface design.

What exactly is an interface? An interface in the simplest sense is a word that refers to software that shapes the interaction between user and computer. The interface serves as a kind of translator mediating between the two parties making one sensible to the other. In other words the relationship governed by the interface is a semantic one characterized by meaning and expression rather than physical force. So for computers to create this illusion, this aforementioned magic, they must represent themselves in a language that the user understands. Sounds and music are something that people can interpret fairly well, even if people don’t have the vocabulary to express their understanding, people understand sound and I do have a certain assumption that people can differentiate between sounds if they are different enough. There are some experiments by Brewster, 1998 that indicate this however there doesn’t appear to be a great deal of experimental conclusions in the field of auditory interfaces but there do seem to be some general traits that characterize sounds that are useful auditory interface material and there is some research to provide some approaches for their use in software based interfaces. Obviously in order to construct a good interface, a lot of specific research and testing is necessary and considerable amounts of creativity are needed to design a useful and aesthetic auditory interface.

Computers, mobile phones, and other machines interfaces primarily present information by visual means but while displaying this content visually is generally accepted as being convenient there are some crucial limitations to using this approach alone. There are two central issues in the general problem of the display of information visually. One is the need to display three more dimensions of information on two dimensional displays; and two the available resolutions for displays of information and this is evident in the fact that all display methods tend to be evaluated on the basis of resolution. (Tufte 1983) So size of the display plays an important factor in the amount of information that can be conveyed. People always seem to want of course the smallest and lightest electronic devices and of course one efficient way of decreasing the size of these portable devices is to decrease the size of the display. Many of Sony’s laptops and other manufacturers laptops gain their portability by dramatically reducing the size of their LCD screen. Of course, in the case of some electronic devices, one way to decrease the size of the device is to remove or make the screen so small that is becomes an unessential part of information delivered. This is one instance where it is possible to deliver information aurally instead of visually. Auditory interaction is necessary, for example, when people want to communicate with computers through a telephone. According to Brewster, 1998 telephone based interfaces are becoming increasingly important for human machine communication. Telephone based interface is very common for booking tickets or paying utility bills. As well, an example which will be used more than once, the mobile phone, has perhaps one of the smallest visual displays and a great deal of information is delivered to the user of this device aurally.

Another issue that you find with visual displays is that the user must focus on the display to obtain this information. This would be of particular interest when you are designing task based interfaces, like an ecommerce shopping cart, where the user might be distracted while trying to complete their task, depending on their environment at any given time. But auditory feedback enables the user to look away from the device, from the website while she is using it. So in the case of the website user is able to conveniently do more than one task at a time, while completing the shopping cart task the user can also focus on talking on the telephone, can focus on looking after their children, or grabbing a cup of coffee while they are waiting for the transaction to complete. Auditory interfaces can also be a useful alternative to visual feedback. In terms of online narrative if you want people focus on the text you could convey a sense of calm, a sense of nature through sound without resorting to elaborate visual interface. Obviously you can’t in terms of mobile devices design for every possible scenario for use. Mobile devices are not the focus of this paper but they exemplify some of the resistance to adding sound to interfaces in that people consider sound through a computer to be equivalent to noise. You see a great deal of backlash in terms of mobile devices in the publics fear relaying the necessary auditory information but at the same time creating noise. At the same time you must account for the number of occasions when visual feedback would be required, where a noisy room would not allow for the use of sound. Still it seems that auditory feedback despite some negative reactions has benefits and has been overlooked in many machine interfaces. 

Vision and hearing are our primary sense for obtaining information. Hearing has often been seen as the secondary sense as it seems that in many situations hearing simply tells us where to turn our eyes. (Gaver 1997). It is important to note that sound is a unique medium which provides information that vision cannot. 

Some of the ways that sound is unique are as follows:

A draw back is that one cannot turn away from sounds nor can one turn off our ears to unpleasant sounds.

Hearing and vision compliment one another quite naturally in our physical environment as they do in multimedia, film, and other created environments. In fact sound design is a well-developed practice in film. Within 30 years of the introduction of soundtrack to cinema, sound has become an essential element of the understanding of film. Compared to film of course much of the use that we see of sound in interface design is very elementary. Multimedia titles and games are perhaps the notable exceptions. People may prefer to communicate face to face but other forms of communication like instant messaging are showing to be quite popular in the workplace and despite increases in the complexity of the conversation people do not appear to be switching communication modes (Isaacs 2003). In this example the current use of sound mirrors much of what is practiced in computer interfaces today which little sound reinforcement and only sounds that are commonly limited to various types of signals; buzzes, beeps, and general alert sounds commonly used to indicate alarms, message arrival, and extreme events. Other types of signals provide simple feedback when you have been successful in certain tasks such as when buttons are pressed to turn on the computer, a message has been sent, and in the case of the MacOS when a window has been successfully hidden. There are events that are not normally associated with sound. Brewster (1998) has investigated the use of sound as a provider of navigational cues in menu hierarchies. These menu hierarchies are common structures in computers, websites, mobile phones or telephone-based interfaces. An example of particular interest, Gaver (1997) claims that another little explored area is to use sound for communicating ongoing processes. Auditory interfaces would grow extensively if ongoing processes were to generate sound. These sound types would need to be discrete and naturalistic. The sounds employed by instant messaging applications are not naturalistic sounds and as such reveal little hidden information. The sound is as annoying as it is informative. (Norman, 1988) By skillfully designing these interfaces this can be avoided.

Approaches to the design of sounds

Barrass (1997) describes 7 types of approaches to the design of sounds that in particular support information processing activities - syntactic, semantic, pragmatic, perceptual, task-oriented, connotative, and device-oriented.

"The syntactic approach focuses on the organization of auditory elements into more complex messages. The semantic approach focuses on the metaphorical meaning of the sound. The pragmatic method focuses on the psychoacoustic discrimination of the sounds. The perceptual method focuses on the significance of the relations between the sounds. The task-oriented method designs the sounds for a particular purpose. The connotative method is concerned with the cultural and aesthetic implications of the sound. The device-oriented method focuses on the transportability of the sounds between different devices, and the optimization of the sounds for a specific device" (Barrass 1997).

There are many ways to go about mapping sounds to information, and many factors to consider. Instead of going into detail about Barrass's 7 types, touching on much the same information but in a simplified form, Gaver's (1997) examination of 2 different strategies for the usage of sound may be better suited to this paper. One possibility he discusses is to base the interface on sounds, the origins of which are analogous to what they represent. These would be characterized as auditory icons and are based on everyday sounds. The other strategy, contrary to the first, is to use sounds that are arbitrarily linked to what they represent. "Ear cons" is a commonly held term for these symbolic sounds. An ear con is a musical sound that can be created from any sound source. A compromise between the iconic and symbolic strategies produces metaphoric or "representational" (Brewster 1998) sounds, which means that they share an abstract feature with what they are to represent. An example of a sound metaphor would be to use a high-pitched sound to represent a high spatial position on the screen. Using a low sound pressure to represent a far distance would be an iconic relationship as this what would happen in the real world. The boundaries between these three types can be difficult to distinguish.

There are a number of advantages to using auditory icons in interfaces. Gaver (1997) states that iconic sounds are easily learned and remembered, though they may not be entirely intuitive. When you combine auditory and visual interfaces, it is ideal to have both icons rely on the same analogy, thereby by producing the most comprehensive hybrids. In Gavers (1989) Sonic Finder for Macintosh computers, he grouped auditory icons into "feedback families". This operating system used "wooden" sounds to indicate interaction with text files. Selecting a text files makes a tapping sound, erasing it would create a splintering effect, and etc. Another possibility introduced in the Sonic Finder was the parameterising of auditory icons where one sound quality or parameter corresponds to the feature of the selected object. In his Sonic Finder, the pitch of the text file sound is parameterized to indicate size, so that selecting a large text file would result in a tapping sound with a lower pitch and selecting a small text file would result in a higher pitch.

A possible problem with auditory icons is that it may prove difficult to find suitable iconic sounds for all events in an interface, as they may not correspond to a sound-producing event in the real world. Compromises would produce inconsistency. Norman (1990) stresses the importance of developing a conceptual model that is understandable to the user. A well-designed interface should have as few interpretation strategies as possible. Another problem of course is that sounds from the most realistic mappings might not be suitable as it may confuse the user. The user might get confused between the interface and the everyday sounds around her. In large interface sets it might prove challenging to find enough varied sounds that do not interfere with one another and the environment in which they will be used.

Symbolic sounds are arbitrarily mapped to what they represent and do not have to be limited by any similarities. Every ear con can be designed with emphasis on its aesthetic qualities, which is difficult when developing auditory icons. Since these symbolic sounds can be freely designed it opens up greater possibilities for musical interfaces that may be more pleasant and less tiring. As well, music provides for a sophisticated system for the manipulation of groups of sounds (Gaver 1997). When designing a musical interface, complex information can be conveyed by sounds that are parameterized in many dimensions.

While the approaches to mapping information to sounds outlined above may, at least in the case of symbolic sounds, may be fine Barrass (1997) has proposed a more comprehensive method which builds on Scaletti's definition of sonification.

One Method

Steven Barrass’s method for designing audio interfaces builds upon Scaletti’s working definition of sonification, which he stated as follows:

A mapping of numerically represented relations in some domain under study to relations in an acoustic domain for the purpose of interpreting, understanding, or communicating relations in the domain under study [Scaletti C. (1994)].

Building on a definition on auditory information design, stated as the design of sounds to support an information processing activity (Barrass 1997), Barrass builds his approach to auditory information design. An approach suitable for task based online interfaces.

An approach to auditory information design

The approach to auditory information design proposed here builds on Scaletti’s definition of sonification. The approach focuses on the design of sounds to support an information processing activity. The approach has two parts that hinge on information

    1. Requirements: analysis of the information requirements of an activity
    2. Representation: design of an auditory representation of the information requirements

Requirements

There are a variety of methods for analysing information. Task analysis is a method developed in Human-Computer Interaction (HCI) design to analyse information required to manipulate events, modes, objects and other aspects of user interfaces [Kaplan B. and Goodsen J. (1995)]. This form of analysis is particularly concerned with actions that occur in sequence and parallel, and the feedback of the current state of the interface. Data characterisation is a method developed in scientific visualisation to describe the relations contained in data sets [Robertson P.K. (1991)]. This analysis addresses concerns about the validity and faithfulness of a representation that is to be re-represented in some other form. A combined task analysis and data characterization can define a mapping from data relations to information that is useful in a task. This mapping may involve transforming or selecting parts of the data set, for example highlighting a region. The combination of task analysis and data characterisation had been demonstrated in a system for designing colour representations called PRAVDA [Bergman L.D. Rogowitz B.E. and Treinish L.A. (1995)], and in an automatic graphing tool called BOZ [Casner S. (1992)]. These tools operate on the task and data descriptors with a rule base that selects a representation scheme. However there is a problem that the addition of a new task to the system requires new rules to be formulated for every type of data. As the number of tasks and data types increases there will be a combinatorial explosion of rules to cope with each special case. This problem can be addressed by introducing an explicit description of information requirements that separates the task and data, but is a function of these influences. When a new task type is added to the system it is only necessary to add a new rule to map that task type to the information requirements. The information requirements are the pivotal point of contact between the analysis and the realisation of the design. I call this the TaDa approach because the design is focused on information that is useful in a task and true to the data.

Representation

Once the information requirements of the activity have been analysed we are in a position to design a representation of that information. This is the stage where the designer maps the required information into sounds in which a person can hear that information. Listening is a complex process which has been described in terms of “innate primitive” and “learnt schema” levels of perception in Bregman’s theory of auditory scene analysis. The TaDa approach addresses these two levels with different design methods – a case-based method for schema design, and a rule-based method for primitive design. The case-based method is a source of familiar auditory scenes that have an information structure that can represent the required information. For example that sounds of a kettle coming to the boil may be a familiar schema for an auditory display of temperature levels in a turbine boiler. The rule-based method aligns auditory structure with information structure in accordance to principles from graphic design and models from psychoacoustics. Examples of these rules are that if the information if ordered then the sounds should be ordered, if the information is categorical then the sounds should be categorical. Equal differences along a continuum should sound equal to the listener, and zero values should sound like they are zero.

The final stage in the design process is to produce specified sounds on an auditory display device. It is called an auditory display device because the sound specifications are perceptual (auditory), rather than device specific (acoustic), so the display can be transported to other devices. The display device may be a compound of hardware, software, synthesis algorithms, samples and audio components such as amplifiers, speakers or headphones. The reproduction of perceptual specifications on a device requires a measurement of the mapping from perceptual coordinates to control parameters, device capabilities and audio ranges. There is no point specifying sounds that cannot be produced by the display so a knowledge of the device characteristics is vital in the design process.

TaDa tiles

The TaDa approach is summarised by an arrangement of six tiles organised into two trapezoidal shapes shown in Figure 3-1. The upper trapezoid is the requirements part of the approach. The lower trapezoid is the representation part of the approach. The component tiles are facets f the design process, and common edges are connections between these facets.


Fig. 3.1

The requirements trapezoid consists of the Task analysis (Ta) tile, Information requirements (Ireq) tile, and the Data characterization (Da) tile. The Ta and Da tile sit upon the Ireq tile to indicate that both of these facets are necessary for the analysis of information requirements. The representation trapezoid consists of the Person (P) tile, Information representation (Irep) tile, and Display device (D) tiles. The Irep tile sits upon the P and D tiles to show that the design depends critically on the types of information relations that can be heard by a human listener, and the types of sounds that can be produced by a particular auditory display device. The core of the TaDa approach is the central diamond where the trapezoids connect. This is where the information requirements (Ireq) are met by the information representation (Irep). The TaDa approach moves through the phases of requirement analysis, design, and representation as shown in Figure 3-2. The phases generally move from left to right, top to bottom through the TaDa tile arrangement. However it is expected that certain facets and connections will be revisited and improved during the process, which is not intended to be strictly linear. (Barrass 1997)


Figure 3-2



References

  1. Barrass, Stephen 1997. Auditory Information Design. Ph.D thesis, Australian National University
  2. Brewster, Stephen A. 1998, “Using Non Speech Sounds to Provide Navigation Cues.” ACM Transactions on Computer-Human Interaction 5:3: 224 –259.
  3. Fellenz, Parkkinen , Shubin 1998. Web Navigation: Resolving Conflicts between the Desktop and the Web
  4. Garrett, James J. 2002. The Elements of User Experience. Indianapolis: New Riders.
  5. Gaver, William W. 1989. “The Sonic Finder” Human-Computer Interaction 4:1. Elsevier Science.
  6. Gaver, William W. 1997. “Auditory interfaces.” In Handbook of Human-Computer Interaction 2nd ed. 1003 –1041.
  7. Isaacs, Ellen 2003. “A Closer Look at Our Common Wisdom.” ACM Queue vol. 1, no. 8
  8. Johnson, Steven 1997. Interface Culture. San Francisco: Harper.
  9. Kivy, Peter. 1984. “Representation and Expression in Music.” In Sound and Semblance. Princeton University Press.
  10. Neilson, Jacob. Alert Box (http://www.useit.com/alertbox/).
  11. Norman, Donald A. 1990. The Design of Everyday Things. New York: Doubleday.
  12. Tufte, E.R. 1983. The Visual Display of Quantitative Information. Cheshire, Graphics Press.

.css file based on the work of Jeffrey Zeldman

Kelake is written by Clark MacLeod.
©2003 All rights reserved. Please link and copy but please accredit.
Any opinions expressed on this site are the personal views of Clark MacLeod only, and do not represent the views or policies of any other person or company.