ContextSonification

An article, posted almost 21 years ago filed in usability, sound, emma, graduation, sonification, effects, interaction, audio, design, invisible & context.

ContectSonification is another way of looking to adding sound to your computer environmen. This was my graduation Thesis for the European Media Master of Arts (EMMA) programme

Preface

The first version of ContextSonifier was created in 2004 by Maarten Brouwers. It was the result of his Master-thesis: 'Context Sonification'. You can download his thesis here or just continue reading it in HTML format.

The whole process has been supervised by:: Martin Lacet (thesis), Janine (project), Tom (project) and David Garcia (project) and was first published in Hilversum, August, 2004.

When communicating from one human to another, visual, auditory and/or haptic cues are being used for communication. When communicating to another person directly a person communicates more than a plain message. It communicates also a state of being, a soul. Somehow also mechanical devices and environments are able to communicate this 'state of being'. Communicating not only in true/false messages. To give some examples: When a car runs out of gasoline it starts sputtering. At first a little, then more..., and finally you really know that its fuel almost ran out. And while never designed to match a human-like metaphor, it is recognizable. Or an office which tends to get more and more filthier at the end of the week, it is something that gives an environment atmosphere, and may tell you that the weekend is nearby.

Digital systems however, lack this soul or atmosphere. Sound might be able to resolve some of this lack. Sound can communicate very directly and intense without human receivers being actually aware. Adding a auditory atmosphere to a computer may make working more pleasurable and foremost natural. That is what I would like to accomplish. Taking away at least that threshold of working with digital black boxes. Whether one wants to accept this artificial soul or atmosphere, is difficult to predict. But that's not the primary question for now. It could however be the enthusiasm when the product is ready, or at least explain some of it.

Software

This thesis was part of the exegesis program of the HKU Master of Arts programme. Together with this thesis I have been developping a piece of software too, in an attempt to realise my idea's. I've however seased developping ContextSonifier due to time contstraints and lack of knowledge implementing it in a resource friendly way. One of the disadvantages was my choice of relying heavily on 'fake' 3D sound, which was a bad design decision anyway. 3D sound does not really add much when just relying on 2 pc speakers (headphones would make the situation a little better) and I should have aknowledged that from start...

Introduction

Auditory Interfaces

This thesis is about auditory interfaces. An auditory interface is an interface which you hear in opposite to a graphical interface. Current operating systems give already some auditory feedback, yet to most people these sounds are rather disturbing. Probably because they convey little or no information and are too obtrusive. On the other hand we don't bother 'interface' sounds when we are crossing a crowded street, auditory information then becomes very valuable to us and might even safe us. Sounds may also make us feel secure. Hearing our other colleagues work comforts us more than a completely silent office. And finally music-lovers may already know that sound can lead our emotions. This introduction will first discuss some basic considerations when it comes to auditory interfaces followed by a general outline for this thesis.

Types of communication

Auditory interfaces are still a broad area of research. Therefore I will narrow my thesis down. I would like to divide computer to user communication, or feedback, into four categories: events, activities, ongoing processes, and environmental.

Events are in the communication from the computer to the user forms of feedback that are not directly initiated by the user but occur suddenly, at a single point in space and time. Examples of such may be error messages, notifications and warnings.

Activities are, in contrast to events, initiated by users. Examples are dragging, clicking and scrolling. The duration (in time) of this sort of feedback equals that of the actual activity.

Ongoing processes are processes that are most of the time running on the background. Ongoing processes may also be a result of activities; like processor load (internal). Other examples are stock information (external), network traffic and time itself.

The last type of communication to be introduced is environmental communication. These may, just as ongoing processes, be initiated by users, but are not processes or activities. They may behave as landmarks and can give an indication of location and/or context. Examples may be overall design (look at websites) and reflections (hall reverberation, wood reverberation...)

This thesis will concentrate on the last two types of communication: 'ongoing processes' and 'environmental communication'.

Sonifying or visualising?

Visual communication is very different from aural communication. Both have their advantages and disadvantages. Gaver (Gaver, 1989) summarised the property of sound compared to visuals by stating that while sound exists in time and over space, vision exists in space and over time. This rule can be retrieved from the by Buxton (Buxton, 1995) cited guideline by Deatherage (1972)1. This seems a useful guideline to start with when it comes to the choice between non-speech audio and displaying information visually.

Use auditory presentation if:

Use visual presentation if:

1. The message is simple.

1. The message is complex.

2. The message is short.

2. The message is long.

3. The message will not be referred to later.

3. The message will be referred to later.

4. The message deals with events in time.

4. The message deals with location in space.

5. The message calls for immediate action.

5. The message does not call for immediate action.

6. The visual system of the person is overburdened.

6. The auditory system of the person is overburdened.

7. The receiving location is too bright or dark adaptation integrity is necessary.

7. The receiving location is too noisy.

8. The person's job requires him to move about continually.

8. The person's job allows him to remain in one position.

Table 1: Auditory compared to visual presentation, Deatherage (1972)

How useful is this table for this research? Deatherage states that the message should be simple and short. This goes not without any explanation. Explaining complex concepts aurally is agreeably difficult. Complex multi-dimensional information however, has been tested to be displayed aurally as well as visually and the outcomes don't dismiss visual over aural messages.

Buxton, Gaver and Bly (Buxton et al., 1994) mention work of Bly on sonification of 6 dimensional data generated by battlefield simulations. This may be considered complex data as analysts had difficulties recognizing differences between the simulations when presented visually. She experimented with 'battlefield songs' and test-persons to whom the data was presented aurally scored 64.5% correctly, against 62% to whom it was presented visually. The combination of both channels however resulted in even a larger (69%) percentage of correct identifications.

The research by Bly is not the only research in this field and other researchers found results similarly proving that auralization is at least 'an interesting twist on visualization'. The point made here is that the 'simple' mentioned by Deatherage should not refer to the difficulties presenting something on screen, but to the difficulties of explaining a concept.

Point 3 and 4 of Deatherage's table can be considered as consequence of the fact that sound exists in time and moves on. Visually presented material can be referred to back and forth without any difficulty. Compare this to seeking a phrase in a large audio interview. Yet sound exists over space and is able to notify you even when you're not looking to your speakers (or monitor).

The fifth point refers to the fact that sound can be very obtrusive. Yet others argue that auditors are well capable of suppressing slightly or not-changing sounds into the background (see for more information sub-chapter Streaming in 'Design Approaches'). Both are true. It is important to know the strengths of both. Messages presented on screen can also be very obtrusive. They may block complete access to other interface elements when needed. Sound can be ignored, one can for example simply turn down the volume.

The sixth point makes my point, as I believe our visual system is overburdened. Some may counter argue. In office situations for example there is more noise than from the computer such as environmental noise. In the chapter 'Human Factors' I will therefore be discussing stress caused by noise and other factors to be dealt with when designing an auditory interface.

Point seven (Use sound when: 'The receiving location is too bright or dark adaptation integrity is necessary. ') will not be discussed in this thesis. This relates to being less capable of receiving via one channel or the other, whether it be because of visual, or auditory, impairment, or because of environmental factors. In the section 'Visual Impairment' of this chapter I will explain why I will neglect the use of sound when visual display does not suit the perceiver.

The last point, eight, reveals the most obvious use of sound in everyday life: Look here! Yet I believe also less moment driven events can be monitored when using sound for presentation. Take for example an imaginary oven. It takes a while before an oven has warmed up. What if you could hear how hot it was? Then you would be able to anticipate your working pace and not be surprised by a sudden alarm notifying you that the oven is on temperature.

Visually impaired

A lot of literature on sonified interaction is about making the visual interface more accessible to the visually impaired. Sound, however, is not always the most ideal channel for all types of messages as explained when discussing the table by Deatherage. Every channel, whether it be sound, vision, taste, has it's own strengths and weaknesses. And therein lies the problem of simply adding techniques developed for the visually impaired to the interface used by visually non-impaired. Techniques developed for the visually impaired tends to focus on sonifying what is important (and moreover what is already on screen, e.g. (Edwards et al., 1993)), while this research will focus on sonifying background-information. The goal of this thesis is to explore how to make best use of the capabilities of sound.

Summary and outline

This research is directed towards information types that are ongoing, time based. I compared the usage of sound in our daily life to how sound could be used in an interface to accomplish similar use.

I have discussed a guideline by Deatherage which was our point of start discussing the basic capabilities of sound. Simple, by means of concept, time based (in contrast to location based) information, works better on the audible channel than it does on the visual. Tasks where the person cannot (keep) focus on (some part of) the screen can be made easier when adding sound.

Though visually impaired people may not be able to focus on a screen (at all), the research aimed at creating a better interface for the visually impaired is not similar to my primary research goal. This type of research is often aimed at translating the complex visual messages into, often more complex, auditory messages.

What simple messages are currently visualized, that would better be sonified? Or is such data not yet there, but could it increase usability? And of what use will it be for the average user? My premise is that sonification of this information could help users to adhere their work flow without being bothered too much with context. I believe that working with a computer can become more task oriented and less of a opened 'Swiss Army knife' 2 where while screwing your hand may be cut open by an unconsciously opened knife. Context may be processed in a more unconscious matter. In the next chapters the search for a suitable communication language will take place. The question I want to answer with this thesis is: How should an auditory interface be designed, exploiting the strengths of the auditory channel, to create a better awareness of overall processes?

In the next chapter I will start introducing basic acoustic terminology and streaming-theory. In chapter three I will look at previous work. This work may inspire us and learn us different approaches the auralization problem could be faced. It may also take away some of the skepticism that seems out there when talking about 'adding sound to your computer'. In chapter four I will discuss the human factors involved and undertake a search for constraints. Chapter five will discuss approaches for design and give valuable input for the development of a communication language through sound. In the final chapter before the conclusion I will attempt to construct such communication language.

All chapters will share the common lay-out of several sub-chapters of theory followed by a summary and conclusions.

Understanding sound

This chapter is split up in two parts. The first part discusses acoustics. To those who are familiar with acoustic terminology may skip this part, yet there are some remarks on the use of particular properties of sound for carrying data. The second part of this chapter will discuss streaming. In this part I will discuss how different sound streams can still be distinguished, even when played simultaneously.

Acoustics

Acoustics is the science of sound dealing with the creation and physics of sound. I believe that having basic knowledge about the properties of sound is important for the development of auditory interfaces. This sub chapter is a summary of different sources. The base for this summary comes from Buxton et al., (Buxton et al., 1994) 3 and the information has been compared and has been made more complete with Hartmann, (Hartmann, 1998) and Bregman (Bregman, 1990).

Properties of sound

Pitch

Pitch is a sensation most directly associated with the physical property of frequency, yet pitch is not a single perceptual dimension. Roughly said, the pitch of a sine tone is a function of its frequency, a pitch of a harmonic4 tone is a function of the (estimated) fundamental frequency and the pitch of an inharmonic tone, or noise, is a function of its amplitude-weighted average frequency (brightness).

There is an disadvantage of using pitch as a data carrier. People are bad at making absolute judgements about the pitch. To overcome this, when implementing pitch variations as an absolute data carrier, designers could add a tone for orientation. Relative mapping however can work, as showed by Walker et al. (Walker et al., 2000). They showed a research where temperature, pressure, velocity and size estimations were made for sounds of different frequencies.

Loudness

Loudness is often confused with intensity or with amplitude. Loudness however refers to how the sound is perceived by the human ear. The loudness level of a sound is determined by comparison for equal loudness with a 1000 Hz tone re .0002 dyne c/m2 when heard binaurally in sound field. The unit of loudness level is known as the "phon." It is numerically equal to the SPL of the 1000 Hz tone, but varies with frequency.

The formula for loudness is: ? = k cdot I^{0,3}

Fletcher-Munson curves can be used to translate loudness to Intensity on different frequencies.

Using loudness as a data carrier is difficult. To start with, we have the different loudness values the same intensity on different frequency. Next, loudness is affected by bandwidth. Within a critical band energy is summed; outside loudness is summed. Research has proven that critical bands play an important role when it comes to loudness. When energy of a single tone is being distributed over several tones not differing more than 20% in frequency, it turns out not to have any effect on the loudness (Plomp, 1998).

Another problem with loudness is similar to that noted with regarding to pitch recognition: people are very bad at making absolute judgements about it. Factors that loudness depends on duration (short sounds seem louder) and perceived loudness is affected by position make it only more difficult to map data to loudness.

The above only underlines the need for proper amplitude to loudness conversion functions. Only when proper amplitude to loudness conversion is made possible the use of loudness as a way of conveying information may succeed.

Time

Another parameter is time. Actually it consists of several parameters: that of duration, attack time and rhythm. Sounds may vary internally or may be part of a melody.

For duration: differences in durations can be perceived between 2 sounds is a change of about 10% for sounds taking longer than 50msec; and proportional to the square root of the duration for sounds of less than 50ms. These numbers are independent of loudness and bandwidth.

Attack is the way the sound starts, how long it takes for it's total energy to build up. For attack it is important to know that the perceived onset sound depends on the slope of its attack.

Interesting about rhythm is that the parameter duration affects the rhythm. With sounds of same length onsets every 4 time units sounds like a regular rhythm. When longer sounds are alternated with short sounds the longer sounds are perceived a little off-beat (after-beat). This has been discovered by Terhardt (1978)5.

In the same research where Walker et al. (Walker et al., 2000) researched pitch as a data carrier also tempo has been tested as a data carrier, with similar success. Relative mapping is possible.

Micro variations & Spectral fusion

Micro variations make sounds stand out. Examples of micro-variations are tremolo (variation in amplitude) and vibrato (or micro-modulation, variation in pitch). When attempting to create different streaming sounds, one is advised to make use of such variations: sounds with the same micro variations tend to group together. Grouping of spectral components is also called spectral fusion.

Some count micro variations and spectral fusion into the group timbre.

Timbre

Timbre is still a vague term as multidimensional parameter. Diverse researchers have studied what timbre actually is6. Mainly, there are three different parameters of timbre to distinguish: Spectral properties, change in time and the positioning in the noise-tone effluvium.

Spectral properties

Spectrum of a sound (Spectral energy distribution, or brightness). Grouping might be affected by the pattern of the intensities of various harmonics (Bregman, 1990). yet there are two types to distinguish:

peaks at exactly the same frequencies (in nature: the vocal tract of the same talker)
proportional intensities, when doubling the fundamental all peaks in the spectrum would also be at doubled the frequency (in nature: properties of the two vibrating bodies were similar (rather than the properties of the resonators that they had been passed through))

Changing tones

With 'Changing tones' I refer for example to the relative amount of high frequency in attack. Yet there is very little known about the importance of these changing within the tone when it comes to streaming. In nature, there is no single tone which is static. Second, dynamic behaviour of the separate harmonics can be far more important to timbre than the actual harmonics themselves.

Noise-tone effluvium

Sounds can be more noisier, or actually be noise. Or noise-less. The frequency components of noise constantly change in amplitude and phase whereas those of tones are fairly constant (theoretical pure tones are constant).

Critical bands

Critical bands are frequency regions of which the sound energies interact. It can thus affect their loudness. This means that auditory interface designers should consider the effect of critical bands when designing a audible ambiance with well separated streams. Sounds in the same critical band are mainly responsible for masking.

CB = 25 + 75 (1+ 1,4 f^{2})^{0,69}

Critical bands may also help explain consonance and dissonance (our judgement of intervals). Sounds within a critical bandwidth are often considered sounding dissonant.

Just noticeable difference (JND)

A just-noticeable difference (JND) is the smallest detectable change in a quantity. It is important to be aware of this when it comes to data auralization. Actual values of the JND depend on experimental techniques, which actually are not guaranteed to hold stand in non-laboratory-environments. Testing is required when relying heavily on small changes in sound.

For expressing JND there is Weber's law 7 , though this is a really ideal law which holds stand for broad-band audible stimuli, but it doesn't work for narrow-band types. The law says that ? I / I is a constant, or that the just noticeable increment in intensity is a fixed percentage.

Yet, to have some rough estimations: The Just Noticeable Difference (JND) for pitch discrimination is about 1/30 of the pitch and the JND for intensity is about 1 dB. The Just Noticeable Difference for perceiving changes in duration between 2 sounds is a change of about 10% for sounds longer than 50msec; and proportional to the square root of the duration for sounds shorter than 50ms.

Psychoacoustic illusions

Infinite Glissando

The infinite glissando is a psychoacoustic effect. Buxton (Buxton et al., 1994) describes it as follows:

The effect consists of a set of coherent sine waves, each separated by an octave. The entire set of waves is raised or lowered in pitch over time. The loudness of the sine waves varies, depending on their pitch. The amplitude/frequency function is a bell curve. Tones are loudest in the middle of the audible range, and softest at the extremes. Consequently, tones fade in and out gradually at the extremes. Since they are octaves (therefore harmonic) and coherent (therefore fused), the overall effect is of a single rising complex tone.

A classic example of this effect are Shepard tones (Shepard, 1964)8.

Spatial Hearing

There could be written much about spatial hearing and how it works. I will not discuss all techniques to great extend, currently there are (hard- and software) modules that will position any sound in space taking care of all related Head Related Time Functions (HRTF) etc. What has to be noted is that true spatial hearing on stereo computer speakers is difficult to achieve. HRTF's are psychoacoustic 'tricks' that in ideal conditions could create the effect of spatial sound. The 'Head Related' part however, indicates already that it is even depended on the shape of a listeners head.

Next to that Plomp (Plomp, 1998) remarks that spatial hearing is part of a bigger whole of sensory orientation, to which also our eyes, our balance system, our muscles and the tactile sensitivity of our skin contribute. The fact that movements of our head don't interrupt, but stimulate, is the result of a very complex process in which all the named sensory organs contribute to.

Streaming

Auditory streaming is the ability to separate different audio sources. Knowledge about streaming is important for the design of an auditory interface. If sequences of sound break into undesirable streams, or fail to be heard as separate streams, the intended message may be lost. Streaming is determined by: frequency separation, tempo (the faster, the stronger), timbre and yes, common fate. But first some on masking, the name for

Masking

Masking occurs when one sound prevents another sound from being heard. Masking can take place when two tones are played simultaneously, but it may also happen to tones shortly heard after each other, without a pause.

The first parameter which affects simultaneous masking is loudness. The loudness is context dependent and sounds may mask each other. Secondly, masking also depends on frequencies. Higher frequency sounds are masked more than lower ones. Also, sounds within a critical band mask each other sooner. To prevent masking of a particular sound this sound should be complex (of high bandwidth). Even an 12 dB(A) louder, yet smallband, sound can't prevent perceiving a high bandwidth sound as loud as without any simultaneously played sounds (Plomp, 1998). Heavy usage however of complex sounds should be opposed as complex sound mask other sounds very good (maybe too good). Good sound design takes masking in account and it is possible to predict masking with great accuracy.

Furthermore there is also masking over (short periods of) time; sounds can be masked by other louder sounds that come before (forward masking) or after (backward masking) them. The effective duration is usually less than 100 msec (Warren, 1999). To prevent masking a sound should have an high bandwidth.

Unmasking

Unmasking is the term related to recreating signals which have been masked by (a) louder signal(s).

Our everyday world is a noisy place. Important signals are often accompanied by noise of greater intensity (Warren, 1999). If humans could only hear the loudest sounds played, hearing would not be of much value to us. Yet our hearing-system is capable of attending even fainter sounds. This through different mechanisms.

The term 'masking' seems to insinuate that that one signal is perceptually obliterated by another. Unmasking suggests that it is possible 'drawing aside a curtain of noise' and hear the masked signal. Yet the masked signal is no longer present. Through perceptual synthesis of the contextually appropriate sound however, humans are able to perceptually restore some of the masked signal.

Perceptual restoration can be done when portions of the signal are present at all time. It can also be done when the masking sound is capable of masking all spectral regions, but this sound is rather of short duration. The latter process is called temporal induction. An interesting research related to this effect is this of Van Noorden9 described by Warren. Van Noorden discovered that when faint 40 millisecond tone burst were alternated with louder 40msec tone bursts with short silent gaps separating the successive bursts, it was possible for listeners to hear the fainter tone burst. Not only when it actually was present, but also whenever the louder burst occurred. The result was that listeners heard a doubling of the actual rate of the fainter bursts.

Warren concludes that this existence of an infra-pitch demonstrates that temporal mechanics are capable of sustaining an illusory persistence of repetition, which may suggest that tonal continuity is not solely explained in terms of spectral mechanisms. Warren was able to confirm this in later research.

Frequency separation

Frequency separation makes part of the primitive process of sound analysis (Bregman, 1990). When listeners are focussing frequency separation of the high and low tones they only need to exceed some small amount (a few semitones in the case of two alternating tones) before the target sequence can be followed by attention.

[Warning: object ignored]Illustration 2: Crossing of tone sequences (Van Noorden, 1975))

There is however an interesting effect when it comes to frequency separation (Plomp, 1998). Van Noorden (1975)9 switched a sequence of single tones raising in frequency with a sequence of single tones decreasing in frequency. See the illustration at the left.

When listening to this, the two sequences are not heard as if they are crossing, which one visually would expect. Rather, they seem to glance off. Similar effects occur when continues tones are being used (Bregman, 1990). The effect does not appear when both sequences have a clearly distinct timbre.

Temporal separation

Temporal separation occurs when a sequence of tones jumps rapidly up and down between different frequency regions (Bregman, 1990). If the alternation is fast enough and the frequency separation great enough, listeners will not experience a single stream of tones alternating in pitch, but will perceive two streams of tones. The degree to which the effect occurs depends on the focus of the listener; when listeners are focussing the frequency separation of the high from the low tones need only exceed some small amount (a few semitones in the case of two alternating tones) before the target sequence can be followed by attention.

To be remarked furthermore is that comparison of timing is more difficult between streams than within.

Changing sounds

Some other interesting aspect of sounds is that changing sounds are more noticeable then static sounds. A good example of this is for example the air-conditioner. When you enter a room where it is turned on you probably won't notice the noise it produces until it is suddenly switched off.

Though static sounds move to the background; it doesn't hinder them from being informative. One could still use them.

Separation on timbre

If tones are more complex, timbre plays a part in the separation process (Bregman, 1990) . It appears as though there is a unit forming process that is sensitive to discontinuities in the sounds. Unit boundaries are created when discontinuities occur. Sounds that fit in these boundaries are grouped, sounds that do not are separated.

Units can occur at different time scales and smaller units can be embedded in larger ones. When the sequence is accelerated, the chance that smaller units are embedded in larger ones increases. Units can form groups with other similar ones. Similarity is determined by analyses that are applied to the units once they are formed.

Does this sound abstract? Here is an example: suppose there is a glide in frequency, bounded by a rise and a fall in intensity. Between these boundaries, the change in frequency may be measured by the auditory system and assigned to the unit as one of its properties. This frequency gliding unit will prefer to group with other ones whose frequency change has the same slope and are in the same frequency region.

Separation on Intensity

Loud sounds will tend to group with other loud ones, as will softer sounds do with other softer sounds. However Bregman (Bregman, 1990) doubts if loudness makes primitive separation possible. The results obtained with tones differing in loudness makes question whether the grouping based on loudness is the result of primitive scene analysis or of some schema-governed process of selection. The resolution of this issue may be as follows: Tones that differ only in loudness may not have a tendency to segregate from one another, but when there are also other differences between th sounds, the loudness differences may strengthen the segregation. Again it is still not completely understood and research is required to resolve the question.

Direction based separation

Location in space is also a similarity that affects the grouping of tones and has turned out to be a uniquely powerful way of separating units. To remark however is that humans, though we may use it, are quite capable at segregating more than one stream of sound coming from a single point in space (Bregman, 1990). When sounds are coming from different directions, however, primitive scene analysis tends to segregate those different streams.

Summary and conclusions

I started looking at the acoustics of sound and analyzed its properties, next to a discussion of what critical bands and just noticeable differences are, and ended this sub-chapter with a discussion of psychoacoustic illusions. I showed parameters which could be used to carry data, yet also difficulties were noted.

Carrying data on single properties of sound is possible, yet I believe it is difficult to retrieve real meaning when single properties are parameterized. Even when such relationship may be analogical, I believe such mapping would be too abstract and illogical.

The psycho-acoustic illusions noted however offer more perspective. The technique of the infinite glissando can be applied to processes with no real end and/or maximum value. Second, spatial hearing seems to me as an interesting part of sound to work with. I noted the remark by Plomp who said that spatial hearing is more a spatial awareness, where not only hearing contributes to, but the whole of sensory organs. When there is made use of visuals complementary to the mentioned HRTF functions, the effect of spatial hearing could possibly be more easily established.

Streaming is important for multichannel sound. A good knowledge of streaming can prevent masking and information from getting lost. In interactive situations therefore it will probably be no luxury to plan frequency regions, timbre properties etc.. If not, chances increase that for example sound streams with similar timbre are crossing in frequency resulting in abnormalities as noted by Plomp.

Case studies

Though sonification is not commonplace in commercially available interfaces, there are some applications out there that are interesting for this research. In this chapter I will discuss in chronological order some related cases: Sonic Finder, Audio Windows, Environmental Audio Reminders, ShareMon, OutToLunch, SoundShark and Dolphin. These cases will be summarized and so will their outcomes.

In the second part of this chapter the relevance for this study will be discussed. These products never made it to the public, so I've not been able to test any of these projects, though parts of SonicFinder in special have made it into mainstream. However I will attempt to review the cases on relevance, with help of primary and secondary literature.

Cases

Sonic Finder

Related researchers: Gaver, W., 1989 (Gaver, 1989)

Description: Sonic Finder inspired me for the project part, popular OS integration. It is also a popular example of showing possibilities of adding sound to an interface. He used a strategy called 'Every-day listening', in which information about computer events is presented by analogy with everyday events. He calls these sound elements auditory icons. Sounds were used to foremost sonify interaction, or what I called 'activities' in the introduction.

Outcome: SonicFinder is still a very often discussed case of a sonified interface. SonicFinder however has never been tested scientifically. For technical reasons SonicFinder didn't work well on all machines. Nonetheless, Gaver claims (in (Buxton et al., 1994)), there are a number of people, including myself, who use it as their standard interface as of this writing, more than a year after it was developed.

He also notes that users complain of missing it when they are using a quiet Finder. At least for them the addition of sounds is valuable, and for this reason alone the SonicFinder must be counted as a success.

Audio Windows

Related researchers: Ludwig, L.F., Pincever, N., Cohen, M. (Ludwig et al., 1990)

Description: Sound events produced by different (layered) windows on a desktop are being placed in an acoustic 3D world. Interesting was their approach to add 'focus' and 'context', or as they call it, hierarchy on the different windows. They used sound processing techniques used by the music radio and film industries: Self animation (exciters, etc.), Distortion (non-linear wave shaping; amplitude distortion), Thickening (a chorused 'doubling' effect via pitch shifted signals), Peaking (linear band-emphasis filtering), Distancing (reverberation & echo), Muffling (linear low-pass filtering),Thinning (linear high-pass filtering):

Self animation was used to make the sound more lively by accentuating frequency variations. They compared it to the way stones create a more eye-catching pattern in a shallow creek. Distortion produced a 'strained' sound, but it reduces intelligibility and increases listener fatigue.

Thickening decreases intelligibility slightly but makes a 'fuller' sound. Peaking can be used to boost amplitudes, e.g. the 1 kilohertz range (where speech phoneme information is carried mostly). But is of no good other use according to Ludwig et al. (Ludwig et al., 1990)

Distancing gives a fuller, mystical sound. Distancing however, reduces intelligibility. Muffling can create the impression of distance or confinement, but with greater reduced intelligibility.

Outcome: Not been able to find any outcomes.

Environmental Audio Reminders (EAR)

Related researchers: Gaver, W. (in 1991)

Description: Environmental Audio Reminders (EAR) is about transmitting short auditory cues to people's offices in order to inform them of upcoming or ongoing events. This system was designed to support casual awareness of colleagues, indicate informal communication opportunities, and signal formal events. Unobtrusively, to announce events in the workplace without interrupting normal workplace activities. The used sounds were very literal. When mail arrives, for example, the sound of a stack of papers falling was heard.

Outcome: The use of literal sounds has a drawback according to Barber (Barber, 1998). The sound of falling objects does not have pleasant associations and could be experienced as a warning that something bad has happened. Another drawback to real-life sounds is if the sounds are too realistic users will be very confused as to their source. (This outcome has been derived from secondary literature)

ShareMon

Related researchers: Cohen, J. (Cohen, 1993)

Description: ShareMon10 utilizes non-speech auditory cues to notify users of file sharing. For example, when a user accesses a file on your machine, you might hear the sound of a drawer opening or the sound of the Star Trek transporter energizing.

Outcome: Cohen says that ShareMon users preferred to be notified by sounds, or sounds combined with other modalities. The users of ShareMon found sounds less disruptive than information presented via other channels. Furthermore, users preferred audio above all other modalities, even if graphics or text-to-speech would have been more informative

OutToLunch

Related researchers: Cohen, J. (in 1994) (information derived from Albers (Albers, 1996))

Description: The OutToLunch system attempted to recreate an atmosphere of group awareness in which individuals felt that their co-workers were nearby even if these co-workers were physically dispersed or isolated. By playing pre-recorded keystrokes and mouse movements, OutToLunch used the sounds of keys clicking and of mice rolling to create the sensation of group activity.

Outcome: Though I have not been able to find any related writing on this case Lepltre (Lepltre et al., 2004) notes that Gaver11 cites this as a rare-example of an application that has quality sound design.

SoundShark / ARKola

Related researchers: Gaver, W., Smith, R., O'Shea, T. (Gaver et al., 1990), (Gaver et al., 1991)

Description: SoundShark makes use of auditory feedback like in sonic finder, but sonifies also ongoing processes, which indicate the nature of the objects, and continuing activity around them. It has been designed as an extension of Shared ARK (Alternate Reality Kit), a test-lab for collaborative working. Testing of SoundShark was done with the 'ARKola bottling plant': 9 machines had to be controlled. 12 sounds played simultaneously. Not all machines could be displayed on 1 display.

Outcome:

Their observations indicated the effectiveness of the sound in two broad area's: First it helped people to keep track of many ongoing processes and secondly it helped people collaborate.

Without sound, people often overlooked machines that were broken or that were not receiving enough supplies; with sound these problems were indicated either by the machine's sound ceasing (which was often ineffective) or by the various alert sounds.

They also make notion of the fact that people can hear the plant as an integrated complex, where sound merges together to produce an auditory texture; like the sound of an auto-mobile does: Participants seemed to be sensitive to the overall texture of the factory sound, referring to 'the factory' more often than they did without sound.

When sound was added collaboration between colleagues increased: The ability to provide foreground information visually and background information using sound seemed to allow people to concentrate on their own tasks, while coordinating with their partners about theirs.

Furthermore the addition of sound increased the participants' engagement. In sum, the ARKola study indicated that auditory icons could be useful in helping people collaborate on a difficult task involving a large-scale complex system, and that the addition of sounds increased their enjoyment as well.

Dolphin

Related researchers: David K. McGookin, Stephen A. Brewster (McGookin et al., 2002)

Description: A multi-modal focus and context technique developed for PDA's using visual display to present focus and aural cues ( spatialized earcons) to place it into context.

Outcome: There was no significant difference in either the accuracy or speed of navigation between the two conditions. The two conditions in this test were purely visual presentation to aurally augmented presentation. The researchers believe that they did not yet have enough information for creating spatialized audio spaces populated with structured audio.

Summary and conclusions

Adding sound is not the holy grail to solve all usability problems. It may however solve some of them.

The first application discussed, SonicFinder, uses sampled auditory icons of everyday sounds. This type of icons could be heard as consequence of direct manipulation. No scientific testing has been involved. Yet Gaver states that 'users complain of missing it (sound, red.) when they use a quiet Finder'. The technique of everyday listening used for the creation of these icons will be discussed later. The further relevance however for our research ends with the fact that the sounds being heard are a consequence of direct manipulation. Therefore possible mistakes of auditory icons for real-world events may occur less often.

In EAR Gaver used manipulated sound-FX. Barber (Barber, 1998) states that the literal sound effects used in EAR might have a drawback: the associations may be wrong and could be mistaken for real-world sound. Possibly sound effects need to be (more) abstract (maybe like the synthesized sounds in Gaver's SonicFinder), or have no relation to any real world event at all (e.g. earcons). The advantage of not-sampled sounds is that it's parameters are more open for control.

Abstraction of sound is to be found in Dolphin. Dolphin used earcons, small informative tone sequences having a high degree of abstraction. Are musical motifs the solution then? Gaver (cited in (Lepltre et al., 2004)) concludes after testing ARKola and further studies12 that musical phrases may be hard to integrate in a working environment. Everyday sounds that integrate well with everyday environment are less annoying than music. Furthermore, when tunes are repeated, people tend to find these annoying. When looking at the pro's and con's together it looks like a battle between two camps. Later this thesis will see that both are currently accepted methods for auralization.

ShareMon is interesting because it sonifies context. Though users may want to know how network traffic is behaving, they don't want to be bothered with it constantly. Context is also being sonified in another study by Cohen, OutToLunch, yet I have not been able to obtain any test results.

What SoundShark proved is that sound can really help to manage information, even when it is off screen. Second to that it stimulated collaboration between operators, discussing problems etc. A third advantage is that operators felt more engaged.

Though Dolphin did not proof sound as a better channel for presenting information, only the fact that there was no significant difference in either the accuracy or speed of navigation between the two conditions is hopeful.

The last four case studies (ShareMon, OutToLucnch, SoundShark and Dolphin) are good examples where context is being sonified and not the direct manipulation. Some information is not important enough to be confronted with all time in great detail (as ShareMon proved). What it can accomplish is a greater awareness of all processes; which OutToLuch tried to accomplish with co-workers on different physical locations, and SharedArk for visually difficult manageable processes. SharedArk even accomplished better co-operation between colleagues and created an atmosphere where operators felt more engaged.

Dolphin however, didn't proof sound to be a better channel to present context, but at least the outcomes were not in favor of the visual channel either. The researchers think that the lack of understanding of spatialized earcons was causing this.

Finally, the paper about Audio Windows gives us an interesting view upon the use of sound-editing techniques used in film, music and radio making. Though I have not been able to find any testing material the simple, yet maybe cultural dependent, tricks being used to add hierarchy to simultaneously played sounds are inspiring. Also such techniques are well applicable when homogeneity is pursued. This often happens during the mastering process of a record. I will get back to this in chapter 5.

Satisfying the user

Today interface designers should be designing for users, not for computers. Before developing, or adhering, a new auditory language auditory interface designers should know what the constrains are. What are the human factors to deal with? In this chapter I will try to find an answer to this question.

Usability

Usability is the term used for describing the property of a product that relates to the design of the interface. ISO13 defines the abstract term usability as a container for three separate units; efficiency, effectiveness and satisfaction, which are on their account rather abstract terms themselves.

Van Welie, Van der Veer and Elins(Wellie et al., 1999) developed a layered model of usability for the convenience of the designer or the evaluators. The upper layer 'Usability' has been based upon the ISO definition. Interesting of this model is that it incorporates both the academic ISO definition of the term Usability as well as a more practically applicable set of indicators (inspired by work of Nielsen and Shneiderman). Below that level they have incorporated 'means' which are related to these usage indicators.

This chapter will mainly focus on tackling the satisfaction pillar of usability within the context of sound. Increasing the efficiency and effectiveness will not be neglected in this document, but will not be discussed in this chapter.

What is satisfaction? WordNet14, an online lexical reference system whose 'an design is inspired by current psycholinguistic theories of human lexical memory', returns the following descriptions of the term 'satisfaction':

the contentment you feel when you have done something right; the chef tasted the sauce with great satisfaction
state of being gratified; great satisfaction; dull repetitious work gives no gratification; to my immense gratification he arrived on time
compensation for a wrong; we were unable to get satisfaction from the local store
act of fulfilling a desire or need or appetite; the satisfaction of their demand for better services

Webster's 1913 Dictionary returns:

The act of satisfying, or the state of being satisfied; gratification of desire; contentment in possession and enjoyment; repose of mind resulting from compliance with its desires or demands.
Settlement of a claim, due, or demand; payment; indemnification; adequate compensation.
That which satisfies or gratifies; atonement.

Satisfaction in interface design is something of the chef who tasted the sauce with great satisfaction and the act of fulfilling a desire or need or appetite. But also contentment in possession and foremost enjoyment plays a role.

Satisfaction is one of the greatest difficulties faced when designing an auditory interface. The auditory channel is very sensitive for nuisance. Especially when one is talking about adding more sound to the interface, people get skeptic and often react with The first thing I do is turn off the operating system's sounds or similar reactions.

In the ISO the subjective measure of 'satisfaction' is described as the comfort and acceptability of use by end users. Seems something to strive for.

Lindgaard and Dudek (Lindgaard et al., 2003) tried to answer the question: 'What is this evasive beast we call user satisfaction'. Their research suggests that it is a complex construct comprising several affective components as well as a concern for usability. Satisfaction may be the aspect which foremost shapes the overall 'feel' a user has when working with an interface. Still, satisfaction stays difficult to distinguish from overall usability and is still being researched by Lindgaard et al.

Purely based on the model by Wellie et al. understanding is needed of the User Model and the Task Model before assumptions can be made on by which means the user can be satisfied. Understanding the user is about knowing the user's limitations. The task model is about knowing the user's tasks. The accent of this chapter will be on knowing the user.

In this chapter I will discuss the problems faced and some of the solutions to these problems, in order to get a more satisfying communication. The outline for this chapter will be as follows: next, cognitive overload, followed by noise surrounding us in our daily life, then an overview of how sound is being judged in experimental settings, after which I will finish the threesome with more on 'Aesthetics'.

Cognitive overload

In the introduction I already mentioned a overburden of information on the visual channel as a problem of concern for the interaction designer. This overburden of information makes part of what is called cognitive overload. But what is it really? Kirsh (Kirsh, 2000) explains that cognitive overload concerns information overload, but also the problem of the increasing numbers of decisions which have to be made, the increased frequency of interruptions they confront, or the increased need for time management in everyday activity.

Next to the fact that knowledge workers are confronted with ever more information, whether it is because knowledge workers think they need it or others supply them with it, there are two other major causes of cognitive overload, according to Kirsh: interruption and multi-tasking and they are related.

As a solution Kirsh mentions changing the physical lay-out into activity spaces dedicated to the task where people are in which will prevent switching between programs. Second to that methods, algorithms and practices can be changed to achieve reduced overload: better techniques for conducting meetings, personal time management, recording results, accessing corporate memory, dealing with interruptions and coordinating activity both at an individual and group level.

Fussel et al. (Fussel et al., 1998) mention trade-offs between overload and awareness. Using passive communication techniques, he mentions e-mail and faxes compared to communication types like telephone calls, users can be better aware without increasing the overload too much.

Noise nuisance in daily life

As I have mentioned before, sound can be very annoying. For a sonified interface to be effective, it may not be annoying. For a basic introduction to annoyance I have mainly used research directed at noise disturbance in our everyday world (traffic noise for example) as there is relatively little research on this subject with regards to interface sounds.

Factors affecting nuisance

Loudness

Often annoyance has to do with the loudness. Loud music which makes a conversation impossible is annoying. Loud noise which makes it hard to concentrate is annoying. I will list some research related to environmental noise and some notes about laws related to sound pressure levels.

Though one may think that people living in cities would less bother environmental sounds, this is not true according to a Dutch TNO-report (Pachier-Vermeer et al., 2001). Demographic factors are of little influence when the threshold of nuisance is being measured. This is confirmed by several other investigations (Miedema et al, 2003). Age however does. The amount of nuisance at a certain sound level increases after having reached the age of 18 (which was the lower limit in the research mentioned) and decreases after an age of 60.

Pachier-Vermeer makes also notion of a sound pressure level of 42dB(A) after which sounds produced by traffic starts causing severe nuisance. This is similar to the day-night average (DNL) noise-level mentioned by the Health Council of the Netherlands according to Miedema. Miedema adds however that the observational threshold for environmental impulse noise and low flying aircraft is even 12dB lower. This may have to do with semantics surrounding the sound.

The Dutch Ministry of Social Affairs and Employment insists on a maximum sound pressure level of 80dB(A) at an 8 hour shift. In relations to living area's Dutch law prescribes that the maximum sound pressure level produced by industries and traffic may not exceed 55 dB(A), with an exception for railway traffic, which has a maximum of 70dB(A).

Volume of speech

Both the Pachier-Vermeer report as the research by Miedema (Miedema, 2001) state that when environmental noise exceeds 40dB(A) people having an conversation over a distance of 1m gradually start to talk louder than average to obtain a greater signal to noise ratio. The average conversation level is 55dB(A). A non-native speaker will even desire a greater signal to noise ratio.

Frequency response

dB(A) is not similar to dB. dB(A) stands for decibels adjusted. This is a weighted measure (A) adjusted to the experience of loudness of the auditor. The dB(A) incorporates the fact that the human-ear is less capable of hearing tones of low and very high frequencies. At 1000Hz levels measured in dB(A) are similar to the same value in dB.

Non-acoustic factors

There are also non-acoustic factors that play a role in experiencing nuisance. Miedema et al. (Miedema et al, 2003) cite Langdon15 who found that noise sensitivity has a strong relation with general environmental dissatisfaction. This is important for my study: semantics. Furthermore anxiety plays a role (Zimmer and Ellermeier16). Fear for the sound's source, the knowledge that others could decrease the nuisance factor or the fact that the sound is a result of a new situation or new technology could all cause a raised sensitivity for noise.

Consequences

Health factors

Besides the fact that an overload can cause hearing damage, there are also psychological factors that can harm health conditions.

Such as somatic stress-phenomena. These are stress phenomena which occur after an enduring exposure to noise related nuisance resulting in corporal changes: hypertension (high blood pressure) and Ischemic Heart Disease (like Myocardial Infarction and angina pectoris). A raised chance on such phenomena occurs when the noise's intensity has levels over 70dB(A).

But there may also occur functional effects like having difficulties concentrating.

Concentration

Being able to concentrate on a subject may be very important. Sounds that distract are disturbing and may make one lose concentration. Attention is used for processing environmental information (whether it is visual, auditive or different), creating a mental representation, concentrating on a subject and (re-) directing attention.

Irritating sounds are sounds that demand attention, without the recipient willing to give this attention. Very difficult to ignore are sounds which do (or seem to) contain data.

The level of nuisance relates also to the task performed at a certain moment. When the task makes hea vy use of a person's short-term memory, one will be annoyed sooner. Whether it is conscious or unconscious, noise troubles thinking; people tend to delay tasks which demand much of their memory-capacity.

Aesthetics

Aesthetics is the theory behind taste; the science of the beautiful. However Lindgaard and Dudek (Lindgaard et al., 2003) wisely state that, even while they have agreed on associating aesthetics to beauty, even 'beauty' has at least five clearly distinguishable meanings in philosophy. Beauty however is often linked to a sense of order. Lindgaard and Whitfield17 comment sceptically that such meaning doesn't make sense when being applied 'chiefly to women and weather'18.

Aesthetics seem to play a role in the sense of satisfaction experienced by the user of a system. This is also recognized by the Auditory Display community (Lepltre et al., 2004). The next step after experimental applications of sonified interfaces is making sonified interfaces aesthetically pleasing.

Several types of research have been directed at how sound is experienced. I will start of with discussing a research aimed at urgency. Followed by a discussion of what makes sounds attractive. A third paragraph will attend to the experience of music.

Urgency

Urgent sounds are often the opposite of aesthetically pleasing. An interesting research has been done by Edworthy et al.19. This research has been by several researchers, foremost in the document by Buxton et al. (Buxton et al., 1994). For each of the experiments by Edwordthy several acoustic factors were varied and the subjects were asked to rate the urgency of the resulting sounds. The factors being varied were then divided into two groups: individual sounds (or pulses) and melodies (or bursts). Subjects seemed very consistent in judging urgency, which implies both that it is a psychologically real construct and that acoustic factors have consistent effects on it.

Parameter

Most Urgent

Least Urgent

Individual Tones

Fundamental Freq.

530Hz

150Hz

Amp. Envelope

~rectangular

slow onset

slow offset

Harmonic series

random

10% irregular > %50 irregular

harmonic

Delayed harmonics

no delay

delayed harmonics

Sequences

Speed

fast

moderate

slow

Rhythm

regular

syncopated

Number of units

Speed change

speeding

regular

slowing

Pitch range

large

small

moderate

pitch contour

random

down/up

Musical structure

atonal

unresolved

resolved

Table 2: Acoustic parameters affecting urgency (Edworthy et al. 1991)

Buxton et al. continue:

it should be noted, however, that the relative effects of the factors was not tested. (...) Nonetheless, these results provide extremely useful information for creating sounds and sequences that are more or less urgent (and thus annoying).

A paragraph later they write that these results have been derived from experimental studies, which may differ from real world experience, where for example repetition takes place.

Appreciation

What an interface should try to accomplish is that it is being appreciated by their users. How to accomplish such? Nicol et al. (Nicol et al., 2004) make notion of the design guidelines of Audio Aura 20 :

Background sounds are designed to avoid sharp attacks, high volume levels , and a substantial frequency content in the same range as the human voice (200-2000Hz)

Nicol continues that it is followed by the following note: current audio design often induces this alarm response, intentionally or otherwise. Exactly the reason why these paragraphs are here.

Buxton et al. (Buxton et al., 1994) make notion of several factors that play a role in the appreciation of the sounds heard: the complexity, the semantics, obtrusiveness.

Complexity

Moderately complex sounds tend to be less annoying than very simple or very complicated ones. However, the perceived complexity of a sound decreases with familiarity (this has been confirmed by both experimental and theoretical studies of emotion and aesthetics). An often heard sound may thus be more complex.

Gaver suggests slightly varying the sounds may be the middle way. When changes are kept small, effective variations can be made without reducing sounds' identification. To decrease the overall complexity, many long sounds can be shortened considerably, and thus be made to convey information concisely. Lepltre and McGregor (Lepltre et al., 2004) agree:

In brief, sonic density refers to the perceived density of a sound. The contributing factors are duration, intensity, spectrum, number of instruments etc. (...) annoyance is best avoided by limiting the density of the sounds.

Semantics

The semantics of a sound also matter. This has already been mentioned when discussing the non-acoustic factors. Another example of this is that highway sounds are more annoying than bird's sounds. There is no theoretical support for this, but it has been verified in various research.

Clarity versus obtrusiveness

Buxton et al. make also notion of a tension between clarity and obtrusiveness is another point of concern. Metallic sounds sound obtrusive. Low pass filtering, and (slightly) slow down the attack may help. A similar point of concern is noted by Lepltre et al. (Lepltre et al., 2004) who refer to this as the balance between homogeneity an maximizing the differences to make them more distinguishing:

Designers often need to maximise the differences between sounds to make them more easily distinguishable. This compromises the homogeneity of the auditory interface and hence its overall aesthetics.

Temporal envelope

Lepl tre and McGregor (Lepltre et al., 2004) add one more property affecting aesthetics, namely, temporal envelope. They state that sounds used in auditory interfaces must often be brief and interruptible. That is why they recommend information to be placed in the onset of the sound. To soften the transitions they recommend the use of fade-ins and fade-outs.

Musical experience

Another interesting research has been performed in relation to music, in specific the rhythms and intervals being used. This research may be especially important when earcons (tonal sequences conveying information) are being discussed.

Smeijsters (Smeijsters, 1987) summed up some research on the how . He notes a research by Gundlach reviewed by Rsing21 which relates to the experience of rhythms in European music (translated from Dutch):

Type

Experienced as

Stiff rhythm

fanciful, nervous

Irregular rhythm

Fine, sensitive, stately, illustrious, dark

Little irregular rhythm

Vivid, daring, fanciful, sparkling

Regular rhythm

Sparkling, vivid, daring, happy

Many unisons and 2nd majors and minors

Nervous, sad, clumsy

Many 3rd 's

Triumphal

Many larger intervals

Glad, renowned, pleasant

Table 3: Judgement of rhythm and interval parameters of European music

The problem with this research is that it searches for communicating emotions. Musical affects stimulating emotions may not be culture independent.

Repetition

When information is being conveyed musically there is another factor to be taken in account. Gaver and Mandler22 found that people tend to find the repetition of tunes annoying. Gaver23 also noted that musical phrases may be hard to integrate in a working environment.

Lepltre and Brewster (Lepltre et al., 1999) mention a research however showing that when sounds are structured without involving musical functions, the sounds are not found to be annoying at all, Brewster (1996)24.

Summary and conclusions

In this chapter I have tried to find the constraints to deal with when designing for humans. As I want to take the auditory interface to a next level I need to adhere the need for satisfaction, which has been recognized as one of the three pillars of usability.

At first I've been discussed the problems which are already out there, facing all people and law-makers in special in daily life. Most of interest was the number ~40dB(A). This was the level at which people started to get (severely) annoyed, but also the level after which people started to talk louder. I believe that this must be the maximum level of sound generated by the computer. When the sound should however be very complicated, even lower intensities are desirable.

Next was a discussion of research related to the experience of sounds. We should not forget that the results presented may not work out the same in practice, yet it may give a reasonable onset for satisfying sound-design. The research on this part does not yet supersede the qualities of a good sound-designers intuition.

I followed with a discussion about aesthetics, or the search of what is beautiful. People tend to appreciate order. Sounds should not be over-complex. Yet very simple, possibly too synthetic sounds, aren't appreciated either. It is also the balance between having very distinct sounds and having an aesthetically unpleasing soundscape of random shaped sounds.

When sounds are used musically, Gaver warns that repetition of sounds may turn out to be annoying.

Finally, sound itself should also be experienced as friendly. Presentation matters. People should be attracted to the phenomena, a sound producing computer. This is interesting, not for the sounddesigner, but for marketing. Though good marketing cannot help succeeding a bad product, I believe people need to be convinced of the positive side of sound produced by a computer. The current misuse of interface sounds has lead to a very negative attitude towards them.

Towards Design

Several cases have been covered and the basic constraints have been outlined. The next task lying ahead is getting an answer to how could one design a sonified interface? What approaches could be used in shaping a model (tackling efficiency and effectiveness)? And how have these been applied already in current auditory interfaces? Finally the question will be raised how to make it sound as a coherent whole.

Mental models

Sophisticated system processes need to be mapped by more understandable concepts. A coherent model which should support the creation of mental models in the user's mind can surely contribute to making systems understandable. A mental model is a term that describes how systems are understood by different people, primarily consisting of (i) the way users conceptualize and understand the system and (ii) the way designers conceptualize and view the system (Preece et al., 1994) . A model can be used to define mappings.

Theories on mental models

Metaphors

One type of mapping often used to present difficult systems more user friendly are mappings through metaphors. Inspired on researchers like Lakoff and Johnson (Lakoff et al., 1980), who made us understand that we learn to comprehend and communicate concepts through metaphors, interface designers started to build interfaces that did the understanding for us. Time, and research, however has proven that using metaphors often leads to inconsequent behavior. Evangelists like Don Norman (Norman, D. 1998) have sworn off metaphors and plead for other types of mappings.

Gaver does not dismiss the use of metaphors completely (Gaver, 1995). He embraces the complexity created by the mixed approaches. It allows a simpler, easier to conceptualize mapping. He ends with: For understanding interfaces, it is important that we untangle our metaphorical web. For designing, it is important we do not.

Models, beyond metaphors

Payne (Payne, 2003) states that there is not just one theory of mental models. In contrary, he thinks there are several independent theoretical commitments made under the hood of 'mental models'. He divides four settled idea's of what mental models are: mental models as theories (the model people have of how a system works), mental models as problem spaces (studying the way people solve problems), mental models as homomorphisms (special kind of representation), mental models of representations (or situation models, as in language). Then he follows with two idea's of his own, more or less based upon the four mentioned theories: Mental representations of representational artifacts and mental models as computationally equivalent to external representations.

Mental representations of representational artefacts

Users form representations of what they see or read. They try to understand what is perceived and try to apply this to their needs. Yet this understanding of the device-space may not live up when being applied in a goal-space. He suggests that either conceptual instructions may fill this gap or the interface should be redesigned to make the appropriate device space visible.

An example may clarify some: Payne observed novice users trying to use the copy command. He learned them the copy operation of selecting the string, selecting copy from the edit menu, point the cursor at the new position and paste. Yet when the users were asked to duplicate the string at two different locations many of the test-participants performed the whole sequence of copy paste, while only paste would have sufficed. Therefore Payne suggested renaming Copy as Store so that more novice users were inclined to construct such an account.

Mental models as computationally equivalent to external representations

When discussing mental models as computationally equivalent to external representations Payne starts with stating that the importance of structure sharing is controversial. Therefor he suggests distinguishing computational and informational equivalence of representation. It is possible to have informationally equivalent representations, yet these may or may not be computationally equivalent, which means that it may or may not be as cost effective as the other. This points to empirical consequences of a representation (together with the processes that operate upon it) that go beyond mere informational content.

Therefore he suggests using task-relative versions of the concepts of informational and computational equivalence. Representations are informationally equivalent, with respect to a set of tasks, if they allow the same tasks to be performed. He later follows with an example where he wonders, after mentioning supporting research by Throndyke and Hayes-Roth 25 , if all mental models might be defined as mental representations that are computationally equivalent with external, diagrammatic representations. He reasons as follows: A cognitive map can usually be defined as a mental representation, which in certain cases is computationally equivalent to an external map. Furthermore, the ordinary definition of a cognitive map is closely related to the analog-representation definition of a mental model. The way I understand this is that if we can offer a diagrammatic representation of how a task can be solved this can be used to stimulate the creation of computationally equivalent mental models. Note that these diagrammatic representations cannot be shaped to conform to every task. After stating his claim, he notes that he is not able to defend it, but the claim does not seem implausible to him. Neither does it to me.

Short reflection on models

I believe we see a really coherent or contributing set of perspectives. Even Gaver's, seemingly controversial point of view can be reasoned with Payne's observation: Gaver's different metaphors can be seen as different external mappings, suited for compliance on different tasks. However Gaver's thinking feels a bit loose-ended, worrying me about the consistency and therefore aesthetics.

Payne's idea's show the problem of the creation models where representational artifacts (like language or metaphors) are being used. However he also remarks that the artifacts may be further explained through the interface itself. Secondly it shows that interface designers, when explaining a model, should consider the task. Payne thinks different tasks need different explanations of how the interface works.

Should there be made use of metaphors? Yes when it comes to explaining the interface aimed at a certain task, no when it comes to the often multiple-tasks-oriented interface itself.

Meaning

Meaning in sound

Sounds may have a narrative, carry data or have no function at all. Sounds may be synthesized without any relation to natural sounds or they may be just such natural sounds, according to Emmerson (Emmerson, 1986) in his search for the relation between language and electro-acoustic music. Translated to interface design this may mean that sounds may explain themselves (metaphorical or iconic use of sound), or they might carry data (symbolic) or may have no meaning (being there for aesthetic reasons, or for no reason at al).

Gaver (in (Buxton et al., 1994)) has also been searching for meaning in sound. He was one of the first to actively research the psychology of everyday listening, inspired by the ecological approach to perception developed by Gibson26. He argues that the primitive analysis only plays a small role in understanding our experience of the everyday world. Instead, he argues, perception depends on learning, memory and problem-solving. Gaver asks the question what do we hear? There are sources creating sounds in a particular way. Understanding how physical objects produce certain sounds may help us create another type of platform for the creation of auditory cues.

Meaning in music

The fact that music is capable of expressing something has been repeatedly doubted. At one side there are composers and critics who think and want music to be expressively portraying the passions. writes also Smeijsters (Smeijsters, 1987) on music and language.

Meyer (Meyer, 1956) named the two camps 'absolutists' and 'referentialists'. Absolutists insist on the musical meaning lying exclusively within the context of the work itself. Referentialists however believe that music in addition to these abstract, intellectual meanings, also communicates meanings, which in some way refer to the extra-musical world of concepts, actions, emotional states, and character.

Lepltre and Brewster (Lepltre et al., 1999) even argue that there is space for a third level which they call the contextual level. Where meaning is derived from it's context. We now can distinguish the following levels of user's musical understanding:

The intellectual level; meaning lying strictly in the musical work of art
The referential level; meaning is related to extra-musical world of concepts, actions, emotional states and character
The contextual level (introduced by Lepltre and Brewster); music takes its actual informational meaning, specific to a situation of interaction in auditory display.

Mapping the auditory display

Lepltre and Brewster (Lepltre et al., 1999) mention Kramer (1994) envisaging auditory display in terms of two broad categories (or two broad types of mapping), namely: analogical (the representation has an intrinsic relationship with the data being represented (a good example is the Geiger-counter)) and symbolic (information is represented in discrete elements which do not have an intrinsic relationship between elements of what is being represented) information representation.

The use of sound found in commercial interfaces is symbolic, often in a manner of true/false states. However, there may also be a more analog symbolic representation, where analog values are transformed to discrete values.

Gaver suggests another categorization (Buxton et al., 1994) . He distincts three different types of mapping: symbolic, metaphorical or iconic. These are distinguished by the degree to which they are arbitrary or lawful. Symbolic mappings are completely arbitrary, and thus may be difficult to understand without learning. Iconic mappings on the other hand resemble reality, both functionally as visually (or aurally), which makes them familiar for beginning users. A disadvantage, not noted by Gaver in the same document, is the difficulties that arise when more complicated functionality is added, not seen in reality.

Visual or aural presentation

Currently mainly visual presentation is being used. Designers are confronted with ever more information to be displayed on a relatively small display. Auralization instead of visualisation may help reduce the weight on the visual channel, possibly even reducing the effects of cognitive overload (see chapter IV). But how do we divide this information? One approach already seen when discussing some cases is focus and context.

Focus and Context

Focus and context are both terms describing the importance of the information available to the user. Focus is of most importance to the user and thus should be presented in more detail than contextual information. Context should not be bothering users too much.

At the department of computer science at the university of Glasgow, Stephen Brewster and his research group investigate augmented communication. A concept tested is that of a multi-modal focus and context design based upon the Bifocal Lens concept (McGookin et al., 2002). This concept developed by Spence & Apperly27 has a strict distinction between focus and context. McGookin et al. chose this approach in favour of the Fisheye model developed by Furnas28 where this distinction is not so obvious. This strict distinction matches the strict distinction between the aural and visual channel.

Types of non-speech audio

Auditory displays can be envisaged in terms of two broad categories, analogical and symbolic information representation29. Idealogical representation is one in which there is an immediate and intrinsic correspondence between the sort of structure represented and the representation medium. A Geiger counter gives a good example.

Symbolic representation involves an amalgamation of the information represented into discrete elements and the establishment of a relationship between information conveying elements that do not reflect intrinsic relationships between elements of what is being represented.

Earcons

Earcons are abstract structured musical tone sequences. Earcons can be combined, transformed, inherit from other earcons properties and constitute a auditory language of representation.

Earcons are not based upon any real world model and therefore earcons are symbolic. Illustration 4: An example of an earcon hierarchy showing sounds that could be used to represent errors (Buxton et al., 1994) may explain how they work.

Since musical sentences are perceived as high level objects, Lepltre and Brewster conclude, it is probable that the development of music cognition of global pieces of music should help musical interface designers to exploit music for its global meaning rather than for its structure.

Auditory Icons

Auditory Icons are natural, often everyday sounds, that can be used to represent actions and objects within an interface(Gaver, 1993). Yet they are not just entertaining; they convey information about events in computer systems. Using an 'ecological listening' approach, Gaver suggests that people don't listen to the properties of sound, but to their sources.

Sound effects

Cohen (in (Buxton et al., 1994)) points out that genre sounds can be successfully incorporated into auditory interfaces; yet it might be difficult to interpret for people not familiar with e.g. sci-fi. A solution to this is to generate the sounds especially for the task of taking in account physics laws.

Another type of mapping may be borrowed from film-production. Many of the sounds we hear when watching a film weren't there at the set. Nor would they occur in a typical setting in the 'real' world. Yet by introducing these sounds a more richer atmosphere could be created, encapsuling the user.

Auditory clichs

Auditory clichs are sounds which are arbitrarily related to their meaning, even in the everyday world. They are so firmly embedded in the culture that they can be used as everyday sounds. An example of such is the use of a telephone bell sound for incoming messages. Some sounds however may be culture dependent.

Combined approach

Nicol et al. (Nicol et al., 2004) mention a third approach, namely the combined approach. When combined the interface designer can both make use of the complex timbral manipulation of Auditory Icons with the complex combination and musical manipulation of Earcons. Therefore, they state, a much richer design space has come available.

Relevance

Both earcons as auditory icons are effective at communicating information in sound, though there is more formal evidence for earcons due to more basic research (Brewster, 2003). Auditory Icons however have proven to work effectively in different tested systems. More detailed research however would provide better guidance for the creation of auditory icons.

The advantages and disadvantages of both are obvious: as earcons are abstract, they are more difficult to learn, but better applicable when there is no real-world equivalent. For auditory icons it's the other way around.

Brewster, advocate for earcons, notices that auditory icons may lose their advantage when the natural context is lost, and their meaning should be learned. He reinforces his argument arguing that the disadvantage of the learning-factor with earcons is not a huge factor, as research has shown that little training is needed if the sounds are well designed and structured.

It seems earcons and auditory icons will never marry. But probably, earcons and icons are not necessarily as far apart as they might appear. The timbre of the earcon for example could also be recognized as meaningful. That is why the third approach sounds so appealing. Only aesthetics may be harmed when both types of sounds are used without care. Learnability might even not be that important as sounds are often reacting on actions, whether they be direct or indirect. Therefore the context may help understanding the message.

Mastering coherence

How to design a soundscape that is coherent. Coherence improves aesthetics which adds to satisfaction. In chapter 3 I have already discussed five factors playing a role when it comes to appreciation: complexity, semantics, clarity, repetition and temporal envelope. This sub chapter is aimed at the final mastering process. During this process complexity and temporal envelope can still be tackled.

Music- and film-production techniques

The quest for coherence is not a new problem. To me, it seems as an obvious choice to look at music- and film-production techniques. Music and film-sound is being mastered after it is mixed and before it is duplicated for distribution. I will not discuss the mastering process in great detail. Mastering is a profession on it's own (Raad, 1998). Yet, applying some mastering techniques may help obtaining a more pleasurable interface.

Filtering

Filtering is used to brighten up the sound, or removing irritating peaks in certain frequency-bands. It really depends on the source's material what processing is needed.

Compression

Compression is a technique used to decrease amplitude differences in a sound. Some audio-drivers on the windows 30 operating system already include the ability to compress the overall output. One typical application could be marketed for working at night; neighbors don't have to have a restless sleep as cause of short sound explosions.

Effects

A final step may be adding effects. This is like adding make-up, it should be used with great precaution. Typical effects added at this stage may be room simulation and enhancing width.

Standardized design tools

Another way of making sure a coherent sound design is created is creating the auditory interface with a specialized tool. Such a tool could standardize transformation taking in account theories on streaming, but also coherence. Of course the main goal of such standardized approach is taking away the need for exhaustive testing and tweaking on the sounds used. I've had a quick view on a design proposal for such purely MIDI based and one using Timbre Spaces.

Brewster (Brewster, 1996) explains in his report where he suggests a design for a MIDI based toolkit the need for a toolkit:

To simplify the implementation of applications that include sound in their interfaces
To allow designers who are not sound experts to create sonically-enhanced interfaces
To make sure that the sounds added are effective and enhance the users interaction with the computer
To make sure the sounds are used in a clear and consistent way across the interface

The advantage of using MIDI is that it is a standardized protocol and really light weighted. The disadvantages however are the impossibility to change timbre properties, positioning in space (with the exception for left/right positioning) and have a reliable set of sounds (the MIDI sounds depend on the type and manufacturer of the soundcard, there is no control to know exactly which sound the end-user will hear).

Another approach is the use of Timbre Spaces. This is suggested by Nicol et al.. A timbre space is a representation method of sound introduced by Grey31. According to Nicol it is representation that is flexible enough to allow designers a variety of ways to adapt it. The idea behind the use of timbre spaces in toolkits for interactive sound is that real time FM-synthesized sounds can easily be manipulated without using much CPU power. Though sounds are being synthesized, through various conversions it is possible to use waveform sounds within the system after they have been converted to a function that can be synthesized (of course with loss of quality). Within the timbre spaces however sounds can be easily adjusted and make use of mapping techniques currently developed for earcons as well as mapping techniques developed for auditory icons.

Evaluation

Evaluation is about finding out what users want and what problems they experience (Preece et al., 1994). It is about understanding how the product is being used in the real world. It's about trying to get to the best interface. It's about checking if it is usable, and finally it may deal with standards. Nielsen (Nielsen, 2003) discriminates among the following steps for evaluation (I have not cited the irrelevant steps):

Field study
Paper prototyping
Refinement of design ideas through multiple iterations
Testing of final design

Observing users

When there is a product or a prototype, maybe just paper prototype, user's usage can be observed. The goal of this observing is discovering flaws in design. It seems easy, but it is difficult to be truly objective in what you tend to observe. Therefore users may be asked to think aloud or report their experience after conducting the test.

Users' opinions

Preece states that checking users' opinions at various stages of the design is essential, and can save a lot of time by ensuring that unusable, unnecessary or unattractive features are avoided. The goal is to find out what the users' attitudes are towards the design. Does it stroke with the designers idea?

She makes notion of several methods: structured interviews, flexible interviews, semi-structured interviews. She also notes some variations: prompted interviews (also referred to as paper prototyping (Nielsen, 2003b)) and card sorting.

Structured interviews have a predetermined structure and can be used to derive statistics, a quantitative approach. This type of interview does not allow exploration of individual attitudes, which flexible interviews do allow. This type of interview is all about creating a friendly atmosphere where all opinions of the participants on the design can be discussed. A third interview type are semi-structured interviews, where prompted interviews are an example off. It can be relatively free, but the interviewer may show some design proposal, asking what the participant think of it or try to ask some questions in a much stricter fashion for easier comparison.

Card sorting is another technique used to find out how people think on certain matters. Card sorting is a generative method for determining the user's model on for example how information is structured.

Evaluation of finalized projects

In alarm-systems testing and validation of sonified interfaces may be critical (Buxton, 1995). Too often sounds are hard to distinguish on their own or are being masked due to other environmental sounds. While sound can be a valuable channel of communication, bad design may undo these advantages.

Buxton makes notion of a research by Nielsen and Schaefer (1993)32. This research is an interesting study that showed that sound is not always the remedy for usability problems. Non-speech audio in a computer paint program tested on users who were between 70 and 75 years old was found not to contribute anything positive, and merely confused some of the users.

Summary and conclusions

The use of sound in interfaces needs serious thoughts on developing the right model. The goal is to have the user understand this model, or at least that bit of the model that he or she needs for the task. Therefore some kind of mapping is needed. Mapping is using artifacts to explain the model, and here the problem of mapping arises. We have concluded that mapping should be directed towards the task the user is trying to pursue. In this way chances on a conflicting mental model of the user will decrease. This is what went wrong with using metaphors describing the working of a user's computer. People got the wrong assumptions on how a system works, resulting in a inconsistent, and thus difficult to understand, system.

A few approaches were attended. One approach was searching for extra-musical meaning in music. Some believe this is impossible, others believe it can when the context is known. This is how Brewster and his group think how earcons add additional meaning to current interfaces. I believe the latter sounds very plausible and is comparable with the way how music is being used in films. Sound can strengthen the perception.

Another approach is reached through everyday listening. In the real-world objects make sounds when you tap on them, drop them, move them, etc.. Research has and is being done on synthesizing these effects, which could lead to virtual objects which make sounds as in natural conditions, following physical laws. This is not to be confused with adding a metaphor. As long as the virtual objects have no relation with any real-world object it is only about adding a sense of logic to the system, which for example could be extended with use of haptic devices.

But what information should be sonified? When studying the properties of sound (first chapter) and examining related case studies (in the second chapter) context was already noted as a good candidate for sonification, and after a more close examination in this chapter this was confirmed. The distinction of focus and context suits both the visual as the auditory platform well. The visual channel is well capable of displaying in high detail and that makes this channel ideal for displaying focus. The aural channel does not have this resolution, yet it's properties like existing over space make this channel suit context well.

What followed was a discussion of types of non-speech audio. Earcons and auditory icons however, are about events. Not about atmosphere, which I intended to research. There are two approaches overcoming this difference: many events make an atmosphere or the mapping philosophy could be used.

In the real world many events make the atmosphere. An immense number of factors play a role in the audible atmosphere. Simulating this is very intensive, but may lead to an ideal soundscape. Another approach would be a more everyday listening approach. But instead listening to how events react, one could be listening to what effects rooms have on sounding objects in a space. Great halls have long reverberations, sounds in a separated room may sound muffled.

The next point of concern in this chapter was coherence. The importance was already noted in chapter 3. In this chapter techniques developed by the film and music industries for creating coherence in their auditory landscapes were looked at. It could provide a more pleasing soundscape, and less annoying to co-workers or neighbors as a result of compression. Second was striving towards coherence through the use of standardized design tools. To make it into main stream, such standardized design tools can surely contribute to the popularity of auditory display, yet the suggested methods are currently less applicable to the situation I am striving.

This chapter ended with a note on evaluation. Making an interface more coherent could result in a loss of signal. The differences between the signals may become less obvious. Regarding my project I will not be able to test the end-product. We have noted however that preliminary testing is also of great value. The difference is that the product does not involve real-time interaction and furthermore comparable software is scarce. What can be done however is testing the user's model. It is suggested that this can be done through card sorting. What also can be done is testing the user's understanding of the representations used. Both are examples of prompted research.

Conclusion and recommendations

Introduction

There is still a lot of space open for exploration when it comes to the development of aural interfaces. Some of this exploration is only since shortly possible since lack of processing power was slowing progress. These technical limitations have for a great deal disappeared, partly due to increasing demands for ever more real sounds during game-play, partly due to increasing brute processing power. And that is why this research can take place in an attempt to come up with an alternative approach solving the problems coming with a system that gets more complex every day.

The complexity is partly due limit use of channels available to represent information. The real world is multi-modal, and has multiple ways of presenting itself, so why not give interface designers access to channels such as sound to prevent a cognitive overload on the visual channel. Some representations are just not suited for visual display and explaining such models better visually would only increase the visual-overload.

Understanding of a model can take place unconsciously. When walking through a building you're not constantly drawing a conceptual map of the building, yet after you've walked often enough through the building often enough you are capable of planning routes just like you would do without knowing the building and having a real map33

So here we are now. Having basic knowledge of sound. Having knowledge of how it has been used, having learned from previous lessons. Knowing how people can be pleased and have the knowledge to deal with designing a new representation where such representation is needed. It is time to come to a recommendation, answering the main question: How should an auditory interface be designed, exploiting the strengths of the auditory channel, to create a better awareness of overall processes?

Design recommendation

Balancing visual and aural presentation

We need to balance the information over the visual and aural channel as a new output channel has been introduced. This means that information should be redistributed. Currently visual information is being used to display the whole continuum of context to focus, and some auditory feedback is given when the user is to be alerted (when the screen demands focus).

[Warning: object ignored]Illustration 5: Focus and context redivided

What has been suggested in previous chapters is the use of the visual channel for presenting focus and the aural channel for presenting context.

I have tried to capture the suggested, new balance (with unbroken lines) in Illustration 5: Focus and context redivided. The balance implies that the visual channel should not present context, yet sound should not be used to present focus. This last is pretty different from how sound is being used today where sounds are being used to notify users on occurring events. This is also captured in the diagram.

Visual representation currently captures focus very well. Context however has never sufficed where interface designers were adhering a visual approach. I think designs of current operating systems are a proof of that. A cause of this is the visual channel's inability to present information that deals with time in an effective way (See table: Auditory compared to visual presentation, Deatherage (1972) at page 1). Sonification may make presenting this contextual information possible as the audible channel is much more capable of presenting time based events.

This intention however does not yet give us a descent guideline which events should be sonified. So the question is: Which events should be sonified, in what detail, why and how? In the next sub-chapters the answers to this question will be given attention.

Events activities and processes

In the introduction I distinguished 4 different types of data which can be made perceivable: Data about events, data about activities (user invoked), ongoing processes and environmental processes.

The type of sounds associated with these events, activities, ongoing processes and environmental processes are respectively: short sounds, medium length sounds, ongoing sounds and effects (affecting all sounds). Following is a table with example events.

Type of data

Type of sound

Example events

Data about events

Short sounds

Incoming messages (mail, instant messaging, RSS-feed)
Time coupled (Calender event, hour change)
Error messages

Data about activities (user invoked)

Medium length sounds

Clicking (OS9 defines 4 types of sonified button interaction, press, release, enter, exit)
Dragging
Scrolling (OS9 also defines: end of track, and no scrolling)
Type sounds

Ongoing processes

Ongoing sounds

File related (copying, moving, downloading)
Network traffic (incoming, outgoing)
Our world (Contacts on IM (online, busy), level of news updates, (forecasted)weather, stock data, pending messages (rrs, mail, im)
Time (clock sounds, calender)
System (processor activity, HDD usage, RAM usage, software updates available)

Environmental processes (result of different events)

Effects affecting overall output

System (nr. of processes, programs running)
Time (Part of day(including pauses))
Workspace

Table 4: Data types, related events and types of sound

Events are a commonplace in current interfaces, and are often turned off eventually as many of these sounds are designed to be obtrusive and the user has to pay attention to them. Activities are not really common. However this feedback may be really valuable, especially with more difficult tasks like drag-drop activities (Brewster, 1998). Ongoing sounds can be found in data (experimental) visualization, though some more advanced users claim to listen to the speed of the computer's fan (which may be considered as ongoing sound). Environmental audio, finally, can only be found in games: when a different environment or space is entered, acoustics change with it.

The first type of interaction is directed at this information where the user's focus should be at. For the second type, activities, the audible channel may have a function of adding the direct context surrounding the user's focus. The third and fourth types of information are almost always part of the user's context. Of course a user may want to know the progress of for example a download, but that is not the main task he or she is working at, and may as well be monitored unconsciously. I will concentrate on creating an atmosphere of ongoing processes and environmental processes. Concentrating on information that has time as dimension.

But why adding this type of context? I believe that adding such contextual information can shape understanding through awareness and thus overall usability. Currently user's models are not always consistent (as the test will show). I think that when the underlying model of a system such as a computer will be communicated better, through contextualizing, understanding will increase.

The mental model

A mental model is a model of how people think something works. This model can be designed to take away the barriers, still in existence, when it comes to understanding the true electronic model of complicated systems like computers. I have discussed these in greater detail in chapter V. As sound is not commonplace in current interfaces, there is still much space open for the development of new models which allow for a new type of representation. I believe the following constraints are important:

It should be coherent
It should be simple (sound is weak at delivering complex concepts)
It should match the physical appearance of a computer on the desk (not interfering with existing models or reality)
It should match the reality of how thinks work (it should not interfere with the technical reality)

I will now discuss all four constraints in greater detail.

Coherent

When designing parameters that should add to the model I think that a coherent model is important. Furthermore, by making sure that the model does not conflict with especially technical reality the model should not break (as in leading to conflicts) in the future.

Simple

I have noted in the introduction that though sound is able to represent complex data it should not be used to communicate complex concepts, or models.

Match current presentation

Using metaphors

Because I also wanted the system to be simple to understand, I looked, among other things, at common language used to explain phenomena, often through the use of metaphors (Lakoff et al., 1980). The Internet is thought of as the web or highway, or is displayed as a cloud. The computer's CPU is like the driving force behind all 'magic', like a motor, etc.. While being fully aware of the problems with using metaphors for explaining the interface, I used established metaphors to shape the model. In this way currently settled metaphors for explaining can easily explain the behaviour of the new model.

Match technical reality

Furthermore it should be true to the techniques used. It should not cover all technical details, but it should make people more aware of how the computer works. The right feedback will make them understand the processes better and will enable them to adjust their way of working. When a heavy program starts the user should hear or feel the computer's need for processor power to accomplish the task of starting a program. Just as a driver of a car feels when a car has troubles driving through mud. When users copy (and I refer here to the discussion of the copy-paste sequence in chapter V) something to the clipboard, the user should be aware that something stays stored, somewhere.

Matching the physical world

I believe positioning in space is important for understanding. A location which where possible strokes with reality and/or the established metaphors when these refer to a location. Placing them at a fixed location in space should make the concepts less vague. It should be possible to refer to objects in terms as what does the engine sound like? Does it sound hollow, or not? Do you hear people around you? Is there still stuff moving upwards?

Introduction of the basic model

graphics7 Illustration 6: Visualization of the suggested model

The model I have developed, matching the constraints mentioned, consists of 4 spaces: The space above the user, the space below the user (the foundation), the room the user is at and outside this room where other people are. I've tried to visualize the model in Illustration 6: Visualization of the suggested model. To clarify it even more some examples will follow:

The Internet or network: a floating cloud above us (I will not refer to it as a cloud, it may as well be a hovering highway).

Files can be downloaded or uploaded to this network (Note the coincidence).

Co-workers do not make part of this Internet. They are on the same level as the user. Yet they do not share the same space, or room. Therefore they should not be as visible as things that happen in the same room.

The user's room (computer space) is powered by a motor (CPU), as being the fundamentals of the working of his machine. Within the user's personal space, or room, local processes take place, like copying and moving files, a clock, personal agenda, etc.

There is an emphasize on spatial location. This is done deliberately. Sound exists in space, not just between the small space between two stereo speakers, which is often just as wide as the width of the monitor. I believe that, in combination with psycho-acoustical effects, an sound surrounding the user can be created, where at least moving and approaching objects are perceivable as separate entities.

A more definite model can be found in the appendix 'From model to sound'.

Review

Weaknesses

Well, to state first, I cannot assure if contextualizing information as mentioned above really contributes to overall usability. I am not aware of any similar approaches using sound to the problem of lacking overview, trying to tackle the representation of such fundamental processes in interface design. I believe the sound channel is too different from the visual channel that any comparison will be a dead end. I argued however that adding contextual information may shape understanding and thus overall usability.

Another argument may be that spatiality does not work for interfaces, at least it does not work, in my opinion, for 2D visual displays. I believe however that the aural approach it considerably different from the visual approach. Furthermore there is no attempt to navigate through this space. The third dimension is only added for informational purposes.

A more practical problem is 3D audio. HRTF, as pointed out in chapter 3, are just simulations of a model-head processing sound in such manner that it sounds spacial to our binaural hearing system. Therefore there should not be relied too heavily on exact positioning. Relative positioning however can be used; in terms of approaching and leaving. Testing will provide us more information on the reliability of the HRTF's.

Strengths

The very simplistic model developed makes basic concepts of the system more clear to all types of users. It may create an awareness of what is actually happening. Though the information about the model could be mapped visually, it would have heavily added additional visual information, which would only intensify to the current load of information displayed visually. The audible channel however suits this type of information much better. First the model is simple, in concept, a precondition already noted by Deatherage (see the introduction). Second it does not need thorough understanding, as it deals with context. Yet this context may lead after a phase of learning to a much better understanding of what is happening. A mental model that can be consistently applied to all computers can be created. This because the model is consistent with current interface design models, common models described in end-user documentation and technical reality.

Testing

In chapter V I had a explanation of testing methods available. We already stated that testing of the actual product, ContextSonifier, cannot be part of this thesis as the product is still in unstable development phase. What I have tried is getting some answers on some of the unanswered problems stated in previous paragraphs through a type of prompted research. Yet this prompted research had, due to time constraints, a structured and rigid (multiple choice) form. I am aware of the problems these structured interviews have, yet I believe that the group tested was large enough (59 people) to construct relevant conclusions.

The test consisted of two parts. The first part was an experimental attempt to get an idea of the user's current models through a card-sorting type of manner. In the second part the qualities of the 3D engine used for the project was tested (a more 'paper'-prototyping type of manner). Most important for this thesis is the first part of the test. The primary goal was to test if my idea's of how people work and look at the computer were correct.

The data from the test will not steer the sounddesign itself, yet it will be considered as valuable input. This input will not be used to let objects sound like a representation. It could however contribute to the design of the virtual audible object. Using this approach I will not make any references to real world objects causing users reason the system's model from. Of course the system may be explained through metaphors, but that is a whole different story.

I asked participants to compare system related matters to more everyday, understandable concepts. This was an attempt to get an idea of how they looked at computer related phenomena. The results will be used to compare the model and the current users model. The aims of this test are to discover gaps in the model and to find out where the model really needs introduction to the end-user. The test was split up in 4 categories: My Computer, Files and programs, The Internet and Time.

The test results were divided over beginning, medium and advanced users. Where experience seemed to matter, medium users were true to their type and showed results balancing between those of beginning and advanced users.

My computer

Most people seem to describe their computer as an advanced machine, though more experienced users seem to refer to it as a collection of print boards. The CPU is considered as the brain by most unexperienced users, while more advanced users like to see it as a calculator. RAM is the notebook of this brain.

I had hoped that the CPU would be referred to as the working motor and RAM as fresh air (the respondents mostly answered respectively brain and notebook). Maybe a motor sounds less sophisticated, while a CPU is considered really sophisticated, and it is not possible to safe something in fresh air? Maybe a motor is to heavy and lump to most users, and should a more subtle sound be used to indicate the CPU.

RAM should still most resemble the notion of fresh air, or open space. This believe is backed when I look at what programs are (most referred to them as physical objects, like books or windows as we will see later). These are, reasoning through my model, placed in a space, filled with air. The sounds these books or windows make should however be more electrical, metallic, or mechanical than paperish or glassy.

Most users consider their monitor mostly as just a output device, though beginning users more often refer to it as being a television.

Materials

The material a computer is made of is electricity or metal, according to almost all respondents. The same is for the CPU and RAM memory (though the answer 'plastic' and 'signs' weren't unpopular for RAM either).

Files and programs

Programs are books according to many beginning users or windows according to more advanced users. Files are collections of bytes according to most of the respondents. Probably files are rather abstract to people. The second most answered was paper. Both programs as files are made of signs and electricity.

Files were often seen as a collection of bytes, which is also striking. I would have thought that more users would have answered with a more metaphorical answer like 'piece of paper'. Yet I can also agree to the fact that files are rather abstract. There may not be any real physical object to which we can really compare files to. Files can be anything. Files can be text documents (which, yes, could be made of metaphorical paper), but also movies or music-pieces. How files should sound is another question, still unanswered because of the rather vague reference to a collection of bytes. Because, how should bytes sounds? Maybe a bit electric? I think I have found an object here that can best be sonified using abstract earcons.

Actions

Copying is an action of read and write, or stamping, though I have to remark these don't really stand out. On moving there was a more common agreement; it was cut and paste for the beginners, and a take and drop action for the more advanced users.

Both actions refer to not moving actions but jumping. When copying takes place, it is being copied and built up. When moving takes place, it is simultaneously being built up and broken down. This is also more analogue to how files are handled on the hard drive (though move actions are much often just a adjustment of the master table).

The Internet

Internet pages are more often seen as pieces of paper by beginning users, though they are actually also files. Maybe because they often only contain text and images and their design still resembles print-design a lot. Advanced users however, see it often as a junction in the web or highway. This is not the intentioned view however. I think this is not how these phenomena should be analysed

The Internet itself is seen as the web by beginning users, and as a library by more advanced users located around them (most) or at their computer (beginning users). I wonder if beginning users may find it hard to find good information on the Internet. Personally, I would have imagined it as a cloud, hovering above us... but it may as well be a hovering library, with loose papers connected with wires. It doesn't harm the model for how it is being imagined, only cloudy-sounds (whatever these may be ;) ) may not be most in place.

Actions

Downloading is seen as copying from the Internet or receiving. Both seem logical and are technically correct. These downloads come from up and land in most of the participants' imaginary computers. Though uploading is often seen as handing out or again copying (answered by mainly advanced users), the movement of uploads however is often answered as an action of moving stuff to their own computer (this is a wrong conception). Maybe the exact meanings of the word are messed up by file-sharing software where other users are uploading when the user is downloading something. This is an example where a program like ContextSonifier could increase understanding and thus usability.

E-mails are made of signs and electricity, just as files (which is technically correct). When arriving, e-mails drop down in the in-box.

Friends and contacts

Instant messaging messages are made of electricity and signs. What it actually is cannot be made up of the answers. Friends and colleagues who may send such messages stand at equal height in reputation, mostly physically surrounding the respondents. This is also how they often are placed in their imagination. This corresponds with my model.

Time

Advanced users imagine time flying, while beginning users imagine time going by. Maybe this difference in speed refers to the fact that advanced computer users are more stressed?

On the matter of relaxing found most respondents babbling most typically relaxing, followed by drinking coffee or thee and watching TV.

Finally, the use of a calender. I was thinking the primary goal was to use it as a reminder, of appointments not to forget. Apparently many people use it primarily as their planner! This will probably not affect the design of ContextSonifier however, as planning is probably a task of focus. A task that can much better be executed in the visual domain.

Summary and conclusions

This thesis has been the theoretical backup of a program called ContextSonifier, which will hopefully be finished just after the appearance of this thesis. This chapter's goal was outlining the design preconditions: What should it do and how? There were several questions to be answered: What information should in real situations be sonified? Why should this information be sonified? How could we sonify it? And would such sonification correspond with the average user?

The next question was what is context? I used the division already suggested in the introduction to get an idea of what the context is that is important to be sonified. I concluded that especially ongoing processes and environmental processes are part of the user's context. I believe this distinction is even stricter than the distinction between focus and context. Choosing the right channel for display is much easier with this more strict distinction. The question of what context actually is doesn't have to be bothered. Feedback during navigation for example, could also be considered as context, as it is information which is not essential, but it helpful. Yet I've dismissed this type of feedback in this research for the reason it is no process with time as dimension.

Contribution

What I believe to have contributed is the development of a new model that can be interactively perceived as a (indirect) result of normal, currently available, operations and the notion that mastering processes can also be used to aesthetically improve auditory interfaces.

The model I developed has currently no real objections other than the fact that it is purely based on reasoning. The model is summarized in Illustration 6: Visualization of the suggested model at page 1). By putting some of my underlying assumptions to the test I was able to fine-tune the model, especially the way the model should sound like. I needed to know how others perceived the digital reality. Using this data I will be able to construct better recognizable new 'reality', in a similar way Gaver (for more information see Buxton et al. (Buxton et al., 1994)) applied his Everyday Listening technique. The definitive results can be found in the appendix.

The second contribution lies in the search for coherence. Using mastering techniques developed for audio post production in more traditional media coherence of sounddesign could also be pursued in interactive situations (though real fine tuning of filters and compressors may not be possible). Coherence leads to a more aesthetically pleasing interface, which contributes to satisfaction, next to efficiency and effectiveness one of the three pillars of what is called usability.

Striving for the creation of a coherent interface may also tackle one of the biggest objections regarding sonically enhanced interfaces, namely annoyance. Annoyance will be greatly reduced if auditory interface designers adhere to the design consequences summarized in the summary of chapter IV. Furthermore the soundscape should be adjustable to meet the demands of the environment. For example, when at an office, co-workers don't need to be sonified (unless all workers are in separate boxes).

I believe that sonified contextual processes can add to improved understanding of complex systems.

Further research

This thesis was aimed at trying to address problems related to the increasing complexity of systems and our lives with it. Future research on this program may result in valuable data for further improvement of the usage of the aural-channel.

Our distinction of events, activities and processes may also open ways of other channels of computer to human communication like haptic devices such as force feedback mouses. I believe such devices are most suited to present the second group of computer to human communication. Yet further research should find out.

Bibliography

Albers, 1996: Albers, M. C., Auditory Cues for Browsing, Surfing, and Navigating, 1996, http://icad.org/websiteV2.0/Conferences/ICAD96 (10-05-2004)

Barber, 1998: Barber, D.A., Audio Feedback in the Computer Software Interface, 1998, http://hexidecibel.org/resources/ ()

Bregman, 1990: Bregman, A.S., Auditory Scene Analysis, 1990

Brewster, 1996: Brewster, S.A., The design of a sonically-enhanced interface toolkit, 1996

Brewster, 1998: Brewster, S.A., Sonically-enhanced drag and drop, 1998

Brewster, 2003: Brewster, S.A, Non-speech auditory output, 2003

Buxton et al., 1991: Buxton, W., Gaver, W. & Bly, S., The Use of Non-Speech Audio at the Interface, 1991

Buxton et al., 1994: Buxton, W., Gaver, W. & Bly, S., Auditory Interfaces: The Use of Non-Speech Audio at the Interface, 1994, http://www.billbuxton.com/Audio.TOC.html (14-04-2004)

Buxton, 1995: Buxton, W., Speech, Language & Audition, 1995, http://www.billbuxton.com/MKaudio.html (4-04-2004)

Cohen, 1993: Cohen, J., "Kirk here:" Using Genre Sounds To Monitor Background Activity, 1993

Edwards et al., 1993: Edwards, A., Edwards A., Mynatt E., Enabling Technology for Users with Special Needs, 1993

Emmerson, 1986: Emmerson, S. (editor), The Language of Electroacoustic Music, 1986

Fussel et al., 1998: Fussel, S., Kraut, R., Lerch, F., Scherlis, W., McNally, M., Cadiz, J., Coordination, overload and team performance, 1998

Gaver et al., 1990: Gaver, W., Smith, R., Auditory icons in large-scale collaborative environment, 1990

Gaver et al., 1991: Gaver, W.W., Smith, R.B., O'Shea, T., Effective sounds in complex systems: the ARKOLA simulation, 1991

Gaver, 1989: Gaver, W., The SonicFinder: An interface that uses auditory icons, 1989

Gaver, 1993: Gaver, W.W., Synthesizing Auditory Icons, 1993

Gaver, 1995: Gaver, W.W., Oh what a tangled web we weave: metaphor and mapping in graphical interface, 1995

Hartmann, 1998: Hartmann, W. M., Signals, sound and sensation, 1998 (rev. 2000)

Kirsh, 2000: Kirsh, D., A Few Thoughts on Cognitive Overload, 2000

Lakoff et al., 1980: Lakoff, G., Johnson, M., metaphors we live by, 1980, 2003

Lepltre et al., 1999: Lepltre, G., Brewster, S.A., Perspectives on the Design of Musical Auditory Interfaces, 1999

Lepltre et al., 2004: Lepltre, G., McGregor, I., How to Tackle Auditory Interface Aesthetics? Discussion and Case Study, 2004

Lindgaard et al., 2003: Lindgaard, G., Dudek, C., What is this evasive beast we call user satisfaction?, 2003

Ludwig et al., 1990: Ludwig, L. F., Pincever, N., Cohen, M., Extending the Notion of a Window System to Audio, 1990

McGookin et al., 2002: McGookin, David K., Brewster, Stephen A., Dolphin: The design an initial evaluation of multimodal focus and context, 2002

Meyer, 1956: Meyer, L.B., Emotion and meaning in music, 1956

Miedema et al, 2003: Miedema, H. M. E., Vos, H., Noise sensitivity and reactions to noise and other environmental conditions, 2003

Miedema, 2001: Miedema, H.M.E., Noise & Health: How does noise affect us?, 2001

Mills, 2003: Anderson, M., Acoustics for Musicians and Recording Engineers (slides, 2003, http://www.ece.utexas.edu/~nodog/main/main.html (09-06-2004)

Nicol et al., 2004: Nicol, C., Brewster, S.A. and Gray, P.D., A system for manipulating auditory interfaces using timbre spaces, 2004

Nielsen, 2003: Nielsen, J., Usability 101: Introduction to Usability, 2003, http://www.useit.com/alertbox/20030825.html (02-08-2004)

Nielsen, 2003b: Nielsen, J., Paper Prototyping: Getting User Data Before You Code, , http://www.useit.com/alertbox/20030414.html (02-08-2004)

Norman, D. 1998: Norman, D., The Invisible Computer, 1998

Pachier-Vermeer et al., 2001: Paschier Vermeer, W., Kluizenaar, Y. de, Steenbekkers, J.H.M., et al., Milieu en gezondheid 2001: Overzicht van risico's, doelen en beleid, 2001

Payne, 2003: Payne, S.J., Users' Mental Models, 2003

Plomp, 1998: Plomp, R., Hoe wij horen, 1998

Preece et al., 1994: Preece J., Rogers Y., Sharp H., Benyon D., Holland S., Carey C., Human-Computer Interaction, 1994

Raad, 1998: Raad, C., Het grote opnameboek, 1998

Smeijsters, 1987: Smeijsters, H., Muziek en psyche: Thema's met variaties uit de Muziekpsychologie, 1987

Walker et al., 2000: Walker, B. N., Kramer, G., Lane, D. M., Psychophysical scaling of sonification mappings, 2000

Warren, 1999: Warren, R. M., Auditory Perception, 1999

Wellie et al., 1999: Welie, M. van, Veer, G. C. van der, Elins, A., Breaking down Usability, 1999

WikipediaJND: Community, Just noticeable difference, 2004, en.wikipedia.org/wiki/Just_noticeable_difference (17-05-2004)

Appendices

From model to sound

Object

Process

Model

Sound

File

Copy

One location sends duplicates of itself to another location within a person's personal space (PPS) where it is being build up.

Gaver used sound of filling bottle. To make a change between copying file, 2 sourrces are heared, one emptying source (which not seems to empty when copying) and one filling. The Idea I have is that the spectral properties of filling are correct to suggest a process, but that earcon style sounds may be used to address that it is a file.

</td>

Download

A location outside PPS sends parts of itself to another location within PPS where it is being build up.

Similar to copy, yet the process sound is muffled (muffling is the standard edit on sounds from outside the PPS)

Upload

Reverse of download

Move

One location sends itself to another location within a person's personal space (PPS) where it is being build up. Moving is is like pushing something.

The sound should move within the PPS to another space, having the spectral variation of Gaver's bottle.

Network traffic

Download

Data moves upwards from the PPS to outside

Data sound moves up and becomes less clearer, more data broader stream

</td>

Upload

Reverse of download

Data sound moves down and becomes clearer, more data broader stream

Our world

Co-workers

More people outside the PPS

More (muffled) sounds of working, or babbling people (regarding to what their status is)

</td>

New news

News awaits just outside the PPS

Telex sounds muffled

Awaiting e-mails

Emails await just outside the PPS

Pile of papers being counted? Muffled of course.

Stock-data

Outside the PPS stocks are going up and down

Continuing sound with accoustics of money moves up/down in frequency, muffled.

Weather

Outside the PPS it rains, or...

Weather sounds, muffled.

Time

Clock

Clock inside the PPS

Ticking and every hour cookooing

</td>

Calender

Events appear inside the PPS

Fading in of associated sounds, from muffled and soft to clear, well hearable sounds.

Part of day

The PPS changes during the day

During pauses longer reverberation, smoothing the complete soundscape

System

Processor activity

The 'motor' (users often referred to it as a brain) works harder, or less hard

Sound of a subtle (still 'engine'-like?) sound that works harder, or less hard, muffled.

</td>

RAM usage

Fresh air in the room

Slight EQ-ing: more high when more RAM is available, less, when less is available

Nr. of processes running

The room gets filled with objects

Less reverberations as space gets fuller (more running processes)

Glossary

Absolute threshold: The minimum detectable level of a sound in the absence of any other external sounds.
Amplitude: The difference between the maximum displacement of a wave and the point of no displacement, or the null point. The common symbol for amplitude is a.
Audiogram: A graph showing absolute threshold for pure tones as a function of frequency. It is usually plotted as hearing loss in dB as a function of frequency, with increasing loss plotted in the downward direction
Auditory Icons: everyday sounds mapped to computer events by analogy with everyday sound producing events.
Attack: The first part of the sound of a note. In a synthesizer envelope, the attack segment is the segment during which the envelope rises from its initial value (usually zero) to the attack level (often the maximum level for the envelope) at a rate determined by the attack time parameter.
Bandwidth: A measure of the information capacity in the frequency domain. The greater the bandwidth, the more information it can carry.
Binaural: A situation involving listening with two ears
Mental Model: a term that describes how systems are understood by different people, primarily consisting of (i) the way users conceptualize and understand the system and (ii) the way designers conceptualize and view the system
Context: The rest of the information to be displayed. In order to allow all of the required information to be displayed this information is displayed in much less detail than the focus.
Decibel (dB): Unit, one-tenth of a Bel, for logarithmic ratios (to the base of 10) of intensity, power, pressure, etc. Its reference base must always be given; e.g., re. input, re. .0002 dynes cm2 - or referred to in some manner as SPL (sound pressure level), SL (sensation level) etc. A decibel is approximately equal to a just noticeable difference (JND) in constant changes of intensity.
Diotic: A situation in which the sounds reaching the two ears are the same
Earcons: brief structured musical motifs conveying information, first introduced by Blattner
Envelope: The envelope of any function is the smooth curve passing through the peaks of the function.
Focus: That part of the information space that is of most interest to the user. This part is presented in maximum detail.
Filter: A device which modifies the frequency spectrum of a signal, usually while it is in electrical form
Frequency (f, Hertz (Hz)): Measure for the number of events or cycles that occur in a time period, usually one second. Frequency is measured in Hertz, which are the number of cycles per second. (e.g., Humans can experience sound from 20 Hz to over 20,000 Hz.).
Frequency threshold curve: See Tuning curve
Fundamental frequency: The fundamental frequency of a periodic sound is the frequency of that sinusoidal component of the sound that has the same period as the periodic sound
Harmonic: a component of a complex tone whose frequency is an integral multiple of the fundamental frequency of the complex
Hertz (Hz): Unit used for cycles per second - named in honor of the German physicist, Heinrich Hertz, 1857-1894.
Infratones: Periodic sounds, their corresponding sensory attribute is called infrapich (Warren, 1999)
Intensity (I, Watts per square meter (W/m2)): Measure of the sound power transmitted through a given area in a sound field. The term is also used as a generic name for any quantity relating to amount of sound, such as power or energy, although this is not technically correct.
Intrasound relationships: These are relationships established among the parameters of individual sounds themselves. For example, a message or datum may be encoded by establishing a particular relationship among the pitch, timbre and duration of the sound.
Intersound relationships: These are relationships that are established between and among sounds. It is through the pattern of sounds that meaning is conveyed. A simple example would be assigning specific meaning to a particular motif.
Inverse Square Law: In a free field, sound intensity is inveresly proportional to the square of the distance from the source.
Itterance: General term encompassing the perceptual attributes of both pitch and infrapitch.
Just noticeable difference (JND): Also known as difference limen (DL) or differential threshold. The smallest change in frequency or intensity that can be recognized. As the smallest recognizable change in intensity, it approximates the decibel. Note that JND is generally measured in ideal listening circumstances. In practice, therefore, the effective JND is even larger.
Level (L, deciBels (dB)): Measure. The level of a sound is specified in dB in releation to some reference level. See Sensation level and Sound pressure level
Linear: A linear system is a system which satisfies the conditions of superposition and homogeneity.
Loudness (?, sones): Measure for subjective sensation of the effect of amplitude or intensity. It is determined partly by the number of auditory nerve fibres activated by the sound wave and partly by the number of impulses carried by each fibre. The unit of measurement of subjective loudness is the "sone.". S. S. Stephens (1972) came up with the following formula for loudness: L=kI0.33 after extensive research of previous research.
Loudness Level (LL, phon.): Measure. The loudness level of a sound is determined by comparison for equal loudness with a 1000 Hz tone re .0002 dyne c/m2 when heard binaurally in sound field. The unit of loudness level is known as the "phon." It is numerically equal to the SPL of the 1000 Hz tone, but varies with frequency.
Masking: Masking is the amount (or the process) by which the threshold of audibility for one sound is raised by the presence of another (masking) sound.
Microvariations: Microvariations as tremolo and vibrato make sounds stand out; sounds from the same source should have a certain coherence in these.
Modulation: Refers to a periodic change in a particular dimension of a stimulus.
Monaural: Situation in which sounds are presented to one ear only.
Noise: Noise in general refers to any unwanted sound. White noise is a sound whose power per unit bandwidth is constant, on average, over the range of audible frequencies. It usually has a normal (Gaussian) distribution of instantaneous amplitudes.
Octave: An octave is the interval between two tones when their frequencies are in the ration 2:1.
Partial: A partial is any sinusoidal frequency component in a complex tone. It may or may not be a harmonic.
Pascal (Pa): Unit of pressure; 1 Pa = 1 Newton/meter2.
Period (T, seconds(s)): The period of a periodic function is the smallest time interval over which the function repeats itself .
Phase (?, radians or degrees): The phase of a periodic waveform is the fractional part of a period through which the waveform has advanced, measured from some arbitrary point in time.
Phon: Unit of loudness level.
Pitch: A psychoacoustic phenomenon that is closely related to but not synonymous with frequency. Pitch is the subjective property that lets us compare whether one sound seems "higher" or "lower" than another. The pitch of a sound can be ambiguous or ill-defined. ANSI recommends that the pitch of any sound can be described in the terms of the frequency of a sinusoal-tone judged to have the same pitch.
Pitch, Absolute: The ability to identify unerringly the fundamental frequency of a tone that is heard.
Power (P, Watts (W)): Measure. Energy (Joules) per second
Pure tone: A sound wave whose instantaneous pressure variation as a function of time is a sinusoidal function, also called a simple tone
Sensation level (L, deciBels(db)): This is the level of a sound in decibels relative to the threshold level for that sound for the individual listener
Simple tone: see pure tone.
Sine wave, Sinusoidal vibration: A waveform whose variation as a function of time is a sine function. This is the function relating the sine of an angle to the size of the angle.
Sone (sone): Unit of loudness. A 1000Hz. tone at 40dB is similar to 1 sone.
Sonification: The use of non-speech audio to convey information; more specifically sonification is the transformation of data relations into perceived relations in an acoustic signal for the purposes of facilitating communication or interpretation
Sound: The sensation resulting from stimulation of the auditory mechanism by air waves or vibrations or the sensation itself.
Sound pressure level: This is the level of a sound in decibels relative to an internationally defined reference level. The latter corresponds to an intensity of 10-12 W/m2, which is equivalent to a sound pressure of 20 microPa
Spectrum: The spectrum of a sound wave is the distribution in frequency of the magnitudes (and sometimes the phases) of the components of the wave.
Timbre: that attribute of auditory sensation in terms of which a listener can judge that two sounds similarly presented and have the same loudness and pitch are dissimilar. Put more simply, it relates to the quality of a sound.
Tone: 1. A tone is a sound wave capable of evoking an auditory sensation having pitch 2. A sound sensation having pitch
Uncomfortable loudness or uncomfortable loudness level: UCL or ULL. The intensity level which a tone or sound subjectively becomes uncomfortably loud.
Waveform: a term used to describe the form or shape of a wave. It may be represented graphically by plotting instantaneous amplitude or pressure as a function of time.

Footnotes

1 Deatherage, B. H. (1972). Auditory and Other Sensory Forms of Information Presentation. In H. P. Van Cott & R. G. Kinkade (Eds), Human Engineering Guide to Equipment Design (Revised Edition).

2 This is a reference to Don Norman's comparison of a computer with a 'Swiss Army Knife' (Norman, D. 1998)

3 This is an updated version of the similarly named tutorial which appeared at the CHI of 1991, (Buxton et al., 1991)

4 See for the definition of harmonic and more definitions the glossary.

5 Terhardt, E. (1978). Psychoacoustic evaluation of musical sounds. Perception and Acoustics, 23(1), 483-492.

6 You might want to bring a visit to this site for more information: http://www.zainea.com/timbre.htm

7 For more information there is the wikipedia entry of 'Just Noticeable Difference' (WikipediaJND)

8 Shepard, R. N., Circularity in judgements of relative pitch. Journal of the acoustical Society of America, 36, 2346-2353 (1964)

9 Van Noorden, L. P. A. S. (1975), Temporal coherence in the perception of tone sequences, Ph.D. thesis, Eindhoven University of Technology.

10 A similarly named project is also available for windows, yet it has only log-in sonification: http://www.frivol.net/wololo/sharemon/sharemon.html

11 W. Gaver, Auditory Interfaces, in Handbook of Human-Computer Interaction, M. G. Helander, T. K. Landauer, and P. Prabhu, Eds. Amsterdam: Elsevier Science, 1997

12 W. Gaver and G. Mandler, Play it again, Sam: on liking music, cognition and Emotion, vol. 1, pp 259-282, 1987W. Gaver, Auditory Interfaces, in Handbook of Human-Computer Interaction, M.G. Helander, T.K. Landauer, and P. Prabhu, Eds. MAmsterdam: Elsevier Science, 1997

13 ISO (1996), ISO 924-10 Ergonomic Requirements for Office Work with VTDs Dialogue Principles

14 WordNet: http://wordnet.princeton.edu/

15 Langdon, F. J. (1976), Noise nuisance caused by road traffic in residential areas: part III J. Sound Vib. 49, 241-256.

16 Zimmer, K., and Ellermeier, W, (1998), Konstruktion und Evaluation eines Fragebogens zur Erfassung der individuellen Larmemfindlichkeit Diagnostica 44(1), 1120.

17 Mentioned in (Lindgaard et al., 2003): Lindgaard, G. and Whitfield, A., 2001. Proceedings, Affective Human Factors Design, SingaporeProceedings, Affective Human Factors Design, Singapore pp.. 373-378

18 The author would like to mention that this skepticism may not be on its place, as facial appreciation for example does seem to relate to certain regularities (though I'm not able to recall any research), or order as you will.

19 Edworthy, J., Loxley, S., & Dennis, I. (1991). Improving auditory warning design: Relationships between warning sound parameters and perceived urgency. Human Factors, 33(2), 205-231.

20 Mynatt, E. D., Back, M., Want, R., Baer, M., and Ellis, Jason, B. (1998). Designing audio aura. In CHI'98. ACM.

21 Gundlach, R.H., Factors determining the characterization of musical phrases. In: American journal of Psychology 47 (1935), 624-643 Rsing, h. (Hrsg), Rezeptionsforschung in der Musikwissenschaft, Darmstadt, 1983

22 William W. Gaver, George Mandler. Play it again sam: On liking music. Cognition & Emotion, 1(3):299-322, 1987.

23 Gaver. W., Auditory Interfaces, in Handbook of human-Computer Interaction, M. G. Helander, T. K. Landauer, and P. Prabhu, Eds. Amsterdam: Elsevier Science, 1997.

24 I've not been able to check this research, the bibliography entries do not lead to this research.

25 Thorndyke, P.W., Hayes-Roth, B. (1982). Differences in spatial knwoledge acquired from maps and navigation. Cognitive Psychology, 14, 560-589

26 Gibson, J. J. (1979). The ecological approach to visual perception . New York: Houghton Mifflin.

27 R. Spence and M. D. Apperley, Database Navigation: An office environment for the professional', Behaviour and Information Technology, vol. 1, pp. 43-54, 1982.

28 G. W. Furnas, Generalized Fisheye Views', presented at CHI'86, Boston, MA, 1986, pp. 16-23.

29 Kramer, G., An introduction to auditory display. Auditory Display. Gregory Kramer Editor, Addision-Wesley, 1994

30 By my knowledge commercially:The nVidia nForce Audio-drivers (on-board sound)

31 Grey, J. (1977). Timbre discrimination in musical patterns. Journal of the Acoustical Society of America, 64:467-72

32 Nielsen, J. & Schaefer, L. (1993). Sound Effects as an Interface Element for Older Users, Behaviour & Information Technology, 12(4), 208-215.

33 This is a reference to the research by Thorndyke and Hayes-Roth (see Mental Models in chapter V)