David Huron
Music Perception, Vol. 19, No. 1 (2001) pp. 1-64.
The traditional rules of voice-leading in Western music are explained using experimentally established perceptual principles. Six core principles are shown to account for the majority of voice-leading rules given in historical and contemporary music theory tracts. These principles are treated in a manner akin to axioms in a formal system from which the traditional rules of voice-leading are derived. Non-traditional rules arising from the derivation are shown to predict formerly unnoticed aspects of voice-leading practice.
In addition to the core perceptual principles, several auxiliary principles are described. These auxiliary principles are occasionally linked to voice-leading practice and may be regarded as compositional "options" that shape the music-making in perceptually unique ways. It is suggested that these auxiliary principles distinguish different types of part-writing, such as polyphony, homophony, and close-harmony.
A theory is proposed to account for the aesthetic origin of voice-leading practices.
AbstractPART I: PERCEPTUAL PRINCIPLES AND WESTERN VOICE-LEADING
Introduction
PlanRules of Voice-leading ReviewedPART II: DERIVATION OF THE RULES OF VOICE-LEADING
Perceptual Principles
1. Toneness Principle
2. Principle of Temporal Continuity
Temporal Continuity and Musical Practice3. Minimum Masking Principle
4. Tonal Fusion Principle
Sensory Dissonance, Tonal Fusion, and Musical Practice5. Pitch Proximity Principle
Musical Terminology: Types of Harmonic Intervals
Pitch Proximity and Music6. Pitch Co-Modulation Principle
Types of Melodic Intervals -- The Fission Boundary
Melodic Motion -- The Temporal Coherence Boundary
Derivations from Multiple PrinciplesPART III: PERCEPTUAL PRINCIPLES AND MUSICAL GENRES7. The Principle of Onset Synchrony
Whence Homophony?8. The Principle of Limited Density
9. The Principle of Timbral Differentiation
10. The Source Location Principle
Auxiliary Principles and Musical TexturesReferences
Conclusion
Aesthetic Origins
Footnotes
In the training of Western musicians, the canon of harmony and voice-leading rules has been considered one of the essential foundations of the art-music craft. Of course not all composers accede to the norms of traditional harmony or voice-leading. Nor should they. Rules ought to be followed only when the composer agrees that the goal or goals embodied by the rules form worthy musical objectives, and when the rules themselves provide an effective means by which the goal(s) may be achieved. We can, therefore, ask three questions of any procedural rule: (1) What goal is served by following the rule? (2) Is the goal worthwhile? and (3) Is the rule an effective way of achieving the purported goal?
The practice of Western harmony and voice-leading has been the subject of extensive theoretical attention and elucidation (Aldwell & Schachter, 1989; Berardi, 1681; Fétis, 1840; Fux, 1725; Hindemith, 1944; Horwood, 1948; Keys, 1961; Morris, 1946; Parncutt, 1989; Piston, 1978; Rameau, 1722; Riemann, 1903; Schenker, 1906; Schoenberg, 1911/1978; Stainer, 1878; and many others [1]). Over the centuries, theorists have generated a wealth of lucid insights pertaining to harmony and voice-leading. Earlier theorists tended to view the voice-leading canon as a set of fixed or inviolable universals. More recently, theorists have suggested that the voice-leading canon can be usefully regarded as descriptive of a particular, historically encapsulated, musical convention. A number of twentieth-century theorists have endeavored to build harmonic and structural theories on the foundation of voice-leading.
Traditionally, the so-called "rules of harmony" are divided into two broad groups: (1) the rules of harmonic progression, and (2) the rules of voice-leading. The first group of rules pertains to the choice of chords, including the overall harmonic plan of a work, the placement and formation of cadences, and the moment-to-moment succession of individual chords. The second group of rules pertains to the manner in which individual parts or voices move from tone to tone in successive sonorities. The term "voice-leading" originates from the German Stimmführung, and refers to the direction of movement for tones within a single part or voice. A number of theorists have suggested that the principal purpose of voice-leading is to create perceptually independent musical lines. This article develops a detailed exposition in support of this view.
In this article, the focus is exclusively on the practice of voice-leading. No attempt will be made here to account for the rules of harmonic progression. The goal is to explain voice-leading practice using perceptual principles, predominantly principles associated with the theory of auditory stream segregation (Bregman, 1990; McAdams & Bregman, 1979; van Noorden 1975; Wright, 1986; Wright & Bregman, 1987). As will be seen, this approach provides an especially strong account of Western voice-leading practice. Indeed, in the middle section of this article, a derivation of the traditional rules of voice-leading will be given. At the end of the article, a cognitive theory of the aesthetic origins of voice-leading will be proposed.
Understood in its broader sense, "voice-leading" also connotes the dynamic quality of tones leading somewhere. A full account of voice-leading would entail a psychological explanation of how melodic expectations and implications arise, and also explain the phenomenal experiences of tension and resolution. These issues have been addressed periodically throughout the history of music theory, and remains the subject of continuing investigation and theorizing (e.g., Narmour 1991, 1992). Many fundamental issues remain unresolved (see von Hippel & Huron, 2000), and so it would appear to be premature to attempt to explain the expectational aspects of tone successions. Consequently, the focus in this article will be limited to the conventional rules of voice-leading.
The purpose of this article is to address the three questions posed above, namely to identify the goals of voice-leading, to show how following the traditional voice-leading rules contributes to the achievement of these goals, and to propose a cognitive explanation for why the goals might be deemed worthwhile in the first place.
Many of the points demonstrated in this work will seem trivial or intuitively obvious to the experienced musician or music theorist. It is an unfortunate consequence of the pursuit of rigor that an analysis can take an inordinate amount of time to arrive at a conclusion that is utterly expected. As in scientific and philosophical endeavors generally, attempts to explicate "common sense" will have an air of futility to those for whom the origins of intuition are not problematic. However, such detailed studies can unravel the specific origins from which intuitions arise, contribute to a more precise formulation of existing knowledge, and expose inconsistencies worthy of further study.
My presentation will proceed according to the following plan. First, the principal rules of traditional voice-leading will be reviewed. Following this review, ten principles of auditory perception will be described and their musical implications identified. The perceptual principles described will initially be limited to a core set of six principles that are most pertinent to understanding voice-leading. These principles can be treated in a manner akin to axioms in a formal system from which a set of propositions can be derived.
In Part II, we will attempt to derive the traditional rules of voice-leading. In the process of the derivation, several novel rules will arise that are not normally found in theoretical writings on voice-leading. These novel rules can be regarded as theory-derived predictions. For these novel rules, we will examine compositional practice to determine whether composers typically write in a manner consistent with these additional rules. To foreshadow the results, we will see that voice-leading practices are indeed consistent with several non-traditional rules predicted by the perceptual principles.
Part III presents four additional perceptual principles that are occasionally linked to the practice of Western voice-leading. In essence, these auxiliary principles constitute "options" that shape the music-making in perceptually unique ways. Composers can elect to include or exclude one or more of these auxiliary principles depending on the implied perceptual goal. Once again, these principles can be treated in a manner akin to axioms in formal logic and selectively added to the core group of six central principles. Depending on the choice of auxiliary principles, we will see that alternative voice-leading systems arise. It will be argued that the choice of such auxiliary principles is one of the hallmarks of musical genres; these auxiliary principles shed light on such matters as the distinction between homophonic and polyphonic voice-leading, and such unique genres as close-harmony (such as the voice-leading found in barbershop quartets).
In the concluding section a psychological account will be advanced whose goal is to explain why conventional voice-leading might be experienced by many listeners as pleasing. A number of testable predictions arise from this account. The article closes by identifying a number of unresolved issues and posing questions for further research.
Note that the purpose of this article is not somehow to "vindicate" or otherwise act as an apologist for the traditional rules of Western harmony. Nor is the intent to restrict in any way the creative enterprise of musical composition. If a composer chooses a particular goal (either implicitly or explicitly), then there are frequently natural consequences that may constrain the music-making in such a way as to make the goal achievable. Musically pertinent goals might include social, political, historical, formal, perceptual, emotional, cognitive, and/or other objectives. The purpose of this article is merely to identify and clarify some of the perceptual and cognitive aspects that shape music-making. In pursuing this analysis, the reader should not assume that other musically-pertinent goals are somehow less important or irrelevant in understanding music.
We may begin by reviewing briefly the traditional rules of voice-leading. Of course there is no such thing as "the" rules of voice-leading. Theoretical and pedagogical treatises on harmony differ in detail. Some texts identify a handful of voice-leading rules whereas other texts define dozens of rules and their exceptions. Different pedagogical works emphasize 16th-century modal counterpoint (e.g., Gauldin, 1985; Jeppesen, 1939/1963), 18th-century Bach-style counterpoint (e.g., Benjamin, 1986; Parks, 1984; Trythall, 1993), or harmony (e.g., Piston, 1978; Aldwell & Schachter, 1989). Despite this variety, there is a core group of some dozen-odd rules that appears in nearly every pedagogical work on voice-leading. These core voice-leading rules include the following:
Note that several traditional voice-leading rules are not included in the above summary. Most notably, the innumerable rules pertaining to chordal tone doubling (e.g., avoid doubling chromatic or leading-tones) are absent. Also missing are the rules prohibiting movement by augmented intervals and the rule prohibiting false relations. No attempt will be made here to account for these latter rules.
Over the centuries, a number of theorists have explicitly suggested that voice-leading is the art of creating independent parts or voices. In the ensuing discussion we will take these suggestions literally, and link voice-leading practice to perceptual research concerning the formation of auditory images and independent auditory streams.
Having briefly reviewed the voice-leading canon, we can now identify and define the pertinent perceptual principles. In summary, the following six principles will be discussed: (1) toneness, (2) temporal continuity, (3) minimum masking, (4) tonal fusion, (5) pitch proximity, and (6) pitch co-modulation. Later, we will examine four auxiliary principles that contribute to the differentiation of musical genres: (7) onset synchrony, (8) limited density, (9) timbral differentiation, and (10) source location.
Sounds may be regarded as evoking perceptual "images." An auditory image may be defined as the subjective experience of a singular sonic entity or acoustic object. Such images are often (although not always) linked to the recognition of the sound's source or acoustic origin. Hence certain bands of frequencies undergoing a shift of phase may evoke the image of a passing automobile.
Not all sounds evoke equally clear mental images. In the first instance, aperiodic noises evoke more diffuse images than do pure tones. Even when noise bands are widely separated and masking is absent, listeners are much less adept at identifying the number of noise bands present in a sound field than an equivalent number of pure tones -- provided the tones are not harmonically related (see below).
In the second instance, certain sets of pure tones may coalesce to form a single auditory image -- as in the case of the perception of a complex tone. For complex tones, the perceptual image is strongly associated with the pitch evoked by a set of spectral components. One of the best demonstrations of this association is to be found in phenomenon of residue pitch (Schouten, 1940). In the case of a residue pitch, a complex harmonic tone whose fundamental is absent nevertheless evokes a single notable pitch corresponding roughly to the missing fundamental.
In the third instance, tones having inharmonic partials tend to evoke more diffuse auditory images. Inharmonic partials are more likely to be resolved as independent tones, and so the spectral components are less apt to cohere and be perceived as a single sound. In a related way, inharmonic tones also tend to evoke competing pitch perceptions, such as in the case of bells. The least ambiguous pitches are evoked by complex tones whose partials most closely approximate a harmonic series.
In general, pitched sounds are easier to identify as independent sound sources, and sounds that produce the strongest and least ambiguous pitches produce the clearest auditory images. Pitch provides a convenient "hanger" on which to hang a set of spectral components and to attribute them to a single acoustic source. Although pitch is typically regarded as a subjective impression of highness or lowness, the more fundamental feature of the phenomenon of pitch is that it provides a useful internal subjective label -- a perceptual handle -- that represents a collection of partials likely to have been evoked by a single physical source. Inharmonic tones and noises can also evoke auditory images, but they are typically more diffuse or ambiguous.
Psychoacousticians have used the term "tonality" to refer to the clarity of pitch perceptions (ANSI, 1973); however, this choice of terms is musically unfortunate. Inspired by research on pitch strength or pitch salience carried out by Terhardt and others, Parncutt (1989) proposed the term tonalness; we will use the less ambiguous term toneness to refer to the clarity of pitch perceptions. For pure tones, toneness is known to change with frequency. Frequencies above about 5,000 Hz tend to sound like indistinct "sizzles" devoid of pitch (Attneave & Olson, 1971; Ohgushi & Hato, 1989; Semal & Demany, 1990). [2] Similarly, very low frequencies sound like diffuse rumblings. Frequencies in the middle range of hearing give rise to the most well-defined pitch sensations -- that is, they exhibit high toneness.
The clarity of pitch perceptions has been simulated systematically in a model of pitch formulated by Terhardt, Stoll and Seewann (1982a, 1982b). For both pure and complex tones, the model calculates a pitch weight, which may be regarded as an index of the pitch's clarity, and therefore, a measure of toneness. For pitches evoked by pure tones (so-called "spectral pitches"), sensitivity is most acute in the spectral dominance region -- a broad region centered near 700 Hz. Pitches evoked by complex tones (so-called "virtual pitches") typically show the greatest pitch weight when the evoked or actual fundamental lies in a broad region centered near 300 Hz -- roughly D4 immediately above middle C (Terhardt, Stoll, Schermbach, & Parncutt, 1986).
Figure 1 shows changes of pitch weight versus pitch for several sorts of complex tones (see also Huron & Parncutt, 1992). The pitch weights are calculated according to the Terhardt-Stoll-Seewann model. The solid curve shows the calculated pitch weight of the most prominent pitch for sawtooth tones having a 60 dB SPL fundamental and ranging from C1 to C7. The dotted curves show changes of calculated pitch weight for recorded tones spanning the entire ranges for several orchestral instruments including harp, violin, flute, trumpet, violoncello, and contrabassoon. Although spectral content influences virtual pitch weight, Figure 1 shows that the region of maximum pitch weight for complex tones remains quite stable. Notice that complex tones having a pitch weight greater than one unit on Terhardt's scale range between about E2 and G5 (or about 80 to 800 hertz). Note that this range coincides very well with the range spanned by the bass and treble staves in Western music.
Fig. 1. Changes of maximum pitch weight versus pitch for complex tones from various natural and artificial sources; calculated according to the method described in Terhardt, Stoll and Seewann (1982a, 1982b). Solid line: pitch weight for tones from C1 to C7 having a sawtooth waveform (all fundamentals at 60 dB SPL). Dotted lines: pitch weights for recorded tones spanning the entire ranges for harp, violin, flute, trumpet, violoncello, and contrabassoon. Pitch weights were calculated for each tone where the most intense partial was set at 60 dB. For tones below C6 virtual pitch weight is greater than spectral pitch weight; above C6, spectral pitch predominates -- hence the abrupt change in slope. The figure shows that changes of spectral content have little effect on the region of maximum pitch weight. Note that tones having a pitch weight greater than one unit range between about E2 and G5 -- corresponding to the pitch range spanned by the treble and bass staves. Recorded tones from Opolko and Wapnick (1989). Spectral analyses kindly provided by Gregory Sandell (Sandell, 1991a).
The origin of the point of maximum virtual pitch weight remains unknown. Terhardt has suggested that pitch perception is a learned phenomenon arising primarily from exposure to harmonic complex tones produced typically by the human voice. The point of maximum virtual pitch weight is similar to the mean fundamental frequency for women and children's voices. The question of whether the ear has adapted to the voice or the voice has adapted to the ear is difficult to answer. For our purposes, we might merely note that voice and ear appear to be well co-adapted with respect to pitch.
In Huron and Parncutt (1992), we calculated the average notated pitch in a large sample of notes drawn from various musical works, including a large diverse sample of Western instrumental music, as well as non-Western works including multi-part Korean and Sino-Japanese instrumental works. The average pitch in this sample was found to lie near D#4 -- a little more than a semitone above the center of typical maxima for virtual pitch weight. This coincidence is especially evident in Figure 2 where the average notated pitch is plotted with respect to three scales: frequency, log frequency, and virtual pitch weight. The virtual pitch weight "scale" shown in Figure 2 was created by integrating the curve for the sawtooth wave plotted in Figure 1 -- that is, spreading the area under this curve evenly along the horizontal axis. For each of the three scales shown in Figure 2, the lowest (F2) and highest (G5) pitches used in typical voice-leading are also plotted. As can be seen, musical practice tends to span precisely the region of maximum virtual pitch weight. This relationship is consistent with the view that clear auditory images are sought in music-making.
Once again, the causal relationship is difficult to establish. Musical practice may have adapted to human hearing, or musical practice may have contributed to the shaping of human pitch sensitivities. Whatever the causal relationship, we can note that musical practice and human hearing appear to be well co-adapted. In short, "middle C" truly is near the middle of something. Although the spectral dominance region is centered more than an octave away, middle C is very close to the center of the region of virtual pitch sensitivity for complex tones. Moreover, the typical range for voice-leading (F2-G5) spans the greater part of the range where virtual pitch weight is high. By contrast, Figure 2 shows that the pitch range for music-making constitutes only a small subset of the available linear and log frequency ranges.
Fig. 2. Relationship of pitch to frequency, log frequency, and virtual pitch weight scales. Each line plots F2 (bottom of the bass staff), C4 (middle C), and G5 (top of the treble staff) with respect to frequency, log frequency, and virtual pitch weight (VPW)-scale. Lines are drawn to scale, with the left and right ends corresponding to 30 hertz and 15,000 hertz respectively. The solid dot indicates the mean pitch (approx. D#4) of a cross-cultural sample of notated music determined by Huron and Parncutt (1992). The VPW scale was determined by integrating the area under the curve for the sawtooth wave plotted in Figure 1. Musical practice conforms best to the VPW scale with middle C positioned near the center of the region of greatest pitch sensitivity for complex tones.
Drawing on the extant research concerning pitch perception, we might formulate the following principle:
1. Toneness Principle. Strong auditory images are evoked when tones exhibit a high degree of toneness. A useful measure of toneness is provided by virtual pitch weight. Tones having the highest virtual pitch weights are harmonic complex tones centered in the region between F2 and G5. Tones having inharmonic partials produce competing virtual pitch perceptions, and so evoke more diffuse auditory images.
A second factor influencing the vividness of auditory images is their continuity. In 1971, Bregman and Campbell coined the term auditory stream to denote the perceptual experience of a single coherent sound activity that maintains its singleness and continuity with respect to time. A number of factors are known to influence the perception of auditory continuity. The most obvious factor is temporal continuity, in which sound energy is maintained over a period of time.
Auditory images may be evoked by either real (sensory) or imagined (purely mental) processes. Two examples of purely mental auditory images can be found in echoic memory and auditory induction (Houtgast, 1971, 1972, 1973; Thurlow, 1957; Warren, Obusek & Ackroff, 1972). [3] In the case of auditory induction, Warren et al. generated stimuli in which intermittent faint sounds were alternated with louder sounds. The faint and loud sounds were contiguous, but not overlapping. Nevertheless, the faint sounds tend to be perceived as a continuous background tone against which the loud sounds are heard to pulse. In this case, there is a mental disposition to continue a sound image even when it is physically absent.
Warren et al. were able to explain the origin of this phenomenon by showing that the frequency/intensity thresholds for auditory induction coincide closely with the thresholds for auditory masking. In other words, auditory induction mentally reinstates sounds which the listener would expect to be masked -- even when the sounds are truly absent. With pure tones, robust auditory induction effects can be achieved for durations of up to 300 msec. In the case of noise bands, auditory induction may be achieved for durations of 20 seconds or more.
Auditory induction may be viewed as an involuntary form of auditory imagination. Fortunately, auditory induction is a subjective phenomenon that readily admits to empirical investigation and measurement. Other pertinent subjective experiences are not so easily studied. Listeners (musicians especially) are also cognizant of the existence of voluntary auditory imagination, in which the listener is able to form or sustain a purely mental image of some sound, such as the imagined sound of a timpani roll or the sound of a bubbling brook.
Although imagined sounds may be quite striking, in general, imagined sounds are significantly less vivid than actual sound stimuli. Moreover, even in the absence of sound, recently heard sounds are more vivid than less recently heard sounds. In short, auditory images have a tendency to linger beyond the physical cessation of the stimulus. In the case of real sounds, the evoked auditory images decay through the inexorable degradation of the short-term auditory store -- what Neisser (1967) dubbed "echoic memory." In general, the longer a sound stimulus is absent, the less vivid is its evoked image.
A number of experiments have attempted to estimate the duration of echoic memory. These measures range from less than a second (Treisman & Howarth, 1959) to less than 5 seconds (Glucksberg & Cowen, 1970). Typical measures lie near one second in duration (Crowder, 1969; Guttman & Julesz, 1963; Rostron, 1974; Treisman, 1964; Treisman & Rostron, 1972). Kubovy and Howard (1976) have concluded that the lower bound for the half-life of echoic memory is about 1 second. Using very short tones (40 msec duration), van Noorden (1975; p.29) found that the sense of temporal continuation of events degrades gradually as the inter-onset interval between tones increases beyond 800 msec.
On the basis of this research, we may conclude that vivid auditory images are evoked best by sounds that are either continuous, or broken only by very brief interruptions. [4] In short, sustained and recurring sound events are better able to maintain auditory images than brief intermittent stimuli.
Drawing on the above literature, we might formulate the following principle:
2. Principle of Temporal Continuity. In order to evoke strong auditory streams, use continuous or recurring rather than brief or intermittent sound sources. Intermittent sounds should be separated by no more than roughly 800 milliseconds of silence in order to assure the perception of continuity.
The musical implications of this principle are readily apparent. In brief, the implications are evident (1) in the types of sounds commonly used in music-making, (2) in the manner by which sound events succeed one another, (3) in the use of physical damping for unduly long sounds, and (4) in the differing musical treatment of sustained versus intermittent sounds.
In the first instance, the principle of temporal continuity is evident in the types of sounds commonly used in music-making. Compared with most natural sound-producing objects, musical instruments are typically constructed so as to maximize the duration of the sounds produced -- that is, to enhance the period of sustain. Most of the instruments of the Western orchestra, for example, are either blown or rubbed -- modes of excitation that tend to evoke relatively long-lasting or continuous sounds. Even in the case of percussion instruments (such as the piano or the vibraphone), the history of the development of these instruments has shown a marked trend toward extending the resonant durations of the sounds produced. For example, the history of pianoforte construction has been marked by the continual increase in string tension, resulting in tones of longer duration. Similar historical developments can be traced for such instruments as timpani, gongs, and marimbas. In some cases, musicians have gone to extraordinary lengths in order to maintain a continuous sound output. In the case of wind instruments, the disruptions necessitated by breathing have been overcome by such devices as bagpipes, mechanical blowers (as in pipe organs), and devices that respond to both `inhaling' and `exhaling' (e.g., the accordion, and, to a lesser extent, the harmonica). In addition, in some cultures, performance practices exist whose sole purpose is to maintain an uninterrupted sound. Most notable is the practice of "cyclic breathing" -- as used, for example, in the Arab shawm or zurna), and the north australian didjeridu. For stringed instruments, bowed modes of excitation have been widespread. In the extreme, continuous bowing mechanisms have been devised such as used in the hurdy-gurdy. In the case of the guitar, solid-body construction and controlled electronic feedback have become popular methods of increasing the sustain of plucked strings.
Apart from the continuous character of most musical sounds, composers tend to assemble successions of tones in a way that suggests a persistent or ongoing existence. A notable (if seemingly trivial) fact is that the majority of notated tones in music are followed immediately by a subsequent tone. For example, 93 percent of all tones in vocal melodies by Stephen Foster and Franz Schubert are followed by another tone. The corresponding percentage for instrumental melodies exceeds 98 percent. The exceptions to this observation are themselves telling. Successions of tones in vocal music, for example, are periodically interrupted by rest periods that allow the singer(s) time to breathe. In most of the world's music, the duration of these rest periods is just sufficient to allow enough time to inhale (i.e., about 1 or 2 seconds). (Longer rests typically occur only in the case of accompanied vocal music.) Moreover, when music-making does not involve the lungs, pitch successions tend to have even shorter and fewer interruptions. Once again, as a general observation, we can note that in most of the world's music-making, there is a marked tendency to maintain a more or less continuous succession of acoustic events.
In tandem with efforts to maintain continuous sound outputs, musical practices also reveal efforts to terminate overlapping resonances. When musicians connect successions of pitches, problems can arise when each pitch is produced by a different vibrator. The dampers of the piano, for example, are used to truncate each tone's normal decay. Typical piano performance finds the dampers terminating one tone at the same moment that the hammer engages the next tone. This practice is not limited to the piano, nor is it limited to Western music. Guitar players frequently use the palm of the hand to dampen one or more strings -- and so enhance the sense of melodic continuation when switching from string to string. In the case of Indonesian gamelan music, proper performance practice requires the left hand to dampen the current resonator while the right hand strikes the subsequent resonator -- a technique known as tutupan. In all of these examples, the damping of physical vibrators is consistent with the goal of maintaining the illusion of a single continuous acoustic activity.
Of course music-making also entails the use of brief sounds, such as sounds arising from various percussion instruments like the wood block. However, musicians tend to treat such brief sounds differently. Brief sounds tend not to be used to construct "lines of sound" -- such as melodies; instead these sounds are typically used intermittently. Those percussion instruments having the shortest tone durations are least apt to be used for melodic purposes. When instruments having rapid decays are used to perform melodies, they often employ tremolo or multiple repeated attacks, as in music for marimba or steel drums. When brief tones are produced by non-percussion instruments (e.g., staccato), there is a marked tendency to increase the rate of successive tones. Contiguous staccato notes are seldom separated by more than 1 sec. of silence. In such cases, echoic memory serves to sustain the illusion of an uninterrupted line of sound.
As a general observation, we can note that in most of the world's music-making there is a marked preference for some sort of continuous sound activity; moreover, when instrumental sounds are brief in duration, such sounds are often assigned to non-melodic musical tasks -- even when the tones produced evoke clear pitches.
In his Nobel-prize-winning research, Georg von Békésy showed that different frequencies produce different points of maximum displacement on the basilar membrane of the cochlea (Békésy, 1943/1949, 1960). Specifically, low frequencies cause the greatest displacement of the membrane near the apex of the cochlea, whereas high frequencies produce maximum displacements toward the oval window. Békésy, and later Skarstein (Kringlebotn, et al., 1979), carefully mapped this relationship by measuring the distance (in millimeters from the stapes) of the point of maximum displacement for a given frequency input. [5] This correspondence between input frequency and place of maximum displacement of the basilar membrane is referred to as a tonotopic mapping or cochlear map.
Subsequent work by Harvey Fletcher linked the frequency-place coordinates of the Békésy cochlear map to experimental data from frequency discrimination and masking experiments (Fletcher, 1953; pp. 168-175). Fletcher showed that there is a close correspondence between distances along the basilar membrane and regions of masking. In pursuing this research, Fletcher defined a hypothetical entity dubbed the critical band to denote frequency-domain regions of roughly equivalent or proportional behavior (Fletcher, 1940). Subsequent research by Zwicker and others established critical bandwidths as an empirical rather than hypothetical construct. Most notably, Zwicker, Flottorp, and Stevens (1957) showed that distance along the basilar membrane accounts for changes in cumulative loudness as a function of the overall frequency spread of several tones or a band of noise (see also Scharf, 1961).
Greenwood (1961b, 1990) extended Fletcher's work by comparing psychoacoustic measures of critical bandwidth with the frequency-place coordinates of the Békésy-Skarstein cochlear map. Greenwood showed that there is a linear relationship, with one critical bandwidth being roughly equivalent to the distance of 1.0 millimeter on the basilar membrane.
Greenwood (1961b) went on to suggest that tonotopic effects might also account for an aspect of dissonance perception now referred to as sensory dissonance. Conceptually, the perception of consonance or dissonance is thought to be influenced by both cultural (learned) and sensory (innate) factors. Although little experimental research has addressed the cultural aspects of consonance or dissonance (Cazden, 1945), considerable research has established that sensory dissonance is intimately related to the cochlear map. Greenwood tested his hypothesis by comparing perceptual data collected by Mayer (1894) against the critical bandwidth/cochlear map. Mayer had collected experimental data for pure tones in which listeners were instructed to identify the smallest possible interval free of roughness or dissonance. For pure tones, this interval is not constant with respect to log frequency. Greenwood showed that there is an excellent fit between Mayer's (1894) measures and the critical bandwidth. [6]
The width of critical bands is roughly intermediate between a linear frequency scale and a logarithmic frequency scale for tones below about 400 Hz. For higher frequencies the width becomes almost completely logarithmic. When measured in terms of hertz, critical bandwidths increase as frequency is increased; when measured in terms of semitones (log frequency), critical bandwidths decrease as frequency is increased. Figure 3 shows the approximate spacing of critical band distances using standard musical notation (i.e., log frequency). Each note represents a pure tone; inter-note distances have been calculated according to the equivalent rectangular bandwidth-rate (ERB) scale devised by Moore and Glasberg 1983; revised Glasberg & Moore, 1990).
Fig. 3. Approximate size of critical bands represented using musical notation. Successive notes are separated by approximately one critical bandwidth = roughly one millimeter separation along the basilar membrane. Notated pitches represent pure tones rather than complex tones. Calculated according to the revised equivalent rectangular bandwidth (ERB) (Glasberg & Moore, 1990).
Plomp and Levelt (1965) extended Greenwood's work linking the perception of sensory dissonance (which they dubbed "tonal consonance") to the critical band -- and hence to the mechanics of the basilar membrane. Further work by Plomp and Steeneken (1968) replicated the dissonance/roughness hypothesis using more contemporary perceptual data. Plomp and Levelt estimated that pure tones produce maximum sensory dissonance when they are separated by about 25% of a critical bandwidth. However, their estimate was based on a critical bandwidth that is now considered to be excessively large, especially below about 500 Hz. Greenwood (1991) has estimated that maximum dissonance arises when pure tones are separated by about 30-40% of a critical bandwidth. For frequency separations greater than a critical band, no sensory dissonance arises between two pure tones. These findings were replicated by Kameoka and Kuriyagawa (1969a, 1969b) and by Iyer, Aarden, Hoglund, and Huron (1999).
In musical contexts, however, pure tones are almost never used; complex tones containing several harmonic components predominate. For two complex tones, each consisting of (say) 10 significant harmonics, the overall perceived sensory dissonance will depend on the aggregate interaction of all 20 pure tone components. The tonotopic theory of sensory dissonance explains why the interval of a major third sounds smooth in the middle and upper registers, but sounds gruff when played in the bass region.
Musicians tend to speak of the relative consonance or dissonance of various interval sizes (such as the dissonance of a minor seventh or the consonance of a major sixth). Although the pitch distance between two complex tones affects the perceived sensory dissonance, the effect of interval size on dissonance is indirect. The spectral content of these tones, the amplitudes of the component partials, and the distribution of partials with respect to the critical bandwidth are the most formative factors determining sensory dissonance. Note that the characterizations of relative dissonance for fixed interval sizes given by musicians may also reflect cultural (learned) factors that have received little overt empirical attention (Cazden, 1945).
Plomp and Levelt hypothesized that in the writing of chords, composers would typically endeavor to maintain roughly equivalent amounts of sensory dissonance throughout the span of the chord. That is, they hypothesized that composers would typically avoid chords that produced disproportionately more dissonance in one or another region of the chord. A given sonority might be highly dissonant or consonant overall, but Plomp and Levelt supposed that the dissonance would typically be distributed homogeneously across a given sonority.
An alternative interpretation of this prediction may be offered without appealing to the concept of dissonance. Spectral components that lie within a critical band of each other cause mutual masking which reduces the capacity of the auditory system to resolve or apprehend all of the sounds present. If composers were disposed to reduce the capacity for mutual masking, then it would be appropriate to space chordal tones in such a way that roughly equivalent amounts of spectral energy would fall in each critical band. Since critical bandwidths span more semitones in the low region than in the higher region, the notated pitches within a chord would show a distinctive distribution were this hypothesis true.
Plomp and Levelt carried out a study of chordal-tone spacing in two musical works: the third movement from J.S. Bach's Trio Sonata No. 2 for organ (BWV 526) [7] and the third movement from Dvorak's String Quartet Op. 51 in E-flat major. Their analyses demonstrated an apparent consistency between the composers' arrangements of vertical sonorities and the tonotopic map -- thus implying that critical bands significantly influence the vertical spacing of chordal tones.
In a replication study, Huron and Sellmer (1992) showed that Plomp and Levelt's demonstration was confounded by an unfortunate artifact, and that their results could not be used to support their conclusion. However, using a more sophisticated inferential approach and a much larger musical sample, we went on to provide an alternative demonstration that confirmed Plomp and Levelt's original hypothesis. The effect of spectral spread on the spacing of chordal tones is not easily summarized; however, Figure 4 can be used to illustrate the phenomenon. Figure 4 shows the average spacing of notated (complex) tones for sonorities having various bass pitches from C4 to C2. For example, the first notated sonority in Figure 4 shows the average tenor, alto, and soprano pitches for a large sample of four-notes sonorities having C4 as the bass pitch. (The specific sonorities notated in Figure 4 should not be interpreted literally; only the approximate spacing of voices is of interest.) As the tessitura of the chord descends, the pitch separation between the lower notes in the sonority tends to become larger. This is consistent with efforts to distribute spectral components in a roughly uniform manner across the basilar membrane. Such a pattern is consistent with both the goals of minimizing sensory dissonance and minimizing auditory masking.
Fig. 4. Average spacing of tones for sonorities having various bass pitches from C4 to C2. Calculated from a large sample (>10,000) of four-note sonorities extracted from Haydn string quartets and Bach keyboard works (Haydn and Bach samples equally weighted). Bass pitches are fixed. For each bass pitch, the average tenor, alto, and soprano pitches are plotted to the nearest semitone. (Readers should not be distracted by the specific sonorities notated; only the approximate spacing of voices is of interest.) Note the wider spacing between the lower voices for chords having a low mean tessitura. Notated pitches represent complex tones rather than pure tones.
As pointed out by theorist Walter Piston (1978), the spacing between chordal tones is not related to the spacing of partials in the harmonic series (as formerly supposed). The harmonic series exhibits the same interval spacing whatever the register of a chord. Yet composed chords show systematic interval changes with respect to register.
While acknowledging that the extant research is consistent with both the goals of minimzing sensory dissonance and minimizing auditory masking, for the purposes of voice-leading, we will formulate the relevant musical principle in terms of masking:
3. Minimum Masking Principle. In order to minimize auditory masking within some vertical sonority, approximately equivalent amounts of spectral energy should fall in each critical band. For typical complex harmonic tones, this generally means that simultaneously sounding notes should be more widely spaced as the register descends.
Tonal fusion is the tendency for some concurrent sound combinations to cohere into a single sound image. Tonal fusion arises most commonly when the auditory system interprets certain frequency combinations as comprising partials of a single complex tone (DeWitt & Crowder, 1987). Two factors are known to affect tonal fusion: (1) the frequency ratio of the component tones, and (2) their spectral content. Tonal fusion is most probable when the combined spectral content conforms to a single hypothetical harmonic series. This occurs most commonly when the frequencies of the component tones are related by simple integer ratios.
The pitch interval that most encourages tonal fusion is the aptly named unison. The second most fused interval is the octave, whereas the third most fused interval is the perfect fifth (Stumpf, 1890; DeWitt & Crowder, 1987). Following Stumpf, many music researchers have assumed that tonal fusion and tonal consonance are the same phenomenon, and that both arise from simple integer frequency ratios. However, the extant psychoacoustic research does not support Stumpf's view. Bregman (1990) has noted that the confusion arises from conflating "smooth sounding" with "sounding as one." As we have seen, work by Greenwood (1961a, 1990, 1991), Plomp and Levelt (1965), Kameoka and Kuriyagawa (1969a, 1969b), and Iyer, Aarden, Hoglund and Huron (1999) implicates critical band distances in the perception of tonal consonance or sensory dissonance. This work shows that sensory dissonance is only indirectly related to harmonicity or tonal fusion.
Whether or not tonal fusion is a musically desirable phenomenon depends on the music-perceptual goal. In Huron (1991b), it was shown that in the polyphonic writing of J.S. Bach, tonally fused harmonic intervals are avoided in proportion to the strength with which each interval promotes tonal fusion. That is, unisons occur less frequently than octaves, which occur less frequently than perfect fifths, which occur less frequently than other intervals. Of course concurrent octaves and concurrent fifths occur regularly in music, but (remarkably) they occur less frequently in polyphonic music than they would in a purely random juxtaposition of voices.
Note that this observation is independent of the avoidance of parallel unisons, fifths, or octaves. As simple static harmonic intervals, these intervals are actively avoided in Bach's polyphonic works. Considering the importance of octaves and fifths in the formation of common chords, their active avoidance is a remarkable feat (see Huron, 1991b).
In light of the research on tonal fusion, we may formulate the following principle:
4. Tonal Fusion Principle. The perceptual independence of concurrent tones is weakened when their pitch relations promote tonal fusion. Intervals that promote tonal fusion include (in decreasing order): unisons, octaves, perfect fifths, ... Where the goal is the perceptual independence of concurrent sounds, intervals ought to be shunned in direct proportion to the degree to which they promote tonal fusion.
Before continuing with the fifth perceptual principle, we might briefly consider the relationship between sensory dissonance and tonal fusion -- in light of musical practice. As we have seen, sensory dissonance depends on more than just the pitch interval separating two tones. It also depends on the spectral content of the tones, as well as their tessitura. Nevertheless, pitch distance retains an indirect influence on sensory dissonance, and it is therefore possible to calculate the average dissonance for intervals of various sizes -- such as the average dissonance of "a major sixth" (see Huron, 1994).
The solid line in Figure 5 shows the aggregate consonance values for musical intervals up to the size of an octave from perceptual experiments by Kaestner (1909). The bars show the interval prevalence in the upper two voices of J.S. Bach's three-part Sinfonias. [8] There is a fairly close fit between these two sets of data, suggesting that Bach tends to use various intervals in inverse proportion to their degree of sensory dissonance. There are a few notable discrepancies, however. The unison and octave intervals occur relatively infrequently in the Bach sample, although the sensory dissonance for these intervals is very low. In addition, the minor and major thirds appear to be "switched:" the consonance data would predict that major thirds would be more prevalent than minor thirds.
Fig. 5. Comparison of sensory consonance for complex tones (line) from Kaestner (1909) with interval prevalence (bars) in the upper two voices of J.S. Bach's three-part Sinfonias (BWVs 787-801). Notice especially the discrepancies for P1 and P8. Reproduced from Huron (1991b).
The origin of the discrepancies for the major and minor thirds is not known. It is possible that the relative prevalence of minor thirds may be an artifact of the interval content of common chord types. Major and minor triads both contain one major third and one minor third each. However, both the dominant seventh chord and the minor-minor-seventh chord contain one major third and two minor thirds. In addition, the diminished triad (two minor thirds) is more commonly used than the augmented triad (two major thirds). Hence, the increased prevalence of minor thirds compared with major thirds may simply reflect the interval content of common chords. The relative prevalences of the major and minor sixth intervals (inversions of thirds) are also consistent with this suggestion.
The discrepancies for P1 and P8 in Figure 5 are highly suggestive in light of the tonal fusion principle. We might suppose that the reason Bach avoided unisons and octaves is in order to prevent inadvertent tonal fusion of the concurrent parts. In Huron (1991b) this hypothesis was tested by calculating a series of correlations. When perfect intervals are excluded from consideration, the correlation between Bach's interval preference and the sensory dissonance Z-scores is -0.85. [9] Conversely, if we exclude all intervals apart from the perfect intervals, the correlation between Bach's interval preference and the tonal fusion data is -0.82. Calculating the multiple regression for both factors, Huron (1991b) found an R² of 0.88 -- indicating that nearly 90 percent of the variance in Bach's interval preference can be attributed to the twin compositional goals of the pursuit of tonal consonance and the avoidance of tonal fusion. The multiple regression analysis also suggested that Bach pursues both of these goals with approximately equal resolve. Bach preferred intervals in inverse proportion to the degree to which they promote sensory dissonance, and in inverse proportion to the degree to which they promote tonal fusion. It would appear that Bach was eager to produce a sound that is "smooth" without the danger of "sounding as one." In Part III we will consider possible aesthetic motivations for this practice.
The experimental results pertaining to sensory dissonance and tonal fusion may be used to illuminate traditional musical terminology. Music theorists traditionally distinguish three classes of harmonic intervals: perfect consonances (such as perfect unisons, octaves, fourths, and fifths), imperfect consonances (such as major and minor thirds and sixths), and dissonances (such as major and minor seconds and sevenths, and tritones). These interval types can be classified according to the criteria of sensory dissonance and tonal fusion. Perfect consonances typically exhibit low sensory dissonance and high tonal fusion. Imperfect consonances have low sensory dissonance and comparatively low tonal fusion. Dissonances exhibit high sensory dissonance and low tonal fusion. (There are no equally-tempered intervals that exhibit high sensory dissonance and high tonal fusion, although the effect can be generated using grossly mistuned unisons, octaves, or fifths.) It would appear that the twin phenomena of sensory dissonance and tonal fusion provide a plausible account for both the traditional theoretical distinctions, as well as Bach's compositional practice.
In 1950, Miller and Heise observed that alternating pitches (such as trills) produce two different perceptual effects depending on the pitch distance separating the tones and their speed of alternation (Miller & Heise, 1950; Heise & Miller, 1951). When the tones are close with respect to pitch, quick alternations evoke a sort of "undulating effect" -- like a single wavering line. However, when the pitch separation is larger, the perceptual effect becomes one of two "beeping" tones of static pitch. Musicians recognize this phenomenon as that of pseudo-polyphony or compound melodic line, in which a single sequence of pitches nevertheless evokes a sort of "yodelling" effect.
Miller and Heise's observations were replicated and extended by a number of researchers including Bozzi and Vicario (1960), Vicario (1960), Schouten (1962), Norman (1967), Dowling (1967), van Noorden (1971a, 1971b), and Bregman and Campbell (1971). (Several of these researchers worked independently, without knowledge of previously existing work.) Of these pioneering efforts, the most significant works are those of Dowling (1967) and van Noorden (1975). Over the past three decades, however, the most sustained and significant research effort has been that of Albert Bregman (1990).
In 1975 van Noorden mapped the relationship between tempo and pitch separation on stream integration and segregation. Figure 6 summarizes van Noorden's experimental results. When the tempo is slow and/or the pitches have close proximity, the resulting sequence is always perceived as a single stream. This area is indicated in Figure 6 as region 1 -- below the so-called fission boundary (lower line). Conversely, when the pitch distances are large and/or the tempo is fast, two streams are always perceived. This condition is indicated in Figure 6 as region 2 -- to the left of the temporal coherence boundary. Van Noorden also identified an intervening grey-region, where listeners may hear either one or two streams depending on the context and the listener's disposition. Notice that the slope of the fission boundary is much shallower than the slope of the temporal coherence boundary. We will return to consider these different slopes later.
Fig. 6. Influence of interval size and tempo on stream fusion and segregation (van Noorden, 1975; p.15). Upper curve: temporal coherence boundary. Lower curve: fission boundary. In Region 1 the listener necessarily hears one stream (small interval sizes and slow tempos). In Region 2 the listener necessarily hears two streams (large interval sizes and fast tempos).
The importance of pitch proximity in stream organization is supported by a wealth of further experimental evidence. Schouten (1962) observed that temporal relationships are more accurately perceived within streams than between streams -- when streams are distinguished by pitch proximity alone. This work was replicated and extended by Norman (1967), Bregman and Campbell (1971), and Fitzgibbons, Pollatsek and Thomas (1974).
In addition, Bregman and his colleagues have assembled strong evidence showing the pre-eminence of pitch proximity over pitch trajectory in the continuation of auditory streams. In sequences of pitches, listeners tend not to "extrapolate" future pitches according to contour trajectories. Rather they "interpolate" pitches between pitches deemed to be in the same stream (Steiger & Bregman, 1981; Tougas and Bregman, 1985; Ciocca & Bregman, 1987; summarized by Bregman, 1990; pp.417-442) Similar effects have been observed in speech perception (Darwin and Gardner, 1986; Pattison, Gardner & Darwin, 1986).
Further experimental work has demonstrated the perceptual difficulty of tracking auditory streams that cross with respect to pitch. Using pairs of well-known melodies, Dowling (1973) carried out experiments in which successive notes of the two melodies were interleaved. The first note of melody `A' was followed by the first note of melody `B,' followed by the second note of melody `A' followed by the second note of melody `B' etc. Dowling found that the ability of listeners to identify the melodies was highly sensitive to the pitch overlap of the two melodies. Dowling found that as the melodies were transposed so that their mean pitches diverged, recognition scores increased. The greatest increase in recognition scores occurred when the transpositions removed all pitch overlap between the concurrent melodies.
Deutsch (1975) and van Noorden (1975) found that, for tones having identical timbres, concurrent ascending and descending tone sequences are perceived to switch direction at the point where their trajectories cross. That is, listeners are disposed to hear a "bounced" percept in preference to the crossing of auditory streams. Figure 7 illustrates two possible perceptions of intersecting pitch trajectories. Although the crossed trajectories represents a simpler Gestalt figure, the bounced perception is much more common -- at least when the trajectories are constructed using discrete pitch categories (as in musical scales).
Fig. 7. Schematic illustration of two possible perceptions of intersecting pitch trajectories. "Bounced" perceptions (right) are more common for stimuli consisting of discrete pitch sequences, when the timbres are identical.
In summary, at least four empirical phenomena point to the importance of pitch proximity in helping to determine the perceptual segregation of auditory streams: (1) the fission of monophonic pitch sequences into pseudo-polyphonic percepts described by Miller and Heise and others, (2) the discovery of information-processing degradations in cross-stream temporal tasks, as found by Schouten, Norman, Bregman and Campbell, and Fitzgibbons, Pollatsek and Thomas, (3) the perceptual difficulty of tracking auditory streams that cross with respect to pitch described by Dowling, Deutsch, and van Noorden, and (4) the pre-eminence of pitch proximity over pitch trajectory in the continuation of auditory streams demonstrated by Bregman et al. Stream segregation is thus strongly dependent upon the proximity of successive pitches.
On the basis of extensive empirical evidence, we can note the following principle:
5. Pitch Proximity Principle. The coherence of an auditory stream is maintained by close pitch proximity in successive tones within the stream. Pitch-based streaming is assured when pitch movement is within van Noorden's "fission boundary" (normally 2 semitones or less for tones less than 700 ms in duration). When pitch distances are large, it may be possible to maintain the perception of a single stream by reducing the tempo.
If a musical melody is a kind of auditory stream (Dowling, 1967), then we might expect sequences of pitches to conform to the pitch proximity principle. Specifically, we would expect the majority of pitch sequences to fall below van Noorden's fission boundary, and certainly to fall below the temporal coherence boundary.
The evidence concerning the prevalence of pitch proximity in music is very extensive. Several authors have observed the pervasive use of small intervals in the construction of melodies, including Ortmann (1926), Merriam, Whinery and Fred (1956), and Dowling (1967). Figure 8 plots further data showing the distribution of interval sizes using samples of music from a number of cultures: American, Chinese, English, German, Hasidic, Japanese, and sub-saharan African (Pondo, Venda, Xhosa, and Zulu). In general, the results affirm the preponderance of small intervals.
Fig. 8. Frequency of occurrence of melodic intervals in notated sources for folk and popular melodies from ten cultures (n=181). African sample includes Pondo, Venda, Xhosa, and Zulu works. N.B. Interval sizes only roughly correspond to equally-tempered semitones.
Apart from melodies, musicians know that the pitch interval between successive notes is a crucial determinant in the construction of pseudo-polyphonic passages. Dowling (1967) carried out a study of a number of Baroque solo works in order to determine the degree to which pseudo-polyphony is correlated with the use of large intervals. In measurements of interval sizes in passages deemed pseudo-polyphonic by two independent auditors, Dowling found the passages to contain intervals markedly larger than the "trill threshold" measured by Miller and Heise -- a threshold similar to van Noorden's fission boundary. In a sample of pseudo-polyphonic passages by Telemann, Dowling noted that Telemann never uses intervals less than the Miller and Heise trill threshold.
As further evidence of the primacy of small intervals in non pseudo-polyphonic lines, Dowling examined the interval preferences of listeners. Through a series of stochastically-generated stimuli, Dowling found that listeners prefer melodies employing the smaller interval sizes. A study by Carlsen (1981), moreover, demonstrated that musicians show a marked perceptual expectancy for proximate pitch contours. Studying musicians from three different countries (Hungary, Germany, U.S.A.), Carlsen discovered that although there are significant differences between musicians in their melodic expectancies, in all cases neighboring pitch continuations of melodic stimuli are strongly favoured.
Further evidence showing the consistency between musical practice and empirical research regarding pitch proximity is evident in the case of part-crossing. Notice that, except in the case of unisons, the crossing of parts with respect to pitch always violates the pitch proximity principle. No matter how the pitches are arranged, the aggregate pitch distance is always lowest when the upper voice consists of the upper pitches, and the lower voice consists of the lower pitches. In a study of part-crossing in polyphonic music, Huron (1991a) showed that J.S. Bach avoids part-crossing, and that he becomes most vigilant to avoid part-crossing when the number of concurrent parts is three or more.
In summary, at least five empirical observations point to the importance of pitch proximity in musical organization: (1) the preponderance of small pitch intervals in non-pseudo-polyphonic melodies observed by Ortmann, Merriam et al., and Dowling, (2) the reciprocal prevalence of large pitch intervals in pseudo-polyphonic passages found by Dowling, (3) the auditory preference for small intervals in melodies found by Dowling, (4) the perceptual expectation for small intervals in continuations of melodic contours found by Carlsen, and (5) the avoidance of part-crossing in polyphonic music measured by Huron. Musical practice thus exhibits a notable consistency with the pitch proximity principle.
Once again, we might pause and consider the relationship between the perceptual evidence concerning pitch proximity and traditional musical terminology. Recall van Noorden's fission and temporal coherence boundaries shown in Figure 6. Not all regions in Figure 6 are equally pertinent to musical practice. Figure 9 shows distributions for note durations for samples of instrumental and vocal lines. Only durations for contiguous notated pitches have been included in this sample of 13,178 notes. Notes immediately prior to rests and at the ends of works have been eliminated from the sample in order to avoid the effect of final lengthening and to make the data consistent with the pitch alternation stimuli used to generate Figure 6.
A mean distribution is indicated by the solid line. In general, Figure 9 shows that the majority of musical tones are relatively short; only 3 percent of tones are longer than 1 second in duration. Eighty-five percent of tones are shorter than 500 msec; however, it is rare for tones to be less than 150 msec in duration. The median note durations for the samples plotted in Figure 9 are 0.22 seconds for instrumental and 0.38 seconds for vocal works.
Fig. 9. Distribution of note durations in 52 instrumental and vocal works. Dotted line: note durations for the combined upper and lower voices from J.S. Bach's two-part Inventions (BWVs 772-786). Dashed line: note durations in 38 songs (vocal lines only) by Stephen Foster. Solid line: mean distribution for both samples (equally weighted). Note durations were determined from notated scores using tempi measured from commercially available sound recordings. Graphs are plotted using bin sizes of 100 msec, centered at the positions plotted.
This distribution suggests that the majority of tone durations in music are too long to be in danger of crossing the temporal coherence boundary -- that is, where two streams are necessarily perceived. Notice that compared with the temporal coherence boundary, the fission boundary is more nearly horizontal (refer to Fig. 6). The boundary is almost flat for tone durations up to about 400 or 500 msec after which the boundary shows a positive slope. Figure 9 tells us that the majority of musical tones have durations commensurate with the flat portion of the fission boundary. This flat boundary indicates that intervals smaller than a certain size guarantee the perception of a single line of sound. In the region of greatest musical activity, the fission boundary approximates a constant pitch distance of about 1 semitone.
As in the case of harmonic intervals, music theorists traditionally distinguish two main classes of melodic intervals: conjunct or step motions, and disjunct or leap motions. In Western music, the dividing line between step and leap motion is traditionally placed between a major second and a minor third; that is, a major second (2 semitones) is considered a conjunct or step motion, whereas a minor third (3 semitones) is considered the smallest disjunct or leap motion. In other cultures, such as those that employ the common pentatonic scale, the maximum "step" size is roughly 3 semitones. It is plausible that the fission boundary in effect identifies a psychoacoustic basis for the distinction between conjunct and disjunct melodic motions. What theorists call conjunct intervals are virtually guaranteed to evoke the perception of stream continuation.
Unlike the fission boundary, the temporal coherence boundary plotted in Figure 6 shows a marked positive slope. This means that, in order for a sequence of pitches to break apart into two streams, tempo is the predominant factor -- although pitch interval continues to play a significant role. If the temporal coherence boundary influences musical organization, we might expect to see trade-offs between pitch interval and instantaneous tempo. For example, we might predict that large pitch leaps would be associated with tones of long duration.
Van Noorden (1975, p.48) and Shepard (1981, p.319) independently drew attention to the similarity between Miller and Heise's trill results and Körte's third law of apparent motion in vision (Körte, 1915). Körte carried out a number of experiments using two lamps that could be alternately switched on-and-off. Körte found that the sense of apparent motion depends on the distance separating the two lamps and their speed of switching. If the lamps are placed farther apart, then the rate of switching must be reduced in order to maintain a sense of apparent motion between the lamps. If the switching rate is too fast, or the lamps are placed especially far apart, then the viewer sees two independent flickering lights with no sense of intervening motion. A possible explanation for this loss of apparent motion is that it is implausible for a single real-world object to move in accordance with the presumed trajectory.
The parallel between Körte's results and Miller and Heise's trills is obvious and direct. It would seem that the sense of continuation between two tones is an auditory analog to apparent motion in vision. Note that both Körte's results and Miller and Heise's results pertain to perception. Research by Huron and Mondor (1994) suggests that the cognitive basis for this parallel can be attributed to the real-world generation or production of motion.
In Huron and Mondor (1994), it was argued that both Körte's third law of apparent motion and Miller and Heise's results arise from the kinematic principle known as Fitts' law (Fitts, 1954). Fitts' law applies to all muscle-powered or autonomous movement. The law is best illustrated by an example (refer to Figure 10). Imagine that you are asked to alternate the point of a stylus as rapidly as possible back-and-forth between two circular targets. Fitts' law states that the speed with which you are able to accomplish this task is proportional to the size of the targets, and inversely proportional to the distance separating them. The faster speeds are achieved when the targets are large and close together -- as in the lower pair of circles in Figure 10.
Fig. 10. Two pairs of circular targets illustrating Fitts' law. The subject is asked to alternate the point of a stylus back-and-forth as rapidly as possible between two targets. The minimum duration of movement between targets depends on the distance separating the targets as well as target size. Hence, it is possible to move more rapidly between the lower pair of targets. Fitts' law applies to all muscle motions, including the motions of the vocal muscles. Musically, the distance separating the targets can be regarded as the pitch distance between two tones, whereas the size of the targets represents pitch accuracy or intonation. Fitts' law predicts that if the intonation remains fixed, then vocalists will be unable to execute wide intervals as rapidly as for small intervals.
Because Fitts' law applies to all muscular motion, and because vocal production and instrumental performance involve the use of muscles, Fitts' law also constrains the generation or production of sound. Imagine, for example, that the circular targets in Figure 10 are arranged vertically rather than horizontally. From the point of view of vocal production, the distance separating the targets may be regarded as the pitch distance between two tones. The size of the targets represents the pitch accuracy or intonation. Fitts' law tells us that, if the intonation remains fixed, then vocalists will be unable to execute wide intervals as rapidly as for small intervals.
In Huron and Mondor (1994), both auditory streaming and melodic practice were examined in light of Fitts' law. In the first instance, two perceptual experiments showed that increasing the variability of pitches (the auditory equivalent of increased target size) caused a reduction in the fission of auditory streams for rapid pitch alternations -- as predicted by Fitts' law. With regard to the perception of melody, Huron and Mondor noted that empirical observations of preferred performance practice (Sundberg, Askenfelt & Frydén, 1983) are consistent with Fitts' law. When executing wide pitch leaps, listeners prefer the antecedent tone of the leap to be extended in duration. In addition, a study of the melodic contours in a cross-cultural sample of several thousand melodies showed them to be consistent with Fitts' law. Specifically, as the interval size increases, there is a marked tendency for the antecedent and consequent pitches to be longer in notated or performed duration. This phenomenon accords with van Noorden's temporal coherence boundary. In short, when large pitch leaps are in danger of evoking stream segregation, the instantaneous tempo for the leap is reduced. This relationship between pitch interval and interval-duration is readily apparent in common melodies such as My Bonnie Lies Over the Ocean or Somewhere Over the Rainbow. Large pitch intervals tend to be formed using tones of longer duration -- a phenomenon we might dub "leap-lengthening."
It would appear that both Körte's third law of apparent motion and the temporal coherence boundary in auditory streaming can be attributed to the same cognitive heuristic: Fitts' law. The brain is able to perceive apparent motion, only if the visual evidence is consistent with how motion occurs in the real world. Similarly, an auditory stream is most apt to be perceived when the pitch-time trajectory conforms to how real (mechanical or physiological) sound sources behave. In short, it may be that Fitts' law provides the origin for the common musical metaphor of "melodic motion" (see also Gjerdingen, 1994).
Further evidence in support of a mechanical or physiological origin for pitch proximity has been assembled by von Hippel (2000). An analysis of melodies from cultures spanning four continents shows that melodic contours can be modeled very well as arising from two constraints -- dubbed ranginess and mobility. Such a model of melodic contour is able to account for a number of melodic phenomena, notably the observation that large leaps tend to be followed by a reversal in melodic direction.
As early as 1863, Helmholtz suggested that similar pitch motion contributes to the perceptual fusion of concurrently-sounded tones. In recent decades, Helmholtz's suggestion has received considerable empirical confirmation and extension. Chowning (1980) vividly demonstrated how coordinated frequency modulations would cause computer-generated voices to fuse, whereas miscoordinated modulations would cause the sounds to segregate. (This phenomenon was used in Chowning's 1979 composition, Phone.) Bregman and Doehring (1984) experimentally showed that tonal fusion is enhanced significantly when two tones are modulated by correlated changes of log-frequency. Indeed, the degree of tonal fusion is greater for tones of changing pitch than for tones of static pitch (McAdams, 1989).
In an extensive series of experiments, McAdams (1982, 1984) demonstrated that co-modulations of frequency that preserve the frequency ratios of partials promote tonal fusion. Moreover, McAdams also showed that positively-correlated pitch motions that are not precise with respect to log-frequency also tend to contribute to tonal fusion. In other words, tonal fusion is most salient when co-modulation is precise with respect to log-frequency and the frequencies of the two tones are harmonically related. Tonal fusion is next most salient when co-modulation is precise with respect to log-frequency and the frequencies of the two tones are not harmonically related. Finally, tonal fusion is next most salient when co-modulation is positively correlated, but not precise with respect to log-frequency. [10]
Apart from the empirical evidence, the pitch co-modulation principle is evident in musical practice as well. In traditional music theory, theorists distinguish two types of positively-correlated pitch motion: similar motion and parallel motion. These types of positively-correlated pitch motions might be collectively referred to as semblant motions. In Huron (1989a), it was shown that polyphonic composers (not surprisingly) avoid semblant pitch motions -- both parallel and similar contrapuntal motions. Moreover, it was shown that parallel pitch motions are avoided more than similar motions. Finally, it was shown that parallel motions are most avoided in the case of intervals that tend most to promote tonal fusion: unisons, octaves, and perfect fifths in particular. Once again, both traditional musical terminology and musical practice are consistent with the perceptual evidence.
Drawing on this research, we might formulate the following principle:
6. Pitch Co-modulation Principle. The perceptual union of concurrent tones is encouraged when pitch motions are positively correlated. Perceptual fusion is most enhanced when the correlation is precise with respect to log frequency.
For the music theorist, the observations made in Part I are highly suggestive concerning the origin of the traditional rules of voice-leading. It is appropriate, however, to make the relationship between the perceptual principles and the voice-leading rules explicit. In this section, we will attempt to derive the voice-leading rules from the six principles described above. The derivation given here is best characterized as heuristic rather than formal. The purpose of this exercise is to clarify the logic, to avoid glossing over details, to more easily expose inconsistencies, and to make it easier to see unanticipated repercussions.
In the ensuing derivation, the following abbreviations are used:
| G | goal |
| A | empirical axiom (i.e., an experimentally-determined fact) |
| C | corollary |
| D | derived musical rule or heuristic (traditional: commonly stated in music theory sources) |
| [D] | derived musical rule or heuristic (non-traditional) |
In formal logic, an axiom is defined as a basic proposition (a statement) that is asserted without proof. However, in this case, each of the axioms presented is supported by the experimental research outlined in Part I. Thus, we might refer to these axioms as "empirical axioms."
First, we can define the goal of voice-leading:
We may begin by introducing our first empirical axiom: the toneness principle:
In light of these non-traditional rules, a few pertinent observations may be made about musical practice. First, most of the world's music-making relies on instruments that produce tones exhibiting harmonic spectra. Instruments that produce inharmonic spectra (e.g., drums) are less likely to be used for voice-leading or melodic purposes. Those percussion instruments that are used melodically (e.g., glockenspiels, marimbas, gamelans, steel drums, etc.) are more apt to produce pitched tones. Carilloneurs have long noted that carillons are ill-suited to contrapuntal writing (Price, 1984; p.310).
- [D1.] Toneness Rule. Voice-leading should employ tones that evoke strong, unique pitch sensations. This is best achieved using harmonic complex tones.
- [C2.] Voice-leading is less effective using noises or tones with inharmonic partials.
Continuing with the toneness principle:
Adding the principle of temporal continuity:
- D2. Registral Compass Rule. Voice-leading is best practiced in the region between F2 and G5, roughly centered near D4.
Adding the minimum masking principle:
- [D3.] Sustained Tones Rule In general, effective voice-leading is best assured by employing sustained tones in close succession, with few silent gaps or interruptions.
Recall that Huron and Sellmer (1992) showed that musical practice is indeed consistent with this latter rule -- as predicted by Plomp and Levelt (1965).
- D4. Chord Spacing Rule. In general, chordal tones should be spaced with wider intervals between the lower voices.
- [D5.] Tessitura-Sensitive Spacing Rule. It is more important to have large intervals separating the lower voices in the case of sonorities that are lower in overall pitch.
Adding the principle of tonal fusion:
- D6. Avoid Unisons Rule. Avoid shared pitches between voices.
- [D7.] Avoid Octaves Rule. Avoid the interval of an octave between two concurrent voices.
In general,
- [D8.] Avoid Perfect Fifths Rule. Avoid the interval of a perfect fifth between two concurrent voices.
Recall that in Huron (1991b), Bach's polyphonic practice was shown to be consistent with the above non-traditional rules. As simple static intervals, unisons, octaves and fifths occur less frequently than occurs in a random juxtaposition of parts.
- [D9.] Avoid Tonal Fusion Rule. Avoid unisons more than octaves, and octaves more than perfect fifths, and perfect fifths more than other intervals.
Turning now to the pitch proximity principle:
- D10. Common Tone Rule. Pitch-classes common to successive sonorities are best retained as a single pitch that remains in the same voice.
(This last rule is merely a corollary of the Common Tone and Conjunct Movement rules.) When pitch movements exceed the fission boundary, effective voice-leading is best maintained when the temporal coherence boundary is not also exceeded:
- D11. Conjunct Movement Rule. If a voice cannot retain the same pitch, it should preferably move by step.
- C3. Avoid Leaps Rule. Avoid wide pitch leaps.
Leap-lengthening (consistent with Fitts' law) has been observed in a cross-cultural sample of melodies by Huron and Mondor.
- [D12.] Leap-Lengthening Rule. Where wide leaps are unavoidable, use long durations for either one or both of the tones forming the leap.
Bregman (1990) has characterized streaming as a competition between possible alternative organizations. It is not simply a matter that two successive pitches need to be relatively close together in order to form a stream. The pitches must be closer than other possible pitch-time traces. That is, previous pitches compete for subsequent pitches:
In a study of part-crossing in 105 works by J.S. Bach, Huron (1991a) showed that of three competing interpretations of the injunction against part-crossing, actual musical practice was most consistent with the principle of pitch proximity.
- D13. Nearest Chordal Tone Rule. Parts should connect to the nearest chordal tone in the next sonority.
- D14. Part-Crossing Rule. Avoid the crossing of parts with respect to pitch.
Once again, failure to obey this rule will conflict with the pitch proximity principle. Each pitch might be likened to a magnet, attracting the nearest subsequent pitch as its partner in the stream. [11]
- D15. Pitch Overlapping Rule. Avoid "overlapped" parts in which a pitch in an ostensibly lower voice is higher than the subsequent pitch in an ostensibly higher voice.
Note that a simple general heuristic for maintaining within-voice pitch proximity and avoiding proximity between voices is to place each voice or part in a unique pitch region or tessitura. As long as voices or parts remain in their own pitch "territory" there is little possibility of part-crossing, overlapping, or other proximity-related problems. There may be purely idiomatic reasons to restrict an instrument or voice to a given range, but there are also good perceptual reasons for such restrictions -- provided the music-perceptual goal is to achieve optimum stream segregation. It is not surprising that in traditional harmony, parts are normally referred to by the names of their corresponding tessituras: soprano, alto, etc.
Adding now the pitch co-modulation principle:
- [D16.] Semblant Motion Rule. Avoid similar or parallel pitch motion between concurrent voices.
Huron (1989a) showed that polyphonic composers do indeed tend to avoid positively correlated pitch motions. Moreover, parallel motion tends to be avoided for all interval sizes -- not just in the case of tonally fused intervals such as unisons, octaves, and perfect fifths. In addition, parallel motion is avoided more than similar motion.
- [D17.] Parallel Motion Rule. Avoid parallel motion more than similar motion.
Each of the preceding voice-leading rules is deduced from a single perceptual principle. Where two or more principles pertain to the same goal, then transgressing more than one such principle will have more onerous consequences than transgressing only one of the principles. Conversely, the consequences of transgressing a single such principle may be minimized by ensuring complete conformity with the remaining principles -- provided all principles pertain to the same goal. The six principles outlined above all pertain to the optimization of auditory images and streaming.
There are other principles that relate to the optimization of auditory streaming, but we will defer deriving the associated rules until Part III. Even in the case of the six stream-related principles identified above, 720 (6 factorial) propositions arise from permuting all the principles. We will present only some of these derivations.
If we link the tonal fusion and pitch proximity principles:
- [D18.] Oblique Approach to Fused Intervals Rule. When approaching unisons, octaves, or fifths, it is best to retain the same pitch in one of the voices (i.e., approach by oblique motion).
- [D19.] Avoid Disjunct Approach to Fused Intervals Rule. If it is not possible to approach unisons, octaves and fifths by retaining the same pitch (oblique motion), step motion should be used.
- [D20.] Avoid Semblant Approach between Fused Intervals Rule. Avoid similar pitch motion in which the voices employ unisons, octaves, or perfect fifths. (For example, when both parts ascend beginning an octave apart, and end a fifth apart.)
Joining the tonal fusion, pitch proximity, and pitch co-modulation principles, we find that:
- D21. Parallel Unisons, Octaves, and Fifths Rule. Avoid parallel unisons, octaves, or fifths.
Note that in traditional treatises on voice-leading most theorists restrict the exposed intervals rule to the case of approaching an octave (fifteenth, etc.) -- i.e., "exposed octaves." Only a minority of theorists extend this injunction to approaching perfect fifths (twelfths, etc.) as well. Since octaves engender tonal fusion more easily than perfect fifths, the perceptual principles account for why there would be greater unanimity of musical opinion against exposed octaves than for exposed fifths.
- D22. Exposed Intervals Rule. When approaching unisons, octaves, or fifths, by similar motion, at least one of the voices should move by step.
Observe that the principles outlined here are not able to account for the fact that most theorists restrict the exposed intervals rule to voice-leading involving the bass and soprano voices. Huron (1989b) has shown that outer voices are more easily perceived than inner voices; hence one might argue that voice-leading transgressions in outer voices are more noticeable. However, one might just as convincingly argue that since the streaming of the inner voices is more apt to be confounded, it is more important that inner voices conform to the voice-leading principles.
The exposed octaves rule is especially revealing. In essence, the exposed octaves rule says: "If you are going to violate principle A, and if you are going to violate principle B, be certain not to violate principle C as well." The exposed octaves rule inextricably links three perceptual principles together and logically implies that all three principles share some sort of common goal. Specifically, the exposed octaves rule confirms that the principle of tonal fusion, the pitch proximity principle, and the pitch co-modulation principle all contribute to the achievement of the same goal -- namely, the optimization of stream segregation.
As in any deductive system, changing one or two axioms will result in a different set of derivations. As in the case of non-Euclidean geometry, it is theoretically possible for composers to devise different voice-leading systems by adding, eliminating, or modifying one or more of the axioms. Changing the axioms will constrain the canon in distinctive ways, and hence result in different "musics."
In the remainder of this article, we consider four examples of auxiliary perceptual principles. When these auxiliary principles are added to the six core principles, a number of additional propositions arise. As noted in the introduction, these auxiliary principles in effect constitute voice-leading "options" that will shape the music-making in perceptually distinctive ways. At the end of our discussion we will identify how these auxiliary principles may contribute to the differentiation of musical genres.
The four auxiliary principles discussed in the remainder of this article include (7) onset synchrony, (8) limited density, (9) timbral differentiation, and (10) source location.
Sounds are more apt to be perceived as components of a single auditory image when they are coordinated in time. That is, concurrent tones are much more apt to be interpreted by the auditory system as constituents of a single complex sound event when the tones are temporally aligned. Since judgments about sounds tend to be made in the first few hundred milliseconds, the most important aspect of temporal coordination is the synchronization of sound onsets. Sounds whose onsets are uncoordinated in time are likely to be perceived as distinct or separate events (Bregman & Pinker, 1978). Onset synchronization may be regarded as a special case of amplitude co-modulation. Apart from the crude coordination of tone onsets, the importance of correlated changes of amplitude has been empirically demonstrated for much more subtle amplitude deviations. For example, Bregman, Abramson, Doehring, and Darwin (1985) have demonstrated that the co-evolution of amplitude envelopes contributes to the perception of tonal fusion.
The concept of synchronization raises the question of what is meant by "at the same time"? Under ideal conditions, it is possible for listeners to distinguish the order of two clicks separated by as little as 20 msec (Hirsh, 1959). However, in more realistic listening conditions, onset differences can be substantially greater and yet retain the impression of a single onset. Initial transients for natural complex tones may be spread over 80 msec or more, even for tones with rapid attack envelopes (Saldanha & Corso, 1964; Strong & Clark, 1967). Many sounds have much more gradual envelopes, and in the case of echoes, more than 100 msec between repetitions may be required in order for successive sounds to be perceived as distinct events (Beranek, 1954).
In the case of the synchronization of note onsets in music, Rasch (1978, 1979, 1988) has collected a wealth of pertinent data. Ensemble performances of nominally concurrent notes show that onsets are typically spread over a range of 30 to 50 msec. Moreover, in experiments with quasi-simultaneous musical tones, Rasch found that the effect of increasing asynchrony is initially to add "transparency" to multi-tone stimuli -- rather than to evoke separate sound images (Rasch, 1988; p.80). Onset asynchronies have to be comparatively large before separate sound events are perceived.
Although musical notation provides only a crude indication of the temporal aspects of real performances, Rasch's work suggests that synchronously and asynchronously notated events are likely to be perceived as such. Of course tones that are notated as having concurrent onsets are by no means performed precisely together. But in typical performances, such tones are obviously more likely to promote the perception of tonal fusion than tones that are notated as having asynchronous onsets. A difference of a sixteenth-duration at a tempo of 80 quarter-notes per minute produces a nominal asynchrony of 187 msec, a time delay that is more than sufficient to elicit the perception of separate events. In short, onsets that are deemed "asynchronous" according to musical notation are unlikely to be perceived as synchronous, whereas onsets that are deemed "synchronous" according to musical notation may well be perceived as being synchronous. Thus musical notation provides a useful (although not infallible) indication of perceived onset synchronies and asynchronies in composed music.
Drawing on this research, we might formulate the following principle:
7. Onset Synchrony Principle. If a composer intends to write music in which the parts have a high degree of perceptual independence, then synchronous note onsets ought to be avoided. Onsets of nominally distinct sounds should be separated by 100 milliseconds or more.
In light of the onset synchrony principle, musical practice is equivocal. Some musics conform to this principle, whereas other musics do not. Rasch (1981) carried out an analysis of vocal works by Praetorius and showed that onset asynchrony between the parts increases as the number of voices is increased. Similarly, Huron (1993a) showed that J.S. Bach minimizes the degree of onset synchrony in his polyphonic works. Figure 11 shows onset synchrony autophases for a random selection of 10 of Bach's 15 two-part keyboard Inventions (BWVs 772-786). A mean onset synchrony function is also plotted (solid line). The values in the horizontal center of the graph indicate the degree of onset synchrony for the work; all other values show the degree of onset synchrony for control conditions in which the parts have been shifted with respect to each other, measure by measure. The dips at zero degrees are indicative of Bach's efforts to avoid synchronous note onsets between the parts.
Fig. 11. Onset synchrony autophases for a random selection of 10 of Bach's 15 two-part keyboard Inventions (BWVs 772-786); see Huron (1993). Values plotted at zero degrees indicate the proportion of onset synchrony for the actual works. All other phase values indicate the proportion of onset synchrony for re-arranged music -- controlling for duration, rhythmic order, and meter. The dips at zero degrees are consistent with the hypothesis that Bach avoids synchronous note onsets between the parts. Solid line plots a mean onset synchrony function for all data.
By contrast, in non-polyphonic works, composers routinely contradict this principle. Figure 12 compares onset autophases for two different works. The first graph shows an autophase for the three-part Sinfonia No. 1 (BWV 787). The second graph shows an autophase for the four-part hymn Adeste Fideles. The Y-values at (0,0) in each X-Z plane indicate the degree of onset synchrony for the work. All other values show the degree of onset synchrony for control conditions in which the parts have been systematically shifted with respect to each other, measure by measure. Since there are more than two parts, the resulting autophases are multi-dimensional rather than two-dimensional. The four-part Adeste Fideles has been simplified to three dimensions by phase-shifting only two of the three voices with respect to a stationary soprano voice. As can be seen, in the Sinfonia No. 1 there is a statistically significant dip at (0,0) consistent with the hypothesis that onset synchrony is actively avoided. In Adeste Fideles by contrast, there is a significant peak at (0,0) (at the back of the graph) consistent with the hypothesis that onset synchrony is actively sought. The graphs formally demonstrate that any realignment of the parts in Bach's Sinfonia No. 1 would result in greater onset synchrony, whereas any realignment of the parts in Adeste Fideles would result in less onset synchrony.
Fig. 12. Three-dimensional onset synchrony autophases comparing exemplar polyphonic and homophonic works. (a) three-part Sinfonia No. 1 (BWV 787) by J.S. Bach; (b) four-part hymn Adeste Fideles. The vertical axis indicates the onset synchrony measured according to "Method 3" described in Huron (1989a; pp.122-123). The horizontal axes indicate measure-by-measure shifts of two of the voices with respect to the remaining voice(s). Figure 12a shows a marked dip at the origin (front), whereas Figure 12b shows a marked peak at the origin (back). The horizontal axes in the two graphs have been reversed in order to maintain visual clarity. The graphs formally demonstrate that any realignment of the parts in Bach's Sinfonia No. 1 would result in greater onset synchrony, whereas any realignment of the parts in Adeste Fideles would result in less onset synchrony.
Musicians recognize this difference as one of musical texture. The first work is polyphonic in texture whereas the second work is homophonic in texture. In a study comparing works of different texture, Huron (1989c) examined 143 works by several composers that were classified a priori according to musical texture. Discriminant analysis [12] was then employed in order to discern what factors distinguish the various types of textures. Of six factors studied, it was found that the most important discriminator between homophonic and polyphonic music is the degree of onset synchrony.
As we have noted, an auxiliary principle (such as the onset synchrony principle) can provide a supplementary axiom to the core principles, and allow further voice-leading rules to be derived. Once again, space limitations prevent us from considering all of the interactions. However, consider the joining of the onset synchrony principle with the principal of tonal fusion. In perceptual experiments, Vos (1995) has shown that there is an interaction between onset synchrony and tonal fusion in the formation of auditory streams.
Figure 13 shows the results from an unpublished study of the interaction between onset synchrony and tonal fusion for polyphonic works (fugues) and homophonic works (chorales) by J.S. Bach. In both homophonic and polyphonic repertoires, dissonant intervals are "prepared" via asynchronous onsets. Imperfect intervals in both repertoires show a tendency to favor synchronous note onsets. However, for perfect intervals, the majority of occurrences are asynchronous in the polyphonic repertoire, whereas the majority are synchronous in the homophonic repertoire. Moreover, for the perfect intervals alone, there is a positive correlation between the degree to which an interval promotes tonal fusion and the likelihood of asynchronous occurrence. In short, the above non-traditional rule is obeyed by Bach -- but only in his polyphonic writing.
- [D23]. Asynchronous Preparation of Tonal Fusion Rule. When approaching unisons, octaves, or fifths, avoid synchronous note onsets.
Fig. 13. Results of a comparative study of tonally fused intervals in polyphonic and (predominantly) homophonic repertoires by J.S. Bach. Only perfect harmonic intervals are plotted (i.e., fourths, fifths, octaves, twelfths, etc.). As predicted, in polyphonic textures, most perfect intervals are approached with asynchronous onsets (one note sounding before the onset of the second note of the interval). In the more homophonic chorale repertoire, most perfect intervals are formed synchronously; that is, both notes tend to begin sounding at the same moment. By contrast, the two repertoires show no differences in their approach to imperfect and dissonant intervals.
The existence of homophonic music raises an interesting perceptual puzzle. On the one hand, most homophonic writing obeys all of the traditional voice-leading rules outlined in Part I of this article. (Indeed, most four-part harmony is written within a homophonic texture.) This suggests that homophonic voice-leading is motivated (at least in part) by the goal of stream segregation -- that is, the creation of perceptually independent voices. However, because onset asynchrony also contributes to the segregation of auditory streams, why would homophonic music not also follow this principle? In short, why does homophonic music exist? Why isn't all multipart music polyphonic in texture?
The preference for synchronous onsets may originate in any of a number of music-related goals, including social, historical, stylistic, aesthetic, formal, perceptual, emotional, pedagogical, or other goals that the composer may be pursuing. If we wish to posit a perceptual reason to account for this apparent anomaly, we must assume the existence of some additional perceptual goal that has a higher priority than the goal of stream segregation. More precisely, this hypothetical perceptual goal must relate to some aspect of temporal organization in music. Two plausible goals come to mind.
In the first instance, the texts in vocal music are better understood when all voices utter the same syllables concurrently. Accordingly, we might predict that vocal music is less likely to be polyphonic than purely instrumental music. Moreover, when vocal music is truly polyphonic, it may be better to use melismatic writing and reserve syllable onsets for occasional moments of coincident onsets between several parts. In short, homophonic music may have arisen from the concurrent pursuit of two perceptual goals: the optimization of auditory streaming, and the preservation of the intelligibility of the text. Given these two goals, it would be appropriate to abandon the pursuit of onset asynchrony in deference to the second goal.
Another plausible account of the origin of homophony may relate to rhythmic musical goals. Although virtually all polyphonic music is composed within a metric context, the rhythmic effect of polyphony is typically that of an ongoing "braid" of rhythmic figures. The rhythmic strength associated with marches and dances (pavans, gigues, minuets, etc.) cannot be achieved without some sort of rhythmic uniformity. Bach's three-part Sinfonia No. 1 referred to in connection with Figure 12 appears to have little of the rhythmic drive evident in Adeste Fideles. In short, homophonic music may have arisen from the concurrent goals of the optimization of auditory streaming, and the goal of rhythmic uniformity. As in the case of the `text-setting' hypothesis, this hypothesis generates a prediction -- namely, that works (or musical forms) that are more rhythmic in character (e.g., cakewalks, sarabandes, waltzes) will be less polyphonic than works following less rhythmic forms.
Listeners do not have an unbounded capacity to track multiple concurrent lines of sound. Huron (1989b) found that as the number of concurrent voices in a polyphonic texture increases, expert listeners are both slower to recognize the addition of new voices, and more prone to underestimate the number of voices present. Figure 14 plots experimental data from five expert musicians. The solid columns show mean estimation errors for polyphonic textures of varying textural densities; the shaded columns indicate mean error rates for single-voice entries. For polyphonic textures employing relatively homogeneous timbres, the accuracy of identifying the number of concurrent voices drops markedly at the point where a three-voice texture is augmented to four voices. Beyond three voices, tracking confusions are commonplace. In a study of isolated chords, Parncutt (1993) found similar results. When asked to count the number of tones in chords constructed using tones with octave-spaced partials (Shepard tones), Parncutt found that listeners make significant errors (underestimation) once the number of chromas exceeds three.
Fig. 14. Voice-tracking errors while listening to polyphonic music. Solid columns: mean estimation errors for textural density (no. of polyphonic voices); data for 130 trials from five expert musician subjects. Shaded columns: unrecognized single-voice entries in polyphonic listening; data for 263 trials from five expert musician subjects (Huron, 1989b). The data show that when listening to polyphonic textures employing relatively homogeneous timbres, tracking confusions are common when more than three voices are present.
Limitations in the perception of auditory numerosity may be symptomatic of a broader perceptual limitation, since the results of these studies parallel the results of similar studies in vision. A variety of counting and estimation tasks show a marked increase in confusion when the number of visual items exceeds three (Estes & Combes, 1966; Schaeffer, Eggleston & Scott, 1974; Gelman & Tucker, 1975). Descoeudres (1921) characterized this as the un, deux, trois, beaucoup phenomenon. In a study of infant perceptions of numerosity, Strauss and Curtis (1981) found evidence of this phenomenon in infants less than a year in age. Using a dishabituation paradigm, Strauss and Curtis showed that infants are readily able to discriminate two from three visual items. However, performance degrades when discriminating three from four items, and reaches chance performance when discriminating four from five items. The work of Strauss and Curtis is especially significant since pre-verbal infants can be expected to possess no explicit knowledge of counting. This implies that the perceptual confusion arising for visual and auditory fields containing more than three items is a low-level constraint that is not mediated by cognitive skills in counting.
These results are consistent with reports of musical experience offered by musicians themselves. Mursell (1937) reported a conversation with Edwin Stringham in which Stringham noted that "no human being, however talented and well trained, can hear and clearly follow more than three simultaneous lines of polyphony." A similar claim was made by composer Paul Hindemith (1944).
Drawing on this research, we might formulate the following principle:
8. Principle of Limited Density. If a composer intends to write music in which independent parts are easily distinguished, then the number of concurrent voices or parts ought to be kept to three or fewer.
Of course most harmony writing consists of four or more parts. Indeed, most harmony texts are based exclusively on four-part writing. So once again, the question that arises regarding this principle is why it is rarely followed.
As a start, we must first recognize that a surprising amount of music actually conforms to the above principle of limited density. Consider, for example, the density of polyphonic writing in the music of J.S. Bach. Figure 15 reproduces Huron (1989a) showing the estimated number of auditory streams versus the nominal number of voices or parts for a large body of polyphonic music by Bach. Estimates of the number of auditory streams are calculated according to a method described and tested experimentally in Huron (1989a). The dotted line indicates a relation of unity slope. If an N-part work truly maintains a mean textural density of N auditory streams, then the plotted value for the work would be expected to lie near the unity slope (dotted line). In order for a value to lie above the dotted line, the work must contain parts that manifest a considerable degree of pseudo-polyphonic activity and must also refrain from much voice-retirement (idle parts). Conversely, in order for a value to lie below the dotted line, the work must contain significant periods of voice retirement and refrain from pseudo-polyphonic writing. Figure 15 shows that as the number of nominal voices increases, Bach gradually changes his compositional strategy. For works employing just two parts, Bach endeavors to keep the parts active (few rests of short duration) and to boost the textural density through pseudo-polyphonic writing. For works having four or more nominal voices, Bach reverses this strategy. He avoids writing pseudo-polyphonic lines and retires voices from the texture for longer periods of time. Figure 15 reveals a change of strategy consistent with a preferred textural density of three auditory streams. In light of the extant research, it would appear that Bach tends to maximize the number of auditory streams, while simultaneously endeavoring to avoid exceeding the listener's ability to track the concurrent parts.
Fig. 15. Relationship between nominal number of voices or parts and mean number of estimated auditory streams in 108 polyphonic works by J.S. Bach. Mean auditory streams for each work calculated using the stream-latency method described and tested in Huron (1989a, chapter 14). Bars indicate data ranges; plotted points indicate mean values for each repertoire. Dotted line indicates a relation of unity slope, where an N-voice polyphonic work maintains an average of N auditory streams. The graph shows evidence consistent with a preferred textural density of three auditory streams.
As for works that regularly employ four or more concurrent parts, once again, the notion of musical genre may be helpful in understanding divergent practices. In contrast to Bach's polyphonic practice, musical works consisting of four or more concurrent parts are more likely to be perceived as homophonic in texture. By employing such large numbers of concurrent parts, homophonic textures may tend to obscure the component tones and encourage synthetic perceptions such as chords.
The influence of timbre (spectral content) on the formation of auditory streams was demonstrated by van Noorden (1975) and Wessel (1979), and tested experimentally by McAdams and Bregman (1979), Bregman, Liao, and Levitan (1990), Hartmann and Johnson (1991) and Gregory (1994). Wessel created stimuli consisting of a repeated sequence of three rising pitches (see Figure 16). Successive tones were generated so that every second tone was synthesized using one of two contrasting timbres. Notice that although the pitch sequence is rising, each of the "timbre sequences" produce a descending sequence.
Fig. 16. Schematic illustration of Wessel's illusion (Wessel, 1979). A sequence of three rising pitches is constructed using two contrasting timbres (marked with open and closed noteheads). As the tempo of