in a given language is essentially moving one's body appropriately so that --
according to the conventions of your language-- your message is encoded in
variations in airpressure triggered by those movements. These movements can
described from several interesting perspectives or levels--the neuromotor
(brain) activity governing the movements, the body (vocal tract, thorax)
movements themselves, the variations in airpressure (waveforms) those
movements produce, and the transfer of energy from one "object" to another
(resonance). The bottom line is that my speech movements cause your
body to move (resonate) systematically so that your brain can interpret those
movements. Of course these interpretations correspond to phonological,
morphological, syntactic, semantic, and pragmatic level conventions in our
The purpose of these notes is to acquaint the reader with some of the very basic vocabulary involved in describing these waveforms. Terms in boldface are particularly important and are listed in the glossary at the end of this Appendix (not completed, 10/02). I will return to some basic speech movements -- voicing of the larynx -- in section 11 below.
Conversely, the process of perceiving speech is a process of sensing movements of the speaker. Listeners perceive that movement through its effect on the air pressure in their ears; comprehension is an interaction of their perception of movement and their knowledge of the language. It is this knowledge that gives the movements meaning. Obviously, perception of the speaker's movements can only be a necessary but not sufficient component to comprehension.
Moving anything is work and work requires energy. Thus in moving the air as we speak, we are working to the extent we must move our body sufficiently to get the air moving. Speaking loudly demands more energy than whispering.since the systematic disturbance of air must carry a greater distance.
Dropping a book on the floor creates a disturbance that we can hear as the book compresses (squeezes) the air molecules between itself and the floor as it hits, momentarily increasing the pressure around the impact. These compressed molecules trigger a ripple of air movement as each molecule moves just enough to jostle its neighbor which in turn moves its neighbor creating a pressure wave that reaches our ears. These movements now set our ear in motion, which in turn transmits nerve impulses corresponding to those movements into our brain. Our ears are extremely sensitive to very subtle vibrations or variations in air pressure. Within certain limits, the loudness of a sound corresponds to the extent of the air pressure disturbance--greater deviations from the background ambient pressure being louder. Our ears essentially serve as filters, selectively admitting some types of pressure waves over others.
Sound, then, is the listener's response over time to variation in air pressure at the listener's ear. Thus these sound "waves" can vary both in how much the pressure changes from time to time (amplitude) and in rapidly these changes occur (frequency). The falling-book pressure wave is similar to the wave ripple caused by the stone in the pond with some interesting differences. In both cases energy is transmitted to another "object", water in the pond, air in the room. In both cases a wave is created as the energy imparted to the object moves through the object's molecules. However the wave moves more rapidly through the water since the water molecules do not compress and can transfer energy more efficiently. In contrast the air molecules are compressible and the local pressure changes as the energy wave passes through, first increasing the pressure above (compression) then reducing it below average (rarefaction). The remainder of these notes expand on the physical characteristics of vocally produced pressure waves and some useful visual representations (graphs) of such waves.
The energy imparted by movements of the speech apparatus moves through air at a speed of about 1100 feet per second. The exact speed depends on various features of the medium including its temperature and density and humidity For example you might visualize the pressure wave from the dropped book as a pressure "spike" moving toward you; however not until it reached your ears would you hear it. (See Figure x.1) Of course if you are nearby--say ten feet away-- the wave will move so rapidly that you might think the "bang" is simultaneous with the book hitting the floor when it actually took about .01 second. It is only when long distances are involved that we can easily sense the time it takes a sound wave to travel. For example in a echo situation, we create a sound and it travels to a wall and bounces back to us; or during a thunderstorm we see a lightening flash and count the seconds until the thunder, which gives us an idea of how far the lightening was from us.
Speech is radiated from the open mouth and nasal passages as a variable air pressure wave. The shape of this wave at any moment depends on the immediately preceding movements of the articulators. If airflow is coming from both nose and mouth, you will hear the resultant interaction of the two sources. If you place a microphone a few feet away from your mouth, you can record and subsequently "see" the shape of the waveforms as they impact your ear.
A visual display plotting pressure changes over time is known as a time pressure wave or oscillogram. Air pressure may be measured in absolute units, e.g. pounds per square inch, or the power required to create those changes, e.g. watts per square meter. The normal atmospheric pressure is about 15 pounds per square inch. Commonly, a relative measure, the deciBel is used. The pressure is indicated on the y-axis and time on the x-axis. Here are several example plots.
actually look at the pressure wave created as I drop my two pound or so tax
guide on the floor--a modest "kaboom" as it hits. You can see in Figure A.1
below that before the "drop" the pressure is relatively constant. At the
moment of "drop" --the arrow--there is a brief period of irregular pressure
peaks (compression) and valleys (rarefaction) lasting 25 to 30 milliseconds and
then a rapid decline in the disturbances. Within 50 milliseconds the
disturbances are inaudible--which is to say they are not reliably discriminable
from the normal random variations in air pressure..
Figure A.1 Visual display of 200 msec air pressure disturbance caused
by dropping my book on the floor.
tuning fork is a tool traditionally used to tune pianos. Typically made of
solid metal, sets of forks were available that would vibrate at the
frequency of every note on a piano.
When a fork is struck, the two prongs vibrate back and forth at fixed rate,
gradually declining in the amplitude of each vibration as the initial energy is
expended. Unlike the dropped book or stone in a pond that only produces one
spike or ripple in pressure, the fork produces a series of waves as it moves
back and forth. Each wave or cycle consists of a high pressure
compression followed by a low pressure rarefaction. For a good fork, the time
for any high-low cycle is a constant, known as the period of theThe fork serves
as a standard against which to compare the note played by any piano key. It
turns out our ears are very sensitive to very slight differences
("beats") in the pitch or frequency between fork and piano key
vibrations. Any differences can be
eliminated by adjusting the piano string tension.
I have an old tuning fork that is stamped 256 cycles per second, C. When I hit it with a pencil, it vibrates somewhat frantically at first and vibrates repeatedly at that note and gradually expends its energy as it works moving the air around it.(damped wave )
Figure x.x.x Brief fragment of of 256 cycle/sec tuning fork waveform
Figure x.3 electronically generated waveform (256 cycles per second)
Figure x.2 A brief fragment of saying "ahhh"
(to be added but I bet it looks like the "sh" in "shoe")
Figure x.4. A 175 millisecond fragment of saying "shoe"
I talk and my body, the air around me, and you move!
Imagine striking a piano key;. Your finger imparts energy to the key that in turn strikes a taut wire string that vibrates on impact. The resulting sound may be quite different depending on the length, mass, and tension of the string.. Similarly the same force exerted blowing into a flute or recorder will sound differently depending where your fingers are placed; different fingering leads to differing amounts of air vibrating. The point is that the same energy can result in quite a different frequency note
Until fairly recently, recordings of sound were made by essentially building a "model" of the air pressure wave. Variations in pressure had their analog in the variations made in the wax or plastic of records. The sounds were reconstituted as the vibrating phonograph needle moved through the grooves carved into the record. Tape recorders also directly model the pressure variations in terms of magnetic flux on the tape. These methods are called analog recordings or representations since the form of the representation is a close analogy or replica to the original pressure wave. Recently computers have been employed in a quite different form of representation of sound-- a numerical or digital representation in which variation in pressure are represented by numbers.
a meter monitoring the variation in air pressure as you talk. Periodically,
say every millisecond, you write down the pressure indicated on the meter.
This would suggest a curve, that would --after connecting the
points--approximate the original waveform. The quality of the approximation
would be determined largely by several factors including the complexity and
regularity of the waveform and the precision of the meter. Speech, which
contains both complex periodic (regular) segments and noisy, aperiodic
segments, would demand more samples than a pure tone to provide a good
approximation. Similarly; a meter that could reliably distinguish among 16
levels of pressure at each sample, could give a better approximation than a
meter with only 8 levels of difference.
Today devices known as analog-to-digital (A-to-D) converters serve as pressure "meters" that record numbers corresponding pressure intensity at a given frequency or sampling rate. Unlike the analog representations, this digital representation is not continuous and in no way is an iconic replica of the original wave. That waveform is reconstituted electronically (D-to-A conversion) by pulsing a speaker with an intensity proportional to the number at the same rate that the sample was taken originally. Continuity is restored by the momentum of the moving speaker.
waveform be approximated by sampling the pressure periodically. Thus a
relatively small set of numbers can represent the entire waveform similar to
the way that a well-chosen sample of voters can predict the outcome of an
election with a hundred million voters. In practice a sample of speech taken
each millisecond by my "pressure meter" would probably result in very poor
Currently many A-to-D converters allow sampling at rates from about 5000 to 44,000 samples per second (5000 to 44,000 Hertz). Compact discs are recorded at rate near 44,000 samples per second. Mathematically it can be demonstrated that one must sample at twice the highest frequency to be reproduced; this is known as the Nyquist frequency. Thus in order to reproduce sounds at frequencies over 10,000 Hz, one must sample a rate of over 20,000 Hertz. This explains why a rate of 1000 samples per second would give poor quality reproduction since components of the waveform over 500 Hertz would be lost. It also explains why compact discs must be recorded at more than a 40,000 Hertz rate in order to reproduce the highest frequency audible components of music--the upper limits of human hearing are around 20,000 Hertz though we are much more sensitive to sounds most commonly produced in our speech (1000 Hz to 4000 Hz.)
Computers and speech waveform digitizers have revolutionized the study of spoken language. My old Macintosh for example can sample at rates from 5000 to 44000 kiloHertz; each sample can take on any of 216 gradations. One can imagine the enormous quantities of data recorded even for a few minutes of sound and consequently, how much computer memory, is needed to record digitized sound.
The pitch we hear is determined by the rate at which the vocal folds vibrate. There are substantial differences between males, females, and infants determined by length, mass, and related stiffness or rigidity and tension of the vocal folds. F0 is also under considerable CNS control as singing demonstrates. Your F0 is due to the rapid opening/closing of the vocal folds letting pressurized puffs of air up into the oral cavity. This is a very complex interaction of muscular tension, inherent rigidity and elasticity of the vocal folds, and the rate of flow of air upward from the lungs.
Figure 1. Mean fundamental frequency plotted as a function of age of subject. Data for males and females are equal up to age eleven.
(data from Kent, 1976) Some male fundamentals may continue to drop through puberty to as low as 100 Hz or more.
3.2.2. larynx--a cartilage tube with connecting ligaments and membranes, vocal folds--two movable bands of muscular tissue, and the glottis (which is the opening between the folds)
3.2.3. supra laryngeal (glottal) vocal tract (above the larynx/glottis
details from film
nasal and oral cavity
tongue and velum
pharynx (pharyngeal tube)
You should be able to informally say how you produce a given English speech sound in terms of operation of the major articulators. Always remember you must have sufficient air, and control of the velum. Can you write a "recipe" for saying "stern?"
3.1. Apparently the movable tissue evolved originally as a valve to prevent
fluids from entering lungs of amphibians. It still serves to keep fluids &
food out of our lungs. Importantly, it also serves to pressurize our lungs
during muscular effort, keeping the body rigid during, e.g. lifting, hitting.
Grunts on tennis courts probably are not just irritating noise--they may
increase the power of the strokes.
3.2. Our f0 reflects muscular tension--hence emotion.
3.3. in addition to normal breathing control, there is language control of breathing in the human CNS.
3.4. Humans in contrast to other primates have a long pharyngeal tube (recall film & video).
This is due to a lengthening of the neck, in effect a lowering of the larynx. Contrary to the usual neotenous relation of humans to chimps, newborn humans are more like adult chimps in this respect than adult humans are. As a result, mature humans have a greater range of possible vowel production than either infants or any other primate. See the diagram in my notes in the history section p.18. One result of this "dropping" of the larynx is that human adults are much more vulnerable to choking on food "going down the wrong pipe" than other primates and human infants. (see video)
duration, intensity, frequency
milliseconds, decibels (dB), Hertz (Hz or cps)
Plots changes in air-pressure as a function of time.
A spectrum is just a histogram of the amount of energy present at various frequencies for a given time, e.g. for a few milliseconds or eternally! Above is an example of a very short term spectrum repeatedly computed numerous times while I said "John." The plural of spectrum is spectra.
A spectrogram is a plot of time X frequency X intensity--where intensity is encoded by the darkness/lightness of the bands at each Hz indicated on the Y-axis
 Human sign language of course uses a
different medium but many of the same principles hold -- sign is
manual-arm-facial movement generated by the brain in dynamic fashion.
Movements are interpreted by the visual system.
Readers may vaguely recall their high school physics lessons about work. The "amount" of work (W) done is the force (F) required to move the "object", .e.g. mass of air, a given distance (D), W=FD. The force, as Isaac Newton tells us, is related to the mass of the object times its acceleration (F=MA). Force, appropriately, is measured in units called newtons. Work (W) is commonly measured in units called joules. Force (F) in physics books is treated as if it is applied to a point on an object; the concept of pressure is the amount of force applied over some unit area (e.g. newtons per square meter, or dynes per sq. cm.).
How loud this book will sound depends on the energy it had as it hit the floor, determined by the height from which it dropped, and importantly, the sensitivity of our ears. See 22.214.171.124.2 below.
It is important, though difficult, to keep the perceptual (psychological) effects of stimuli distinct from those stimuli themselves. The field of psychophysics provides much data on these issues, e.g. the relation between the physical energy (intensity) in an air pressure wave, its frequency, and perceived qualities of loudness and pitch. How "loud" we judge a sound to be depends complexly on its intensity and frequency, as well as what we have just immediately experienced. See Warren (1982) for a good survey.
Sound travels through air at 1130 feet per second at 20 degrees C but only 1087 fps at 0 degrees.. Of course in a different medium the speed will be different-- in water for aquatic mammals, for example--sound travels four times faster than in air. The blind mole rat, a very unusual mammal, is a solitary subterranean rodent that communicates after leaving its mother using "seismic signals"--banging the roof of their tunnels with flattened heads! This creates a long distance signal with energy concentrated in the range of 150-250 Hz that travels through earth very rapidly.(Rado, Wollberg, and Terkel, 1991). In steel, as in a railroad track, sound travels at over 16,000 feet per second. Keeping one's ear to the ground or rail thus does have a virtue!
 A decibel is a measure designed by Alexander Graham Bell to conveniently index pressure changes or intensity in terms of how loud they might sound to a normal human ear. His idea was to form a ratio of the particular sound intensity to the minimal intensity changes we can notice on average. Thus as sound we cannot reliably detect would form the ratio of one (1/1). But since our ears have an extremely great range, Bell saw it was more convenient to take the logarithm of this ratio. This not only reduces the range of units but provides a convenient approximation to the quality of loudness. Sounds around threshold of hearing will be assigned zero deciBels while sounds that really hurt our ears will be above 100 deciBels. The resulting logarithmic scale more or less accurately reproduces the psychological effect that doubling energy does not double loudness.
These waveforms were made using Macrecorder and then plotted using Signalize or SoundEdit. Then they were saved as paint files and imported into the text. See below.
The last time I had a piano tuned, the tuner used an electronic tone generator. Apparently tuning forks are obsolete.
Pierce (1983, p.22) cites Samuel Pepys diary entry August, 8, 1666, that Robert Hooke, an early student of vibratory objects, could tell how many strokes a fly's wing made "by the note that it answers to in musique during their flying," The actual process of tuning a piano is more complex than I have suggested but the idea is simple enough. See Pierce, pp.62-71 for a discussion of scales and beats.
One older yet very accessible source of information on the physics of sound and recording is Backus (1969). Information about digital recording and processing is scattered about in many instrument and software manuals.
 This is not to say it would be unintellible; speech is surprisingly resistant to degradation due to its redundancy. See Chapter 2.