The following retrospective summaries are presented by Martin Rothenberg to
fill in some contextual factors that may not be evident in reading the publications
themselves. To emphasize that the viewpoints are that of the writer, the summaries
are written in the first person and not in the third person usually used for
technical papers. References can be found in the papers themselves, which are
available at www.rothenberg.org.
Note: These contextual summaries may be augmented or revised from time-to-time,
as indicated by the current revision date in the title
I. The Breath-Stream Dynamics of Simple-Released-Plosive Production.
Bibliotheca Phonetica VI, S. Karger AG, Basel, Switzerland (1968).
The monograph The Breath-Stream Dynamics of Simple-Released-Plosive Production
is a slightly edited version of my doctoral dissertation from the University
of Michigan (Ann Arbor) with the same title, published in 1966. BSD presents
a physiologically based dynamic model for a broad class of stop consonants occurring
in spoken natural languages, in particular simple released plosives. This model
was proposed as an alternative to acoustic models that were widely discussed
in the preceding 15 or 20 years, after the introduction of the sound spectrogram
(frequency vs. time plot, with intensity indicated by darkness or other visual
feature at each frequency-time coordinate) greatly simplified measuring acoustic
parameters. The trend toward acoustic modeling was also encouraged by research
in speech synthesis based on acoustic parameters. The model described in BSD
may be considered an extension of the physiological modeling proposed by Stetson
and others in the late 1930s and early 40s.
The acoustic models current in the early 1960s described stop consonants largely
in terms of features visible on a sound spectrogram, such as presence or absence
of a "voice bar" during a presumed occlusion interval and a "voice
onset time" (duration of the interval from the acoustic impulse presumed
to mark the release instant to the onset of quasi-periodic acoustic energy,
if present).
The model presented in BSD, on the other hand, describes stop consonants in
terms of physiological gestures. Mathematical models are described that relate
these gestures to the aerodynamic patterns that determine the acoustic energy
produced. It is assumed that for two stops to be phonemically distinct in a
particular language dialect, they must consistently result in perceptually distinct
acoustic patterns. However, it is also postulated that the most natural and
explanatory characterization is in terms of the underlying gestures and not
the resulting acoustic parameters. Thus, a physiological model should identify
just those degrees of freedom of the speech production apparatus that are consistently
producible and result in acoustic patterns perceptually distinct enough to be
used linguistically. This also implies that a production differentiation allowed
by the model should be learnable by almost all speakers of a language dialect.
Under the model, gestures are differentiated as being either unidirectional
(from one state to another different state, as closed to open) or cyclic (returning
to the approximate original state), since the aerodynamic consequences could
be different for each class. According to the model, the primary gestures determining
the classification of a particular production may occur at (1) the thoracic/abdominal
level (to produce or modify the subglottal air pressure), (2) the laryngeal
level (resulting primarily in an abduction or adduction of the vocal folds),
(3) the velar level (to seal/unseal the velopharyngeal port in order to allow
or prevent a buildup of air pressure above the vocal folds), and (4) the articulatory
level (resulting in a momentary occlusion of the vocal tract at some point).
In the model presented, such gestures can be either ballistic (maximally fast,
as limited by both tissue mass and muscle contraction times) or controlled.
Wherever possible, the dynamic constraints determining the speed of ballistic
unidirectional and cyclic gestures at each of the above levels were estimated
from the physiological literature at that time or measured experimentally.
Importantly, the various gestures can vary in their timing with respect to one
another to determine the phonetic category of the consonant produced. For example,
an unvoiced stop produced in an intervocalic position is invariably produced
with a cyclic opening (abductory) laryngeal gesture, from voiced to open (or
breathy-voiced) to voiced. In English phonology, this laryngeal gesture is normally
produced with a timing that synchronizes the initial phase of the glottal opening
movement with the onset of contact phase of the closing articulatory gesture.
Because of the dynamic constraints inherent in the cyclic laryngeal gesture,
there is generally a period following the onset of the release during which
the articulators are open but the glottis has not yet closed - the period of
aspiration. However, if the articulatory gesture is made earlier with respect
to the period of articulatory closure, the result is a "pre-aspirated"
stop, such as those reported in dialects of Icelandic. If the articulatory gesture
is made later with respect to the articulatory closure, the result is a "voiced-aspirated"
stop, such as found in some languages spoken in India.
In addition to the four primary gestures described above, there are a number
of other gestures that can modify the consonant produced sufficiently to be
phonetically significant in some language dialects. Those additional gestures
mentioned in BSD are a gesture of medial compression of the vocal folds and
a gesture of downward vertical laryngeal movement. Some evidence is presented
that a gesture of medial vocal fold compression, which would suppress vocal
fold vibration in the period of occlusion, is used in certain stop consonants
in Korean termed "unvoiced-tense" in the literature. Computations
are also presented to show that a purposeful downward movement of the larynx
could modify the supraglottal air pressure during the period of occlusion by
a piston action, so as to significantly and perceptibly augment the strength
and duration of voicing during the occlusion of a voiced stop. (A gesture raising
the larynx is known to be used in certain languages to increase the supraglottal
pressure in a class of stop consonants termed "ejectives". However,
ejectives were not among the class of stop consonants considered in BSD.)
In summary, BSD explores the hypothesis that the range of simple released plosives
to be found in spoken natural languages can be explained and understood by characterizing
such consonants in terms of a set of closely synchronized physiological gestures.
**********
II. The Glottal Volume Velocity Waveform during Loose and
Tight Voiced Glottal Adjustments.
Proceedings of the Seventh International Congress of Phonetic Sciences, held
at the University of Montreal and McGill University, 22-28 August 1971; edited
by André Rigault and René Charbonneau, Mouton, The Hague - Paris.
(1972)
This paper presented some of the first experimental results using the circumferentially
vented (CV) pneumotachograph mask developed by the author for low distortion
speech airflow measurements and the inverse filtering of oral airflow. In the
monograph BSD (see above), the significance of laryngeal abductory gestures
in speech production was explored in some detail. However, the role of laryngeal
adductory gestures was treated only cursorily. In this paper, using the new
CV mask, the author derives estimates for the dynamic constraints in both abductory
and adductory gestures and presents examples illustrating how such gestures
can be linguistically significant (independent of the stop consonant context
assumed in BSD). In effect, the purpose of this paper was to use some of the
new tools developed at the Syracuse University Speech Research laboratory to
fill some of the gaps left open in BSD. [A more complete description of the
theory and practice of the inverse filtering of oral airflow and measurements
of its validity were submitted the same year to the Journal of the Acoustical
Society of America and published in 1973. See III below.]
An important theoretical result that can be derived from the examples given
in the paper is the conclusion that an adductory gesture can lead to at least
three types of vocal fold vibratory behavior and their acoustic correlates,
namely a cessation of vibration, a sharp reduction in the frequency of the vibration
with possibly irregular intervals, or a bistable vibratory pattern, depending
on factors not well under the speaker's control. In addition, an adductory gesture
of minimal extent, as in rapid speech may only be evidenced by a reduction of
the amplitude of spectral components at or near the fundamental frequency and
not any of the above behaviors.
The often-noted occurrence of a cessation of vocal fold vibration (the first
possible vocal fold behavior noted above) has led to the classification of an
adductory gesture during voiced speech with the vocal tract not occluded as
a "glottal stop", even though no actual cessation may occur in many
instances. Likewise, though an abductory gesture, as in the English /h/, can
often result in a cessation of voicing and acoustic turbulence, it is shown
in the paper that a brief abductory gesture may be voiced throughout, with the
identifying acoustic characteristic being a change in the spectral characteristics
of the glottal airflow waveform. The invariant is not so much the acoustic result
as the presence of the adductory or abductory gesture. These results can be
taken as further support for the hypothesis presented in BSD that the natural
linguistic classification of consonants should depend on the underlying gestures
and not on the particular acoustic manifestation.
**********
III. A New Inverse-Filtering Technique for Deriving the Glottal
Airflow Waveform during Voicing.
J. Acoustical Soc. of Amer. 53, 1632-1645, June 1973.
This paper represented a radical change in the direction of my research from
the direction in BSD and the above paper. The change was from linguistic models
at the level of phonetics and phonology to a study of the human voice source.
The change resulted directly from my success in almost inadvertently developing
a new tool for voice research, the circumferentially vented (CV) mask (and a
little from my curiosity as to why I was such a poor singer).
The CV mask was originally developed for research on consonants; especially
for recording the gross changes in airflow resulting from vocal fold abductory
and adductory movements (see II above), including the post-release aspiration
airflow. For these purposes, a response time of 2 or 3 ms would have been adequate,
since the movements in question took place in at least 10 times that duration.
After testing a number of approaches, it was decided to adapt the wire-screen
pneumotachograph mask used for respiratory measurements. In addition to the
response time requirements, the primary design criteria were low speech distortion
and acceptably low back-pressure caused by the mask's flow resistance.
The standard respiratory masks having hard walls and flow measured at a single
centrally located outlet, by acting as an acoustic extension of the vocal tract,
strongly distorted the formant structure when used during speech or singing.
(Though at the time this is written they are still marketed by at least one
manufacturer for speech measurements. See the Glottal Enterprises website for
sound samples from various masks.) However, after a series of improvements in
mask design and careful measurements of speech distortion, response time and
response linearity, it was found that a response time of as little as ½
ms could be attained, with a distortion and muffling of the speech that was
acceptable for most speech research. (Later improvements brought the response
time down closer to ¼ ms.)
The response time requirement was met by finding a differential pressure measurement
system with a faster response time than the transducers used for respiratory
work that would still be sensitive enough to measure pressures less than a tenth
of the subglottal pressure (to keep mask back-pressure low). In addition, to
keep the response time low, the sensors had to be coupled to the mask without
the tubing used in respiratory applications. The reduction in speech distortion
was accomplished by reconfiguring the mask so as to vent the mask via wire screen
distributed around its circumference, much closer to the mouth, instead of at
one centrally located outlet.
In making speech airflow measurements with these new masks, it became apparent
that the masks were able to resolve at least coarsely the airflow variation
within each glottal cycle, since the response time attained was usually less
than 1/10 of the vocal fold vibratory period for both male and lower pitched
female speaking voices. Thus the temptation arose to turn to looking at the
variation of airflow within the glottal cycle, and from there it was only a
short step to attempt inverse filtering the oral air flow to obtain the waveform
of the airflow through the glottis.
Inverse filtering of a microphone signal had been performed previously and showed
the approximate glottal waveform, but now we were able to add a flow scale and
track how the waveform varied with the average airflow during an ab/adductory
gesture. The waveforms showed that the waveform and resulting spectrum varied
greatly during abduction. These results were used later during a stay at the
Royal Institute of Technology, where, working with Rolf Carlson, Bjorn Granström
and Jan Lindqvist-Gauffin, the first speech synthesizer to vary its source spectrum
in a natural manner during an abductory or adductory gesture was implemented.
This resulted in more natural unvoiced consonants and glottal stops. (Previous
synthesizers included formant transitions and fading in of noise to simulate
aspiration, but did not include the important transitions in source spectrum.)
One other important (to us) outcome was that because an ab/adductory tremors
such an important part of laughing, we developed the first speech synthesizer
that could laugh! (The method we used is described in our paper A Three Parameter
Model for the Glottal Source, in Speech Communication 2, Almqvist and Wiksell,
Stockholm, 235-243, 1975.)
Extensive experience in inverse filtering also made clear that there was a characteristic
shape to the glottal airflow waveform during non-breathy voicing that was independent
of the shape of the glottal area waveform. This shape of a typical glottal airflow
waveform was characterized by a gradual flow onset and a sharp cessation of
flow that generated much more second and third formant energy than predicted
by the then-current glottal flow models published by Flanagan. This shaping
of the glottal flow pulse was discussed in the paper, and possibility of source-tract
acoustic interaction included as a causative factor (in addition to adding first
formant oscillations to the waveform as some of the formant energy was absorbed
by the glottis during the open phase of the cycle).
Another novel feature of this paper is the method used to estimate subglottal
pressure, developed after exploring the difficulties in implementing more intrusive
methods. The aerodynamic model presented in BSD makes clear that if the glottis
is not sealed and there is a complete supraglottal articulatory closure, the
pressure behind the closure, the intraoral pressure, will rise to approximate
the subglottal pressure in just a few milliseconds, at most. Thus if the consonant
/p/ is pronounced between two vowels, to assure an open glottis, the intraoral
pressure during the closure for the /p/ could be taken as a fairly accurate
measure of subglottal pressure during the adjoining vowels. A sequence of repeated
syllables /b/ vowel /p/ was used for the subglottal pressure measurements in
the paper and the speaking rate kept high enough to prevent a purposeful variation
of subglottal pressure with each syllable. (Though other researchers have subsequently
used syllables /p/ vowel /p/, the syllable /b/ vowel /p/ is preferable because
there is no decrease in subglottal pressure during the release, as would be
caused by aspiration in the initial /p/.) This method is now commonly used for
voice measurements.
A subsequent letter published in the Journal of Speech and Hearing Disorders
(No. 47, 218-224, 1982) cautions that for the method to be effective, a syllable
rate of at least 2 per second should be used in order to keep a relatively constant
subglottal pressure during each syllable by discouraging the use of a pulse
of subglottal pressure for each syllable and the system recording intraoral
pressure should have a response time of no more than 30 ms. There are also a
number of contexts that can be used to discourage the use of a pulse of subglottal
pressure for each syllable. One good method is to use a sequence of four or
five syllables, with a stress on the last syllable. The first and last syllables
are then eliminated from the measurements.
**********
IV. (with S. Zahorian) Nonlinear Inverse Filtering Technique
for Estimating the Glottal Area Waveform.
J. Acoust. Soc. of Amer., Vol. 61, pp. 1063-1071 (1977).
and
V. Acoustic Interaction Between the Glottal Source and the
Vocal Tract. In Vocal Fold Physiology.
K.N. Stevens and M. Hirano, Eds., Univ. of Tokyo Press, 305-328 (1980).
The paper Nonlinear Inverse Filtering Technique for Estimating the Glottal
Area Waveform (NIF) was essentially an effort to bridge the gap between the
glottal airflow waveform, as obtained be standard inverse filter techniques,
and the waveform of the glottal area. This gap can consist of at least two factors,
namely, oscillations in the waveform at the frequency of the first formant and
a tilt to the right of the pulse of glottal airflow that results in a slowing
of the increase in airflow and a more abrupt closing phase. Though this asymmetry
in the pulse of glottal airflow had been long noted in the literature, it had
never been reported in glottal area waveforms. It follows from the principles
of Fourier analysis that this abrupt termination of the airflow pulse creates
most of the acoustic energy in the voice at the frequencies of the second and
higher formants. Thus it is most pronounced in acoustically strong voices. (A
related contributory factor is that in non-breathy voice the resulting pressure
pulse generated by the glottal closing is followed by a period of glottal closure,
or near closure, so that little of the energy produced is absorbed by the glottis,
especially if the closed quotient is high.)
An inverse filter has a transfer characteristic the inverse of that of the vocal
tract with the glottis closed. This means that when the glottis is not closed
the inverse filter is not tuned to the actual resonances of the vocal tract,
and so there will not be a complete cancellation of the instantaneous formant
energy. Thus, for example, during the open phase of the glottal cycle there
can be some formant oscillation visible on the properly inverse filtered waveform.
This makes sense acoustically, since this energy can be considered the supraglottal
energy generated during the glottal closed phase being absorbed by the open
glottis, and would undoubtedly be found at the glottis if one could actually
measure the airflow at that point.
In NIF, an inverse filter is implemented in which the formant frequencies and
damping coefficients are varied dynamically during the glottal cycle to roughly
track the actual instantaneous vocal tract resonances. In this way it was hoped
to obtain a waveform that more closely represents the glottal area. Some success
was reported, but the resulting waveform, though less asymmetrical than the
normal inverse filtered airflow, still showed an asymmetry, with a rapid closing
phase that we considered not likely to be present in glottal area.
In what, in retrospect, may have been the most important contribution of NIF,
the possibility was explored that the asymmetry of the glottal airflow pulse
is caused by an acoustic interaction between the variation in glottal area and
the inertance of the column of air in the vocal tract, especially that part
of the tract closest to the glottis. This would mean that previously proposed
differential equations relating area and airflow at the glottis were greatly
in error, since this potential factor was not modeled. To support this hypothesis,
a simple electrical analog was implemented consisting of a time-varying resistance
(simulating the time-varying glottal flow resistance related to the instantaneous
glottal area) and an inductor (simulating the vocal tract inertance). It was
observed that the resulting electrical current (the glottal airflow) had a waveshape
much more like glottal airflow pulses obtained by inverse filtering that we
had observed over the years for many speakers in modal register.
However, the results of an electrical simulation would not be as satisfactory
as a mathematical expression representing the solution to the differential equation
characterizing the posited source-tract interaction, if such an expression can
be found. The solution to a differential equation representing the source-tract
acoustic interaction was presented at a conference in January of 1980 in a paper
entitled Acoustic Interaction Between the Glottal Source and the Vocal Tract
(AI) and published in the proceedings of the conference. The solution to the
differential equation, when plotted for various values of vocal tract inertance,
pulse duty cycle and simulated breathiness (incomplete vocal fold closure),
exhibited even more closely than did the electrical simulation the characteristics
observed inverse filtered airflow waveforms for numerous speakers in the modal
or normal speaking voice register.
In both the NIF and AI papers, attempts were made to estimate the crucial parameter
representing vocal tract inertance in the differential equation from approximate
glottal and vocal tract dimensions. However, these efforts yielded values of
inertance too low to fully explain the interaction seen in the waveforms of
strong voices. In addition, this parameter was not easily estimated from acoustic
measurements made at the lips, such as vocal tract formants. This conclusion
is not difficult to visualize. The effective inertia of the flow in a channel
stems from the velocity of the flow and therefore increases with a decrease
in the diameter (and increase in length) of the channel. Thus the most obvious
source of the additional inertance required to affect source-tract interaction
to the degree seen in strong voices would be a constriction immediately above
(or theoretically within or below) the glottis. A vocal tract constriction immediately
above the glottis (or a component coming from a constricted air jet as airflow
exits the glottis (see Note 1 below) would thus increase flow inertance seen
by the glottis and therefore the source-tract interaction. However, a constriction
immediately above a closed or almost closed glottis would have little effect
on parameters of vocal tract acoustics that can be measured at the lips, such
as the formant frequencies, since there is little oscillatory airflow near the
high glottal impedance. (Constrictions closer to the lips, on the other hand,
have a great effect on the formant frequencies.) Though there have been some
simulation studies of the aerodynamics near the glottis, this important factor
in voice quality, and how it varies between speakers, has not yet been adequately
tied down.
Note; The dynamic properties of a jet of air near the glottis, that is, the
reaction of the jet to a rapid change of airflow or pressure, is not considered
in linear acoustics. Mathematical analysis of jets, eddies, turbulence and other
nonlinear phenomena is much more complex than the analysis of linear acoustic
waves used, for example, to compute formants.
It may be interesting to those familiar with automobile ignition systems that
principle that a rapid change of airflow through an inertance can generate a
strong pressure peak has an analog in ignition systems. The differential equation
is similar to that proposed for the voice. To produce the high voltage peaks
needed for the engine spark plugs, the flow of electrical current through an
inductor (sometimes referred to as the "spark coil" or just the coil)
is interrupted each time a high voltage pulse is needed by a spark plug. Until
electronic ignition was introduced, this interruption was by means of a mechanical
switching (separating the "points" in the distributor), though now
the high voltage pulse generation is done with semiconductor electronics. In
the vocal tract, the analogy to separating the points to stop the electrical
current is the closing of the glottis to stop the airflow. Thus, a strong voice
in the modal register can be thought of as creating 100 to 200 acoustic 'sparks'
per second.
I consider the discovery of the principle of glottal source interaction with
the inertive component of the vocal tract impedance, as detailed in NIF (with
Steven Zahorian) and AI, to be my most important single contribution to voice
research. It contradicted previous mathematical models of source-tract interaction
and added an important component to concepts in linear modeling, which partially
explained the acoustically strong modal or chest voice only in terms of resonances
in the vocal tract, such as the so-called singers formant, and the duty cycle
of the vibratory pattern, sometimes referred to in terms of a closed quotient
or open quotient.
Though the tools developed in the process of discovery, such as the CV mask
and airflow inverse filtering, may have a value in their own right, as for clinical
or linguistic measurements, for me it was the theoretical advance they brought
about that was of greatest value. After using these tools for many years, I
came to recognize that there was a pattern in the inverse filtered waveforms
that was relatively independent of the shape of the glottal area waveform. I
looked for a system characteristic that explained this pattern. I found this
patterning analogous to the well known exponentially damped sinusoidal response
of a linear system to a perturbation of rather arbitrary waveform. In linear
system theory or the study of linear differential equations, the exponentially
damped sinusoids are referred to as comprising the natural response of the system.
This analogy suggests that the theoretical waveforms shown in AI should be considered
the natural response of the glottal-supraglottal system, when there is a glottal
open period followed by a significant period of glottal closure and a high inertive
component to the acoustic vocal tract impedance at the glottis.
**********
VI. Measurement of Airflow in Speech.
J. of Speech and Hear. Res. 20, 155-166 (1977).
The paper Measurement of Airflow in Speech (MAS) documents certain advances
in the development of CV mask technology for measuring airflow in speech and
in applications to voice and speech research. Most significantly, it documents
that the response time of the CV mask had been reduced from the ½ ms
reported previously to approximately ¼ ms. When a CV mask is used to
estimate the glottal airflow waveform by inverse filtering, this reduction of
response time approximately doubles the range of F0 (voice fundamental frequency)
over which it can be used. Assuming that a good representation of the glottal
waveform requires a system response time of no more than 1/20 of the glottal
period, and a reasonable representation requires a response time of no more
than 1/10 of the period, a mask with a response time of ¼ ms could be
used for a good representation at values of F0 up to 200 Hz and at values of
F0 up to 400 Hz for a reasonable representation. [In later work with the soprano
singing voice, a smaller mask covering only the mouth was used to reduce the
response time even further, to increase the F0 range.]
One difficulty with CV mask design at higher of F0 was the necessity of recording
the differential pressure across the wire screen. This is the same as saying
that the sound pressure immediately outside the screen must be subtracted from
the pressure inside the mask. It is shown in MAS that in the absence of a true
differential pressure signal, a relatively simple correction can be made to
the inside pressure to emulate the subtraction of outside pressure.
It was also shown theoretically and experimentally that in applications in which
only a rough approximation of the shape of the glottal airflow waveform is required,
and an open vowel is used, an appropriately designed low-pass filter can often
substitute for an inverse filter tuned to the specific vocal tract resonances.
Among other applications explored in MAS, it was shown that the CV mask could
portray accurately the airflow pattern in the period of aspiration of an unvoiced
released stop, and, in conjunction with a simultaneous measurement of the intraoral
pressure waveform, the variation of the conductance of the articulatory constriction
as the constriction opens. This 'conductance' (using a linear system term advisedly)
describes quite well the patterning of separation during the aspiration interval.
[Note that this was the type of application that the CV mask was designed for
many years previously, before I moved to voice research.]
Also shown in the paper is that the mask signal yields a voice representation
very resistant to ambient noise and from which F0 traces could be derived more
accurately and reliably than from the radiated acoustic pressure (microphone)
signal.
**********
VII. The Voice Source in Singing.
in Research Aspects of Singing, publications issued by the Royal Swedish
Academy of Music, no. 13, 13-33 (1981).
The Voice Source in Singing (VSS) is one of a number of papers published during
and just after a year I spent at the Speech Transmission Laboratory at the Swedish
Royal Institute of Technology, where, with a number of colleagues there, I explored
the ramifications of the new view of the voice source in speech and singing
afforded by the interactive model of the voice presented in previous papers.
VSS was a tutorial presentation of the new theory in which the implications
for the professional singing voice were explored.
**********
VIII. An Interactive Model for the Voice Source.
In Vocal Fold Physiology: Contemporary Research and Clinical Issues, D.M. Bless
and J.H. Abbs, eds., College Hill Press, San Diego, 155-165 (1983). (Proceedings
of the Vocal Fold Physiology Conference, Univ. of Wisconsin -Madison, May 31-June
4, 1981.)
This paper (IMVS) explores a number of issues concerning the previously proposed interactive source-tract model, such as where in the vocal tract the inertive component of the vocal tract impedance must be to explain the glottal flow waveforms obtained by inverse filtering, the effect on the glottal flow waveform of the glottal conductance being flow dependent (and not dependent only on glottal dimensions) over some portion of the vocal fold vibratory cycle, and certain variations in the vibratory pattern of the vocal folds. To provide a basis for investigating the important question of why some individuals are blessed with an acoustically strong voice, and others (including me) are not, a parametric model is proposed that relates the vocal fold vibratory pattern to an inertive component of the vocal tract impedance at the larynx. As far as the answer to the question of the source of the strong voice, my best guess was, and still is, that the stronger than average source-tract interaction that is associated with a strong voice stems from a higher than average acoustic inertance component immediately above the glottis, combined with a more abrupt glottal closing phase. The more abrupt glottal closing phase appears to be associated with a parallel vocal fold geometry during the closing phase, as compared to a more gradual or 'zipper-like' closing. This conclusion has been reinforced for me by a comparison of electroglottograph waveforms for speakers with weaker and stronger voices. (I began to use the electroglottograph shortly before my stay in Sweden, using the version designed by Adrian Fourcin, which reduced the noise enough compared to previous units to allow reliable recording from most subjects. After my stay in Sweden, the use of electroglottography to augment airflow inverse filtering became a regular part of our research protocol.)
**********
IX. Source-Tract Acoustic Interaction and Voice Quality.
in the Transcripts of the Twelfth Symposium: Care of the Professional Voice,
The Julliard School, New York City, June 6-10, 1983.
In view of the potential importance of the postulated interaction of the pattern of variation in glottal conductance in voicing with an inertive component of the vocal tract impedance, it was decided to attempt to validate the theory by varying the vocal tract inertance and observing the inverse filtered airflow to look for a resultant change in the flow waveform. A Helium-Oxygen mixture was used to reduce the inertive component of the vocal tract impedance. The resulting waveform changes were those that would be predicted by the theory.
**********
X. Source-Tract Acoustic Interaction in Breathy Voice.
In Vocal Fold Physiology: Biomechanics, Acoustics and Phonatory Control,
I.R. Titze and R.C.Scherer, eds. The Denver Center for the Performing Arts,
Denver, CO, 465-481 (1984).
This paper (BV) would turn out to be my last in the series exploring the acoustic
interaction between the glottal valving of the breath-stream and the inertance
of the vocal tract airflow in or near the glottis. The research reported set
out to study how this interaction affects breathy voice (vocalization with partially
abducted vocal folds), but lead me to unexpected results concerning models for
source-tract interaction and ways for measuring the parameters of such models.
The research stemmed from theoretical and experimental work performed in Stockholm
in 1980, especially a chart recording showing simultaneous traces of inverse
filtered glottal airflow, EGG, and a photoglottograph during a sentence containing
a segment of breathy voice and vowels easy to inverse filter. (My recollection
is that this recording was made in cooperation with Peter Kitzing of the University
of Lund, who was kind enough to introduce me to the art of photoglottography.)
The photoglottograph provides a trace roughly proportional to glottal area from
light passed through the glottis.
The general conclusion for breathy voicing is that the acoustic inertance near the glottis acts as a low pass filter for the glottal source function, reducing the proportion of energy at the second and higher formant frequencies, as compared to the increase in the proportion of higher formant energy in non-breathy voicing. This effect would be perceptually important, since it would emphasize the transitions in source spectra in voice-to-breathy-voiced (or breathy voiced to voiced) transitions, and should be considered a potential factor in speech recognition and high quality speech synthesis. Such transitions occur whenever unvoiced consonants adjoin a voiced phoneme.
A second conclusion in BV is that simultaneous inverse filtered airflow and EGG waveforms during breathy voice can be used to work backward from measures of the low pass filter effect of the source-tract interaction to an estimate of the value of acoustic inertance that would cause that interaction. This procedure could be a potentially important tool for exploring the sources of the strong difference between an acoustically strong voice and an acoustically weak voice. Its importance stems from the fact that acoustic inertance appears to originate in a location in the pharynx difficult to access for direct measurements. However, to my knowledge voice research has not exploited this tool subsequent to the publication of BV.
Finally, establishing a quantitative relationship between the low-pass effect
described above and the value of vocal tract inertance required a rather elaborate
mathematical exploration of the various proposed relationships between glottal
airflow and transglottal pressure. A simple linear pressure-flow relationship,
resulting in a well-defined glottal conductance that varied with glottal area
was used in the my original papers to establish the principle of inertance-related
source-tract interaction. However, life in the glottis is not that simple. Fant
and van den Berg have both proposed flow-dependent (or pressure-dependent) glottal
conductance models that yield somewhat different degrees of inertance-related
source-tract interaction than the simple flow independent model. These various
models and their implications for the study of breathy voice are discussed at
length in the paper.
XI. Cosi' Fan Tutte.
in Vocal Fold Physiology: Laryngeal Function in Phonation and Respiration,
T. Baer, C. Sasaki, and K.S. Harris, eds., College Hill Press, San Diego, 254-263
(1986).
The title Cosi' Fan Tutte (CFT) was an attempt to humorously convey the meaning
"the way the women do it". (If you don't like the title, you can put
the blame on Donald Miller, who suggested it.) In the early 1980s, there were
a number of professional singers who were regular visitors to the Speech Research
Laboratory at Syracuse University. They were trying, with me, to unlock some
of the secrets of the good singing voice. We worked primarily on the male voice
because of the F0 limitations of the CV mask/inverse filtering system developed
at the laboratory.
The sopranos working at the laboratory came to take this as a sign of sexism, albeit inadvertent, and not as an insuperable technical limitation. I was pressed hard by them to balance our research with some work on the female voice. Eventually, I gave in and thought of a straightforward but significant problem that might be approachable. How can a soprano with a very good voice and lots of training produce the ear piercing and lengthy notes, rich in harmonic structure, near the top of her range, and do this without damaging her vocal cords or running out of breath? If it was simply a matter of tuning the vocal tract to one or another harmonic, the note would tend to be sinusoidal and not rich in harmonics (and thus not clearly convey a vowel quality), and the breath conservation and vocal cord damage issues would remain.
Fortunately we had among us, in Dolores Leffingwell, a talented soprano who could readily produce notes of this type and was willing to be a subject. To extend the range of the CV mask to the top of Dolores' range, I designed a small mask fitting over only her mouth. When using this mask at lower F0 values at which inverse filtering is relatively easy, the improvement in the details in the resulting waveform indicated that the response time of the mask and transducer was down significantly from the ¼ ms that was verified for a larger mask, though no formal measurements of response time were made. However, the problem remained of how to adjust the parameters of the inverse filter at higher pitches, namely those at which the value of F0 approached the expected value of the first formant. At these pitch levels, even a slight variation in the first formant frequency or damping setting would produce a great change in the waveform, and the normal procedure of eliminating or minimizing formant frequency oscillations occurring during the glottal closed phase did not work, since the closed phase would not contain even a single oscillation at the frequency of the first formant.
To solve this problem, we employed an electroglottograph. The mask signal and simultaneously recorded EGG signal were stored on a two channel recording system and replayed repetitively, properly time synchronized. The EGG signal was able to identify unambiguously the glottal closed period. After considerable adjustment of the inverse filter parameters, it was determined that there was only one setting for F1 frequency and damping at which there was a flat period in the waveform near zero airflow that closely coincided with the closed period indicated by the EGG. So with the help of the EGG signal we were able to see the glottal flow waveform!
The resulting waveform was surprising to us at first. It had two peaks and a null near the center of the glottal open period. The interpretation I made was that when the first formant was closely tuned to the F0 and there was a period of glottal closure that was about half the total period, the pressure wave generated by the previous glottal pulse and returning from the lips strongly suppressed the glottal airflow, creating the dip in airflow that we observed.
The result of this pattern was (1) a greatly reduced average airflow, with a resultant reduced drying of the mucous membranes of the vocal folds and increased conservation of lung volume expended, and (2) a glottal waveform with a rich overtone content that would add energy at higher formants and support judgments of vowel quality.
Both of these factors are potentially very important in singing. Consider point number 2. This result says that with the proper vocal fold vibratory pattern, the tuning of F1 to F0 can greatly increase energy at harmonics well above F0 and reduce the F0 component of the glottal wave. This result is not predicted by mathematical models of the singing voice in which source and vocal tract are considered to function separately.
I found point number 1 above easy to accept, since it agreed with another aspect of my experience in technology, similar to the way in which the male type of source-tract interaction in the strong voice was analogous to an automobile ignition system. The potential soprano source-tract interaction at high pitches is apparently analogous to the final amplification stage of a radio transmitter. Just as Dolores' vocal folds were open for only part of the glottal cycle, the final amplifying stage supplies current to the antenna circuit for only part of each cycle at the transmitter's carrier frequency. It is well known in communications technology that under these conditions, when the antenna circuit is tuned accurately to the transmission frequency, there is a drop in the power (electrical current) drawn from the power supply. In fact, when helping maintain a radio relay station on a mountaintop during my military service, I would periodically check for the proper tuning of each transmitter's antenna circuit by observing a meter measuring the electrical current taken by the final amplification stage, and adjusting the antenna circuit to minimize that current. I couldn't help wondering whether the highly trained soprano does the same thing, that is, develop a feel for the amount of breath flow used and adjust the articulators to minimize that flow. The paper also speculates that performance by a soprano at times in which the vocal fold physiology is not providing the correct conditions for this airflow reduction to occur may be harmful to the vocal fold tissues.
**********
XII. The Control Of Airflow During Loud Soprano Singing.
Journal of Voice, Vol. 1 No. 3, 338-351 (1988).
The paper The Control Of Airflow During Loud Soprano Singing (COA), written with colleagues in the Speech Research Laboratory, Donald Miller, Richard Molitor and Dolores Leffingwell, can be seen as returning to the line of research of my doctoral dissertation (BSD, see I. above) after over 20 years in other areas, primarily voice research. (In the research for this paper, the CV mask was used for the purpose for which it was originally intended, the measurement of airflow during consonants.) The problem we considered stemmed from the fact that the subglottal air pressure used by a professional singer (we considered sopranos in this study) was known to reach values four or five times the pressure used in normal speaking. The previous paper (CFT) explored one mechanism for conserving breath volume during the vowel segments with air pressures this high. However, were there also mechanisms for conserving the breath volume during unvoiced consonants occurring between the vowels in the piece being sung? In most such consonants, as pronounced during speech, there is a period during which the vocal tract and glottis are open and the airflow increases.
It was affirmed in my dissertation (BSD) that the response times in the postural muscles controlling subglottal pressure were among the slowest in the body, and therefore it is not likely that the pressure could be reduced abruptly for the consonant and increased abruptly for a succeeding vowel. There are likely to be other compensatory mechanisms that must be learned by the professional singer. In COA we indeed found such mechanisms, and I will leave it to the reader to go the paper itself to explore what we found.
To my knowledge, there have been no follow-ups to this study nor research on the implications for voice pedagogy. Perhaps this summary will help stimulate such activity.