Fricative consonants are sustained noise sounds produced by the human vocal tract. Unvoiced fricatives result from turbulent air flow, which is produced by forcing air through a constriction. The location of the constriction in the vocal tract determines how spectral energy is distributed. Each unvoiced fricative has a voiced counterpart, which is produced by allowing the vocal chords to vibrate in combination with the turbulent noise. For further background on vocal fricatives I direct you to Robert Mannell's page on Fricatives and James Kirby's PowerPoint, Spectral features of fricatives and stops.
Gunnar Fant's source-filter model still applies to fricative synthesis and indeed provides the
general framework for the so-called subtractive approach to
sound synthesis. Included is the wide range of whispered sounds, formally known as glottal fricatives and represented by the single
h. Whispered sounds can
be synthesized by processing wide-band noise through Instrument #121: Mouth1 and
Instrument #122: Nose1.
However there is no Mouth1 or Nose1 for fricatives other than
at least not to my knowledge.
That is, I am not aware of a single filter bank which can sythesize the full range of fricative sounds simply by swapping parameters.
In part this may be due to the fact that there are several distinct places of excitation for
fricative sounds, and many of these places lie around the lips and front teeth rather than around the
glottis. Issues surrounding the use of filters to produce fricative sounds were explored in the heading of
Synthesizing Noise Sounds devoted to Fricative Consonants,
where it was found necessary to employ a different configuration of filters to obtain each of the unvoiced
fricatives listed in Table 5 below.
Table 5 lists more common fricatives in their unvoiced and voiced variants. The decision as to which fricatives should be included in this table was driven by the availablity of spectrum graphs in Wiktor Jassem's 1962 typescript, “The formant patterns of fricative consonants”. As such, Table 5 includes all fricatives employed by English (also Swedish and Polish) plus a few voiced instances which happen to have unvoiced English counterparts. However, Table 5 falls far short of the diversity of fricative sounds employed by human languages worldwide.
|Description||Location of Constriction||Unvoiced IPA||Unvoiced Examples||Voiced IPA||Voiced Examples|
|labiodental||between the teeth and the lips||
|dental||between the tounge and the teeth||
|alveolar||between the tip of the tongue and the alveolar
ridge (the gum line behind the teeth)
|prepalatal||between the top of the tongue and the front of
the hard palate
|palatal||between the top of the tongue and the central
||not used in English||
||not used in English|
|velar||between back of the tongue and the
velum (i.e. soft palate)
||not used in English|
|glottal||transitional state of the glottis||
To produce wide-band noise suitable for synthesizing aspiration through the vocal tract,
Instrument #102: Whisper1. Speech synthesis literature from the early days advocate
brown noise for this purpose, but when I tried that it produced a coarse low rumbling
that was hardly suggestive of whisper. More satisfying results were obtained by processing output from the
Noise unit through FilterBandPass2B,
which implements Victor Lazzarini's 2nd-order Butterworth band-pass filter.
The operation of Instrument #102 is detailed
in Figure 6 (a), while Figure 6 (b) plots the resulting spectrum.
|Figure 6 (a): Instrument #102: Whisper1 realizes the excitation phase for whispered vowels, diphthongs, glides, and liquids.||Figure 6 (b): Spectrum of the wide-band noise emitted by Instrument #102: Whisper1. The peak is located at 7000 Hz.|
The sources available to me fall short when it comes to procedures for synthesizing voiced fricatives.
Klatt suggests amplitude-modulating the source noise with a sine wave, but
when I tried this the outcome was pretty horrible.
In the end I opted to mix pulse waves with noise in proportions favoring the pulse waves.
This source sound would then be processed through the filter banks used for the corresponding unvoiced fricative.
My solution worked well for some sounds, notably
v, but worked less well for other
sounds such as
Listing 7 presents the fifth-iteration synthesis of “Daisy Bell”. New indications for fricative consonants are color-coded in magenta. The addition of many new note stacks, for the most part employing voice #2, once again forces a resequencing of note ID's.
h sound happens only once in “Daisy Bell”, and that one occurance is
with the word “half”, 27 seconds into the rendition. The aspiration of the
vowel is produced by
note #77, whose wide-band noise output from
Instrument #102: Whisper1 processes through Instrument #121: Mouth1
note #80. Notice that the aspiration source initially speaks by itself; the entrance of pulse-wave
sound from Instrument #101: Buzz1 (
note #78) holds off for 100 msec.,
then ramps up gradually over 200 msec.
While Whisper1 ramps up, Whisper1 sustains full amplitude for 150 msec. then ramps down
to zero over 50 msec.
The instrument stacks for voice #1 and voice #2 in
SpeechOrch.xml are designed to accomodate such overlap.
The note stacks for unvoiced fricatives process unpitched noise from Instrument #202: Noise2 Sust
through the phoneme-specific filter banks worked out in Synthesizing Noise Sounds.
Here are examples of note stacks synthesizing unvoiced fricatives. Pay attention to parameter #7 of
the notes invoking Instrument #202. This parameter provides the amplitude scale factor,
which controls the amplitude of the fricative relative to Contour #1: Amplitude.
For sibilant fricatives such as
&int, the scale factor is 0.03125 (1/32) or -15 dB.
For non-sibilant fricatives such as
θ, the scale factor is 0.0078125 (1/128) or -21 dB.
Notes53-62 at time 17.00 synthesize the
ssound introducing syllable 2 of the word “answer”.
Notes82-88 at time 29.00 synthesize the
fsound concluding the word “half”.
Notes195-204 at time 53.50 synthesize the
∫sound concluding the word “stylish”.
The note stacks for voiced fricatives process pulse waves from Instrument #201: Buzz2 (sometimes combined with Noise2 Sust) through the same filter banks used by their unvoiced counterparts. Here are examples of note stacks synthesizing voiced fricatives:
Notes2-12 at time 3.00 synthesize the
zsound introducing syllable 2 of the word “daisy”. The filter bank is the same as that used for
s. The amplitude scale factor for Buzz2 is 0.015625 (1/64) or -18 dB. I chose not to include a Noise2 Sust source for
Notes35-42 at time 12.50 synthesize the
vsound concluding the word “give”. The filter bank is the same as that used for
f. The amplitude scale factor for Buzz2 is 0.1015625 (13/128) or -10 dB. The amplitude scale factor for Noise2 Sust is 0.0078125 (1/128) or -21 dB.
Notes124-130 at time 38.00 synthesize the
ðsound introducing the word “the”. The results are less than satisfactory. The filter bank is the same as that developed for
θunder Synthesizing Noise Sounds. The amplitude scale factor for Buzz2 is 0.125 (1/8) or -9 dB. I chose not to include a Noise2 Sust source for
ð(possibly a mistake).
Notes211-221 at time 56.00 synthesize the
ʒsound concluding the word “marriage”. The results are again less than satisfactory. The filter bank is the same as that used for
∫. The amplitude scale factor for Buzz2 is 0.21875 (7/32) or -7 dB. The amplitude scale factor for Noise2 Sust is 0.03125 (1/32) or -15 dB.
Finally, a point about continuity. If you examine spectrograms of human fricative utterances, for example in those provided by
Mannell, the unvoiced and voice portions
of the utterances are clearly distinguished. This suggests that fricative sounds should be spliced into pitched sounds with little
or no overlap, yet when I tried doing that the result sounded glitchy.
The fricative sounds in Listing 6 are therefore not spliced in, but rather overlapped with ongoing voiced
The second syllable of the word “Daisy” at time 3.0, for example, drops the slur employed in previous iterations
in favor of the voiced consonant
z. Instead, the coding implements the following scheme of overlap:
zconsonant and the
ivowel start simultaneously at time 3.0.
zconsonant ramps up quickly (0.03 seconds, specified as in parameter #8 of
note#2) to full volume, holds that volume for 0.22 seconds (parameter #6 of
note#2 is 0.3 seconds; subtracting an attack duration of 0.03 and a release duration of 0.05 gives 0.22), and ramps back down to silence over 0.05 seconds (specified directly in argument #2 of the envelope unit for Instrument #102: Whisper1).
ivowel ramps up gradually (0.2 seconds, specified as parameter #8 of
note#12), reaching full volume just before the
note#2) begins its release. Now, I could instead have held back the start time for
ifor a tenth of a second, then ramped the vowel up more quickly. Actually, I tried that first and the results sounded glitchy.
My articulation policy for Iteration #5 has been to overlap fricatives with surrounding vowels, glides, liquids, and nasals so long as both phonemes occur in the same word.
Next topic: Plosives
|© Charles Ames||Page created: 2014-02-20||Last updated: 2017-06-12|