You belong on this page if you understand sound-synthesis instruments and notelists, and if you wish to understand how
MUSIC-N style software sound synthesis can emulate human vocal sounds.
The earliest example of computer-synthesized singing known to me is a 1961 rendition of a male human voice singing the chorus of Henry Dacre's
1892 “Daisy Bell” (Popularly known as “A Bicycle Built for Two”).
This early example was created at Bell Labs, and the results can be heard on
youtube. The example has an accompaniment generated by Max Mathews,
no doubt using one of the
MUSIC-N series of programs. However,
MUSIC-N does not seem
to have played a role in the actual speech synthesis.
The goal of this page sequence is to use the Sound engine to re-synthesize “Daisy Bell”. The objective is synthesis informed by acoustics, which means that wherever possible we want the assembly of sound-synthesis components (oscillators, noise generators, filters, etc.) to be explainable in terms of the vocal tract and its resonant cavities. However, this will be possible only up to a point — the later pages on fricatives and plosives rely upon sound spectra which are analytically produced without reference to any physical model. This exercise intentionally excludes techniques such as sampling and and linear predictive coding. Although these techniques produce vocal sounds which are much more realistic than those which will be achieved here, using them leaves the user with very little insight into the nature of a sound: its spectral peaks and valleys, and how the sound unfolds over time.
I undertook this exercise as a reality check on the Sound engine.
It should have been straightforward; after all, computers have been doing speech synthesis at least since 1961.
My major professor at UB, Lejaren Hiller, had always maintained that the
MUSIC-N programs (from which the Sound engine descends) were justified to Bell Labs as speech-synthesis platforms.
Plus I had a very promising resource in Dennis H. Klatt's 1980 article “Software for a cascade/parallel formant synthesizer”
(J Acoust Soc Am 67, pp. 971-995), which is available on this site in PDF format.
Klatt's article described solutions to pretty much all of the challenges posed by speech synthesis, and it even provided tables of
specific parameter settings used to produce specific phonemes.
Whether Hiller was historically correct about
MUSIC-N, or not, I cannot say.
However, I have personally concluded that
MUSIC-N's note-parameter model
is inadequate for speech synthesis, for several reasons:
While here listing reasons why
MUSIC-N's note-parameter model is inadequate, I have
also suggested ways of enhancing the model so that speech synthesis would be feasible. How my Sound
engine implements these enhancements is detailed elsewhere, but in a nutshell
it involves voices, which provide scope to passing signals between instruments, and contours,
which allow segment-by-segment description of control signals and whose information is accessible to any note —
so long as the note's voice ID matches the contour's voice ID.
Reality check indeed!
Not being affiliated with any academic institution, I have no ready access to a university library or to the archival services employed by academic journals. As such, my sources have been limited to information publicly available on the internet, and to old books I have kicking around. While Klatt's solution did not prove as helpful as I hoped, I found enough information elsewhere to make up the deficit. The resource pages on Speech Acoustics developed by Robert Mannel and Felicity Cox have been particularly helpful to me, and should be considered as suggested reading even though the pages are written in Australian. Another useful overview is available in the series of lecture slides prepared by James Kirby, evidently for presentation in Hanoi. I have also made recourse to two period books, J.L. Flanagan's 1977 Speech Analysis Synthesis and Perception and a 1973 collection edited by Minifie, Hixon, and Williams, Normal Aspects of Speech, Hearing, and Language, particularly Minifie's contribution on “Speech Acoustics” (pp. 235-284).
We start with melody and lyrics. Sheet music for “Daisy Bell” is available on the web, for example at www.free-scores.com. We are concerned just with the chorus, ignoring both the verses and the accompaniment.
The ultimate rendition of “Daisy Bell” will be realized over several iterations with new speech-sythesis techniques being introduced as needed.
When I began this exercise, I initially coded the note lists by hand. As complexity increased owing to the addition of new
phoneme categories, manual preparation became increasingly more tedious and less practical. Tiresomeness reflected back
on earlier iterations when changes in policy (e.g how notes should be articulated) or instrument design forced re-coding
of the earlier listings. During the preparation of Iteration #5 (which employs separate notes to shape different spectral features of
fricative noise sounds) the manual coding got to be too much. I had already developed Java procedures to
try out individual words, but from this point on I undertook to write procedures which generated note lists for
entire phrases, with the iteration number as a parameter. By making use of
statements to offset note starting-times, it was possible to generate note lists which could be tested individually, then
pasted into larger iteration lists.
The orchestra used for the present exercise is implemented in the file SpeechOrch.xml. If you have access to the Sound application, and you intend to replicate the sound-synthesis runs on your own, you should download SpeechOrch.xml into your working directory.
The speech-synthesis orchestra extends many ideas about modular instrument design developed in the page on Synthesizing Noise Sounds. The orchestra defines two voices, with a separate stack of instruments operating within each voice. Both instrument stacks implement versions of the source-filter model of speech production. Voice #1 is for vowels and vowel-like consonants, while voice #2 is for fricatives and other mostly noisy sounds. In either case, it takes several notes to generate a single sound, and information is passed between notes through two intra-voice signals. Signal G1 transmits the audio signal while signal G2 transmits the power envelope.
The orchestra is monophonic, which means that although vowel-like and noisy sounds are produced along very different synthesis streams, they are both heard as coming from the same physical location. Keep this in mind should you choose to adapt this orchestra for stereophonic synthesis: The voice entity as implemented within the Sound engine is a construct affecting the scope of contours and signals. It may often equate to a musical “voice”, but it doesn't have to.
If you have access to the Sound application, and want to know intimate details about the speech-synthesis orchestra,
you can always view
SpeechOrch.xml in Sound's Orchestra Editor.
Listing 1 presents the note list header shared by all “Daisy Bell” iterations.
The first text line in the header is the
which indicates that the listed notes will be synthesized using
SpeechOrch.xml. You'll need to
adapt this statement to reflect your own working directory. Of the remaining statements,
set rate sets the sampling rate
to 44100, which is the standard for audio CD's and permits synthesis of frequencies across the full range of human hearing, while
set bits sets the ultimate sound quantization level to 16-bits.
set norm statement causes two passes through
the data. The first pass saves all samples 32-bit accuracy in a temporary file, while the second pass rescales these samples
to optimize the signal-to-noise ratio.
SpeechOrch.xml defines five contours and declares all five accessible both to voice #1 and to voice #2. The five contours are:
All five contours have the exponential calculation mode, which means that transitions from origins to goals proceed in equal-ratios curves.
To avoid foldover it is necessary to ensure that no harmonic of any tone exceeds the Nyquist limit. The upper limit for Contour #2: Frequency is calculated to accomodate a pulse waveform whose uppermost harmonic is 16 times the frequency of the waveform's fundamental: (44100 ÷ 2) ÷ 16 ≈ 1378 Hz. Frequencies in “Daisy Bell” range from 146.8 Hz. (D3) to 293.7 Hz. (D4), all well below this calculated limit.
Additional due diligence would verify that the harmonics of any tone used will cover the highest formant regions. A table of vowel formant frequencies will be provided for Iteration #2, but for now it is sufficient to know that the highest F3 value listed in that table is 3079 Hz. The lowest pitch in “Daisy Bell” is 146.8 Hz. (D3), which will produce a highest harmonic at 16 × 146.8 = 2349 Hz.
The stack for voice #1 has four categories of swappable component. It takes at least four notes to make a sound using the voice #1 stack:
ŋ). Instrument #122: Nose1 draws its attack duration from note parameter #7, its release duration from note parameter #8, its notch frequency from note parameter #9, and its notch bandwidth from note parameter #10.
Thus to produce a vowel sound, you would use a quartet of
The first note in the quartet will invoke instrument #101 to generate a pitched tone.
The second note in the quartet will invoke instrument #119 to capture the RMS power envelope.
The third note of the quartet will invoke instrument #121 to apply the resonant characteristics of the vocal track to the pitched tone.
The fourth note of the quartet will invoke instrument #199 to restore the envelope captured by instrument #119.
Instrument #122 always works in conjunction with instrument #121. Thus synthesizing the word
“manor” would require a stack of six notes invoking instrument numbers
101, 119, 121, 122, 122, and 199 respectively. The notes for instruments 101, 119, 121, and 199 would start and end simultaneously,
lasting for the entire duration of the word. The first note for instrument 122 would last for the duration of the
sound, while the second note for instrument 122 would last for the duration of the
The stack for voice #2 also has four categories of swappable component.
In this case the design is more directly influenced by
It again takes four or (usually) more notes to make a sound using the voice #2 stack:
Next topic: Melody
|© Charles Ames||Page created: 2014-02-20||Last updated: 2015-07-12|