The first attempts to produce human speech by machine were made in the 2nd half of the 18th century. Ch. G. Kratzenstein, professor of physiology in Copenhagen, previously in Halle and Petersburg, succeeded in producing vowels using resonance tubes connected to organ pipes (1773). At that time, Wolfgang von Kempelen had already begun with his own attempts that led him to construct a speaking machine. Von Kempelen was an ingenious person in the service of empress Maria Theresa in Vienna. He was born in 1734 in Bratislava, then capital of Hungary, and he died in Vienna in 1804. While he became known for various additional feats, his main concern was the study of human speech production, with therapeutic applications in mind. He has been called the first experimental phonetician. In his book Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine (1791) he included a detailed description of his speaking machine - in order for others to reconstruct it and make it more perfect. The six drawings (three plus three) shown here below to the right, are taken from this book. (Click on them for an enlarged view, and use the back-button of your browser to return.)
Von Kempelen's machine was the first that allowed to produce not only some speech sounds, but also whole words and short sentences. According to von Kempelen, it is possible to acquire an admirable facility in playing the machine within three weeks, especially if one chooses the Latin, French, or Italian language, since German is much more difficult because of its many closed syllables and consonant clusters.
The machine consisted of a bellows that simulated the lungs and was to be operated with the right forearm (uppermost drawing). A counterweight provided for inhalation. The middle and lower drawings show the 'wind box' that was provided with some levers to be actuated with the fingers of the right hand, the 'mouth', made of rubber, and the 'nose' of the machine. The two nostrils had to be covered with two fingers unless a nasal was to be produced. The whole speech production mechanism was enclosed in a box with holes for the hands and additional holes in its cover.
The air flow was conducted into the mouth not only by way of an oscillating reed, but also through a narrow shunting tube. This allowed the air pressure in the mouth cavity to increase when its opening was covered tightly in order to produce unvoiced speech sounds. Driven by a spring, a small auxiliary bellows would then deliver an extra puff of air at the release.
With the left hand, it was also possible to control the resonance properties of the mouth by varied covering of its opening. In this way, some vowels and consonants could be simulated in sufficient approximation. This was not really a simulation of natural articulation, since the shape of the mouth of the machine in itself remained constant. Some vowels and, especially, the consonants [d t g k] could not be simulated in this way, but only feigned, at best. An [l] could be produced by putting the thumb into the mouth.
The function of the vocal cords was simulated by a slamming reed made of ivory (leftmost drawing). Although the effective length of the reed could be varied, this could not be done during speech production, so that the machine spoke on a monotone.
Two of the levers to be actuated with the right hand served the production of the fricatives [s] and  as well as [z] and  by means of separate, hissing whistles (right drawing). A third one effectuated the production of a rattling [R] by dropping a wire on the vibrating reed (middle drawing).
The final version of von Kempelen's machine is preserved to this day. It was kept at the k. k. Konservatorium für Musik in Vienna until 1906, when it was donated to the Deutsches Museum (von Meisterwerken der Naturwissenschaft und Technik) in Munich, that had been founded three years before. There, it is exhibited in the department of musical instruments. This machine differs from the one described in the book in the presence of a handle, to be operated with the palm of the right hand, by which the oscillating length of the reed can be controlled during speech production. In this way it can be tried to simulate a natural course of intonation.
On July 8th, 1997, I enjoyed the privilege of being allowed to play von
Kempelen's machine. Its voice production mechanism, including the pitch control,
was still functional. The voice sounded like that of a child or of an adult
speaking quite loudly.
A reconstruction of the machine, demonstrated by Wheatstone (1835) in Dublin, differed from the version described in the book by having a flexible oral cavity and active voicing control, but it lacked the pitch control mechanism included in Kempelen's final version.
In the 19th century, some additional machines of similar kind were constructed, but there were no really fundamental innovations in the field of speech synthesis. However, the device constructed by Joseph Faber in 1835 can be said to represent some progress in that its speech production mechanism included a model of the tounge and a pharyngeal cavity whose shape could be controlled. It was also suited for the synthesis of singing. Its bellows was operated via a pedal, and otherwise it was controlled via a key board.
As late as in 1937, R. R. Riesz (USA) constructed a device similar to those mentioned above, but with a vocal tract shape that was close to the natural.
Although already Kempelen had understood that no more than one vocal tract should be used for the synthesis of continuous speech, devices with separate resonators, each for one particular vowel, were still constructed for certain purposes a hundred years later. Kind of charming are the Sirènes a voyelles et résonateurs buccaux of G.R.M. Marage (Paris, 1900).
At the beginning of the 20th century, the progress in electrical engineering made it possible to synthesize speech sounds by electrical means. The first device of this kind that attracted the attention of a wider public, was the VODER, developped by Homer Dudley and presented at the World Fair in New York in 1939. However, this device required a very long training time for successful use.
Manually controlled speech synthesizers like that of Kempelen and the VODER served mainly the purpose of entertainment, but they had also a more seriously motivated background. Kempelen developped his device in parallel with his investigation of the human speech production mechanism, and Dudley's device was based on his VOCODER (Voice Coder), whose purpose it was to reduce the bandwidth necessary for the telephonic transmission of speech, so that a larger number of telephone calls could be transmitted over a given line.
A lamp produces a light ray that is directed radially against a rotating disk with 50 concentric tracks whose transparence varies in a systematic fashion in order to produce 50 partials with a fundamental frequency of 120 Hz. The light is further projected against a spectrogram whose reflectance or transparence (in an alternative mode of operation) corresponds to the sound pressure level of each partial, and it is directed towards a photovoltaic cell by which the variation in light is converted, ultimately, into variations in sound pressure. The spectrogram is moved past the light ray by means of rollers. In this way, one obtaines a monotonous speech signal that in other respects can be quite similar to the original speech. Instead of real spectrograms, it is also possible to use fake spectrograms painted by hand. By means of perception experiments performed with signals produced in this way, it was possible to obtain a series of new cognitions about the perceptual role of various details in the spectra of speech sounds.
In the models that were developed by several researchers since the 1950-ies, an electric source signal is passed through a filter. The source signal is either a harmonic tone, as in the voiced speech sounds, or an aperiodic noise, as in the unvoiced segments.
The filter serves the purpose of simulating the resonance properties of the vocal tract. There are two distinct approaches that have been tried. In one of these, articulation is simulated with a large number of circuits connected in cascade. Each of these circuits represents a short section of the vocal tract (5 mm or so), whereby the cross sectional area of each section is crucial (transmission line analog). The other method uses resonance circuits to simulate each formant, i.e., the resonaces of the vocal tract, irrespective of its shape (terminal analog).
With the Parametric Artificial Taker of Walter Lawrence (1953), it was also possible to produce naturally sounding consonants.
The simplistic ida of concatenating stored words or various shorter segments in order to produce speech was also realized. However, single speech sounds (phones) can not be successfully concatenated into words and sentences, since the acoustic properties of these minimal distinctive segments of speech vary as a function of their context, and this variation is necessary for intelligibility and naturalness. Better success has been achieved with so called "diphones", which consist of the second half of one speech sound and the first half of the subsequent. This results in a large number of elements, which have to be carefully selected. With such methods it is possible to acheive a high degree of naturalness, even without having a complete description and understanding of the acoustics of speech production. However, these methods lack the flexibility of synthesis by rule. Rule based synthesis does not make use of any stored segments of speech, but all the properties of the speech signal result from the application of a set of rules.
At the present state of the art, the limits of the achievable intelligibility and naturalness of synthetic speech are no longer set by technological factors, but rather by our limited knowledge about the acoustics and the perception of speech. In research, speech synthesis is used to test this knowledge. Now, there are methods by which an automatic analysis and resynthesis of speech can be performed. In this process, it is possible to manipulate the description of the signal before resynthesis, for instance in order to modify the apparent age of the speaker. The success of such manipulations depends on knowledge of the essential factors. You can listen to such manipulations and judge yourself how well they succeeded: Manipulations in speaker age and sex (Swedish examples).
Wolfgang von Kempelen (1791) Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine and Le Méchanisme de la parole, suivi de la description d'une machine parlante, Vienna: J.V. Degen. A reprint of the German edition, with an introduction by Herbert E. Brekle and Wolfgang Wildgren (1970), Stuttgart: Frommann-Holzboog. There are also more recent translations into Hungarian and Slovak.
James L. Flanagan (1965) Speech Analysis: Synthesis and Perception, Berlin: Springer.
Jens-Peter Köster (1973) Historische Entwicklung von Syntheseapparaten zur Erzeugung statischer und vokalartiger Signale nebst Untersuchungen zur Synthese deutscher Vokale, Hamburg: H. Buske. (Dissertation, out-of-print.)
Dennis H. Klatt (1987) Review of text-to-speech conversion for English, Journal of the Acoustical Society of America, 82: 737 - 793.
Joachim Gessinger (1994) Auge & Ohr. Studien zur Erforschung der Sprache am Menschen 1700-1850, Berlin, N.Y.: De Gruyter. Hartmut Traunmüller | Inst. för lingvistik | Stockholms Universitet | Wolfgang von Kempelen on the Web | Aug. 1997, Jan 1998, Sept. 2000.