![](/uploads/1/2/6/9/126989167/192980186.jpg)
PDF versions of readings will be available on the web site. Recommended text: Speech Synthesis and Recognition, Holmes, 2nd.
![Speech Synthesis And Recognition Holmes Pdf Creator Speech Synthesis And Recognition Holmes Pdf Creator](/uploads/1/2/6/9/126989167/195019470.jpg)
Computer generation and recognition of speech are formidable problems; manyapproaches have been tried, with only mild success. This is an active area ofDSP research, and will undoubtedly remain so for many years to come. Youwill be very disappointed if you are expecting this section to describe how tobuild speech synthesis and recognition circuits. Only a brief introduction to thetypical approaches can be presented here. Before starting, it should be pointedout that most commercial products that produce human sounding speech do notsynthesize it, but merely play back a digitally recorded segment from a humanspeaker. This approach has great sound quality, but it is limited to theprerecorded words and phrases.Nearly all techniques for speech synthesis and recognition are based on themodel of human speech production shown in Fig.
Most human speechsounds can be classified as either voiced or fricative. Voiced sounds occurwhen air is forced from the lungs, through the vocal cords, and out of the mouthand/or nose.
The vocal cords are two thin flaps of tissue. Stretched across the air flow, just behind the Adam's apple.
In response tovarying muscle tension, the vocal cords vibrate at frequencies between 50 and1000 Hz, resulting in periodic puffs of air being injected into the throat. Vowelsare an example of voiced sounds. 22-8, voiced sounds are representedby the pulse train generator, with the pitch (i.e., the fundamental frequency ofthe waveform) being an adjustable parameter.In comparison, fricative sounds originate as random noise, not from vibrationof the vocal cords. This occurs when the air flow is nearly blocked by thetongue, lips, and/or teeth, resulting in air turbulence near the constriction.Fricative sounds include: s, f, sh, z, v, and th. In the model of Fig.
22-8,fricatives are represented by a noise generator.Both these sound sources are modified by the acoustic cavities formed from thetongue, lips, mouth, throat, and nasal passages. Since sound propagationthrough these structures is a linear process, it can be represented as a linearfilter with an appropriately chosen impulse response. In most cases, a recursivefilter is used in the model, with the recursion coefficients specifying the filter'scharacteristics. Because the acoustic cavities have dimensions of severalcentimeters, the frequency response is primarily a series of resonances in thekilohertz range. In the jargon of audio processing, these resonance peaks arecalled the format frequencies. By changing the relative position of the tongueand lips, the format frequencies can be changed in both frequency andamplitude.Figure 22-9 shows a common way to display speech signals, the voicespectrogram, or voiceprint.
![Recognition Recognition](http://slideplayer.com/5010630/16/images/1/Introduction+to+Speech+Synthesis.jpg)
The audio signal is broken into short segments,say 2 to 40 milliseconds, and the FFT used to find the frequency spectrum ofeach segment. These spectra are placed side-by-side, and converted into agrayscale image (low amplitude becomes light, and high amplitude becomesdark). This provides a graphical way of observing how the frequency contentof speech changes with time. The segment length is chosen as a tradeoffbetween frequency resolution (favored by longer segments) and time resolution(favored by shorter segments).As demonstrated by the a in rain, voiced sounds have a periodic time domainwaveform, shown in (a), and a frequency spectrum that is a series of regularlyspaced harmonics, shown in (b). In comparison, the s in storm, shows thatfricatives have a noisy time domain signal, as in (c), and a noisy spectrum,displayed in (d).
These spectra also show the shaping by the formatfrequencies for both sounds. Also notice that the time-frequency display of theword rain looks similar both times it is spoken.Over a short period, say 25 milliseconds, a speech signal can be approximatedby specifying three parameters: (1) the selection of either a periodic or randomnoise excitation, (2) the frequency of the periodic wave (if used), and (3) thecoefficients of the digital filter used to mimic the vocal tract response.Continuous speech can then be synthesized by continually updating these threeparameters about 40 times a second. This approach was responsible for one theearly commercial successes of DSP: the Speak & Spell, a widely marketedelectronic learning aid for children. The sound quality of this type of speechsynthesis is poor, sounding very mechanical and not quite human. However, itrequires a very low data rate, typically only a few kbits/sec.This is also the basis for the linear predictive coding ( LPC) method of speechcompression. Digitally recorded human speech is broken into short segments,and each is characterized according to the three parameters of the model. Thistypically requires about a dozen bytes per segment, or 2 to 6 kbytes/sec.
Thesegment information is transmitted or stored as needed, and then reconstructedwith the speech synthesizer.Speech recognition algorithms take this a step further by trying to recognizepatterns in the extracted parameters. This typically involves comparing thesegment information with templates of previously stored sounds, in an attemptto identify the spoken words. The problem is, this method does not work verywell. It is useful for some applications, but is far below the capabilities ofhuman listeners. To understand why speech recognition is so difficult forcomputers, imagine someone unexpectedly speaking the following sentence. He was an American spy during the war.Even if exactly the same sounds were produced to convey the underlined words,listeners hear the correct words for the context. From your accumulatedknowledge about the world, you know that children don't wear secret agents,and people don't become spooky jewelry during wartime.
This usually isn't aconscious act, but an inherent part of human hearing.Most speech recognition algorithms rely only on the sound of the individualwords, and not on their context. They attempt to recognize words, but not tounderstand speech. This places them at a tremendous disadvantage comparedto human listeners. Three annoyances are common in speech recognitionsystems: (1) The recognized speech must have distinct pauses between thewords. This eliminates the need for the algorithm to deal with phrases thatsound alike, but are composed of different words (i.e., spider ring and spyduring). This is slow and awkward for people accustomed to speaking in anoverlapping flow. (2) The vocabulary is often limited to only a few hundredwords.
This means that the algorithm only has to search a limited set to find thebest match. As the vocabulary is made larger, the recognition time and errorrate both increase.
(3) The algorithm must be trained on each speaker. Thisrequires each person using the system to speak each word to be recognized,often needing to be repeated five to ten times. This personalized databasegreatly increases the accuracy of the word recognition, but it is inconvenientand time consuming.The prize for developing a successful speech recognition technology isenormous. Speech is the quickest and most efficient way for humans tocommunicate. Speech recognition has the potential of replacing writing,typing, keyboard entry, and the electronic control provided by switches andknobs. It just needs to work a little better to become accepted by thecommercial marketplace. Progress in speech recognition will likely come fromthe areas of artificial intelligence and neural networks as much as through DSPitself.
Don't think of this as a technical difficulty; think of it as a technical opportunity.Next Section.
![](/uploads/1/2/6/9/126989167/192980186.jpg)