Give an Ear to Your Computer, Byte June 1978 @BYTE Publications Inc.

apr 2000


fig 17
fig 1
Figure 1: A time domain voice waveform and its energy. The top trace is the time domain signal for the word "three." The bottom trace is the energy in the above signal computed every 70 ms. Note that the signal before and after the word (arrows mark the beginning and end of the word) Is not zero. This is due to background noise pkked up by the microphone, In this case computer coollng fans and air conditioning noise.
fig 2
Figure 2: Sampllng in the time domain. Waveform A is a time domain signal. If we sample it at equally spaced intervals we will retain the ampiltude of the waveform at the dotted points and the signal will be zero between these points, as shown in waveform B. Although A and B look very different if the sampling is done with a frequency higher than the Nyquist frequency (see text), both signals will contain exactly the same information.
fig 3
Figure 3: Rectifying the microphone signal at A we obtain signal B which contains a slowly varying DC component proportional to the volume of the signal and various high frequency components due to the formants. The low pass filter separates these and the analog to digital converter (ADC) sees only the volume signal.
fig 4
Figure 4: Amplitude envelopes of the words "one)" "three," "zero." Note that " has one hump while "three" has two. The smaller one corresponds to "th" and the larger one to "ee". In "zero", "z" is the low amplitude area in the beginning. The dip corresponds to "r"and the humps to "e" and "o." Vowels always have high amplitude because they are produced with the mouth open and with strong excitation.
fig 5
Figure 5: The frequency (logarithmic magnitude) spectrum of a 25 ms segment of speech. it covers the frequencies fmm 0 to 3.2 kHz. The malor peaks correspond to the formant frequencies while the smaller, regulady spaced peaks are harmonics of the glottal frequency. In this case It Is obvious that the sound is voiced both from the harmonics of glottal frequency which are highly visible and from the fact that the lower frequencies have more energy than the higher ones.
fig 6
Figure 6: A typical Sonagram of an utterance. In a Sonagram the dark areas represent high intensity. This example represents the word "machine." The vertical (y) axis is frequency and the horizontal axis is time. The formants are seen as the dark bands that change with time. The large dark area about 1/5 of the way Into the pattern corresponds to the sound of the "ch" in machine.
fig 7
Figure 7.. A filter bank feature extractor consists of a number of bandpass filters (BPF1 to BPF (N)) covering the range from about 100 to 10 kHz. The output of the filters is rectified and low pass filtered (LPF). Then it is multiplexed and digitized by the analog to digital converter for input into the computer.
fig 8
Figure 8: Using a comparator after the low pass filter in the filter bank simpilfies hardware and reduces data rate at the cost of discarding useful information. The thresholds should be adlusted for the voice of each individual speaker. In addition, a good automatic gain control circuit should precede the filter bank to normalize the time domain signal.
fig 8
Figure 9: The differential comparator feature extractor detects which one of two adjacent bandpass filters has the highest energy. The output of each pair of filters is summed together and compared with the sum of the next pair of filters, yielding another, coarser comparison from the frequency viewpoint. In this example, the output of eight filters (we assume that the output of the blocks labeled BPF is the rectified and smoothed output of the bandpass filters) is encoded into a 7 bit digital word. The computer performs pattern recognition on time varying sequences of these 7 bit words.
fig 10
Figure 10: The frequency response of the two filters is designed to separate F1 and F2, when used in a zero crossing based formant detector. The filter charactedstics overlap In the region of formant overlap but their slopes are designed to separate the two formants.
fig 11
Figure 11: A zero crossing formant extractor. The zero crossing method of formant extraction uses two special bandpass filters to separate the formants. The output of the filters is passed through a zero crossing detector (a comparator whose threshold is set to zero) that puts out a logical "1", or a "0" depending on whether its input is positive or negative. The output is fed into a counter for each formant and the number of "0" to "1" transitions is counted for 20 ms. Then the counters are read by the computer and reset to start the next 20 ms counting perlod. An envelope detector (rectifier followed by a low pass filter) feeds a comparator whose threshold Is set to detect word beginnings and endings.
fig 12
Figure 12 is a block diagram of a complete filter bank feature extractor. Let us pay close attention to the specifics of its elements.
fig 13
fig 14
Figure 14: A 2 pole 20 Hz lowpass filter (LPF) suitable for smoothing rectified speech audio in figure 12
fig 15
Figure 15: Linear time normalization compresses the stream input samples to a fixed number of samples (in this case six) by selecting a sample at regular Intervals. The resulting fixed length for all words facilitates feature matching for pattern recognition. Various techniques (such as duplicating samples) are used to make the number of inputs fit the selection rule.
fig 16
Figure 16: Shifting the unknown left and then right one position helps prevent mismatches due to missing ends or beginnings. in this case the right shift gave a good match. This test has been simplified to one parameter (S) which might be the output of one filter in a filter bank.

Back    Back to  Richard Davies NLnet Home Page