Published in Electronics and Computing Monthly october 1983
Mike Furminger of Nen College
The idea of talking to a computer then getting an answer in plain English has long been a popular one in science fiction. The reality of talking to a computer though is not quite as easy as having a conversation with another person. Speech Recognition is being used in modern helicopters and commercial planes, however, and enables pilots to call up desired displays on a VDU rather than looking for an instrument amongst many required by the aircraft.
Alternative Approaches
Computers can understand certain words when an operator has 'taught' the computer the word before or, can analyse each syllable of the sound and try to create a reasonable match to the that word against a stored library of sound. It is possible to buy speech recognition hoards which will recognise up to 100 words with 99% reliability, however these are expensive and unsuitable for home use.
To make a small home computer recognise words is a little tricky, but reasonable results can be obtained if the computer is not expected to discriminate between similar words spoken by the same person. In order to appreciate how machines can understand the spoken word it's necessary to examine the basics of human speech.
In principle spoken words are made up of a series of basic elements. There are 'Voiced' and 'Unvoiced' sounds, ie. sound which uses the mouth to form the noise and sound which is just blown from the lungs. An example of unvoiced sound is the word 'Oh'. The voiced sounds are very complex but simply can be identified as one of three major groups ie. fricative, sibilant and plosive sound. Fricative sound uses the tongue to make the sound as in 'fur', sibilant sound is the 's' sound and plosive sound is a sudden release of air by the mouth as in the word 'pop'. Although the above is not a complete analysis of spoken sound it is adequate for an introduction.
Analysis Techniques
To analyse these different sounds so that a computer can recognise different words it's necessary to use some technique which breaks down the complex wave form into a manageable spectrum. If we had a large mainframe computer to hand then we could try, for example the linear predictive coding (LPC) technique used by the Texas company's speech chips. In principle LPC creates a series of speech frames (eg 40 per second) in which the frequency, amplitude, duration and sound type is converted to a binary code. This binary code is then saved in a relatively small memory. Recognition would occur if an incoming speech pattern matched a known one.
The techniques of LPC are beyond the scope of a small microcomputer but simpler approaches to the problem can be used A spectrum analyser, such as used to control disco lights, would be a suitable method The spectrum analyser breaks down a complex spoken word or musical note into its frequency elements. This is similar to the way in which a prism breaks white light into different colours but instead a sound is broken down to elements like notes on a piano keyboard.
Spectrum Analysers
Many designs for spectrum analysers have bee&publishedby various magarines These are either based on a series op-amp filters or use a switched capacitor filter such as the MF10. Either approach will work well enough, so it a spectrum analyser is already to hand then this is another use for it.
Figure 1: Block diagram of a Spectrum Analyser. After amplification, the speech signal is divided into four frequency bands
If you have not got a spectrum analyser then building a basic model is not too difficult With a BBC model B computer a very simple device could be built using filters into each of the 4 ADC channels. Suitable filter frequencies are mentioned later. All other computers will require a multiplexed ADC to accept the signal data.
Figure 2: Each of the block diagram's filter blocks could take the form of an active filter based around an op-amp
Another approach is to use an MF10 controlled with various clock rates to reproduce many filters in one as shown in Fig. 3.
Figure 3: An alternative to an op-amp filter this filter is built around a Switched Capacitor block
Figure 4: Circuit details of the filter of the block diagram of Fig. 3
Sound Experiments
The author used a PET computer fitted with an Eventide 30 1/3 octave filter spectrum analyser and consequently all examples are quoted for a PET computer, the method described though can be transferred to any other machine. The properties of speed will dictate which filters are significant
Unvoiced sound tends to have a broad spectrum with a maximum around 500 Hz, the exact frequency depending on the physical size of the speaker. Voiced sounds can be identified by their properties, since sibilant sound is a high frequency hiss around8 KHz, plosives are low frequency sounds around 150 Hz and fricative sound tends to be more difficult to identify but mostly its components are around 1 KHz. There is no useful information to be found below 50 KHz or above 10 KHz. If only a few channels are used then it is a good idea to keep the above points in mind.
A suitable 8 channel octave set of filters would have centre frequencies at 62, 125,250,500, 1000,2000,4000 and 8000 Hz.
Software Outlines
The software for speech recognition on a microcomputer would first store a copy of a word, then compare any incoming word with the stored copy. The more comparisons that are required the slower will be the response.
The program elements for saving a copy of a word are shown in fig. 5a
An example of these procedures on the PET computer are given in program 1, lines 330 to 570. The program fills an array with the stored data from the spectrum analyser, sorts for peaks and then sorts again into channel order with a bubble sort.
Figure 5a (above) shows the program elements for saving a copy of a word. Fig 5b (right) shows the flow diagram of a Spectrum Analyser based on an MF10 Filter
Program 1: This illustrates the way in which words can be saved as in the outline of Fig 5a.
It is a good idea to loop round the recording program 3 times and take the hest match. This ensures any odd noises are removed, as lines 575-630.
Program 2: This illustrates the way in which a computer could perform a simple 'sex test' based on the fact that males have peak vocal frequencies of less than 500 Hz while females have higher dominant frequencies
A simple 'sex test' could now be performed since the maximum energy channel would indicate the frequency which is significant to the speaker. Males have peak frequencies of less than 500 Hz, females and children have a higher dominant frequency. This is not foolproof but good fun.
Speech recognition with this system is possible if the order of channels is saved in a file. Then when an incoming word is required to be recognised the stored files are compared with the channel order of the incoming word. This can be a 50% match being OK or finding the best match. Neither system is particularly good in the simple form. The problem is that BASIC is slow and if several matches are made for one word(about 16 per second would be a good idea) then the sort and compare becomes too slow. A job for machine code!
If only simple words or noises are compared then a single sample will work well. To demonstrate this try out a musical instrument and see if different notes are recognised. A game can be written where words~or music drives a car round a track, or the words are used to control a simple robot. This system only allows prerecorded words to he compared whereas some professional systems attempt to recognise words as they are spoken. The principles of this system are simple. The challenge is for you to improve the system.E"CM