Speech / Voice Recognition. Arduino project, next in a series FFT and Arduino.
No HMM, neural networks, or other very popular and “scientifically sounding” theories, were considered to be implemented in the algorithm. Google brings up millions links on a topic, just ask, but only few of them are designed on really scientific concept, rather than dumb data base “sharpening”. I’m not saying they are completely wrong, and I’m not an expert in the field, but they are not smart ether. My decision is simple 2D cross-correlation. Basically, the heart of the recognition algorithm is similar to an image matching program, which works the same way for voice/sound. To create a Spectrogram image, arduino is continuously monitoring sound level via microphone, and start capturing data when VOX threshold is exceeded. After input array “X” filled up, data transfered on next level to calculate FFT. The same “conveyor belt” works between FFT and Filtering, flags raised when data is ready, and flags lowered when process finished. The only difference is a speed, conveyor belt is running faster passing data ADC-FFT, and slower at Filter-Correlation stage, as it requires 64 regular cycles to complete spectrogram image in one SuperCycle. The most time consuming part is Edge Enhancement / HPF Filtering of the spectrogram. I’m still looking around to improve performance of this stage, as it holds all process back from to be fully “Real Time”.

- 4 kHz sampling rate: 2 kHz voice freq. range;
- 64 FFT subroutine, 62.5 Hz spectral resolution;
- 16 x 64 Spectrogram Image, around 1 second max voice password;
- duration of the Cross-Correlation < 5 milliseconds;
- duration of the FFT+SQRT+Compression < 4 milliseconds;
- duration of the Edge Enhancement ~ 35 milliseconds;Main cycle time frame is 16 milliseconds, it’s defined by sampling rate x FFT size, 0.25 x 64 = 64 millisecond. Super-cycle 1.024 is needed only because EE prevents all processes to be completed in less than 16 milliseconds. There is a resources left, to increase sampling up to 8 or even 12 kHz, I just had no time to conduct experiments if it is beneficial.
There is a Command Line Interface, built-in the software, which control “record” and debug “print” functions, 7 commands for now:
if (incomingByte == ‘x’) { // INPUT ADC DATA
if (incomingByte == ‘f’) { // FFT OUTPUT
if (incomingByte == ‘s’) { // SPECROGRAMM PRE FILTERED
if (incomingByte == ‘g’) { // SPECROGRAMM POST FILTERED
if (incomingByte == ‘r’) { // RECORD SPECROGRAMM TO EEPROM
if (incomingByte == ‘p’) { // PLAY SPECROGRAMM FROM EEPROM
if (incomingByte == ‘m’) { // FREE MEMORY BYTES
Software is written for AtMega328p microprocessor, Arduino Uno board or similar. For others, all referenced registers has to be replaced with appropriate names for microprocessor.Compiles on 022 IDE, there are some conflicts with 1.0 IDE, that I was not feel myself right to troubleshoot yet. For better understanding some math background, have a look at my previous posts.
Link to download a sketch: Voice_Recognition_24_01
Analog front-end is the same, as I used in my first project: Color Ogran
There is not much could be improved on this part, and I again used both inputs – from microphone to do tests with my own voice, and also from “line” input, for single tone test generated by computer during debugging. Next picture shows “s” command print-out in the serial monitor window, after I pronounce a word : “Spectrogram” . Due limited size of the window, data printed with 90 degree rotation, left-right is frequencies bands direction, and up-down is time. Lower freq. on left side (60 Hz) and higher (2 kHz) on the right. The same time 3D images generated in right view angle.
This is how spectrogram looks like after “g” command entered in serial monitor and word sounds just right after that:
Next couple images created with single tone frequency (320 Hz), just to show more clear “internal properties” of the filtering, again “s” and “g” commands were entered:
Well, as tone sounds continuously, it shows filtering in one direction only, and not the best tutorial on edge-enhancement theory. (“Home brew” lab limits). The same time last picture shows, that each “peek” on the original spectrogram, become surrounded by negative smaller peeks, resulting in “0″ overall sum on 3×3 foot-print, and consequently on the whole map. In electronics it goes under HPF name, and essence of process is to remove DC component, plus attenuate Low Frequencies.
Excelent on-line book
Short manual:
to be completed later

[original story: coolarduino]