Posts with «fft» label

Voice Pitch Shifting – Scrambler.

After I’ve made astonishing breakthrough in speed of the FFT algorithm based on RADIX-4 version, I decided to create a new project, which would take a full advantage of Radix-4 “rocket science” performance. Reviewing my “Voice Recognition” blog it looks logically to create just exactly opposite application – Voice Scrambler. To make it happened, I need FFT subroutine to be completed twice, in forward and reverse ( iFFT ) direction. Than simply manipulating individual frequencies (bins) position in the array, I could scramble a voice in well known old fashioned manner, inverting the spectrum of the human voice, making it sounds completely non intelligible (alien’s voice).  For Pitch Shifting, bins have to be “progressively” spaced between each other, driving timbre up on the scale. It is possible to lower a Pitch as well, shrinking and over-lap bins, but I made only up shifting part for now.

Making preliminary calculation, I get:   10.1 milliseconds x 2 / 256 samples = 78.9 microseconds / sample.  Or turning up side down,  12.67 kHz sampling frequency. It’s rough estimation, but regular “public phone quality” – 8 kHz sampling rate looks quite achievable.  Next, as always, if CPU is o’k, let check into memory management. Unfortunately, arduino Uno (2 kB) wouldn’t allow to use fft-256, because:  input buffer (256) + output (256) +  fft processing ( real + imaginary) (512), plus multiply by 2 (integer size, 1024 x 2 = 2048) would occupy all available 2 kB. So, I have to decrease fft size on one level down, to fft-128, or even two level down – make fft-64, from original fft-256 (presumably good quality, comparable with MP3 codec).  After completing a couple sketchy tests, fft-64 shows pure quality of speech and was rejected. Can’t say I did thoroughful  research on this matter, may be fft-64 could still be good in other circumstances or with musical material content instead of speech.

O’k, fft-128 – compromise version was selected by God. But my code, published in “RADIX-4″ blog hasn’t included RADIX-8 section, which is required when size of the array isn’t a power of 4. Nothing to do, the only way to bring  Scrambler project to life, is  to write a missing section of the RADIX-8 code… So I did. Timing measurements show, that Radix-4 with new patch is running even faster, than extrapolated from fft-256 down to fft-128 speed of RADIX-4 without it, measurements result: 4.2 milliseconds (compare to extrapolated 4.6 milliseconds). By the way, as I mention in other post about Split-Radix, if it would be my next adventure. It was, I have re-written from “Matters Computational” http://www.jjj.de/  Split-Radix C/C++ code in my tools-box now, which is to my surprise shows practically no difference in speed with Radix-4 algorithm, same  10.1 (fft-256) milliseconds execution time, and loosing ! competition giving 4.6 milliseconds (fft-128).

 
Other things to explain, as Pitch Shifting considered to be most complicated task even for monstrous DSP processors, I’ve skipped two important procedures: windowing on input samples and add-overlap on the output flow, just because no processor’s time left. Luckily, due speech’s spectrum concentration around most noticeable “middle” area, cutting off “windowing” is only slightly increases noise level. On the other hand, missing add-overlap procedure has greater negative impact on the voice quality,  ”robotize” it via parasitic amplitude modulation. When frequency bins re-mapped in the pull, running inverse iFFT doesn’t produce continuity  ”match” with previous block as it should, because input has no a continuity disruption between samples blocks (128 samples, ~16 milliseconds of voice frame).  Well, if I do everything right, there wouldn’t be any fun with arduino, would it be?  Btw, there are two outputs buffer in the software, especially to track this issue. I was thinking to implement some kind of distortion estimator based on this information. Meanwhile, one of them could be removed to save some memory for other part of the program.

Hardware.

Arduino has 10 bits ADC, this is why I decided to build a DAC for audio output based on PWM TIMER0 feature, 5 to 5 bits split equally between pin 5 and 6. One PWM pin could only play 7 bits sound (don’t forget sign bit), which is too low. Weighted 32R-R ladder potentially allows to increase PWM frequency up to 250 kHz or so, and simplify requirements for output filter design. Nevertheless, I left “default” 31 kHz settings for TIMER0, just in case I need more research with 16 bits output later on, All it would takes to switch on “full” 16 bits, just add one 256 k resistor and divide integer in  high and low byte.

Please, be advised, that included scrambler function was not fully tested, as I don’t have second arduino to do a decoding in real time-);  (Probably, I can record a scrambled sound and decode it in the second passage trough the system – things TO DO.)  The same time Pitch Shifting was successfully tested, as you can see in posted video clip. YouTube Video

Summary:

  • Sampling rate 8 kHz. (easily to adjust by TIMER2 variable.)
  • Output rate 8 kHz. (same as sampling, two processes are synchronous.)
  • Delay 16 milliseconds. (1 block, or two blocks 32 milliseconds.)
  • Input/Output resolution: 10 bits.

There are build-in command line interface:

  • if (incomingByte == ‘m’) { // FREE MEMORY BYTES
  • if (incomingByte == ‘x’) { // PRINT OUT INCOMING SAMPLING BUFFER
  • if (incomingByte == ‘s’) { // SWITCHES – SCRAMBLING ON / OFF
  • if (incomingByte == ‘y’) { // PRINT OUT OUTGOING  BUFFER
  • if (incomingByte == ‘f’) { // DATA AFTER FFT, FREQUENCIES BINS
  • if (incomingByte == ‘p’) { // DATA AFTER PITCH SHIFTING – SCRAMBLING
  • if ((incomingByte >= ’0′) && (incomingByte <= ’9′)) { // DIGITS  ”1″ to “9″ – REGULATE MAGNITUDE OF SHIFTING, “0″ – SPECTRUM INVERSION.

Link to download an Arduino sketch: Pitch Shifting – Scrambler 


RADIX-4 FFT (integer math).

Updates on 30 Sept. 2014:

Everything below is correct, and may worth to read. But new  code based on Split Radix Real is published. Faster, lower memory demands, both version for UNO and DUE available as libraries.  New algorithm makes it possible to run FFT_SIZE =  512 on UNO board in less than 9.6 milliseconds.

/**************************************************************************************************************************************

Tweaking the FFT code, that I’ve published earlier in my series of blogs, I hit a “stone wall”. There are nothing could be improved in the “musical note recognition” version of the code, in order to make it faster. At least, nothing w/o completely switching to assembler language, what I’m trying to avoid for now.  I’m sure, it’s the fastest C algorithm. Looking around it didn’t take long to find out that there is other option: change RADIX-2 algorithm for RADIX with higher order, 4, 8, or split-radix approach. Putting split-radix aside, (would it be my next adventure?), RADIX-4 looks promising, with theoretically 1/4 reduction in number of multiplications (what I believe is an “Achilles heel”).

Googling for awhile, I couldn’t find fixed point version in plain C or C++ language. There is TI’s “Autoscaling Radix-4 FFT for MS320C6000TM” application report, which I find useful , but the problem is it’s “bind” with TI microprocessors hardware multiplier, and any attempt to re-write code would, probably, make it’s performance even worse than RADIX-2. Having “tweaking” experience with fix_fft source code from:  http://www.jjj.de/             I decide to follow same path, as I did before, adapting fix_fft for arduino: take their floating point source, disassemble it to the pieces, and than combine all parts back as fixed point or integer math components.    And you know what ? Thanks God, I successed!!!

I decided not all parts to re-assemble back again, this is why fft_size has to be power of 4 ( 16, 64, 256, 1024 etc.). Next, the software is “adjustable” for different level of the optimization. Trade is always the same, accuracy against speed. I’d highlight 3 level at this point:

1. No optimization, all math operation 15-bits.   The slowest version. Not tested at all.

2. Compromise version.  Switches: 12-bits Sine table, regular multiplication (long) right shifted >>12, Half-Scaling in the sum_dif_I (RSL) >>1. Recorded measurements result:  24 milliseconds with N = 256 fft_size.

3. Maximum optimization. Switches: 8-bits Sine table, macro assembler multiplication short cut, no scaling in the core. Timing 10.1 millisecond!!!

Fastest. Best of the Best Ever written FFT code for 8-bit microprocessor.   Enjoy the meal:   https://docs.google.com/open?id=0Bw4tXXvyWtFVMldRT3NFMGNTZVN0Y0d4eVRsenVZdw

Here is slightly modified copy, where I moved sine table from RAM to FLASH memory using progmem utility. For someone, who was curious to find the answer: how much progmem slower compare to access data in the RAM, there is an answer. 10.16 milliseconds become 10.28, or 120 usec slower. Divide by 84 x 6 = 504 number of readings, each progmem costs 0.24 useconds. Its about 4 cycles CPU.

https://docs.google.com/open?id=0Bw4tXXvyWtFVQjZpZkw1c3VUZXlmaF9sOEJwMmpEUQ

Screenshot from the running application, signal generator running on the computer, feeding audio wave to OPA and than analog input 0. Look for hardware setup configuration on the “color organ” blog-post.

BTW, there is one more important thing, I missed to emphasize in my short introductory paragraph, code offers FLEXIBILITY over SNR ratio. Basic FFT algorithm has an intrinsic “build-in” GAIN: G(in) = FFT_SIZE / 2 . (in) stands for intrinsic. That is perfect value for fft_size = 64 ( Gain = 64 / 2 = 32) and arduino (Atmel AtMega328)  10-bit ADC ( max value = 1023 ). FFT output would be 32 x 1023 = 32736, exactly 15 bit + sign. In other words, scaling in the algorithm core doesn’t required at all! That alone improve speed and lower rounding noise error significantly. The same time G(in)  grows too high with FFT_SIZE = 256, when G = 256 / 2 = 128 and output of the FFT would overflow size of 16-bit integer math. But again, scaling don’t have to be 100%, as long as there is a way to keep it in balance with ADC data. In this particular case, with 10-bit ADC, we can keep gain just below 32, it’s not necessary to make it exactly “1″.  For 12-bit ADC upper G limit would be 8, still not “1″. To manipulate the gain, division by 2 (>> 1) in the “sum_dif_I” could be set, to prevent overflow with fft_size > 64. Right shift “gain limiter” creates a square root adjustment, according to new formula: G(rsl) = SQRT (FFT_SIZE) / 4 . (rsl) stands for right-shift-limiter.

  1.  G = 1 for fft_size = 16,
  2.  G = 2 for fft_size = 64,
  3.  G = 4 for fft_size = 256,
  4.  G = 8 for fft_size = 1024.

Summing up, for using RADIX-4 with arduino ADC and FFT_SIZE <= 64, keep division by 2 (>> 1) in the “sum_dif_I” commented out. In any other circumstances, >10 bits external ADC, >64 fft_size, uncomment it.

To be continue…..


RADIX-4 FFT (integer math).

Tweaking the FFT code, that I’ve published earlier in my series of blogs, I hit a “stone wall”. There are nothing could be improved in the “musical note recognition” version of the code, in order to make it faster. At least, nothing w/o completely switching to assembler language, what I’m trying to avoid for now.  I’m sure, it’s the fastest C algorithm. Looking around it didn’t take long to find out that there is other option: change RADIX-2 algorithm for RADIX with higher order, 4, 8, or split-radix approach. Putting split-radix aside, (would it be my next adventure?), RADIX-4 looks promising, with theoretically 1/4 reduction in number of multiplications (what I believe is an “Achilles heel”).

Googling for awhile, I couldn’t find fixed point version in plain C or C++ language. There is TI’s “Autoscaling Radix-4 FFT for MS320C6000TM” application report, which I find useful , but the problem is it’s ”bind” with TI microprocessors hardware multiplier, and any attempt to re-write code would, probably, make it’s performance even worse than RADIX-2. Having “tweaking” experience with fix_fft source code from:  http://www.jjj.de/             I decide to follow same path, as I did before, adapting fix_fft for arduino: take their floating point source, disassemble it to the pieces, and than combine all parts back as fixed point or integer math components.    And you know what ? Thanks God, I successed!!!

I decided not all parts to re-assemble back again, this is why fft_size has to be power of 4 ( 16, 64, 256, 1024 etc.). Next, the software is “adjustable” for different level of the optimization. Trade is always the same, accuracy against speed. I’d highlight 3 level at this point:

1. No optimization, all math operation 15-bits.   The slowest version. Not tested at all.

2. Compromise version.  Switches: 12-bits Sine table, regular multiplication (long) right shifted >>12, Half-Scaling in the sum_dif_I (RSL) >>1. Recorded measurements result:  24 milliseconds with N = 256 fft_size.

3. Maximum optimization. Switches: 8-bits Sine table, macro assembler multiplication short cut, no scaling in the core. Timing 10.1 millisecond!!!

Fastest. Best of the Best Ever written FFT code for 8-bit microprocessor.   Enjoy the meal:   https://docs.google.com/open?id=0Bw4tXXvyWtFVMldRT3NFMGNTZVN0Y0d4eVRsenVZdw

Here is slightly modified copy, where I moved sine table from RAM to FLASH memory using progmem utility. For someone, who was curious to find the answer: how much progmem slower compare to access data in the RAM, there is an answer. 10.16 milliseconds become 10.28, or 120 usec slower. Divide by 84 x 6 = 504 number of readings, each progmem costs 0.24 useconds. Its about 4 cycles CPU.

https://docs.google.com/open?id=0Bw4tXXvyWtFVQjZpZkw1c3VUZXlmaF9sOEJwMmpEUQ

Screenshot from the running application, signal generator running on the computer, feeding audio wave to OPA and than analog input 0. Look for hardware setup configuration on the “color organ” blog-post.

LInk to first version based on RADIX-2 FFT:     LINK

BTW, there is one more important thing, I missed to emphasize in my short introductory paragraph, code offers FLEXIBILITY over SNR ratio. Basic FFT algorithm has an intrinsic “build-in” GAIN: G(in) = FFT_SIZE / 2 . (in) stands for intrinsic. That is perfect value for fft_size = 64 ( Gain = 64 / 2 = 32) and arduino (Atmel AtMega328)  10-bit ADC ( max value = 1023 ). FFT output would be 32 x 1023 = 32736, exactly 15 bit + sign. In other words, scaling in the algorithm core doesn’t required at all! That alone improve speed and lower rounding noise error significantly. The same time G(in)  grows too high with FFT_SIZE = 256, when G = 256 / 2 = 128 and output of the FFT would overflow size of 16-bit integer math. But again, scaling don’t have to be 100%, as long as there is a way to keep it in balance with ADC data. In this particular case, with 10-bit ADC, we can keep gain just below 32, it’s not necessary to make it exactly “1″.  For 12-bit ADC upper G limit would be 8, still not “1″. To manipulate the gain, division by 2 (>> 1) in the “sum_dif_I” could be set, to prevent overflow with fft_size > 64. Right shift “gain limiter” creates a square root adjustment, according to new formula: G(rsl) = SQRT (FFT_SIZE) / 4 . (rsl) stands for right-shift-limiter.

  1.  G = 1 for fft_size = 16,
  2.  G = 2 for fft_size = 64,
  3.  G = 4 for fft_size = 256,
  4.  G = 8 for fft_size = 1024.

Summing up, for using RADIX-4 with arduino ADC and FFT_SIZE <= 64, keep division by 2 (>> 1) in the “sum_dif_I” commented out. In any other circumstances, >10 bits external ADC, >64 fft_size, uncomment it.

To be continue…..


Speech / Voice Recognition. Arduino project, next in a series FFT and Arduino.

 Finally, I’d like to present  the most sophisticated project I’ve done so far, build around the idea turning Arduino board into a DSP.  The results are really impressive for small microprocessor, with low memory size and low MIPS. IMHO, arduino provides better results, than Windows Vista VR system, with 1 GB / 2.2 GHz  hardware, for short one-two words commands, of course.
No HMM, neural networks, or other very popular and “scientifically sounding” theories, were considered to be implemented in the algorithm. Google brings up  millions links on a topic, just ask, but only few of them are designed on really scientific concept, rather than dumb data base “sharpening”. I’m not saying they are completely wrong, and I’m not an expert in the field, but they are not smart ether. My decision is simple 2D cross-correlation. Basically, the heart of the recognition algorithm is similar to an image matching program, which works the same way for voice/sound.  To create a Spectrogram image, arduino is continuously monitoring sound level via microphone, and start capturing data when VOX threshold is exceeded. After input array “X” filled up, data transfered on next level to calculate FFT. The same “conveyor belt” works between FFT and Filtering, flags raised when data is ready, and flags lowered when process finished. The only difference is a speed, conveyor belt is running faster passing data ADC-FFT, and slower at Filter-Correlation stage, as it requires 64 regular cycles to complete spectrogram image in one SuperCycle.  The most time consuming part is Edge Enhancement / HPF Filtering of the spectrogram. I’m still looking around to improve performance of this stage, as it holds all process back from to be fully “Real Time”.
 Specification:
-  4 kHz sampling rate:  2 kHz voice freq. range;
-  64 FFT subroutine,    62.5 Hz spectral resolution;
-  16 x 64 Spectrogram Image, around 1 second max voice password;
-  duration of the Cross-Correlation < 5 milliseconds;
-  duration of the FFT+SQRT+Compression < 4 milliseconds;
-  duration of the Edge Enhancement ~ 35 milliseconds;Main cycle time frame is 16 milliseconds, it’s defined by sampling rate x FFT size, 0.25 x 64 = 64 millisecond. Super-cycle 1.024 is needed only because EE prevents all processes to be completed in less than 16 milliseconds. There is a resources left, to increase sampling up to 8 or even 12 kHz, I just had no time to conduct experiments if it is beneficial.

There is a Command Line Interface, built-in the software, which control “record” and debug “print” functions, 7 commands for now:
if (incomingByte == ‘x’) {           // INPUT ADC DATA
if (incomingByte == ‘f’) {           // FFT OUTPUT
if (incomingByte == ‘s’) {           // SPECROGRAMM PRE  FILTERED
if (incomingByte == ‘g’) {           // SPECROGRAMM POST FILTERED
if (incomingByte == ‘r’) {           // RECORD SPECROGRAMM TO EEPROM
if (incomingByte == ‘p’) {           // PLAY SPECROGRAMM FROM EEPROM
if (incomingByte == ‘m’) {           // FREE MEMORY BYTES

Software is written for AtMega328p microprocessor, Arduino Uno board or similar. For others, all referenced registers has to be replaced with appropriate names for microprocessor.Compiles on 022 IDE, there are some conflicts with 1.0 IDE, that I was not feel myself right to troubleshoot yet. For better understanding some math background, have a look at my previous posts.

Link to download a sketch:   Voice_Recognition_24_01

Analog front-end is the same, as I used in my first project: Color Ogran
There is not much could be improved on this part, and I again used both inputs – from microphone to do tests with my own voice, and also from “line” input, for single tone test generated by computer during debugging. Next picture shows “s” command print-out in the serial monitor window, after I pronounce a word : “Spectrogram” . Due limited size of the window, data printed with 90 degree rotation, left-right is frequencies bands direction, and up-down is time. Lower freq. on left side (60 Hz) and higher (2 kHz) on the right.  The same time 3D images generated in right view angle.

This is how spectrogram looks like after “g” command entered in serial monitor and word sounds just right after that:

Next couple images created with single tone frequency  (320 Hz), just to show more clear “internal properties” of the filtering, again “s” and “g” commands were entered:

Well, as tone sounds continuously, it shows filtering in one direction only, and not the best tutorial on edge-enhancement theory. (“Home brew” lab limits). The same time last picture shows, that each “peek” on the original spectrogram, become surrounded by negative smaller peeks, resulting in “0″ overall sum  on 3×3 foot-print, and consequently on the whole map. In electronics it goes under HPF name, and essence of process is to remove DC component, plus attenuate  Low Frequencies.
Excelent on-line book

Short manual:
to be completed later


Arduino Musical Note Recognition – Pushing the limits.

Second project based on Arduino Uno and FFT code.    (First one : Project 1.)
Short description ( EDITED: Project stopped. I lost my inspiration and working on visual recognition now, I decided to publish a code, so someone else could find it useful and continue research. ).

Main array size is 1024 bytes, real / imaginary 512 / 512, output 256 bins (only half real part, other half is mirror).  Sampling rate 4 kHz, upper note B6 (1975.5 Hz) on yellow line – right side led, lower note C3 (130.8 Hz) on red line – left side led.  Frequency resolution is approximately 7.8 Hz per bin. Processing and sampling are running in parallel. Sampling array size is 256. After data captured (256 x 0.25 msec = 64 msec) they transfered to processing array 1024, missing 256 real samples is “zero padded”, and sampling continue w/o interruption.  I did zero padding on a purpose to get “response time” of led matrix as fast as possible, so real-time 1/16 notes could be visually distinguished the same time,  as they played.

After computational cycle is completed (~36 msec), main program executes “cognitive core function” to differentiate between notes / tones, that have to be displayed on LED matrix.

 In order to minimize error rate of this process, Masking Shadow Theory (MST) was developed. Masking shadow for each note is calculated in several steps and result is compared with notes magnitude in the cycle. If magnitude is less than shadow, than led corresponding to this note wouldn’t  lights up. There are five steps for now, but as I say, work in progress. 
Step 1 (masking shadow 0):  noise floor, which is common for all notes. 
Step 2 (masking shadow 1):  shadow from neighboring notes, that includes 8   notes on left and right side, in inverse proportion to their distances. 
Step 3 (masking shadow 2):  shadow from the note, which is located 1 octave below. Or in other words, cross check if current bins value isn’t second harmonic from sounding note 1 octave below  it. 
Step 4 (masking shadow 3):  similar to step 3, the only difference is, cross check if current bin (note) isn’t third harmonic from sounding note below  it. 
Step 5 (masking shadow 5):  similar to step 3 and 4,  cross check if current note isn’t fifth harmonic from sounding note below  it.

Masking shadow is multiplied by notes specific coefficients after steps 3 – 5, to accommodate significantly richer spectral content for lower octaves.  Formula for calculation of the coefficients, is the trickiest   part of all project. What I’ve discovered, coefficient not just varying between notes / tones, they dynamically varying during “life-time” of the tone.

25 Sept. 2011 

There are two variables have been considered: speed, octave range. Third one, “not technical” – is a price for the project. 
 Octave range is defined by  RAM memory available on chip. 2K on UNO. As maximum processing array size is must be power of 2 ( FFT Radix-2 ), array couldn’t be more than  1024 bytes, next value – 2048 is size of all RAM, that obviously couldn’t be taken. Size of array defines maximum quantity of frequency bins at the output 1024 / 4 = 256. Divided by 4 as there is real / imaginary part, and only half real part is present data. What is interesting, that musical octave is nothing else than doubling of the tones frequency, so it follows binary arithmetic rules… If I will count from high side, the upper octave would occupied half of all 256 bins, from bin 256 to bin 128. Simply because frequency / musical tone spaced logarithmically (LOG_2), and bins spaced equally. Next octave takes half what left over, from bin 128 to bin 64. Third octave 64 to 32, and fourth 32 to 16. Can I go more down ? No. There are 12 notes in each octave. It means, that after fourth octave counting down , I arrived to location where bins and tones spaced almost in sequence, bin 16 – C3, bin 17 – C3#, bin 18 – D3. Well there is four more ( 15, 14, 13, and 12 ), but it doesn’t change much, as it only 1/3 of octave and only would complicate multiplexing LED display.
This is why variable octave range = 4. Summing up, increasing octave range by 1 ( to 5 octaves ) would require double memory size (2K processing array, still possible with chip 4K), by 2 ( to 6 octaves ) – four times more memory (4K processing array).

Arduino mega board has 8K RAM, would it be better to design project with it? In first, it cost more money. In second, it has the same CPU performance, and as you will see below, to have more memory w/o faster CPU doesn’t make any sense. CPU wouldn’t be able to process bigger volume of data in time.

Now lets have a look at speed variable. Following math (and design itself), is greatly depends on it. In order to get at least 1/16 note to be visually “alive”, all cycle ( sampling , pre-processing, FFT, post-processing ) has to be completed for less than 64 msec. My impression is, when timing a little bit longer, LED display looks like it shows something, that was played last Saturday night. Invisible real-time connection between “light” and “music” become broken.
Octave range variable ( 4 octave to be specific ), especially low notes starting from C3, begging for 8 Hz resolution, as it equals to distance between C3 and C3#. And consequently, for 128 msec sampling frame duration. So, there is a contradiction. To solved this , zero padding was introduced, which help to keep sampling window down to 64 msec, the same time frequency resolution not very far from 8 Hz. To make real-time life show, sampling must continue w/o interruption. Even more, it has to be “overlapped”, as pre-processing (windowing) would cut off beginning and ending of the sampling pull. All three other functions ( FFT, pre- and post-processing ) have to go in parallel and must be completed in the same time frame 64 msec. Arduino platform has 8-bit microprocessor, with low horse power engine under hood. This is why 8-bit math was selected instead of 16-bits ( which would save me a lot of troubles ). Troubles, I’m talking about, are very low dynamic range when integer math and 8 – bit comes together in FFT. Integer math, which gives nice time performance, puts really hard constrain on dynamic range, just because it performs “scaling” before and after every “butterfly”. And every scaling procedure brings in rounding error, which grows enormously, as there are 2304 butterfly (9216 round operation) for N = 512 FFT. Special attention must be payed, to keep rounding error under control, the same time not to increase calculation time too much or integer math would not make any sense. What I find out, there is  an excellent algorithm to make “symmetrical” 1/2 bit rounding, but it almost doubles calculation FFT timing, which I obviously, could not afford. So, I choose other path, to increase dynamic range. Compression algorithm on the input data. The easiest way to do it, is “clipping”. Set couple lines in the code:
             if ( x[i] >  127 )  x[i] =  127; 
             if ( x[i] < -127 )  x[i] = -127; 

and all good. Not quite. It will do a great job for any other signal (vibration from accelerometer for example), except music…..
Clipping generates a very high level of harmonics, and ones again , it couldn’t be afford, as it just undermine basic idea of the project – MUSICAL note recognition.
 
 Summary: Scaling extends dynamic range of integer FFT on 24 dB ( 4 bits ).

8- bit FFT dynamic range is +36 dB;
scaling                               +24 dB;
noise                                 -   3 dB;     
———————————————————–
Overall                                 57 dB.   


Link to download a scketch:


Arduino_Musical_Notes_Recognition