RADIX-4 FFT (integer math).

Tweaking the FFT code, that I’ve published earlier in my series of blogs, I hit a “stone wall”. There are nothing could be improved in the “musical note recognition” version of the code, in order to make it faster. At least, nothing w/o completely switching to assembler language, what I’m trying to avoid for now.  I’m sure, it’s the fastest C algorithm. Looking around it didn’t take long to find out that there is other option: change RADIX-2 algorithm for RADIX with higher order, 4, 8, or split-radix approach. Putting split-radix aside, (would it be my next adventure?), RADIX-4 looks promising, with theoretically 1/4 reduction in number of multiplications (what I believe is an “Achilles heel”).

Googling for awhile, I couldn’t find fixed point version in plain C or C++ language. There is TI’s “Autoscaling Radix-4 FFT for MS320C6000TM” application report, which I find useful , but the problem is it’s ”bind” with TI microprocessors hardware multiplier, and any attempt to re-write code would, probably, make it’s performance even worse than RADIX-2. Having “tweaking” experience with fix_fft source code from:  http://www.jjj.de/             I decide to follow same path, as I did before, adapting fix_fft for arduino: take their floating point source, disassemble it to the pieces, and than combine all parts back as fixed point or integer math components.    And you know what ? Thanks God, I successed!!!

I decided not all parts to re-assemble back again, this is why fft_size has to be power of 4 ( 16, 64, 256, 1024 etc.). Next, the software is “adjustable” for different level of the optimization. Trade is always the same, accuracy against speed. I’d highlight 3 level at this point:

1. No optimization, all math operation 15-bits.   The slowest version. Not tested at all.

2. Compromise version.  Switches: 12-bits Sine table, regular multiplication (long) right shifted >>12, Half-Scaling in the sum_dif_I (RSL) >>1. Recorded measurements result:  24 milliseconds with N = 256 fft_size.

3. Maximum optimization. Switches: 8-bits Sine table, macro assembler multiplication short cut, no scaling in the core. Timing 10.1 millisecond!!!

Fastest. Best of the Best Ever written FFT code for 8-bit microprocessor.   Enjoy the meal:   https://docs.google.com/open?id=0Bw4tXXvyWtFVMldRT3NFMGNTZVN0Y0d4eVRsenVZdw

Here is slightly modified copy, where I moved sine table from RAM to FLASH memory using progmem utility. For someone, who was curious to find the answer: how much progmem slower compare to access data in the RAM, there is an answer. 10.16 milliseconds become 10.28, or 120 usec slower. Divide by 84 x 6 = 504 number of readings, each progmem costs 0.24 useconds. Its about 4 cycles CPU.

https://docs.google.com/open?id=0Bw4tXXvyWtFVQjZpZkw1c3VUZXlmaF9sOEJwMmpEUQ

Screenshot from the running application, signal generator running on the computer, feeding audio wave to OPA and than analog input 0. Look for hardware setup configuration on the “color organ” blog-post.

LInk to first version based on RADIX-2 FFT:     LINK

BTW, there is one more important thing, I missed to emphasize in my short introductory paragraph, code offers FLEXIBILITY over SNR ratio. Basic FFT algorithm has an intrinsic “build-in” GAIN: G(in) = FFT_SIZE / 2 . (in) stands for intrinsic. That is perfect value for fft_size = 64 ( Gain = 64 / 2 = 32) and arduino (Atmel AtMega328)  10-bit ADC ( max value = 1023 ). FFT output would be 32 x 1023 = 32736, exactly 15 bit + sign. In other words, scaling in the algorithm core doesn’t required at all! That alone improve speed and lower rounding noise error significantly. The same time G(in)  grows too high with FFT_SIZE = 256, when G = 256 / 2 = 128 and output of the FFT would overflow size of 16-bit integer math. But again, scaling don’t have to be 100%, as long as there is a way to keep it in balance with ADC data. In this particular case, with 10-bit ADC, we can keep gain just below 32, it’s not necessary to make it exactly “1″.  For 12-bit ADC upper G limit would be 8, still not “1″. To manipulate the gain, division by 2 (>> 1) in the “sum_dif_I” could be set, to prevent overflow with fft_size > 64. Right shift “gain limiter” creates a square root adjustment, according to new formula: G(rsl) = SQRT (FFT_SIZE) / 4 . (rsl) stands for right-shift-limiter.

  1.  G = 1 for fft_size = 16,
  2.  G = 2 for fft_size = 64,
  3.  G = 4 for fft_size = 256,
  4.  G = 8 for fft_size = 1024.

Summing up, for using RADIX-4 with arduino ADC and FFT_SIZE <= 64, keep division by 2 (>> 1) in the “sum_dif_I” commented out. In any other circumstances, >10 bits external ADC, >64 fft_size, uncomment it.

To be continue…..


[original story: coolarduino]

RADIX-4 FFT (integer math).

Updates on 30 Sept. 2014:

Everything below is correct, and may worth to read. But new  code based on Split Radix Real is published. Faster, lower memory demands, both version for UNO and DUE available as libraries.  New algorithm makes it possible to run FFT_SIZE =  512 on UNO board in less than 9.6 milliseconds.

/**************************************************************************************************************************************

Tweaking the FFT code, that I’ve published earlier in my series of blogs, I hit a “stone wall”. There are nothing could be improved in the “musical note recognition” version of the code, in order to make it faster. At least, nothing w/o completely switching to assembler language, what I’m trying to avoid for now.  I’m sure, it’s the fastest C algorithm. Looking around it didn’t take long to find out that there is other option: change RADIX-2 algorithm for RADIX with higher order, 4, 8, or split-radix approach. Putting split-radix aside, (would it be my next adventure?), RADIX-4 looks promising, with theoretically 1/4 reduction in number of multiplications (what I believe is an “Achilles heel”).

Googling for awhile, I couldn’t find fixed point version in plain C or C++ language. There is TI’s “Autoscaling Radix-4 FFT for MS320C6000TM” application report, which I find useful , but the problem is it’s “bind” with TI microprocessors hardware multiplier, and any attempt to re-write code would, probably, make it’s performance even worse than RADIX-2. Having “tweaking” experience with fix_fft source code from:  http://www.jjj.de/             I decide to follow same path, as I did before, adapting fix_fft for arduino: take their floating point source, disassemble it to the pieces, and than combine all parts back as fixed point or integer math components.    And you know what ? Thanks God, I successed!!!

I decided not all parts to re-assemble back again, this is why fft_size has to be power of 4 ( 16, 64, 256, 1024 etc.). Next, the software is “adjustable” for different level of the optimization. Trade is always the same, accuracy against speed. I’d highlight 3 level at this point:

1. No optimization, all math operation 15-bits.   The slowest version. Not tested at all.

2. Compromise version.  Switches: 12-bits Sine table, regular multiplication (long) right shifted >>12, Half-Scaling in the sum_dif_I (RSL) >>1. Recorded measurements result:  24 milliseconds with N = 256 fft_size.

3. Maximum optimization. Switches: 8-bits Sine table, macro assembler multiplication short cut, no scaling in the core. Timing 10.1 millisecond!!!

Fastest. Best of the Best Ever written FFT code for 8-bit microprocessor.   Enjoy the meal:   https://docs.google.com/open?id=0Bw4tXXvyWtFVMldRT3NFMGNTZVN0Y0d4eVRsenVZdw

Here is slightly modified copy, where I moved sine table from RAM to FLASH memory using progmem utility. For someone, who was curious to find the answer: how much progmem slower compare to access data in the RAM, there is an answer. 10.16 milliseconds become 10.28, or 120 usec slower. Divide by 84 x 6 = 504 number of readings, each progmem costs 0.24 useconds. Its about 4 cycles CPU.

https://docs.google.com/open?id=0Bw4tXXvyWtFVQjZpZkw1c3VUZXlmaF9sOEJwMmpEUQ

Screenshot from the running application, signal generator running on the computer, feeding audio wave to OPA and than analog input 0. Look for hardware setup configuration on the “color organ” blog-post.

BTW, there is one more important thing, I missed to emphasize in my short introductory paragraph, code offers FLEXIBILITY over SNR ratio. Basic FFT algorithm has an intrinsic “build-in” GAIN: G(in) = FFT_SIZE / 2 . (in) stands for intrinsic. That is perfect value for fft_size = 64 ( Gain = 64 / 2 = 32) and arduino (Atmel AtMega328)  10-bit ADC ( max value = 1023 ). FFT output would be 32 x 1023 = 32736, exactly 15 bit + sign. In other words, scaling in the algorithm core doesn’t required at all! That alone improve speed and lower rounding noise error significantly. The same time G(in)  grows too high with FFT_SIZE = 256, when G = 256 / 2 = 128 and output of the FFT would overflow size of 16-bit integer math. But again, scaling don’t have to be 100%, as long as there is a way to keep it in balance with ADC data. In this particular case, with 10-bit ADC, we can keep gain just below 32, it’s not necessary to make it exactly “1″.  For 12-bit ADC upper G limit would be 8, still not “1″. To manipulate the gain, division by 2 (>> 1) in the “sum_dif_I” could be set, to prevent overflow with fft_size > 64. Right shift “gain limiter” creates a square root adjustment, according to new formula: G(rsl) = SQRT (FFT_SIZE) / 4 . (rsl) stands for right-shift-limiter.

  1.  G = 1 for fft_size = 16,
  2.  G = 2 for fft_size = 64,
  3.  G = 4 for fft_size = 256,
  4.  G = 8 for fft_size = 1024.

Summing up, for using RADIX-4 with arduino ADC and FFT_SIZE <= 64, keep division by 2 (>> 1) in the “sum_dif_I” commented out. In any other circumstances, >10 bits external ADC, >64 fft_size, uncomment it.

To be continue…..


[original story: coolarduino]