Operations efficiency – Memory vs. CPU speed

Ttelmah · Joined: 11 Mar 2010 Posts: 19895

There is a total answer to your question in fft.c

You need to look at the functions it _loads_. fft.h, shows how to return values either using a direct register write, or as a return from a C function. Remember also you can write into a C variable just as easily as read from it.

viki2000 · Joined: 08 May 2013 Posts: 233

I see now.
There are examples in fft.h and dsp_data_util.c
I will check that during the next days.

viki2000 · Joined: 08 May 2013 Posts: 233

Is it true that is a mistyping mistake on PCD Reference manual page 103 where the #asm #endasm example is provided?
https://www.ccsinfo.com/downloads/PCDReferenceManual.pdf

Shouldn't be "," instead of "." on the next line?

PCM programmer · Joined: 06 Sep 2003 Posts: 21708

Yes, it's obviously one of CCS's code example typos.

viki2000 · Joined: 08 May 2013 Posts: 233

Ok, then I "stripped"out in assembly the Q15sinPI function from Microchip fixed point math.
I have tested it. It is fast and accurate.
The function is executed for one angle value in 1.85-2us at 80MHz internal clock of the PIC24HJ64GP202.
The accuracy, measured as deviation from float sin() in Excel is between 0,00958%-0,0117%, calculated in similar manner as I described in the previous above posts. That is lots batter than Cordic or polynomial 2nd order.
The explanations below are not for experienced programmers, but rather for beginners.
One more time, here is the generated .lst file done with XC16:
https://goo.gl/voce0R
The call of Q15sinPI is done like that:

viki2000 · Joined: 08 May 2013 Posts: 233

In tried to contact the author of the article http://www.coranac.com/2009/07/sines/ , but there was no answer, probably because the article is old, since 2009.
Then I decided to make the analysis by myself, step by step, starting with the 3rd order polynomial.
It is not easy, but rather an adventure.
After several trials on paper, I realized it is easier an Excel emulation for concrete situations, because the solution may be achieved with many pairs of parameters.
I set up the Excel worksheet below:
https://goo.gl/jKhKGe
You cannot open it online, because is big, so it requires download.
It contains sine approximation with 3rd degree polynomial, for different input ranges and the output is always 0…4095, so 4096 values = 2^12, equivalent with 12 bit.
From practical point of view I was thinking that 3rd order polynomial should have the 12 bit output, because I might use later that function for a DAC 12bit, which would accepts only positive integer values in the range 0…4095.
The calculation sheets inside the Excel file are arranged with calculations for 4 input ranges.
If x is the variable and S3(x) is the function, then I wanted always S3(x)=0…4095 and then x to be:
1) 0…4095
2) 0….8191
3) 0…16383
4) 0…32767
It can be obviously calculated also for the range 0…65535, but was not interesting for me as you will se below.
The 3rd order polynomial approximation has the form:
S3(x)=x*(3-x*x)/2
Tis function can be implemented as it is in C code. If we do that and we want x to be in the range 0…16383, then it is an int16 variable and the result S3(x) is 0…4095, so it is as well an int16 variable, but we will get an error when the multiplication x*x is executed – I tried that and I watched in SFR, so we must declare them int32, both. It does not work x to be int32 and S3(x) to be int16.

The above SS3(x) function can be transformed as:
S3(x)=x*(3*2^p-x*x/2^r)/2^s
With r=2n-p and s = n+p+1−A

The idea of such transformation is to provide easy division operation with power of 2, which is equivalent with shift logical right. Also the multiplication with power of 2 is equivalent with logical shift left.

Here A is the power of 2 which gives the output range. In my case the output range is always 0…4095, so A=12, because 2^A=2^12=4096 values.
n is the power of 2 that provides the input range. For example in case 1) above we have n=12 and in the case 3) above we have n=14.
Then we can calculate r and s based only how we choose p.
That’s why I set up the Excel tables, because we can get valid solutions for various p values. It is enough to set A and n, then we play we p for several values. Of course we set up certain constrains which will limit also the value of chosen p. In the same time we look at the chart to see if the generated sine with our 3rd order polynomial overlay the sin function from Excel.
We can setup also a column with error, deviation between sin Excel and our generated sin and then we change p until we reduce the error to minimum. But that is not the only constrain.
I am going to use for tests PIC24, in the beginning with C code and later for fast calculation using a bit of ASM. PIC 24 works on 16bit, its internal registers are 16bit. Besides that it offers multiplication of 2 integers 16bit and the result is 32bit in 2 successive registers.
http://microchipdeveloper.com/dsp0201:multiplier
http://microchipdeveloper.com/dsp0201:multiplication-instructions
Then in the Excel table, I looked at each operation that had to be executed and I tried to have the result in one register, so on 16bit, meaning less than 65535, except the multiplication.
Why calculations of S3x for so many x ranges?
1) To check and see the duplicates.
We have for many similar x values the same S3x value, which is normal because is not a linear function.
We need max. 4096 different points output of the function, so the input should be max. 4095 different input points.
Basically x=4095 is the same as x=8191 STEP 2 or x=16383 STEP 4 or x=32767 STEP 8
This helps when we want fewer points at input and to sweep the entire output 0…4095.
2) If we can have only 1800 or 1600 points due to the speed limitation of the PIC processing the code, then we can take only some x values with a chosen step.
For example we want 1800 points.
With x=0...4095
4096/1800=2.2, so STEP 2 will provide 2048 values (too much) or STEP 3 will provide 1365 values (too less).
With x=0...8191
8192/1800=4.5, so so STEP 4 will provide 2048 values (too much) or STEP 5 will provide 1638 values (too less, but we come close).
With x=0...16383
16384/1800=9.1, so STEP 9 will provide 1820 values, very well.
With x=0...32767
32768/1800=18.2, so STEP 18 will provide will provide also 1820 values, very well.
We may take the range x=0...16383 with STEP 9

Another constrain is related with how these p, r, s and especially r and s look like. It is easier and faster In ASM shift right with 1 or 16bit than with 10 or 13bit.
The idea of providing all these input ranges for x is to compare them and see what range provides better coefficients for r, s, p when we shift bits in ASM.

For example, if I want to use 1800 points/values for x for the output range 0…PI/2 (=0…4095) then the best would be the input range 0…8191 with step 9.
Thus range has among several valid pairs of coefficients p, r, s values as r=14, r=14, s=17, then S3(x) looks like:
S3(x)=x*(3*2^14-x*x/2^14)/2^17.
But if we choose p=13, then we get r=15 and s=16 and S3(x) looks like:
S3(x)=x*(3*2^13-x*x/2^15)/2^16.
If we compare the result of S3(x) with pair (14, 14, 17) with the result when we use (13, 15, 16) then we see we get the same result.
The pair (13, 15, 16) is preferable because is easier in ASM to shift right with 16 and 15, rather than with 14 and 17.

With PIC24HJ64GP202 at 80MHz, the function:
S3(x)=x*(3*2^13-x*x/2^15)/2^16.
Is executed in approx. 4.7us, which is very good.

But looking at the ASM output of .lst file, is obviously a general ASM code with too many operations for only some multiplication and shifting.
I started to look into this manual with ASM instruction for PIC24:
http://ww1.microchip.com/downloads/en/DeviceDoc/70157F.pdf

Ttelmah · Joined: 11 Mar 2010 Posts: 19895

I can't, it's commercial. Also some parts wouldn't work on currently released chips. (You'll only be able to go to 8Mbps on current chips).

If you talk to CCS, they will send you some updated files on the setup of the DMA on this family (these chips have a later DMA than the standard examples). What they send though still has a couple of faults. As standard they don't offer PING_PONG support in the configuration for the new DMA. To get really high continuous rates you have to enable this.

viki2000 · Joined: 08 May 2013 Posts: 233

Could at least enumerate the faults detected in the received support files used to setup of the DMA?
It would be nice a short description how to enable the PING-PONG.
High continuously rates means 16Mbps? And PING-PONG mode?

I found Microchip forum with a bit more details about DMA SPI possible problems:
http://www.microchip.com/forums/tm.aspx?m=400320&mpage=1&key=%F1%A1%AF%80
http://www.microchip.com/forums/tm.aspx?m=240039&mpage=1&key=%F0%BA%A6%A7
Then some setup example of DMA:
http://courses.ece.msstate.edu/ece3724/main_pic24/docs/sphinx/chap11/adc4simul_dma.html
http://courses.ece.msstate.edu/ece3724/main_pic24/docs/pic24__dma_8h_source.html
https://engineering.purdue.edu/ece477/Archive/2009/Spring/S09-Grp06/Code/PIC/pic24_code_examples/docs/dma__example_8c-source.html
https://github.com/UWARG/PICpilot/wiki/DMA
The DMA manual:
http://ww1.microchip.com/downloads/en/DeviceDoc/70215C.pdf
I will start to read about it.

viki2000 · Joined: 08 May 2013 Posts: 233

Here are few observations related with optimization that can be done to the code above.
1) First of all the divide by 3 operations above should not be used, because will reduce the resolution of the signal, the step size in mV.

viki2000 · Joined: 08 May 2013 Posts: 233

For who is interested, I tried the 5th order polynomial approximation as recommended here: http://www.coranac.com/2009/07/sines/
but of course for PIC24 on 16bit, not for ARM on 32bit.

More interesting than the result and implementation are the steps needed to arrive to a certain format of the polynomial approximation.
In the above link It is a challenge for the reader following the S3(x) example.
We start from here:
S5(z)=1/2 z(π-z^2 [(2π-5)-z^2 (π-3)]
And continue in next Word file to preserve nice format of seeing the polynomial, you have to download it to see the polynomials in a nice format:
https://goo.gl/HCc9JT

The implementation in ASM is here: