CCS C Software and Maintenance Offers
FAQFAQ   FAQForum Help   FAQOfficial CCS Support   SearchSearch  RegisterRegister 

ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

CCS does not monitor this forum on a regular basis.

Please do not post bug reports on this forum. Send them to support@ccsinfo.com

Operations efficiency – Memory vs. CPU speed
Goto page Previous  1, 2, 3, 4, 5
 
Post new topic   Reply to topic    CCS Forum Index -> General CCS C Discussion
View previous topic :: View next topic  
Author Message
Ttelmah



Joined: 11 Mar 2010
Posts: 19195

View user's profile Send private message

PostPosted: Sun Jun 18, 2017 1:18 pm     Reply with quote

There is a total answer to your question in fft.c

You need to look at the functions it _loads_. fft.h, shows how to return values either using a direct register write, or as a return from a C function. Remember also you can write into a C variable just as easily as read from it.
viki2000



Joined: 08 May 2013
Posts: 233

View user's profile Send private message

PostPosted: Sun Jun 18, 2017 3:05 pm     Reply with quote

I see now.
There are examples in fft.h and dsp_data_util.c
I will check that during the next days.
viki2000



Joined: 08 May 2013
Posts: 233

View user's profile Send private message

PostPosted: Tue Jun 20, 2017 2:24 pm     Reply with quote

Is it true that is a mistyping mistake on PCD Reference manual page 103 where the #asm #endasm example is provided?
https://www.ccsinfo.com/downloads/PCDReferenceManual.pdf

Shouldn't be "," instead of "." on the next line?
Code:
MOV W0. _RETURN_

I ask because I get an error when I use dot symbol and is compiled correct when I use comma.
PCM programmer



Joined: 06 Sep 2003
Posts: 21708

View user's profile Send private message

PostPosted: Tue Jun 20, 2017 4:14 pm     Reply with quote

Yes, it's obviously one of CCS's code example typos.
viki2000



Joined: 08 May 2013
Posts: 233

View user's profile Send private message

PostPosted: Wed Jun 21, 2017 2:45 am     Reply with quote

Ok, then I "stripped"out in assembly the Q15sinPI function from Microchip fixed point math.
I have tested it. It is fast and accurate.
The function is executed for one angle value in 1.85-2us at 80MHz internal clock of the PIC24HJ64GP202.
The accuracy, measured as deviation from float sin() in Excel is between 0,00958%-0,0117%, calculated in similar manner as I described in the previous above posts. That is lots batter than Cordic or polynomial 2nd order.
The explanations below are not for experienced programmers, but rather for beginners.
One more time, here is the generated .lst file done with XC16:
https://goo.gl/voce0R
The call of Q15sinPI is done like that:
Code:
    do {
        Y=_Q15sinPI(X++);   //Computing sine value then incrementing X
 3b4:   80 04 e8       inc.w     w0, w9
 3b6:   8a ff 07       rcall     0x2cc <__Q15sinPI>
 3b8:   00 04 78       mov.w     w0, w8

That made me understand that value X++ is in fact in register W0, which is incremented and then moved in register W9 and then the __Q15sinPI function is called. The result is W0, moved in registered W8.
There are 2 ways how we can implement the ASM code, once is extracted from .lst file of original XC16.
We can declare the ASM code as subroutine and we use inside the subroutine, as input/output parameters, some variables declared as general, outside of ASM subroutine and outside of main( ) C code, so right in the beginning of the program after fuses. Or we can declare the ASM code as function, signed integer function with signed integer argument. Then we can declare local variables in main() C code and we call the function by assigning it to a local variable in main(). I tried both and both approaches work.
The “return” problem: in case of ASM code as subroutine, the code may be finished with a simple “return” instruction. In case of ASM as function, to assign the calculated value located in register W0 to the function name, then we use “_RETURN_”.
I have used MPLABX with CCS plugin, because it allows simulation of the code. I am not aware if CCS IDE environment allows that, but as far as I have seen it let me only to debug in real time using ICD-U64 adapter and the PIC connected to it, but nothing offline without real PIC. Then I used Step In (F7) to see how the code is walked looking at some variables, SFR under Watches.

The subroutine approach example:
Code:
#include <24HJ64GP202.h>
#use delay(internal=80MHz)

#FUSES FRC_PLL
#FUSES NOWDT                    //No Watch Dog Timer
#FUSES NOWRTB                   //Boot block not write protected
#FUSES NOBSS                    //No boot segment
#FUSES NORBS                    //No Boot RAM defined
#FUSES NOWRTSS                  //Secure segment not write protected
#FUSES NOSSS                    //No secure segment
#FUSES NORSS                    //No secure segment RAM
#FUSES NOWRT                    //Program memory not write protected
#FUSES NOPROTECT                //Code not protected from reading
#FUSES IESO                     //Internal External Switch Over mode enabled
#FUSES NOOSCIO                  //OSC2 is clock output
#FUSES IOL1WAY                  //Allows only one reconfiguration of peripheral pins
#FUSES CKSFSM                   //Clock Switching is enabled, fail Safe clock monitor is enabled
#FUSES WINDIS                   //Watch Dog Timer in non-Window mode
#FUSES PUT128                   //Power On Reset Timer value 128ms
#FUSES NOALTI2C1                //I2C1 mapped to SDA1/SCL1 pins
#FUSES NOJTAG                   //JTAG disabled

#pin_select U1TX=PIN_B6
#pin_select U1RX=PIN_B7
#use rs232(UART1, BAUD=115200, ERRORS)

signed int an, angle, res;

extern void fast_sin(){

#asm
 mov     w0, w2
 mov     #0x8000, w1
 clr     w0
 cpsne   w2, w1
 return   
 mov     w8, [w15++]
 mov     #0x1, w8
 cpsgt    w2, w0
 mov      #0xffff, w8
 mul.ss   w2, w8, w0
 mov      #0x4001, w2
 mov      #0x8000, w3
 cpslt    w0, w2
 sub      w3, w0, w0
 mov      #0x28bf, w2
 cp       w0, w2
 bra      GE, SinePI_CosCall
 mov      #0x6488, w2
 mul.ss    w0, w2, w2
 mov      #0x1000, w4
 add      w2, w4, w2
 addc     w3, #0x0, w3
 lsr      w2, #0xd, w2
 sl       w3, #0x3, w4
 ior      w2, w4, w0
 mov      w0, w2
 mov      #0x6bb5, w0
 mov      #0x7fff, w1
 cpsne    w2, w1
 bra      L_SIN_PI_RETURN
 mov      #0x944b, w0
 mov      #0x8001, w1
 cpsgt    w2, w1
 bra      L_SIN_PI_RETURN
 mul.ss   w2, w2, w6
 lsr      w6, #0xf, w6
 sl       w7, #0x1, w7
 ior      w6, w7, w3
 sl       w3, #0x1, w3
 mul.su    w2, w3, w4
 mov      #0x5555, w1
 mul.ss    w5, w1, w6
 asr      w7, #0x1, w7
 sub      w2, w7, w0
 mul.su    w5, w3, w4
 mov      #0x4444, w1
 mul.ss    w5, w1, w6
 asr      w7, #0x5, w7
 add      w0, w7, w0
 mul.su    w5, w3, w4
 mov      #0x6807, w1
 mul.ss    w5, w1, w6
 mov      #0x400, w5
 add      w7, w5, w7
 asr      w7, #0xb, w7
 sub      w0, w7, w0

L_SIN_PI_RETURN:
 bra      SIN_PI_END

SinePI_CosCall:
mov      #0x4000, w3
sub      w3, w0, w0
mov      #0x6488, w2
mul.ss   w0, w2, w2
mov      #0x1000, w4
add      w2, w4, w2
addc     w3, #0x0, w3
lsr      w2, #0xd, w2
sl       w3, #0x3, w4
ior      w2, w4, w0
mov      #0xff01, w1
mov      #0xff, w2
cp       w0, w1
bra      LT, SIN_PI_END
cp       w0, w2
bra      GT, L_SIN_PI_Cos_Else
mov      #0x7fff, w0
bra      SIN_PI_END

L_SIN_PI_Cos_Else:
 mov      w0, w2
 mov      #0x8000, w1
 mov      #0x4529, w0
 cpsne    w2, w1
 bra      SIN_PI_END
 mov      #0x7fff, w1
 mov      #0x7fff, w0
 mul.ss   w2, w2, w4
 lsr      w4, #0xf, w4
 sl       w5, #0x1, w5
 ior      w4, w5, w4
 sl       w4, #0x1, w4
 mul.us   w4, w1, w2
 mov      #0x8000, w7
 mul.su   w3, w7, w6
 sub      w0, w7, w0
 mul.us   w4, w3, w2
 mov      #0x5555, w7
 mul.su   w3, w7, w6
 asr      w7, #0x3, w7
 add      w0, w7, w0
 mul.us   w4, w3, w2
 mov      #0x2d83, w7
 mul.su   w3, w7, w6
 asr      w7, #0x7, w7
 sub      w0, w7, w0
 mul.us   w4, w3, w2
 mov      #0xd0, w7
 mul.su   w3, w7, w6
 asr      w7, #0x7, w7
 add      w0, w7, w0

SIN_PI_END:
 mul.ss   w0, w8, w0
 mov      [--w15], w8
 return
#endasm
}

void main(){
  while(TRUE){
    for (angle=-32768;angle<32767;angle++){
       #asm
         MOV angle, W0
         MOV W0, W9
       #endasm
        fast_sin();
       #asm
        MOV W0, res
        MOV W9, W0
       #endasm
       //Send to serial port
       putc(make8(res,1)); //MSB                 
       putc(make8(res,0)); //LSB
    }   
  }
}

The function approach:
Code:
#include <24HJ64GP202.h>
#use delay(internal=80MHz)

#FUSES FRC_PLL
#FUSES NOWDT                    //No Watch Dog Timer
#FUSES NOWRTB                   //Boot block not write protected
#FUSES NOBSS                    //No boot segment
#FUSES NORBS                    //No Boot RAM defined
#FUSES NOWRTSS                  //Secure segment not write protected
#FUSES NOSSS                    //No secure segment
#FUSES NORSS                    //No secure segment RAM
#FUSES NOWRT                    //Program memory not write protected
#FUSES NOPROTECT                //Code not protected from reading
#FUSES IESO                     //Internal External Switch Over mode enabled
#FUSES NOOSCIO                  //OSC2 is clock output
#FUSES IOL1WAY                  //Allows only one reconfiguration of peripheral pins
#FUSES CKSFSM                   //Clock Switching is enabled, fail Safe clock monitor is enabled
#FUSES WINDIS                   //Watch Dog Timer in non-Window mode
#FUSES PUT128                   //Power On Reset Timer value 128ms
#FUSES NOALTI2C1                //I2C1 mapped to SDA1/SCL1 pins
#FUSES NOJTAG                   //JTAG disabled

#pin_select U1TX=PIN_B6
#pin_select U1RX=PIN_B7
#use rs232(UART1, BAUD=115200, ERRORS)

int fast_sin(int an){

#asm
mov      an, w0
mov      w0, w9
mov      w0, w2
mov      #0x8000, w1
clr      w0
cpsne    w2, w1
return   
mov      w8, [w15++]
mov      #0x1, w8
cpsgt    w2, w0
mov      #0xffff, w8
mul.ss   w2, w8, w0
mov      #0x4001, w2
mov      #0x8000, w3
cpslt    w0, w2
sub      w3, w0, w0
mov      #0x28bf, w2
cp       w0, w2
bra      GE, SinePI_CosCall
mov      #0x6488, w2
mul.ss   w0, w2, w2
mov      #0x1000, w4
add      w2, w4, w2
addc     w3, #0x0, w3
lsr      w2, #0xd, w2
sl       w3, #0x3, w4
ior      w2, w4, w0
mov      w0, w2
mov      #0x6bb5, w0
mov      #0x7fff, w1
cpsne    w2, w1
bra      L_SIN_PI_RETURN
mov      #0x944b, w0
mov      #0x8001, w1
cpsgt    w2, w1
bra      L_SIN_PI_RETURN
mul.ss   w2, w2, w6
lsr      w6, #0xf, w6
sl       w7, #0x1, w7
ior      w6, w7, w3
sl       w3, #0x1, w3
mul.su   w2, w3, w4
mov      #0x5555, w1
mul.ss   w5, w1, w6
asr      w7, #0x1, w7
sub      w2, w7, w0
mul.su   w5, w3, w4
mov      #0x4444, w1
mul.ss   w5, w1, w6
asr      w7, #0x5, w7
add      w0, w7, w0
mul.su   w5, w3, w4
mov      #0x6807, w1
mul.ss   w5, w1, w6
mov      #0x400, w5
add      w7, w5, w7
asr      w7, #0xb, w7
sub      w0, w7, w0

L_SIN_PI_RETURN:
bra      SIN_PI_END

SinePI_CosCall:
mov      #0x4000, w3
sub      w3, w0, w0
mov      #0x6488, w2
mul.ss   w0, w2, w2
mov      #0x1000, w4
add      w2, w4, w2
addc     w3, #0x0, w3
lsr      w2, #0xd, w2
sl       w3, #0x3, w4
ior      w2, w4, w0
mov      #0xff01, w1
mov      #0xff, w2
cp       w0, w1
bra      LT, SIN_PI_END
cp       w0, w2
bra      GT, L_SIN_PI_Cos_Else
mov      #0x7fff, w0
bra      SIN_PI_END

L_SIN_PI_Cos_Else:
mov      w0, w2
mov      #0x8000, w1
mov      #0x4529, w0
cpsne    w2, w1
bra      SIN_PI_END
mov      #0x7fff, w1
mov      #0x7fff, w0
mul.ss   w2, w2, w4
lsr      w4, #0xf, w4
sl       w5, #0x1, w5
ior      w4, w5, w4
sl       w4, #0x1, w4
mul.us   w4, w1, w2
mov      #0x8000, w7
mul.su   w3, w7, w6
sub      w0, w7, w0
mul.us   w4, w3, w2
mov      #0x5555, w7
mul.su   w3, w7, w6
asr      w7, #0x3, w7
add      w0, w7, w0
mul.us   w4, w3, w2
mov      #0x2d83, w7
mul.su   w3, w7, w6
asr      w7, #0x7, w7
sub      w0, w7, w0
mul.us   w4, w3, w2
mov      #0xd0, w7
mul.su   w3, w7, w6
asr      w7, #0x7, w7
add      w0, w7, w0

SIN_PI_END:
mul.ss   w0, w8, w0
mov      [--w15], w8
mov      w0, _RETURN_
#endasm
}

void main(){
int angle, res;

  while(TRUE){
    for (angle=32768;angle<32767;angle++){
        res=fast_sin(angle);
        //Send to serial port
        putc(make8(res,1)); //MSB                 
        putc(make8(res,0)); //LSB
    }   
  }
}

The ASM code may be written in a .h file and included as a file/library to the main code, just to look nicer.
The function accepts as input/argument a signed integer 16bit -32768 to +32767 and provides as output a number as sine function in the same range.
What is the big deal with this ASM Q15sinPI approach?
It is faster than any other and more accurate.

There are 2 more challenges that worth to be analyzed:
1) To try reverse engineering the equation, the polynomial used in the above ASM code by Microchip.
2) To compare the result with the polynomials described here: http://www.coranac.com/2009/07/sines/
and see if they are faster or/and with similar or better accuracy, using C or ASM implementation.
viki2000



Joined: 08 May 2013
Posts: 233

View user's profile Send private message

PostPosted: Tue Jun 27, 2017 7:05 am     Reply with quote

In tried to contact the author of the article http://www.coranac.com/2009/07/sines/ , but there was no answer, probably because the article is old, since 2009.
Then I decided to make the analysis by myself, step by step, starting with the 3rd order polynomial.
It is not easy, but rather an adventure.
After several trials on paper, I realized it is easier an Excel emulation for concrete situations, because the solution may be achieved with many pairs of parameters.
I set up the Excel worksheet below:
https://goo.gl/jKhKGe
You cannot open it online, because is big, so it requires download.
It contains sine approximation with 3rd degree polynomial, for different input ranges and the output is always 0…4095, so 4096 values = 2^12, equivalent with 12 bit.
From practical point of view I was thinking that 3rd order polynomial should have the 12 bit output, because I might use later that function for a DAC 12bit, which would accepts only positive integer values in the range 0…4095.
The calculation sheets inside the Excel file are arranged with calculations for 4 input ranges.
If x is the variable and S3(x) is the function, then I wanted always S3(x)=0…4095 and then x to be:
1) 0…4095
2) 0….8191
3) 0…16383
4) 0…32767
It can be obviously calculated also for the range 0…65535, but was not interesting for me as you will se below.
The 3rd order polynomial approximation has the form:
S3(x)=x*(3-x*x)/2
Tis function can be implemented as it is in C code. If we do that and we want x to be in the range 0…16383, then it is an int16 variable and the result S3(x) is 0…4095, so it is as well an int16 variable, but we will get an error when the multiplication x*x is executed – I tried that and I watched in SFR, so we must declare them int32, both. It does not work x to be int32 and S3(x) to be int16.

The above SS3(x) function can be transformed as:
S3(x)=x*(3*2^p-x*x/2^r)/2^s
With r=2n-p and s = n+p+1−A

The idea of such transformation is to provide easy division operation with power of 2, which is equivalent with shift logical right. Also the multiplication with power of 2 is equivalent with logical shift left.

Here A is the power of 2 which gives the output range. In my case the output range is always 0…4095, so A=12, because 2^A=2^12=4096 values.
n is the power of 2 that provides the input range. For example in case 1) above we have n=12 and in the case 3) above we have n=14.
Then we can calculate r and s based only how we choose p.
That’s why I set up the Excel tables, because we can get valid solutions for various p values. It is enough to set A and n, then we play we p for several values. Of course we set up certain constrains which will limit also the value of chosen p. In the same time we look at the chart to see if the generated sine with our 3rd order polynomial overlay the sin function from Excel.
We can setup also a column with error, deviation between sin Excel and our generated sin and then we change p until we reduce the error to minimum. But that is not the only constrain.
I am going to use for tests PIC24, in the beginning with C code and later for fast calculation using a bit of ASM. PIC 24 works on 16bit, its internal registers are 16bit. Besides that it offers multiplication of 2 integers 16bit and the result is 32bit in 2 successive registers.
http://microchipdeveloper.com/dsp0201:multiplier
http://microchipdeveloper.com/dsp0201:multiplication-instructions
Then in the Excel table, I looked at each operation that had to be executed and I tried to have the result in one register, so on 16bit, meaning less than 65535, except the multiplication.
Why calculations of S3x for so many x ranges?
1) To check and see the duplicates.
We have for many similar x values the same S3x value, which is normal because is not a linear function.
We need max. 4096 different points output of the function, so the input should be max. 4095 different input points.
Basically x=4095 is the same as x=8191 STEP 2 or x=16383 STEP 4 or x=32767 STEP 8
This helps when we want fewer points at input and to sweep the entire output 0…4095.
2) If we can have only 1800 or 1600 points due to the speed limitation of the PIC processing the code, then we can take only some x values with a chosen step.
For example we want 1800 points.
With x=0...4095
4096/1800=2.2, so STEP 2 will provide 2048 values (too much) or STEP 3 will provide 1365 values (too less).
With x=0...8191
8192/1800=4.5, so so STEP 4 will provide 2048 values (too much) or STEP 5 will provide 1638 values (too less, but we come close).
With x=0...16383
16384/1800=9.1, so STEP 9 will provide 1820 values, very well.
With x=0...32767
32768/1800=18.2, so STEP 18 will provide will provide also 1820 values, very well.
We may take the range x=0...16383 with STEP 9

Another constrain is related with how these p, r, s and especially r and s look like. It is easier and faster In ASM shift right with 1 or 16bit than with 10 or 13bit.
The idea of providing all these input ranges for x is to compare them and see what range provides better coefficients for r, s, p when we shift bits in ASM.

For example, if I want to use 1800 points/values for x for the output range 0…PI/2 (=0…4095) then the best would be the input range 0…8191 with step 9.
Thus range has among several valid pairs of coefficients p, r, s values as r=14, r=14, s=17, then S3(x) looks like:
S3(x)=x*(3*2^14-x*x/2^14)/2^17.
But if we choose p=13, then we get r=15 and s=16 and S3(x) looks like:
S3(x)=x*(3*2^13-x*x/2^15)/2^16.
If we compare the result of S3(x) with pair (14, 14, 17) with the result when we use (13, 15, 16) then we see we get the same result.
The pair (13, 15, 16) is preferable because is easier in ASM to shift right with 16 and 15, rather than with 14 and 17.

With PIC24HJ64GP202 at 80MHz, the function:
S3(x)=x*(3*2^13-x*x/2^15)/2^16.
Is executed in approx. 4.7us, which is very good.

But looking at the ASM output of .lst file, is obviously a general ASM code with too many operations for only some multiplication and shifting.
I started to look into this manual with ASM instruction for PIC24:
http://ww1.microchip.com/downloads/en/DeviceDoc/70157F.pdf
Code:
#asm
        //multiply x*x
        MOV x,W0
        MOV x,W1
        MUL.UU W0,W1,W2   

        //shift logical right with 15 (x*x>>15)
        RLNC W2, W2
        AND W2,#0x1,W2
        SL W3, W3
        IOR W3, W2, W0       

        //substract (3*2^13) - (x*x>>15)
        MOV #0x6000,W1
        SUB W1,W0,W0
       
        //multiply x*((3*2^13) - (x*x>>15)), result in W3
        MOV x,W1
        MUL.UU W0,W1,W2
        MOV W3, S3x
#endasm

which does the job faster and implements the above function.

As I am not an ASM programmer, it took me few hours to look into the manual and imagine how the bits are moved from one register to another.
I have also used MPLABX Simulator with CCS compiler to look at registers.
Here are few explanations to the code.
The multiply of 2 unsigned integers is simple, we use 2 registers and the result in 2 successive registers, we specify only 1. We load W0 and W1 with x, apply multiplication and the result is in W2, W3.
The shift logical right with 15 positions is tricky, I realized the logical shift right instruction ASM should be called 15 times and I wanted a faster execution.
We have W2 and W3 with the result, a 32 bit result, which must be shifted right 15 bits. W3 is MSB and W2 is LSB (or should I say MSR and LSR, with R from Register instead of B from Bit)
What I do is I move 1 bit in W3 from MSB position to LSB, by rotating left 1 position, then I apply a mask with instruction AND using constant 0x1 and all the bits become 0 except the last one, the LSB. That happens in W2.
Then I shift logical left the bits in W3 register with one position. The LSB inW3 becomes 0.
Then we can overlay W3 and W3 with an OR instruction and we get the end results as it would have been 15 positions shift right, but faster.
Then it follows the subtraction between the constant and a register. The constant 0x6000 is 24576=3*2^13.
I checked that always the result of the difference above to be a positive number and less than 65536, to be able to hold it in one 16bit register.
Then it follows one more multiplication.
The final shift right with 16bit is very nice, because I do not do it.
The multiplication result is stored in 2 successive registers, each 16bit. By shifting with 16bit, it would mean to move on register into another, which makes no sense. We just take the value from W3 as final result.

What if we want to change the amplitude?
If the chip is powered 3V (or 3.3V) and uses that as reference voltage for DAC, then for 4096 values we have 3V, then for 1V is 3 times less. We have to divide the output of the S3x function to 3.
Dividing by 3 is not easy as by 2 when we shift, but is fun with the help of some examples from internet:
http://www.microchip.com/forums/m301063.aspx
http://www.hackersdelight.org/divcMore.pdf
I like this suggestion for 8 bit:
Quote:
“When you need to divide by a particular number "n", use a calculator and divide 256/n, drop the fractional part and add one. For n=3, the result is 86. Then multiply this result (86) by the number you want to be divided and shift right one byte.
Example: you want to divide 120 by 3. 256/3 + 1 = 86 (dropping the fractional part). Multiply 120 by 86 = 10320. Shift right 10320 one byte (same as divide by 256) = 40 (dropping the fractional part).”


In my case is 16bit. So for x/3 we do 65536/3+1=21846, then x*21846 and final we divide by 65536 by shifting right 16bit: (x*21846)>>16.
But as I mentioned in the above example with shifting right 16bit , in such case we just read the content of the MSB Register, the result is there.
Basically we only multiply x with 21846 and that is equivalent with dividing by 3, but we have to load x in one register and 21846=0x5556 in the following one and then the result will be in 2 successive registers. The MSB register from result of multiplication has the equivalent result of dividing by 3. That is tricky and fun.
The ASM code looks like:
Code:
#asm
        //multiply x*x
        MOV x,W0
        MOV x,W1
        MUL.UU W0,W1,W2   

        //shift logical right with 15 (x*x>>15)
        RLNC W2, W2
        AND W2,#0x1,W2
        SL W3, W3
        IOR W3, W2, W0       

        //substract (3*2^13) - (x*x>>15)
        MOV #0x6000,W1
        SUB W1,W0,W0
       
        //multiply x*((3*2^13) - (x*x>>15)), result in W3
        MOV x,W1
        MUL.UU W0,W1,W2
       
        //W3 divided by 3 (trick), result in W5
        MOV #0x5556, W4
        MUL.UU W3,W4,W5

        //get the result from W5 in S3x
        MOV W5, S3x
#endasm

This code is executed in 400ns with a real PIC24 at 80MHz clock, 3Vdc VDD.
I consider it a real improvement/optimization in terms of speed.
It produces a quarter of sine from 0 to PI/2 with 1800 points.
When I double it for 0…PI, I get a rectified sine wave 1V, close to 99Hz (can be tweaked for 100Hz) and 3600 points.
It is a very simple and short code, more elegant than a lookup table, lots faster.
There is also a downside, if the errors between a float sine and this integer polynomial approximation are not acceptable, then we have to move on a higher order polynomial as 5 order for example, which requires a bit more time to execute, but higher accuracy to approximate the sine.
I overlaid the signal generated with this 3rd order polynomial with the sine generated inside by the oscilloscope (it has a signal generator) and visually they look fine. Of course in terms of signal error and deviation, the discrete generated 4096 values should be analyzed, but I do not need that.
If I have time I will try to implement also the 5th order polynomial approximation.
I just wanted to see these integer polynomial approximations, because offers high speed processing in ASM and how to increase to the maximum number of points/values comparable with lookup table solution, which was from beginning suggested.
The bottleneck is now on SPI, which needs around 2.64us to execute. It works on 20MHz clock, but spi_xref() is still slower compared with fast sin approximation 400ns.
I was thinking that maybe is time to try DMA as next level of optimization, but I have no experience with it.


@Ttelmah
You said some days ago that:
Quote:
“I'm running SPI on a PIC24FJ256GB610 at 16Mbps. Using DMA.”


Could you please share your code regarding SPI and DMA part?
I would like to start learning and be inspired by it.
Ttelmah



Joined: 11 Mar 2010
Posts: 19195

View user's profile Send private message

PostPosted: Tue Jun 27, 2017 9:20 am     Reply with quote

I can't, it's commercial. Also some parts wouldn't work on currently released chips. (You'll only be able to go to 8Mbps on current chips).

If you talk to CCS, they will send you some updated files on the setup of the DMA on this family (these chips have a later DMA than the standard examples). What they send though still has a couple of faults. As standard they don't offer PING_PONG support in the configuration for the new DMA. To get really high continuous rates you have to enable this.
viki2000



Joined: 08 May 2013
Posts: 233

View user's profile Send private message

PostPosted: Tue Jun 27, 2017 2:24 pm     Reply with quote

Could at least enumerate the faults detected in the received support files used to setup of the DMA?
It would be nice a short description how to enable the PING-PONG.
High continuously rates means 16Mbps? And PING-PONG mode?

I found Microchip forum with a bit more details about DMA SPI possible problems:
http://www.microchip.com/forums/tm.aspx?m=400320&mpage=1&key=%F1%A1%AF%80
http://www.microchip.com/forums/tm.aspx?m=240039&mpage=1&key=%F0%BA%A6%A7
Then some setup example of DMA:
http://courses.ece.msstate.edu/ece3724/main_pic24/docs/sphinx/chap11/adc4simul_dma.html
http://courses.ece.msstate.edu/ece3724/main_pic24/docs/pic24__dma_8h_source.html
https://engineering.purdue.edu/ece477/Archive/2009/Spring/S09-Grp06/Code/PIC/pic24_code_examples/docs/dma__example_8c-source.html
https://github.com/UWARG/PICpilot/wiki/DMA
The DMA manual:
http://ww1.microchip.com/downloads/en/DeviceDoc/70215C.pdf
I will start to read about it.
viki2000



Joined: 08 May 2013
Posts: 233

View user's profile Send private message

PostPosted: Wed Jul 12, 2017 6:26 am     Reply with quote

Here are few observations related with optimization that can be done to the code above.
1) First of all the divide by 3 operations above should not be used, because will reduce the resolution of the signal, the step size in mV.
Code:
        //W3 divided by 3 (trick), result in W5
        MOV #0x5556, W4
        MUL.UU W3,W4,W5

Reading the DAC MCP4921 manual on page 21 at “6.4.1.1 Decreasing The Output Step Size”:
http://ww1.microchip.com/downloads/en/devicedoc/21897b.pdf
we can see that preferred method is either to change VREF or to use a voltage divider with 2 resistors at output. The last one I consider it as the simplest and better method.
As conclusion the initial shorter code should be used:
Code:
#asm
        //multiply x*x
        MOV x,W0
        MOV x,W1
        MUL.UU W0,W1,W2   

        //shift logical right with 15 (x*x>>15)
        RLNC W2, W2
        AND W2,#0x1,W2
        SL W3, W3
        IOR W3, W2, W0       

        //substract (3*2^13) - (x*x>>15)
        MOV #0x6000,W1
        SUB W1,W0,W0
       
        //multiply x*((3*2^13) - (x*x>>15)), result in W3
        MOV x,W1
        MUL.UU W0,W1,W2
        MOV W3, S3x
#endasm


2) The code above can be shortened even more, if instead of “//shift logical right with 15 (x*x>>15)” we do another trick.
Instead of shifting right 15bit we multiply with 2 (=shift left 1bit) and then we shift right 16bit.
Basically we only multiply by 2 and that’s it, because the result of multiplication will be in one register, no need of shifting 1 register of 16bit into another one.
So everything is reduced at one multiplication with 2.
Code:
#asm
        //multiply x*x, the result is W2 and W3
        //only content of W3 is used
        MOV x,W0
        MOV x,W1
        MUL.UU W0,W1,W2   

        //shift logical right with 15 (x*x>>15)
        //by multiplyinh with 2 and shift right with 16
        //the result is W4 and W5, only content of W4 is used
        MUL.UU W3,#0x2,W4

        //substract (3*2^13) - (x*x>>15), the reuslt is in W4
        MOV #0x6000,W6
        SUB W6,W4,W4
       
        //multiply x*((3*2^13) - (x*x>>15)), result in W2 and W3
       //only W3 is used, being equivalent with shift right 16bit
        MUL.UU W1,W4,W2

        //get the result from W3 in S3x
        MOV W3, S3x
#endasm

The code was tested with a real device and works fine.
The execution is 250ns at 80MHz internal clock of PIC24HJ64GP202.
viki2000



Joined: 08 May 2013
Posts: 233

View user's profile Send private message

PostPosted: Wed Jul 12, 2017 7:47 am     Reply with quote

For who is interested, I tried the 5th order polynomial approximation as recommended here: http://www.coranac.com/2009/07/sines/
but of course for PIC24 on 16bit, not for ARM on 32bit.

More interesting than the result and implementation are the steps needed to arrive to a certain format of the polynomial approximation.
In the above link It is a challenge for the reader following the S3(x) example.
We start from here:
S5(z)=1/2 z(π-z^2 [(2π-5)-z^2 (π-3)]
And continue in next Word file to preserve nice format of seeing the polynomial, you have to download it to see the polynomials in a nice format:
https://goo.gl/HCc9JT

The implementation in ASM is here:
Code:
#asm
//multiply x*x and divide by 2^16, the result is in W3
  MOV x,W0
  MOV x,W1
  MUL.UU W0,W1,W2   

//multiply by 9279=0x243F and divide by 2^16, the result is in W7
  MOV #0x243F,W4
  MUL.UU W3,W4,W6   

//subtract from 5256=0x1488, the result is in W6
  MOV #0x1488,W6
  SUB W6,W7,W6

//multiply with x and divide by 2^16, the result is in W9
  MUL.UU W6,W0,W8     
             
//multiply by 2^5=32=0x20, the result is in W10
  MOV #0x20,W5
  MUL.UU W5,W9,W10

//multiply with x and divide by 2^16, the result is in W7
  MUL.UU W0,W10,W6
   
//subtract from 25736=0x6488, the result is in W7
  MOV #0x6488,W6
  SUB W6,W7,W7
 
//multiply with x, which is in W0 or W1 and divide by 2^16, the result is in W3
  MUL.UU W0,W7,W2   

//get the result from W3 in S5x
  MOV W3, S5x
#endasm

The code was tested with a real PIC and takes 425ns with 80MHz internal clock of the PIC24HJ64GP202.
The approximation is better with this 5th order polynomial approximation compared with 3rd polynomial approximation:
https://www.desmos.com/calculator/xxkkb0gmvw
The only big headache with polynomial approximation is to find the proper format, the constants, the exponents of power of 2 that avoid overflow and low value numbers by division/shifting right. These constants depend by the input range and output range of the function and must be recalculated for different ranges.

The lookup table version was implemented in C and to retrieve one element from that 2K array it takes 1.425us in C.
Polynomial implementation in C is slow, but better than float sin(x) default function calculation.
The ASM version of the 3rd order polynomial, calculation of one value, takes 250ns.
The ASM version of the 5th order polynomial, calculation of one value, takes 425ns.
The Microchip fixed point library _Q15sinPI takes 1.9us, but probably better accuracy using a higher order polynomial.
The CORDIC approach in C is better than polynomial in C as accuracy and speed (depending by the number of loops), but slower than polynomial in ASM.
When comes to accuracy and lower number of points, then LUT is the best. When the accuracy provided by polynomial approximation is acceptable by the application and the speed of processing is important, then polynomial approximation in ASM is the best. If we add DMA to polynomial in ASM, then is a rocket.
This is the comparison that I was looking for from beginning of my questions.
Display posts from previous:   
Post new topic   Reply to topic    CCS Forum Index -> General CCS C Discussion All times are GMT - 6 Hours
Goto page Previous  1, 2, 3, 4, 5
Page 5 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group