Operations efficiency – Memory vs. CPU speed

viki2000 · Joined: 08 May 2013 Posts: 233

Additional info about CORDIC and implementation in C:
https://en.wikiversity.org/wiki/CORDIC_Hardware_Implementations
https://en.wikibooks.org/wiki/Digital_Circuits/CORDIC
http://www.eetimes.com/document.asp?doc_id=1271838
https://people.sc.fsu.edu/~jburkardt/cpp_src/cordic/cordic.html
http://forums.parallax.com/discussion/88799

I added some text description at the defined numbers in the suggested Cordic code:

RF_Developer · Joined: 07 Feb 2011 Posts: 839

What is the "best" way to calculate sin(x) (or cos(x) or whatever)?

That simple question has many different answers, and as we've tried to explain over and over, what is "best" depends on your application. Best, perhaps surprisingly but it really shouldn't be, doesn't often mean "fastest", nor necessarily "most accurate". Also, you must bear in mind that best speed and accuracy/precision can rarely, if ever be had at the same time and that any choice trades off one against the other.

It is impossible to precisely compute sin(x). You will never get a totally precise value of sin(x). That is because it is an infinite series which can not be computed: it would take infinite time. So, all computations, not some or most, but ALL computations of sin(x) are approximations. The "best" from a mathematician's perspective would be the full sine series, which can be precisely expressed as an infinite integral, but which cannot be precisely computed.

So what can you compute? Well, you can compute an approximation to sin(x) to any specified finite precision. Computers do it all the time. The questions, as with all practical computations is how precise, how long will it take, and what resources are needed?

One thing to get past is that all computations, all approximations can be no better than +/-one half least significant bit. In floating point that's of the mantissa, but for fixed point its of all the available bits. Hold that thought...

Apparently, Excel uses a early variant of IEEE754 binary64 (i.e. 64 bit floating point, or double precision float). I can pretty much guarantee that it uses the floating point co-processor hardware integrated in the the CPU cores, and that will use a Tchebychev-based polynomial approximation. Excel is no more "right" than any other computation done to the same. As I said before there are reasons why Tchebychev is used, and that it a) give known errors, which are chosen to as close to the available precision of the floating point format as is practical and b) it is known and proved to converge fastest, i.e. to use the least terms to give the required precision. Done right, it is a accurate and fast as any approximation using floating point format can be. That's not to say its guaranteed to be done right, Intel have been known to make significant floating point implementation errors...

The CCS routines essentially do the same, but implemented using firmware rather than floating point hardware. On PIC24s as I described before there are 32 (float), 48 and 64 (double) bit floating point implementations of sin() and cos(). You can hope, though probably not expect that the 64 bit, double, version gives essentially the same result as Excel. The problem is that the floating point hardware in the PC, and hence Excel, probably uses a longer mantissa internally, if only by a few bits probably, with greater potential precision (but still no better that +/-half a bit, though its likely to hit that more often), than the C version which can only have the 24 bit internal precision allowed by the double format.

For most of us, most of the time the CCS sin() is the "best". Why? Because its there, is known to work, is easy to use and we don't need anything faster or more precise. Therefore its a no brainer: use it. It saves a lot of development time, and in general, in real projects, that's where most of the cost goes: development. Anything that saves time and money and gets our product to market sooner is a plus. Therefore we'll use the CCS routines more than 99% of the time. They are the "best".

Well, of course they aren't the absolute best in all circumstances. Sometimes, rarely, we might want something faster, in which case a different approximation might be more appropriate, or a pure integer implementation, but they are likely not to give the same accuracy or precision. Even 23 bit float has a 24 bit mantissa, and if done right will always produce a more precise result than a 16 or 15 bit implementation.

CORDIC offers another route, but your have to remember where it came from and why it was developed. it was an analogue computing thing. Analogue computers, while remarkably powerful in their own way, were great at adding and multiplying by constants (op-amps do that no trouble at all) and were pretty good at integrating and differentiating (hence were great for PID control, which was first developed for analogue computing), but they were poor at multiplying changing values (mult by constant was easy, just set up an amplifier with the right gain ad your there). The CORDIC process was developed as a way of doing trig functions with such analogue computers, which were bad and multiplying. Hence CORDIC essentially uses just additions, with a few constant multiplications thrown in for scaling. It is an iterative process, which is easy enough to implement in analogue by using feedback.

CORDIC is especially useful in some digital applications, where hardware multiplication is not available... but it is on most PICs, and 24s have a full 16x16 hardware single cycle hardware multiplier, so multiplications are just as fast as additions, hence CORDIC's unique selling point is nullified. It's iterative approach will always be slower than a well implemented polynomial method, which goes in one pass. CORDIC's time has largely passed.

Polynomials can also be computed fairly simply with a bit of reorganisation. Writing them as f*x^5 + e*x^4 + d*x^3... and so on hides the computational short cuts that can be used. You start off with the x, multiplying by its constant, then you multiply x by x and the x^2 constant, then multiply by x again, and the x^3 constant and so on. Some DSPs have hardware to do this, and can evaluate polynomials very fast indeed. I'm not sure if the dsPIC have it, but its the sort of thing they might have. Leveraging such hardware can make polynomial evaluation almost as easy and fast as addition. There are almost certainly DDS chips that use such hardware internally to generate its output, in fact, I can see almost no reason to use anything else.

I could go on and on, but I have to go home now. Please, please think again about what you are trying to do. Currently you are bogged down in what is essentially a blind alley. Which tends to be what happens if you don't have any particular direction to go in. There is no best way of computing sin(x). There are many best ways depending on what your requirements and resources are. They are all trade offs. They are approximations, of varying precision.

viki2000 · Joined: 08 May 2013 Posts: 233

Good thoughts to approach the problems in principle.
Let’s go particular.
I want to use as reference for accuracy and speed of execution the _Q15sinPI (libq.h, actually is libq-elf.a) from XC16 Microchip.
The idea of using fixed point math, the 16 bit integers instead of float sin function is the speed, even if we lose accuracy when we produce sine signal. That is the reason why I do not want to use sin() from math.h inside CCS.
The max. deviation between sine calculated with sin() Excel function and sine from _Q15sinPI function inside the PIC is 0.00759%, which I find very good and I do not know what algorithm was implemented to get that.
The polynomial approach proposed by Ttelmah gave me 0.111% max. deviation compared with the sine calculated with sin() Excel function.
About Cordic I did not hear and I was curious to try it and see what precision offers based on different number of iterations.
Maybe are also other polynomial approaches, of higher order with a better precision, as described here for example: http://www.coranac.com/2009/07/sines/ , but maybe I will try that later.
What I am trying to do is to find out and learn, what would be the best approach, approximation, subroutine, which can offer similar accuracy and speed as _Q15sinPI from XC16 Microchip.
If I use as reference _Q15sinPI from XC16 Microchip, then why I do not stick with XC16 to the end?
Because CCS offers better, nicer subroutines for other functions, especially easy communications setup.

temtronic · Posted: Tue Jun 13, 2017 12:19 pm

re:
What I am trying to do is to find out and learn, what would be the best approach, approximation, subroutine, which can offer similar accuracy and speed as _Q15sinPI from XC16 Microchip.

If you're dead stuck on using _Q15sinPI as your 'best' implementation then simply convert that code into the equal using CCS C.

All that's required is the listing of the 'best', then cut code, compile, compare the CCS C listing to the uChip XC16 listing.

This is not hard to do, maybe an hour or two, depending on how well you type..

Jay

viki2000 · Joined: 08 May 2013 Posts: 233

_Q15sinPI is encoded in a library "libq-elf.a" which I cannot see/open. Then remains only to compile the project XC16 and look into .lst, assembly code, but I hoped for a C subroutine.

Coming back to the proposed cordic, I am puzzled by the fact that these guys used the same cordic code and have positive result:
http://www.stm32duino.com/viewtopic.php?t=1510
I noticed the constant half_pi is defined "#define half_pi 0x00006487" but not used.
I started the debugging of the proposed cordic subroutine in an unconventional way.
I "exploded" the subroutine expanding the iteration in Excel cells. There are errors on calculating variable "tz" due to the same "d" sign variable. I will try to dig more, but any suggestion is appreciated.
Here is the Excel with the "exploded" cordic subroutine:
https://goo.gl/mxYWQz

temtronic · Posted: Tue Jun 13, 2017 2:46 pm

Actually assembly is not that hard to learn. Heck, less instructions than PICs have 'fuses' these days !!

To get the 'best' perforamnce, you should use assembly anyway. When you become a 'low level' programmer you have more control over how the code is generated. Depending on who wrote the compiler, it might be optimised for speed, memory use or ??? !

Jay

Ttelmah · Joined: 11 Mar 2010 Posts: 19215

Remember this is an int16 representation of a number with decimals. An integer 16384* the value.

So 0.5 PI = 1.5707963
1.5707963 * 16384 = 25735.927

25735 in hex is 0x6487

However in fact if they were using this 0x6488, would really be the more accurate conversion.

The reason they have this defined, is that this is the range limit for the function. Remember just one bit in front of the decimal, so the function is rated to work between +Pi/2 and -PI/2.

viki2000 · Joined: 08 May 2013 Posts: 233

I will come to these numbers later, but first the idea of looking at the listing file, the assembly of the XC16 project that uses Q15sinPI.
MPLABX with XC16 provides listing.disasm view/file after compilation with next content:
https://goo.gl/9n9zMu
So, I see nothing except “RCALL __Q15sinPI”
Usually the intermediary files are deleted, but if I set MPLABX + XC16 to generate .lst file as suggested here: https://www.eevblog.com/forum/microcontrollers/mplabx-xc16-generate-assembly-listing-file-coff-elf-agnostic/
then I get next:
https://goo.gl/voce0R
Here I can see the Q15sinPI in assembly. The above recall is “rcall 0x2cc <__Q15sinPI>” and the function is this:

viki2000 · Joined: 08 May 2013 Posts: 233

I found the errors inside the Excel sheet that I tried to use to emulate the cordic subroutine. Some of them were related with cell references and some with my confusion about “^” symbol, which in C is bitwise xor and in Excel is x to power of y. Besides that I used Excel 2013, which has Bitwise XOR and also Excel 2010, which does not have that function, so I had to implement it in VBA. Now I work with Excel 2010 enabled macro workbook and the values are fine in the emulation sheet.
The y column gives the sine value and x gives the cosine values after k=10 iterations.
I used 50 steps for the angle between 0..PI/2 and I calculated for the first 5 angles and the last 5 angles with 10 iterations for each angle.
The file is here: https://goo.gl/7HjqF1
Then I compared with the PIC computations and I saw differences, errors from PIC side.

I need your help to understand what I neglect inside the PIC, why do I get the following simple computational error.
I have a programmer that I can use also as debugger: ICD-U64 from CCS.
I compiled the program and in the Watches list I look at the variables inside cordic subroutine.
I start with angle 0 and everything matches the Excel calculation up to the 4th iteration (k=3), but when I reach the 5th iteration (k=4) inside the cordic subroutine, then I get a calculation error for tx variable.
Excel tells me that I should get 16372.
The Watches list tells me that tx is 12274.
It supposed to be the same number or very close.
Then I go to evaluation Eval tab and I write “tx” without commas and I get 12274 and then I write the expression of tx, which is “x-(((y>>k)^d)-d)” and I get 16370, which is acceptable, close to Excel.
Why “tx” and its equivalent expression “x-(((y>>k)^d)-d)” give different results? How is that possible? It should be the same number. What is happening? What is wrong with the PIC? Or do I neglect something that I suppose to know?
It is a big computational difference between 16370 and the wrong value 12274. And that I see in “real time”. The code used is the same as above, except that I used the “dummy” variable to have a breakpoint clear defined between computations.
How do we approach such kind of problem?
See for yourself:

Ttelmah · Joined: 11 Mar 2010 Posts: 19215

OK.
Now you've still not actually explained why you are so worried about accuracy. Problem is that the half wave waveform you are working with is going to give inherently large inaccuracies at the transition point between the half cycles, unless you have hardware that can handle very high frequencies indeed. The half wave involves having all the even harmonics, with some terms at frequencies way above the fundamental, still having large components. As such the inherent errors from trying to synthesise this are going to be much larger than the errors from the maths....

Which is again back to 'much easier to synthesise a pure sin, and rectify it'.

Now the point about the polynomial synthesis, is that it is quick, and gives results that are 'well behaved', so giving smooth terms and covering the whole sinusoid. The integer cordic form, can be made to give quite good accuracies, but only by going to high numbers of terms, which then slows it down. So we are back to 'why not just just use a lookup'...

You said you didn't want to fiddle with having to 'extend' the lookup so using one table to give all four quadrants, but this has to be done for the cordic synthesis as well (in fact has to be done for all sinusoidal synthesis algorithms...).

The problem you are having with cordic, is because of the 'implementation specific' nature of the >> operator. If you read the C textbooks you will see that on some languages this remains as a logical shift right, while others implement it as a 'mathematical shift right' (so handling the sign for -ve numbers).

So, The cordic 16 coded round this:

temtronic · Posted: Wed Jun 14, 2017 7:06 am

I get the impression that the OP is a 'numbers' guy who needs to find the 'best' solution when in reality there are been presented several 'very good - will work fine in the Real World' solutions.
Everyone who has been using using for any time(some of us a 1/4 century) KNOW that PICs were never designed for floating point calculations, it just wasn't designed for it. Ideally you'd use another processor or even a FP chip like the PC did and still does.
Makes me wonder if anyone has interfaced an FP chip to a PIC ? Now that would be good for a thesis... 'time/accuracy comparisons of using FP chip with PIC'.

Jay

Ttelmah · Joined: 11 Mar 2010 Posts: 19215

There was a project done with an FP processor years ago, but I don't think the processor used is made any more. However one company sells a PIC18, programmed with a maths library for use as a 'subsidiary' maths processor on one particular PIC system. The idea is you offload the maths too this, while your main processor gets on with doing other things. I think Sparkfun still sell this. This though is slow...

The other comment made early on about the serial still applies. As it stands if a bit gets lost, the whole communication could become screwed. If this is what he intends to use in the end, then it needs to be 'rethought'....

viki2000 · Joined: 08 May 2013 Posts: 233

If I understand right, the error came from difference between arithmetical shift and logical shift.
https://en.wikipedia.org/wiki/Arithmetic_shift
https://en.wikipedia.org/wiki/Logical_shift

In Excel I used Bitwise XOR function defined in VBA for “^” and power of 2 for “>>” symbol, for shifting, which is the arithmetic shift, exactly as the author intended in his original 32bit and 16bit code.
In the CCS code the “>>” symbol was interpreted as logical shift instead of arithmetical shift.
Was that the problem?
Few more explanations on the subject that I did not realize is the cause of the problem:
http://programmedlessons.org/AssemblyTutorial/Chapter-14/ass14_13.html
https://www.youtube.com/watch?v=nSKT6Ph8u9Q

I have tested your last proposed code.
The results are here: https://goo.gl/lgnPxX

The error compared with sin() Excel is 0.0161%, lots better than proposed polynomial code with 0.111% error.

P.S. I try to learn different approaches and understand the differences between them.

Ttelmah · Joined: 11 Mar 2010 Posts: 19215

It takes twice as long to run though....

viki2000 · Joined: 08 May 2013 Posts: 233

I used the simple communication with serial port in a continuous loop to send the data only for test purpose to analyze the data in PC, not for final implementation.
A bit more info about arithmetic shift right, because it gave me that headache.
Microchip gives next explanation about their C compilers:
http://microchipdeveloper.com/tls2101:shift-operators
then another discussion on the subject with a slightly different implementation:
http://www.microchip.com/forums/m98041.aspx
similar with rotate right from here: https://en.wikibooks.org/wiki/C_Programming/Simple_math
One more discussion from 2012 that gave similar trouble between XC8 (HITECH C) and MPLAB C18
http://www.microchip.com/forums/m677639.aspx
“When right shifting a signed integer, HITECH C does sign extension and MPLAB C18 does not. XC8 is based on HITECH C, so it will propagate the sign bit.”

I learned that we have to pay attention always to the C compiler, how shift right is implemented.

If I use your math shift right subroutine, then of course the original code works.