If we need a fast square of an 8 bit number, we just have to create a simple table of squares. As the square of 255 is 65025, we only need 16 bit values for the table. But, if we have to evaluate the square of a 16 bit number, we can’t just use tables, since values would be too many. Is there any way of computing the square of a 16 bit number without resorting on a 16 bit multiplication?
It’s easy to see that a 16 bit number A can be expressed using its high byte and low byte:
A = AH * 256 + AL
where AH is the high byte, AL is the low byte.
A^2 = (AH * 256 + AL)^2
It can be shown that:
(AH * 256 + AL)^2 = 65536 * AH^2 + 512 * AH * AL + AL^2
A^2 = 65536 * AH^2 + 256 * 2 * AH * AL + AL^2
Please note that multiplying by 65536 just requires a two bytes shift. And, multiplying by 256 just requires a one byte shift. Those shifts only require rearranging the bytes, no slow bit shifting instructions are needed. The squares of AH and AL can be taken simply by using a table of squares. The product AH * AL is an 8 bit multiplication that can be done using this fast routine.
So, we only have to multiply by two (using a simple ASL : ROL). For the rest, we just don’t need any bit-shifting multiplication.
The following C64 BASIC program demonstrates this simple technique.
10 rem 16 bit square with 8 bit squares and 8 bit multiply 15 print:input"16 bit number";a 20 if a<0 or a>65535 then print:goto 15 25 ah=int(a/256):al=a-ah*256 30 r1 = 65536*ah*ah+256*2*ah*al+al*al 35 r2 = a*a 40 print "routine result: ";r1 45 print "basic result: ";r2 48 if r1<>r2 then print"error!" 50 goto 15
The 6502 assembly version follows.
Code can be greatly optimized, but it is provided in this very essential form just to demonstrate the algorithm. Possible improvements include testing AL or AH for zero values. This will allow to avoid computing some terms if not needed.
The following BASIC program can be used to test the routine:
10 rem 16 bit fast square test 12 fora=0to65535 20 ah=int(a/256):al=a-ah*256 25 poke700,ah:poke701,al 30 sys8192 35 r1=peek(9992)*16777216+peek(9993)*65536+peek(9994)*256+peek(9995) 40 r2=a*a 42 if r1<>r2 then stop 45 print r1;r2 50 nexta
This test will require some time. Using VICE in warp mode will help.
To do the test, simply load the machine language program, type NEW, then paste the BASIC test program (using VICE). Remember to press the return key to store line 50.
To test the speed of the routine, I have coded a simple program that draws eight parabolic segments. For these tests, it appears that computing 16 bit squares by using this routine is faster than using a standard 16 bit bit-shifting multiply.
The program using the 16 bit squares routine discussed above takes 1.8 seconds. The program using the standard 16 bit multiply routine takes 2.3 seconds instead. Of course, if we have to draw more segments, the speed advantage obtained becomes greater.
The above programs have been coded to be inefficient, as they are intended as benchmarks. For the sake of clarity, the following program draws the above picture faster.