If we need a fast square of an 8 bit number, we just have to create a simple table of squares. As the square of 255 is 65025, we only need 16 bit values for the table. But, if we have to evaluate the square of a 16 bit number, we can’t just use tables, since values would be too many. Is there any way of computing the square of a 16 bit number without resorting on a 16 bit multiplication?

It’s easy to see that a 16 bit number A can be expressed using its high byte and low byte:

A = AH * 256 + AL

where AH is the high byte, AL is the low byte.

So:

A^2 = (AH * 256 + AL)^2

It can be shown that:

(AH * 256 + AL)^2 = 65536 * AH^2 + 512 * AH * AL + AL^2

Thus:

A^2 = 65536 * AH^2 + 256 * 2 * AH * AL + AL^2

Please note that multiplying by 65536 just requires a two bytes shift. And, multiplying by 256 just requires a one byte shift. Those shifts only require rearranging the bytes, no slow bit shifting instructions are needed. The squares of AH and AL can be taken simply by using a table of squares. The product AH * AL is an 8 bit multiplication that can be done using this fast routine.

So, we only have to multiply by two (using a simple ASL : ROL). For the rest, we just don’t need any bit-shifting multiplication.

The following C64 BASIC program demonstrates this simple technique.

```
10 rem 16 bit square with 8 bit squares and 8 bit multiply
15 print:input"16 bit number";a
20 if a<0 or a>65535 then print:goto 15
25 ah=int(a/256):al=a-ah*256
30 r1 = 65536*ah*ah+256*2*ah*al+al*al
35 r2 = a*a
40 print "routine result: ";r1
45 print "basic result: ";r2
48 if r1<>r2 then print"error!"
50 goto 15
```

The 6502 assembly version follows.

**DOWNLOAD: 16 bit square (source code)**

Code can be greatly optimized, but it is provided in this very essential form just to demonstrate the algorithm. Possible improvements include testing AL or AH for zero values. This will allow to avoid computing some terms if not needed.

The following BASIC program can be used to test the routine:

```
10 rem 16 bit fast square test
12 fora=0to65535
20 ah=int(a/256):al=a-ah*256
25 poke700,ah:poke701,al
30 sys8192
35 r1=peek(9992)*16777216+peek(9993)*65536+peek(9994)*256+peek(9995)
40 r2=a*a
42 if r1<>r2 then stop
45 print r1;r2
50 nexta
```

This test will require some time. Using VICE in warp mode will help.

To do the test, simply load the machine language program, type NEW, then paste the BASIC test program (using VICE). Remember to press the return key to store line 50.

To test the speed of the routine, I have coded a simple program that draws eight parabolic segments. For these tests, it appears that computing 16 bit squares by using this routine is faster than using a standard 16 bit bit-shifting multiply.

**DOWNLOAD: parabolic segments using 16 bit squares routine (prg file)**

**DOWNLOAD: parabolic segments using 16 bit squares routine (source code)**

**DOWNLOAD: parabolic segments using 16 bit bit-shifting multiply (prg file)**

**DOWNLOAD: parabolic segments using 16 bit bit-shifting multiply (source code)**

The program using the 16 bit squares routine discussed above takes 1.8 seconds. The program using the standard 16 bit multiply routine takes 2.3 seconds instead. Of course, if we have to draw more segments, the speed advantage obtained becomes greater.

The above programs have been coded to be inefficient, as they are intended as benchmarks. For the sake of clarity, the following program draws the above picture faster.

**DOWNLOAD: faster parabolic segments (prg)**

**DOWNLOAD: faster parabolic segments (source code)**

This one is more techy article then usual, but still great.

I really enjoy reading and playing with stuff & knowledge you’re sharing.

Keep on with great work!

Best regards…

Many thanks Crispin! And thanks for your e-mail too. Have fun with your C64!