# Assembly math: 6502 8 bit fast multiply routine. 16 bit multiply without bit shifting.

On my last post I had presented an 8 bit fast multiply routine. Actually, it was limited to a narrower range of numbers, as it was custom designed for a certain program. Now, we are going to present a general 8 bit fast multiply routine.

; fast multiply 8 bit

multiply_ab_fast

lda #>square_low
sta mod1+2
lda #>square_high
sta mod2+2

clc
lda multiplicand
bcc skip_inc

inc mod1+2
inc mod2+2

skip_inc        tax

sec
lda multiplicand
sbc multiplier
bcs no_diff_fix

sec
lda multiplier
sbc multiplicand

no_diff_fix
tay

sec
mod1            lda square_low,x
sbc square_low,y
sta result+1

mod2            lda square_high, x
sbc square_high, y
sta result

rts


This routine accepts 8 bit numbers in the full range 0…255 (the previous routine was only capable of multiplying numbers in the range 0…127 for one factor, 0…90 for the other one).

This time, square tables must be 16 bit. Thus, the code has been modified so that 16 bit indexing is made possible. This is only required when looking for the term ((a+b)^2)/4, as the unsigned difference |a-b| will always remain in the 8 bit range.

To look for the term ((a+b)^2)/4, the sum a+b is used as an index, ranging from 0 to 510. That sums up to 511 different values. The x index is used along with self-modifying code to reach this range.

I have tested this routine with a BASIC program and with some assembly language programs and it seems to work fine.

The following Commodore 64 code can be assembled with DASM.  Decimal addresses of parameters used are available in the comments, so that it is possible to test this routine. The code also includes the tables of squares (a 1022 bytes table is needed here).

### Performing a 16 bit multiply using an 8 bit multiply routine

Extending the above routine to the 16 bit range is not that easy, as a huge table would be required, eating up all of the memory available. But, we can use the above routine as a part of a 16 bit multiply routine, without the need of any bit-shifting instruction (ASL, ROL, etc.). The result will not be a fast 16 bit routine, but it’s just something that feels quite unusual. However, this 16 bit routine can be adjusted on particular situations, maybe offering better performances than a standard bit-shifting 16 bit multiply routine.

### The principle

Given two 16 bit numbers A and B, it is possible to write them by using their high bytes and low bytes:

A = AH * 256 + AL
B = BH * 256 + BL

with:

AH = INT(A/256)
AL = A-256*AH

BH = INT(B/256)
BL = B-256*BH

Using those expressions, the product A * B can be written as:

A * B = (AH * 256 + AL) * (BH * 256 + BL)

It can be shown that the above equation can be rewritten as:

A * B = 65536*(AH*BH)+256(AH*BL+AL*BH)+AL*BL

So, the product of the two 16 bit numbers A and B can be written as a linear combination of four products, with factors made up of the low bytes and the high bytes of the original 16 bit factors.

This seems to be pretty useless for practical applications. But, this approach does make it possible to use the above 8 bit routine for 16 bit multiplications.

A large number like 65536 seems intimidating, but no multiplication is actually required. Multiplying by 65536 is obtained by two byte shifts. That means, we can multiply by 65536 by just re-arranging the bytes of the number to be multiplied. Likewise, multiplying by 256 it’s just a one byte shift. So, no slow bit-shifting instructions are really required to perform the above multiplications by constants. Those constants are just powers of two, that’s why life is getting so easier here.

### The code

The following code hopefully demonstrates  the above approach. The generic 16 bit routine seems to reach the very same speed of a classic 16 bit multiply routine – with a much longer code, so that’s not a deal. Still, a few testing code can improve its efficiency.

For instance, if we happen to multiply two 8 bit numbers, we can avoid the product 65536*(AH*BH) by testing for zero the high bytes of the 16 bit factors.

Furthermore, if either BL or AL are 0, the product 256*(AH*BL+AL*BH) can be computed faster, as it is possible to avoid at least one call to the 8 bit multiply routine.

Finally, if we only need the two lower bytes of the result (that is, we only need a 16 bit multiplication with 16 bit result), the term 65536*(AH*BH) can be simply ignored.

Source code (16 bit multiply using 8 bit multiply routine)

Prg file

Source code (16 bit multiply with 8 bit routine, optimized, zero check version)

Prg file

The following screenshots show a couple of tests. A simple BASIC program can be written, so that tests are faster and easier.

The following programs (we already know them) have been modified to include the above routines.