On my last post I had presented an 8 bit fast multiply routine. Actually, it was limited to a narrower range of numbers, as it was custom designed for a certain program. Now, we are going to present a general 8 bit fast multiply routine.
; fast multiply 8 bit multiply_ab_fast lda #>square_low sta mod1+2 lda #>square_high sta mod2+2 clc lda multiplicand adc multiplier bcc skip_inc inc mod1+2 inc mod2+2 skip_inc tax sec lda multiplicand sbc multiplier bcs no_diff_fix sec lda multiplier sbc multiplicand no_diff_fix tay sec mod1 lda square_low,x sbc square_low,y sta result+1 mod2 lda square_high, x sbc square_high, y sta result rts
This routine accepts 8 bit numbers in the full range 0…255 (the previous routine was only capable of multiplying numbers in the range 0…127 for one factor, 0…90 for the other one).
This time, square tables must be 16 bit. Thus, the code has been modified so that 16 bit indexing is made possible. This is only required when looking for the term ((a+b)^2)/4, as the unsigned difference |a-b| will always remain in the 8 bit range.
To look for the term ((a+b)^2)/4, the sum a+b is used as an index, ranging from 0 to 510. That sums up to 511 different values. The x index is used along with self-modifying code to reach this range.
I have tested this routine with a BASIC program and with some assembly language programs and it seems to work fine.
The following Commodore 64 code can be assembled with DASM. Decimal addresses of parameters used are available in the comments, so that it is possible to test this routine. The code also includes the tables of squares (a 1022 bytes table is needed here).
Performing a 16 bit multiply using an 8 bit multiply routine
Extending the above routine to the 16 bit range is not that easy, as a huge table would be required, eating up all of the memory available. But, we can use the above routine as a part of a 16 bit multiply routine, without the need of any bit-shifting instruction (ASL, ROL, etc.). The result will not be a fast 16 bit routine, but it’s just something that feels quite unusual. However, this 16 bit routine can be adjusted on particular situations, maybe offering better performances than a standard bit-shifting 16 bit multiply routine.
Given two 16 bit numbers A and B, it is possible to write them by using their high bytes and low bytes:
A = AH * 256 + AL B = BH * 256 + BL with: AH = INT(A/256) AL = A-256*AH BH = INT(B/256) BL = B-256*BH
Using those expressions, the product A * B can be written as:
A * B = (AH * 256 + AL) * (BH * 256 + BL)
It can be shown that the above equation can be rewritten as:
A * B = 65536*(AH*BH)+256(AH*BL+AL*BH)+AL*BL
So, the product of the two 16 bit numbers A and B can be written as a linear combination of four products, with factors made up of the low bytes and the high bytes of the original 16 bit factors.
This seems to be pretty useless for practical applications. But, this approach does make it possible to use the above 8 bit routine for 16 bit multiplications.
A large number like 65536 seems intimidating, but no multiplication is actually required. Multiplying by 65536 is obtained by two byte shifts. That means, we can multiply by 65536 by just re-arranging the bytes of the number to be multiplied. Likewise, multiplying by 256 it’s just a one byte shift. So, no slow bit-shifting instructions are really required to perform the above multiplications by constants. Those constants are just powers of two, that’s why life is getting so easier here.
The following code hopefully demonstrates the above approach. The generic 16 bit routine seems to reach the very same speed of a classic 16 bit multiply routine – with a much longer code, so that’s not a deal. Still, a few testing code can improve its efficiency.
For instance, if we happen to multiply two 8 bit numbers, we can avoid the product 65536*(AH*BH) by testing for zero the high bytes of the 16 bit factors.
Furthermore, if either BL or AL are 0, the product 256*(AH*BL+AL*BH) can be computed faster, as it is possible to avoid at least one call to the 8 bit multiply routine.
Finally, if we only need the two lower bytes of the result (that is, we only need a 16 bit multiplication with 16 bit result), the term 65536*(AH*BH) can be simply ignored.
The following screenshots show a couple of tests. A simple BASIC program can be written, so that tests are faster and easier.
The following programs (we already know them) have been modified to include the above routines.
The fractal program actually makes use of an optimized version, taking into account the fact that the two higher bytes of the 16 bit product are not needed. In this particular case, we actually get a fast 16 bit multiply routine outperforming its classic 16 bit bit-shifting counterpart. The fact that most of the time the factors are actually limited to the 8 bit range also helps.
Instead, in the drawing lines program, we get the very same performance as a standard bit-shifting multiply routine.