Fixed point math can be quite useful for fast computations involving decimal numbers. We have seen some examples in this blog: the line drawing program, the mandelbrot set program and the “Sphere” program. I have had a look at the code of the program Sphere and I have found some ways of optimizing it.
Trigonometric functions values are stored in tables using an integer offset of 128. After multiplication is performed to compute the terms R*sin(angle) and R*cos(angle), the result is divided by the offset.
But, is there a way to quickly divide by 128 in 6502 assembly language? The standard way consists of performing the following instructions:
lsr multiplier ror sum lsr multiplier ror sum lsr multiplier ror sum lsr multiplier ror sum lsr multiplier ror sum lsr multiplier ror sum lsr multiplier ror sum
That simply means dividing by two several times, until dividing by 128 is achieved. The dividend is a 16 bit number. Its high byte is multiplier, its low byte is sum.
There is a faster way however. Instead of dividing by 128, we can divide by 256, then multiply by two. This may seem strange, as dividing by 256 theoretically requires another
lsr/ror operation. But, there is a very fast way of dividing by 256, as the following code shows (code taken from my lines drawing program).
lda sum sta sum+1 lda multiplier+1 sta sum lda multiplier sta multiplier+1 lda #$00 sta multiplier
This piece of code divides by 256 a 32 bit number, made up of the bytes multiplier, multiplier+1, sum, sum+1 (in higher bytes… lower bytes order). As it’s easy too see, dividing by 256 is an 8 bit shift. And an 8 bit shift, in turn, it’s just a byte shift. So, the lowest byte of the dividend is discarded and replaced by the higher byte near it. The process is repeated for the other bytes. The highest byte is replaced by a 0, since an 8 bit shift would leave it empty.
This simple trick actually allows for very fast fixed point math, using an offset value of 256 which can assure high accuracy for most applications.
As we need to divide by 128 in the Sphere program, we have to adapt the principle from the above code. Right at the end of the multiply subroutine of this program, I added the following code (multiplier is the high byte of the 16 bit number to be divided by offset, sum is the low byte):
lda sum sta shift lda multiplier sta sum lda #$00 sta multiplier ;/256 asl shift rol sum rol multiplier ;*2 => /256 * 2 = /128
The code performs a division by 256 followed by a multiplication by 2. That is just the same as dividing by 128, but we gain a speed advantage.
Another byte is introduced (called shift). Before performing the division, we store the lowest byte of the dividend in shift. It is necessary, as otherwise we would miss some digits and loose accuracy.
After division is performed, multiplication by two takes place. With the instruction asl shift, bit 7 of the original low byte is recovered, so that it can be used for the multiplication by two. This way, that digit is preserved and the multiplication is accurate. Asl shift is meant to put this bit in the carry, then rol sum makes use of this bit to compute the multiplication.
Since we have an offset of 128, we can modify the above code so that it is slightly faster. As we don’t really need to shift the byte shift, we can directly store the bit we need in the carry, then retrieve that bit from it.
asl sum ;only one bit must be stored with offset = 128 ;store old bit 7 of sum in the carry lda multiplier sta sum lda #$00 sta multiplier ;/256 rol sum ;the old bit 7 of sum is on the carry rol multiplier ;*2 => /256 * 2 = /128 ;cycles for offset divide with offset 128: 32 ;normal divide by 128 cycles: 84
We must take bit 7 from the low byte before the division takes place to preserve that digit. asl sum just puts that bit in the carry. After the division, rol sum just makes use of that old bit to rol the new value of sum (the value it has after the division), thus allowing to use the old digit in the multiplication, with no accuracy loss. The cycle count (which I hope is right) actually shows the great speed advantage that we can achieve be using this technique.
This way we get a slightly faster program, that completes in 0,5 seconds less than the previous version. That speed advantage, although trivial, does show off the effectiveness of the above trick. Still, since I wanted to increase the speed of the program by a bigger amount, I had to work out something else.
A faster 8-bit multiply
This nice page provides useful information on implementing a fast 8 bit multiply routine. Since the 6502 lacks instructions for multiplications and it lacks internal 16 bit registers as well, if we need a great amount of speed multiplications must be replaced by additions and subtractions somehow.
It can be shown that:
a* b = ((a + b)^2)/4 - ((a - b)^2)/4
That means we can compute the product a*b by doing squares, additions and subtractions. And since squares can be performed by using tables, using this expression we can clearly create faster code.
If we could keep numbers low, squares values could just be hold on one single byte. That way, our new 8-bit multiply routine should be quite fast.
The products to be performed on the Sphere program are:
tc * r1 ts * r2
tc and ts quantities both range from 0 to 127. Both r1 and r2 maximum value is 90.
So, we can just accommodate our multiply routine for these values.
Let’s have a look at the terms in the expression above. Let’s put for example a = tc and b = r1.
a+b max value is 127+90, that is 217. Min value is 0.
|a-b| max value is 127. Please note that we need the module of a-b.
So, we just need a table with the squares of the numbers from 0 to 217, so 218 different values.
The program Sphere uses an integer offset of 128. Since after multiplication the result must be divided by 128, according this integer math squares should be evaluated as follows:
S = n^2/(4 * offset) = n^2/(4*128) = n^2/512
So, we could just put these values on the tables, with no need to perform the division by 512. That approach is correct, but we have an accuracy loss. As a result, the sphere will have empty points.
The fix is easy: we just have to put on the tables the squares divided by 256, then divide the product by two in the code. Now the question is: will those values fit in one byte only? Let’s check.
217^2/256 = 183 (integer value)
So, the answer is yes, and the multiply routine can use 8 bit values in the tables, resulting in faster code than by using 16 bit values.
The fast multiply routine I have come up with follows.
; fast multiply (max 127 * max 90) multiply_ab_fast clc lda multiplicand adc multiplier tax sec lda multiplicand sbc multiplier bpl no_comp_dif eor #$ff clc adc #1 no_comp_dif tay sec lda square,x sbc square,y sta result rts
It just stores the values (a+b) and |a-b| in the index registers, then loads the squares and subtracts them, obtaining the result. In the code that will follow, it is called “sum” as a hangover from previous program versions, but of course it is the result of the product.
The following is the fast version of the Sphere program. The sphere now takes 4,7 seconds to be drawn. The older version of the program took 6,7 seconds instead.
The image generated by this faster program seems even better than the previous version, as the sphere has a better round shape.