Optimizing the assembly program "Sphere" - More tips on fixed point math. A fast 8 bit multiply routine.

Fixed point math can be quite useful for fast computations involving decimal numbers. We have seen some examples in this blog: the line drawing program, the mandelbrot set program and the “Sphere” program. I have had a look at the code of the program Sphere and I have found some ways of optimizing it.

Trigonometric functions values are stored in tables using an integer offset of 128. After multiplication is performed to compute the terms R*sin(angle) and R*cos(angle), the result is divided by the offset.

But, is there a way to quickly divide by 128 in 6502 assembly language? The standard way consists of performing the following instructions:

 lsr multiplier
 ror sum
 lsr multiplier
 ror sum
 lsr multiplier
 ror sum
 lsr multiplier
 ror sum
 lsr multiplier
 ror sum
 lsr multiplier
 ror sum
 lsr multiplier
 ror sum

That simply means dividing by two several times, until dividing by 128 is achieved. The dividend is a 16 bit number. Its high byte is multiplier, its low byte is sum.

There is a faster way however. Instead of dividing by 128, we can divide by 256, then multiply by two. This may seem strange, as dividing by 256 theoretically requires another lsr/ror operation. But, there is a very fast way of dividing by 256, as the following code shows (code taken from my lines drawing program).

		lda sum
		sta sum+1
		
		lda multiplier+1
		sta sum
		
		lda multiplier
		sta multiplier+1
		
		lda #$00
		sta multiplier

This piece of code divides by 256 a 32 bit number, made up of the bytes multiplier, multiplier+1, sum, sum+1 (in higher bytes… lower bytes order). As it’s easy too see, dividing by 256 is an 8 bit shift. And an 8 bit shift, in turn, it’s just a byte shift. So, the lowest byte of the dividend is discarded and replaced by the higher byte near it. The process is repeated for the other bytes. The highest byte is replaced by a 0, since an 8 bit shift would leave it empty.

This simple trick actually allows for very fast fixed point math, using an offset value of 256 which can assure high accuracy for most applications.

As we need to divide by 128 in the Sphere program, we have to adapt the principle from the above code. Right at the end of the multiply subroutine of this program, I added the following code (multiplier is the high byte of the 16 bit number to be divided by offset, sum is the low byte):

                lda     sum
                sta     shift
                lda     multiplier
                sta     sum
                lda     #$00
                sta     multiplier      ;/256
                
                asl     shift
                rol     sum
                rol     multiplier      ;*2 => /256 * 2 = /128

The code performs a division by 256 followed by a multiplication by 2. That is just the same as dividing by 128, but we gain a speed advantage.

Another byte is introduced (called shift). Before performing the division, we store the lowest byte of the dividend in shift. It is necessary, as otherwise we would miss some digits and loose accuracy.

After division is performed, multiplication by two takes place. With the instruction asl shift, bit 7 of the original low byte is recovered, so that it can be used for the multiplication by two. This way, that digit is preserved and the multiplication is accurate. Asl shift is meant to put this bit in the carry, then rol sum makes use of this bit to compute the multiplication.

Since we have an offset of 128, we can modify the above code so that it is slightly faster. As we don’t really need to shift the byte shift, we can directly store the bit we need in the carry, then retrieve that bit from it.

                asl     sum             ;only one bit must be stored with offset = 128 
                                        ;store old bit 7 of sum in the carry
                lda     multiplier
                sta     sum
                lda     #$00
                sta     multiplier      ;/256
                
                rol     sum             ;the old bit 7 of sum is on the carry
                rol     multiplier      ;*2 => /256 * 2 = /128
                
                                        ;cycles for offset divide with offset 128: 32
                                        ;normal divide by 128 cycles: 84

We must take bit 7 from the low byte before the division takes place to preserve that digit. asl sum just puts that bit in the carry. After the division, rol sum just makes use of that old bit to rol the new value of sum (the value it has after the division), thus allowing to use the old digit in the multiplication, with no accuracy loss. The cycle count (which I hope is right) actually shows the great speed advantage that we can achieve be using this technique.

DOWNLOAD: Sphere program with offset divide trick (prg file)

DOWNLOAD: Sphere program with offset divide trick (source code)

This way we get a slightly faster program, that completes in 0,5 seconds less than the previous version. That speed advantage, although trivial, does show off the effectiveness of the above trick. Still, since I wanted to increase the speed of the program by a bigger amount, I had to work out something else.

A faster 8-bit multiply

This nice page provides useful information on implementing a fast 8 bit multiply routine. Since the 6502 lacks instructions for multiplications and it lacks internal 16 bit registers as well, if we need a great amount of speed multiplications must be replaced by additions and subtractions somehow.

It can be shown that:

a* b = ((a + b)^2)/4 - ((a - b)^2)/4

That means we can compute the product a*b by doing squares, additions and subtractions. And since squares can be performed by using tables, using this expression we can clearly create faster code.

If we could keep numbers low, squares values could just be hold on one single byte. That way, our new 8-bit multiply routine should be quite fast.

The products to be performed on the Sphere program are:

tc * r1
ts * r2

tc and ts quantities both range from 0 to 127. Both r1 and r2 maximum value is 90.

So, we can just accommodate our multiply routine for these values.

Let’s have a look at the terms in the expression above. Let’s put for example a = tc and b = r1.

a+b max value is 127+90, that is 217. Min value is 0.

|a-b| max value is 127. Please note that we need the module of a-b.

So, we just need a table with the squares of the numbers from 0 to 217, so 218 different values.

The program Sphere uses an integer offset of 128. Since after multiplication the result must be divided by 128, according this integer math squares should be evaluated as follows:

S = n^2/(4 * offset) = n^2/(4*128) = n^2/512

So, we could just put these values on the tables, with no need to perform the division by 512. That approach is correct, but we have an accuracy loss. As a result, the sphere will have empty points.

The fix is easy: we just have to put on the tables the squares divided by 256, then divide the product by two in the code. Now the question is: will those values fit in one byte only? Let’s check.

217^2/256 = 183 (integer value)

So, the answer is yes, and the multiply routine can use 8 bit values in the tables, resulting in faster code than by using 16 bit values.

The fast multiply routine I have come up with follows.

; fast multiply (max 127 * max 90)

multiply_ab_fast

                clc
                lda multiplicand
                adc multiplier
                tax             
                
                sec
                lda multiplicand
                sbc multiplier                
                bpl no_comp_dif
                
                eor #$ff
                clc
                adc #1
                
no_comp_dif     
                tay
                              
                sec
                lda square,x 
                sbc square,y
                sta result 
                
                rts

It just stores the values (a+b) and |a-b| in the index registers, then loads the squares and subtracts them, obtaining the result. In the code that will follow, it is called “sum” as a hangover from previous program versions, but of course it is the result of the product.

The following is the fast version of the Sphere program. The sphere now takes 4,7 seconds to be drawn. The older version of the program took 6,7 seconds instead.

DOWNLOAD: Sphere fast multiply version (prg file)

DOWNLOAD: Sphere fast multiply version (source code)

The image generated by this faster program seems even better than the previous version, as the sphere has a better round shape.

7 Replies to “Optimizing the assembly program “Sphere” – More tips on fixed point math. A fast 8 bit multiply routine.”

MageMaster says:

February 5, 2018 at 6:24 pm

Now you just need to do multiple screens and rotation to make it animated 🙂

1. retro64 says:
  
  February 5, 2018 at 8:13 pm
  
  Hello! I will try one day!
  
Sterling says:

October 19, 2023 at 10:09 pm

Hello,
i like your programs and, if i can, i make it a little bit faster.
My version needs 223/60 counts from line ; draw ellipses
lda #90
sta r1
…

Have a nice day.

Sterling says:

November 23, 2023 at 2:14 pm

Hello,
after updating the plot routine with the mandelbrot plot routine from Marcello we get an huge improbement. Now the stopwatch gives me 123/60 counts.

And thanks to prepare the super fast multiply routine. I used it in some other applications.

BR
Sven

retro64 says:

November 29, 2023 at 7:47 am

Thank you for your interest on my programs. Have fun! 🙂

Commodore 64 & 6510 retro-programming says:

January 17, 2024 at 2:33 pm

Just to thank you for all your tutorials and explanations. They’re great. Well, you are great. Very, very, very smooth and clear unlike many others.
They will be a source of inspiration as I’m currently working on a 3D hat plotting in assembly on the Commodore 64. It’s based on the following BASIC Code:
105 REM HIRES screen mode set
110 BM=8192
120 POKE53272,PEEK(53272)or8
130 POKE53265,PEEK(53265)OR32
140 FORI=1024TO2023:POKEI,22:NEXT
142 FOR I=BM TO BM+7999:POKE I,0:NEXT I
150 REM main calculation routin
151 P=170:Q=100
152 XP=144:XR=1.5*pi:rem pi can be replaced with 3.1415927
153 YP=56:YR=1:ZP=64
154 XF=XR/XP:YF=YP/YR:ZF=XR/ZP
155 FORZI=-QTOQ-l setp 2
156 IFZIZP THEN 164
157 ZT=ZI*XP/ZP:ZZ=ZI
158 XL=INT(.5+SQR(XP*XP-ZT*ZT))
159 FORXI=-XL TO XL step 2
160 XT=SQR(XI*XI+ZT*ZT)*XF:XX=XI
161 YY=(SIN(XT)+.4*SIN(3*XT))*YF
162 GOSUB 170
163 NEXT XI
164 NEXT ZI
166 REM waiting space key press routine
167 WAIT 203,60:STOP
168 REM plotting routine
170 X1=XX+ZZ+P
171 Y1=INT(90-(YY-ZZ))
180 IF X1319 THEN 220
181 IF Y1199 THEN 220
185 P1=BM+320*INT(Y1/8)+8*INT(X1/8)+(Y1AND7)
190 POKEP1,PEEK(P1)OR(2^(7-(X1AND7)))
200 REM not required IFY1=0 THEN 220
210 REM return to line 163
220 RETURN

1. retro64 says:
  
  February 12, 2024 at 2:19 pm
  
  Hello, thank you for your kind words, very appreciated. Happy programming and have fun! 🙂

Optimizing the assembly program “Sphere” – More tips on fixed point math. A fast 8 bit multiply routine.

A faster 8-bit multiply

7 Replies to “Optimizing the assembly program “Sphere” – More tips on fixed point math. A fast 8 bit multiply routine.”

Leave a Reply Cancel reply

Zoccoli per integrati THT

Il Commodore 1541, consigli d'uso e manutenzione (parte 2)

Dissaldare un integrato THT in maniera sicura

Commodore 64 con RAM guasta

Guida alla riparazione del Commodore 64