archive.worldofdragon.org

Posted: **Wed Jan 21, 2015 4:14 pm**

Just to share a discovery... it's probably been done a million times before but it's new to me

The idea behind loop unrolling is simple enough. It seeks to reduce the time overhead of looping by doing more work per loop. To use a copy routine as an example, instead of copying one byte per loop, copy eight bytes per loop. Clearly this example works best when the number of bytes to be copied is a multiple of eight.

To copy any number of bytes, the traditional solution would be to perform a division to figure out the required number of loops and the remainder from the division tells us how many left over bytes there are to deal with. Sticking with powers of two this doesn't seem so bad but it's still a bit messy and you still need another byte-sized loop to deal with the remainder.

While thinking about how best to deal with the remainder it occurred to me that one way would be to use a 'divide and conquer' approach. Instead of copying the remaining bytes one at a time, copy four, then two, then one as necessary. After playing with it for a while I realised that this works much more nicely if you start with a single byte and work upwards.

The contrived example below copies 'count' bytes from u to y:

Code: Select all

copybytes
  lsr count
  bcc chk2
  lda ,u+		; copy 1 byte
  sta ,y+

chk2
  lsr count
  bcc chk4
  pulu d		; copy 2 bytes
  std ,y++

chk4
  lsr count
  bcc chk8
  pulu d,x	; copy 4 bytes
  std ,y++
  stx ,y++

  lda count
chk8
  beq done
lp8
  pulu d,x	; copy 8 bytes
  std ,y
  stx 2,y
  pulu d,x
  std 4,y
  stx 6,y
  leay 8,y
  dec count
  bne lp8

done
  rts

What's happening here is that each bit of the counter is tested, and if it is set, the appropriate amount of work is performed. It's quite a lot like binary multiplication. The nice thing is that more stages can be added to find the best size/speed tradeoff. The lsr instruction at each stage makes count the right value to use as a loop counter for the final stage.

If the size of the counter is the same as the number of bits tested then the final stage doesn't need to be a loop. In the above example, if count stays in the range 0-15 then the final stage will be required at most once.

A faster way of copying a block of 8 bytes would be to use both stack pointers but this requires some setup and I wanted to keep things simple for illustration.

Posted: **Thu Jan 22, 2015 10:25 pm**

That's a very interesting approach which I'll definitely keep in mind.

I use the dual stack pointer method a lot, which I also picked up off you.

Thanks for sharing.

Posted: **Fri Jan 23, 2015 6:27 pm**

Cheers. After playing a bit more it's proving to be quite a useful device. Even just a two stage version gives a decent speed up.

One thing it's handy for is clipping sprites vertically, where we want to draw a variable number of rows. The same routine can be used for different sizes while still being fast.

It's also really good for variable length copy operations. If the copy count is larger than eight bits, the low byte can be processed on its own and any left over bits can be shifted into the upper byte for the final stage.

I'd love to know what this optimisation is called but I haven't been able to find anything so far, probably because it doesn't seem very relevant to modern computing. Loop optimisations now are more about parallelism and memory latency and are usually best left to the compiler to figure out.

archive.worldofdragon.org

Fun with loops

Fun with loops

Re: Fun with loops

Re: Fun with loops