Fun with loops
Posted: Wed Jan 21, 2015 4:14 pm
Just to share a discovery... it's probably been done a million times before but it's new to me
The idea behind loop unrolling is simple enough. It seeks to reduce the time overhead of looping by doing more work per loop. To use a copy routine as an example, instead of copying one byte per loop, copy eight bytes per loop. Clearly this example works best when the number of bytes to be copied is a multiple of eight.
To copy any number of bytes, the traditional solution would be to perform a division to figure out the required number of loops and the remainder from the division tells us how many left over bytes there are to deal with. Sticking with powers of two this doesn't seem so bad but it's still a bit messy and you still need another byte-sized loop to deal with the remainder.
While thinking about how best to deal with the remainder it occurred to me that one way would be to use a 'divide and conquer' approach. Instead of copying the remaining bytes one at a time, copy four, then two, then one as necessary. After playing with it for a while I realised that this works much more nicely if you start with a single byte and work upwards.
The contrived example below copies 'count' bytes from u to y:
What's happening here is that each bit of the counter is tested, and if it is set, the appropriate amount of work is performed. It's quite a lot like binary multiplication. The nice thing is that more stages can be added to find the best size/speed tradeoff. The lsr instruction at each stage makes count the right value to use as a loop counter for the final stage.
If the size of the counter is the same as the number of bits tested then the final stage doesn't need to be a loop. In the above example, if count stays in the range 0-15 then the final stage will be required at most once.
A faster way of copying a block of 8 bytes would be to use both stack pointers but this requires some setup and I wanted to keep things simple for illustration.
The idea behind loop unrolling is simple enough. It seeks to reduce the time overhead of looping by doing more work per loop. To use a copy routine as an example, instead of copying one byte per loop, copy eight bytes per loop. Clearly this example works best when the number of bytes to be copied is a multiple of eight.
To copy any number of bytes, the traditional solution would be to perform a division to figure out the required number of loops and the remainder from the division tells us how many left over bytes there are to deal with. Sticking with powers of two this doesn't seem so bad but it's still a bit messy and you still need another byte-sized loop to deal with the remainder.
While thinking about how best to deal with the remainder it occurred to me that one way would be to use a 'divide and conquer' approach. Instead of copying the remaining bytes one at a time, copy four, then two, then one as necessary. After playing with it for a while I realised that this works much more nicely if you start with a single byte and work upwards.
The contrived example below copies 'count' bytes from u to y:
Code: Select all
copybytes
lsr count
bcc chk2
lda ,u+ ; copy 1 byte
sta ,y+
chk2
lsr count
bcc chk4
pulu d ; copy 2 bytes
std ,y++
chk4
lsr count
bcc chk8
pulu d,x ; copy 4 bytes
std ,y++
stx ,y++
lda count
chk8
beq done
lp8
pulu d,x ; copy 8 bytes
std ,y
stx 2,y
pulu d,x
std 4,y
stx 6,y
leay 8,y
dec count
bne lp8
done
rts
What's happening here is that each bit of the counter is tested, and if it is set, the appropriate amount of work is performed. It's quite a lot like binary multiplication. The nice thing is that more stages can be added to find the best size/speed tradeoff. The lsr instruction at each stage makes count the right value to use as a loop counter for the final stage.
If the size of the counter is the same as the number of bits tested then the final stage doesn't need to be a loop. In the above example, if count stays in the range 0-15 then the final stage will be required at most once.
A faster way of copying a block of 8 bytes would be to use both stack pointers but this requires some setup and I wanted to keep things simple for illustration.