On Inner Loops by Rick Booth

A great book for those of us who savor tricky code-optimization problems the way gourmets savor fine wines. The first half of the book grounds the reader in the "theory" of modern Intel optimization, namely how processors since the 486 have worked; the second half applies that theory to the practice of optimizing different types of real-life inner loops.

The theory section starts from the basics and works through 486 and Pentium optimization, ending with the Pentium Pro and MMX extensions. This coverage is solid, worth reading even for the expert optimizer: I realized upon reading the Pentium chapter that I'd previously understood prefetching imperfectly. The Pentium Pro chapter gives the clearest explanation I've seen of that processor's fundamentally complex scheduler. The gist is that a Pentium Pro is like a Pentium with read/write memory-access pipelines in parallel with the familiar Pentium U and V pipelines; the effect is that many core operations may include memory accesses without incurring extra cycles, for instance "add eax,[ebx]". The MMX chapter is just as sturdy, quite impressive considering that no MMX processors were available when Booth wrote the book!

The "practice" half of the book is also good. Each chapter explores a different aspect of modern optimization, using an appropriate algorithm. For instance, the memcpy analysis spotlights cache effects, and the matrix multiplication explores the interleaving of dense floating point operations. Although the result is a clear tutorial for the intermediate programmer, it's not very exciting for the advanced reader, since the example algorithms are picked more for their ability to show off different techniques rather than for their depth as advanced optimization puzzles.

There Ain't No Such Thing As The Fastest Code

The only fault I find is with an opinion alluded to throughout the book, that an expert optimizer (like Booth) can reliably calculate the "speed limit" for a problem. It is true that, given a particular approach, one can distill it to its atomic reads, adds, etc. and use this as a guide to coding it. But as all expert optimizers eventually learn, as soon as one has slaved over a piece of code to get the best possible solution, another expert comes along and finds a fundamentally better approach. Perhaps Booth doesn't associate with enough other expert optimizers.

For instance, his chapter 7 caps the theory section with the optimization of the WordCount routine, which counts the number of words in a block of text. Booth's presentation is lucid and educational, moving briskly from a starting rate of 23 cycles per character to 7 to 5 to the final "speed limit" of 2 cycles per character. But by looking at the problem differently, we can find a 1-cycle-per-character solution (given in the appendix below). That's not just a few percent faster, it's more than twice as fast. This highlights the problem with believing that you have the fastest solution: you can't see past it to the even better solutions. A good maxim for optimizers is Abrash's "there ain't no such thing as the fastest code." An irony is that that is the title of Abrash's chapter on WordCount in his Zen book, in which he dismisses the super-fast approach used in the appendix here as unprofitable!

This quibble aside, this book is one of the few every expert optimizer should read. It's comparable to Abrash's Zen of Code Optimization: not quite as entertaining for advanced optimizers, but just as functional.

Appendix A: The 1-cycle-per-character WordCount()

Booth's best WordCount is based on a 4-instruction inner loop at each character: read-translate-compare-add. He reads the character, looks it up in a 256-byte table to convert it to a word/nonword code, compares that code with the code from the previous character to detect a nonword-to-word transition, and uses adc to count such transitions.

We can do better by making two assumptions: characters are 7-bit ascii (i.e. the high bit of each byte is zero); and a word character is defined as non-whitespace, rather than the definition (a-z, A-Z, 0-9) that Booth uses. These changes allow us to use arithmetic operations rather than a table lookup to determine whether a byte is a word-character or not; specifically, we add 0x5f to the character, and the high bit 0x80 is then set only for word characters.

Now that we're using arithmetic instead of a table lookup, we can do four of these operations simultaneously, on the four bytes in a 32-bit register. That gives us 4 useful bits, in the high bit of each of the register's four bytes. We can accumulate these bits in another 32-bit register using shifts and adds, so that after processing 32 characters we'll have a full 32 bits, in the accumulator, one bit for each character, although the bits are arranged strangely.

All that remains is to convert the strangely-interleaved word/nonword bits in this 32-bit register into a count of the number of word-to-nonword transitions, and add that count into the total. We count these transitions by anding the register with an inverted copy of itself shifted one character over, leaving us with a 32-bit register whose bits are 1 only at the word-to-nonword transitions.

Now we need just count the number of bits in the 32-bit register. Here we take advantage of the fact that there can never be two word-to-nonword transitions at adjacent chars (after all, right after such a transition we must be in the nonword state, so it'll take at least one char to get back to the word state before we can have another word-to-nonword transition). We can therefore add (or OR) together adjacent bits to reduce the 32-bit register to only 16 bits. We then do two lookups in a 256-byte lookup table that gives the number of set bits in a byte.

The trickiest part of the algorithm is shifting the 32-bit register so that each of its bits is at the position for the following character. It's particularly tricky to handle the overlap of the last character of one batch of 32 with the first character of the next batch. This is complicated; we need a picture. For characters c0 to c31 in the input stream, the corresponding word/nonword bits are arranged in the 32-bit register as follows:

 High byte                           Low Byte
c31...c3  c30...c2  c29...c1  c28...c0
The shifted value to be xored with this has each bit positioned where its preceding character's bit is in the above:
 High byte                           Low Byte
c30...c2  c29...c1  c28...c0  c27..c3. <-- c-1
In other words, we rotate the register left 8 bits, then shift the low byte left an extra bit, shifting in the ending bit from the last batch of 32 chars, and shifting out the ending bit from this batch of 32 chars. The following code does this quite efficiently, but doesn't work well on a Pentium Pro.

The final code is fairly long because it's doing 32 characters per loop, but it achieves 29 cycles/32 chars, i.e. less than 1 cycle per character. And I don't think it's the fastest possible solution; the situation is complex enough that there might well be a fundamentally better approach.

For best performance, the surrounding code should arrange esi to start with an address that's a multiple of 32, minus 4, by processing earlier characters with a different loop. That alignment works best with cache miss delays, as cogently described by the table on page 115 of Booth's book, by allowing the longest delay between the reading of the first dword from a cache line (the mov ebp,[esi+offset+32+4]) and the next read from that cache line (the mov ebx,[esi+offset+32+12] that comes 13 cycles later).

Note that the loop prereads some of the following iteration's chars, so you should terminate the loop while there are still enough entries, and finish up the processing another way.

; Count the number of words in a buffer pointed to by esi, ending at 'end',
; accumulating the count into edi,
; by processing groups of 32 chars at a time using the wcount macro.  
; Macro to process the remainder of the 32 chars in this group, count the
; number of word-to-nonword transitions in the group, and process the start
; of the next group of 32 chars.  The start of the process is to execute 
; the following basic code, which processes 4 chars, 8 times:
;    mov ebx,[esi+ebp]   ; Get 4 chars
;    add ebx,5f5f5f5fh   ; Make 80 bits hold word/nonword flag.
;    and ebx,80808080h   ; Mask out all but the flag bits.
;    shr ebx,shift       ; Shift the bits in prep for accumulation
;    or  ebp,ebx         ; Accumulate the 4 bits into a 32-bit reg.
; But this macro starts with the first 8 chars already completely processed,
; and the next 8 chars partially processed, in order to accomodate the
; partially-interleaved processing of adjacent 32-char groups.
; On entry:
;      ebp = the bits for the first 8 chars in the group,
;      ecx = the 8-11th chars plus 5f5f5f5fh,
;      ebx = the 12-15th chars.
;      edx = the 32 bits for the preceding group of 32 chars, of which only
;            the high bit (the one for the group's last char) is used.
; On exit everything is set up for entry into another copy of this macro.
wcount macro offset
    add ebx,5f5f5f5fh
    and ecx,80808080h
    shr ecx,5
    and ebx,80808080h
    shr ebx,4
    or  ebp,ecx         ; 3 cycles so far.
    or  ebp,ebx

    mov ebx,[esi+offset+20]    ;ebx = chars 20-23
    mov ecx,[esi+offset+16]    ;ecx = chars 16-19
    add ebx,5f5f5f5fh
    add ecx,5f5f5f5fh
    and ebx,80808080h
    shr ebx,2
    and ecx,80808080h
    shr ecx,3
    or  ebp,ebx         ; 8 cycles so far.
    or  ebp,ecx

    mov eax,[esi+offset+28]    ;eax = chars 28-31
    mov ecx,[esi+offset+24]    ;ecx = chars 24-27
    add eax,5f5f5f5fh
    add ecx,5f5f5f5fh
    and eax,80808080h
    or  eax,ebp

   ; Outdented instructions below are for the next group of 32 chars, for
   ; which ebp accumulates the bits, and ebx is a temporary register. 
   ; The other lines are for the current 32 chars, and don't use ebx or ebp.

    and ecx,80808080h
    shr ecx,1
   mov ebp,[esi+offset+32+4]  ; Read 4 characters for the next group.
    or  eax,ecx
   mov ebx,[esi+offset+32]    ; 14 cycles so far.
    ; eax = 32 word/nonword bits for the 32 chars.  The rest of the macro
    ; computes ~eax & ecx, where ecx = shifted version in eax, and counts
    ; the set bits with a lookup table.
    mov ecx,eax
   add ebp,5f5f5f5fh

    rol ecx,8

   add ebx,5f5f5f5fh
    add edx,edx   ; Carry flag = bit from last char of preceding 32 chars.

    adc cl,cl     ; Shift low byte an extra bit, see algorithm desc above.
   and ebp,80808080h
   shr ebp,6
    xor ecx,-1

    and ecx,eax   ; Set all bits of ecx for word-to-nonword transitions.
   and ebx,80808080h

    mov edx,ecx
    add cl,ch     ; Combine adjacent counts; see algorithm desc above.

    shr edx,16 
    and ecx,0ffh

   shr ebx,7
    add dl,dh     ; Combine adjacent counts; see algorithm desc above.

    mov cl,[countTable+ecx]
    and edx,0ffh

    add edi,ecx
   or ebp,ebx

    mov cl,[countTable+edx]
   mov ebx,[esi+offset+32+12]

    add edi,ecx
   mov ecx,[esi+offset+32+8]

   add ecx,5f5f5f5fh
   mov edx,eax
    endm          ; 28 cycles.
;End of wcount
    ; Start up the pipeline, see wcount macro's comment.
    mov ebp,[esi+ebp+4];
    mov ebx,[esi+ebp];
    add ebp,5f5f5f5fh   ;ebx = chars 0-3, eax = chars 4-7
    add ebx,5f5f5f5fh   
    and ebp,80808080h
    and ebx,80808080h
    shr ebp,6
    mov ecx,[esi+8]     ;ecx = chars 8-11
    shr ebx,7
    add ecx,5f5f5f5fh
    or  ebp,ebx          
    mov ebx,[esi+12]    ;ebx = chars 12-15
       wcount 0
       wcount 32
       add esi,64
       mov eax,end
    .until esi == eax       ; 2*28+2=58 cycles/64 chars = 29cyc/32 chars.
We can improve this even further by defining a word character as being one that has the 0x40 bit set. This includes all the letters, and some symbols, but none of the digits or common punctuation. This is not acceptable for applications that need to count numbers, since the string "12 45 67" is all non-word characters under this definition. But if this is acceptable for the application at hand, we can shorten the code that computes the 32-bit eax at the start of the loop by 5 cycles, giving 24cyc/32chars. I haven't worked out the exact interleaving but here's the basic idea:
	; Four copies of the following basic code are interleaved:
	;    mov ebx,[esi+ebp]   ; Get 8 chars
	;    mov ecx,[esi+ebp+4]
	;    and ebx,40404040h   ; Mask out all but the flag bits.
	;    and ecx,40404040h
	;    lea eax,[ebx*2+eax] ; Add ebx into the total, shifted left 1.
	;    or  eax,ecx         ; Add ecx into the total.
	;    shr eax,2           ; Shift accumulator in prep for next 8 chars
On the Pentium Pro, we can do much better by storing the 32 word-bits in a different order and using the rcl instruction to perform a 33-bit-wide rotate. This avoids the partial register stall as well as being fundamentally faster, since this single instruction (2 uops) replaces 4 or 5 instruction from the Pentium loop.

On the MMX or Pentium II, we can do 64 bits at a time rather than 32, the pcmpeqb instruction saves an instruction over the add/and pair, and we get some elbow room with the 8 MMX registers. I haven't worked it out, but another 2x improvement might well be possible over the straight Pentium code.

Substantive changes:
    1997-11-24: created.
    1997-11-25: ROL improves WordCount solution.
    1997-11-26: 32- and 27-cycle solutions.
    1997-11-30: Only 16 count bits, so 29 and 24 cycles.
Copyright © 1997-1998, Steve Colwell, All Rights Reserved
Home page