Optimization Tidbits

Every processor type has its particular constraints (e.g. too few registers, slow memory access, idiosyncratic instructions) that control the types of optimization that will best apply. The references cover Intel-specific instruction twiddling issues like avoiding AGIs, and UV and FXCH pairing. I'll focus here on more general techniques that apply to the problems common to all modern processors: slow memory and slow jumps.


Slow Memory

Slow memory (relative to internal processor speed) means that it makes sense to develop compressed data formats that cram an algorithm's working set of into a couple of K so it'll fit in the primary cache. It's like having a processor that accesses only 16K of RAM, with slower access to an additional 256K -- sound familiar? Just like the good ol' bank-switching days!

Superior optimization depends on a precise understanding of the caches and paths to main memory for your particular processor, because clever coding of the 95% of the code that hits the primary cache will be completely swamped by that other 5% that misses the cache. Main memory hits are the coin of the realm now, not cycle counts.

An important corollary of slow memory is that the era is past where it makes sense to use huge lookup tables to save 3 instructions in the inner loop. An access to main memory now uses up to 100 cycles (!), so most worthwhile tables must now fit in the primary cache.

An interesting side note is that on the newest processors, simultaneous load/store execution makes the primary cache nearly as accessible as the registers, finally relieving Intel's legacy too-few-registers problem.


Slow Jumps

Predictable jumps are plenty fast; the problem is with unpredictable conditional branches. This is the area where tight code can make the most difference. The hardware guys can speed up memory accesses by adding more cache, but they can't do much better at figuring out which way branches are likely to go. They're trying speculative parallel execution of both branches, but I'd rather use that 2x faster processor to go 2x faster than to go 1x faster at branches -- wouldn't you?

The key is to get rid of branches completely. It's worth between 4 and 10 straight-line instructions to avoid one branch. A common technique is to get a register full of 0s or 1s depending on the result of a flag (the sign flag using i>>31, or the carry flag in assembly using sbb), then to AND that with other data to get a conditional effect. For instance, min(i,j) { return j+( ((i-j)>>31) & (i-j) ); }. Or, instead of the typical increment-mod for advancing an array index with wrapping, decMod(i,mod) { --i; return i + ((i>>31)&mod); }.

One can use bitwise tricks to do complicated things with straightline bit twiddling. One class of tricks mixes the boolean and arithmetic ALU operators, which usually mix like oil and water. For instance, a well known idiom is the check for whether i is a power of two: (i&(i-1))==0; if you don't like the way that handles i==0, you can use (i|(i-1))>i (for i signed). There's a whole family of related equations that work because i-1 differs from i only in the low bits: 10..0 becomes 01..1. For instance, the lowest bit of i is ((i^(i-1))+1)>>1. For numbers of the form 0..01..10..0, the highest bit is ((i|(i-1))+1)>>1. And so forth.

We can use the bitwise view of values to eliminate branches in multistage arithmetic comparisons. Rather than a==0 && b==0 && c==0, use (a|b|c) == 0. The same thing works for >=0, and the & version works for <0.

The bitwise view of floats is even more profitable, because of the clumsiness of FPU access on the Pentia. And with everyone now using IEEE-standard floating point, these tricks are perfectly portable. Just watch out for the negative floating point zero (0x80000000). The multistage zero check above works great: f==0.0f && g==0.0f becomes ((*(int*)&f | *(int*)&g)&0x7ffffff) == 0. Even simple comparisons can be faster as integer operations on the bitwise float representation: f<g is simply *(int*)&f < *(int*)&g, for f and g >= 0. Negative floats are trickier because of the ones-complement sign storage, as well as that darned negative zero.

A neat trick is to use the FPU's fast integer to floating point conversion as a way to find the highest set bit position: intlog2(i) { float x=i; return (*(int*)&x >> 23) - 127; }. This takes only a few cycles, much better than the usual linear or binary bit-searching loop.

Once you've replaced branches with straight-line code, you can often pack byte values into processor registers to get parallel processing. This is one of the best optimizations around: four bytes in a 32-bit register means 4x faster processing. A simple example is to copy or compare strings four bytes at a time rather than one at a time.

Intel's MMX instructions were designed for this very purpose, but we don't need no stinking dedicated instructions. Pure boolean operations like masking require no tricks. Using a guard bit in each byte, you can do addition and subtraction. Shift-and-mask acts like per-byte shifts, and with shift and add, you can do multiplication by constants.


References

Abrash, Michael; Zen of Code Optimization, Coriolis Group, 1994, ISBN 1-883577-03-9. A classic, the first good Intel assembly optimization book, covers the 486 and Pentium. Mostly dated now but still fun reading.

Booth, Rick; Inner Loops, Addison Wesley Developers Press, ISBN 0-201-47960-5, 1997. Very good, focuses on the Pentium, with some MMX and Pro. See my review; my appendix contains a good example of sophisticated bytewise parallelism.

Glassner, Andrew; Graphics Gems, Academic Press, 1990, ISBN 0-12-286165-5. Multitudes of four-page tips and tricks for graphics programming. First in a series, which goes up to at least Graphics Gems V. I think the first book is the best, but the others aren't bad; browse them for items relevant to you.

Hummel, Robert L.; Programmer's Technical Reference: The Processor and Coprocessor, Ziff-Davis Press, 1992, ISBN 1-56276-016-5. By far the best reference work for the Intel instruction set, but unfortunately the timings only go up through the 486. Still the best for describing what each instruction does.

Schmit, Michael L.; Pentium Processor Optimization Tools, 1995, AP Professional, ISBN 0-12-627230-1. Nowhere near as nice as the others mentioned here, but this is the only one with a table of Pentium instruction cycle counts. Get the Intel doc instead for the table, and for details about cycle counts, look up the instruction in Booth's index.

 
 
Substantive changes:
    1998-01-07: created.
Copyright © 1998, Steve Colwell, All Rights Reserved
Home page