Comments:"Embedded in Academia : Stochastic Superoptimization"
URL:http://blog.regehr.org/archives/923
“Stochastic Superoptimization” is a fancy way to say “randomized search for fast machine code.” It is also the title of a nice paper that was presented recently at ASPLOS. Before getting into the details, let’s look at some background. At first glance the term “superoptimization” sounds like nonsense because the optimum point is already the best one. Compiler optimizers, however, don’t tend to create optimal code; rather, they make code better than it was. A superoptimizer, then, is either an optimizer that may be better than regular optimizers or else a code generator with optimality guarantees.
How can a superoptimizer (or a skilled human, for that matter) create assembly code that runs several times faster than the output of a state of the art optimizing compiler? This is not accomplished by beating the compiler at its own game, which is applying a long sequence of simple transformations to the code, each of which improves it a little. Rather, a superoptimizer takes a specification of the desired functionality and then uses some kind of systematic search to find a good assembly-level implementation of that functionality. Importantly, the output of a superoptimizer may implement a different algorithm than the one used in the specification. (Of course, a regular optimizer could also change the algorithm, but it would tend to do so via canned special cases, not through systematic search.)
Over the last 25 years a thin stream of research papers on superoptimization has appeared:
- The original superoptimizer paper from 1987 describes a brute-force search for short sequences of instructions. Since candidate solutions were verified by testing, this superoptimizer was perfectly capable of emitting incorrect code. Additionally, only very short instruction sequences could be generated.
- The Denali superoptimizer works on an entirely different principle: the specification of the desired functionality, and the specification of the available instructions, are encoded as a satisfiability problem. Frustratingly, both the original paper and the followup paper on Denali-2 are short on evidence that this approach actually works. As idea papers, these are great. As practical compiler papers, I think they need to be considered to be negative results.
- A more practical superoptimizer was described in 2006. This one is modeled after the “peephole” passes found in all optimizing compilers, which perform low-level (and often architecture-specific) rewrites to improve the quality of short sequences of instructions or IR nodes. Although using exhaustive search makes it expensive to discover individual rewrites, the cost is amortized by storing rewrites in a fast lookup structure. Another cool feature of this work was to use testing to rapidly eliminate unsuitable code sequences, but then to verify the correctness of candidate solutions using a SAT solver. Verifying equivalence of already-discovered code is much more practical than asking the SAT solver to find a good code sequence on its own. The punchline of this paper was that the peephole superoptimizer could sometimes take unoptimized code emitted by GCC and turn it into code of about -O2 quality. Also, without any special help, it found interesting ways to use SSE instructions.
Ok, enough background. There’s more but this is the core.
The new paper on stochastic superoptimization, which is from the same research group that produced the peephole superoptimizer, throws out the database of stored optimizations. However, that aspect could easily be added back in. The idea behind STOKE (the stochastic superoptimizer tool)—that it’s better to sparsely search a larger region of the search space than to densely search a smaller region—is the same one that has been responsible for a revolution in computer Go over the last 20 years. The punchline of the new paper is that STOKE is sometimes able to discover code sequences as good as, or a bit better than, the best known ones generated by human assembly language programmers. Failing that, it produces code as good as GCC or Intel CC at a high optimization level. These results were obtained on a collection of small bit manipulation functions. STOKE optimizes x86-64 code, and it is only able to deal with loop-free instruction sequences.
At a high level, using randomized search to find good code sequences is simple. However, the devil is in the details and I’d imagine that getting interesting results out of this required a lot of elbow grease and good taste. Solving this kind of optimization problem requires (1) choosing a mutation operator that isn’t too stupid, (2) choosing a fitness function that is pretty good, and (3) making the main search loop fast. Other aspects of randomized search, such as avoiding local maxima, can be handled in fairly standard ways.
STOKE’s mutation operator is the obvious one: it randomly adds, removes, modifies, or reorders instructions. Part of the fitness function is also obvious; it is based on an estimate of performance supplied by the x86-64 interpreter. The less obvious part of STOKE’s fitness function is that it is willing to tolerate incorrect output: a piece of wrong code is penalized by the number of bits by which its output differs from the expected output, and also by erroneous conditions like segfaults. To make each iteration fast, STOKE does not try to ensure that the code sequence is totally correct, but rather (just like the original superoptimizer) runs some test cases through it. However, unlike the original superoptimizer, STOKE will never hand the user a piece of wrong code becuase a symbolic verification method is applied at the end to ensure that the optimized and original codes are equivalent. This equivalence check is made much easier by STOKE’s insistence on loop-freedom. STOKE’s speed is aided by the absence of loops and also by a highly parallelized implementation of the randomized search.
The STOKE authors ran it in two modes. First, it was used to improve the quality of code generated by Clang at -O0. This worked, but the authors found that it was unable to find better, algorithmically distinct versions of the code that were sometimes known to exist. To fix this, they additionally seeded STOKE with completely random code, which they tried to evolve into correct code in the absence of a performance constraint. Once the code became correct, it was optimized as before. The fact that this worked—that is, it discovered faster, algorithmically distinct code—for 7 out of 28 benchmarks, is cool and surprising and I’d imagine this is why the paper got accepted to ASPLOS.
Are superoptimizers ever going to be a practical technology? Perhaps. One can imagine them being used in a few ways. First, a system such as STOKE (with modest improvements such as handling loops) could be used by library developers to avoid writing hand-tuned assembly for math kernels, media CODECs, compression algorithms, and other small, performance-critical codes that are amenable to tricky optimizations. The advantage is that it is easy to tell STOKE about a new instruction or a new chip (a new target chip only requires a new cost function; a new target architecture is much more difficult). This goal could be achieved in a short time frame, though I suspect that developers enjoy writing assembly language enough that it may be difficult to persuade them to give it up.
The second, and more radical way that a superoptimizer could be used is to reduce the size of a production-grade optimizing compiler by removing most of its optimization passes. What we would end up with is a very simple code generator (TCC would be a good starting point) and a database-backed superoptimizer. Actually, I’m guessing that two superoptimizers would be needed to achieve this goal: one for the high-level IR and one for the assembly language. So perhaps we would use Clang to create LLVM code, superoptimize that code, use an LLVM backend to create assembly, and then finally superoptimize that assembly. If the optimization database lived in the cloud, it seems likely that a cluster of modest size could handle the problem of superoptimizing the hottest collections of IR and assembly code that everyone submits. Drawbacks of this approach include:
- Novel code will either be optimized very poorly or very slowly.
- Compilation predictability suffers since the cloud-based peephole database will constantly be improving.
- You can’t compile without a network connection.
It remains to be seen whether we want to do this sort of thing in practice. But if I were a compiler developer, I’d jump at a chance to substantially reduce the complexity, maintenance requirements, and porting costs of my toolchain.