Comments:"Introduction to AVX2 optimizations in x264"
URL:http://www.scribd.com/doc/137419114/Introduction-to-AVX2-optimizations-in-x264
Introduction to AVX2 optimizations in x264
x264 is a good example of a program with many integer SIMD functions that operate on a widevariety of data types and perform a variety of different operations. Few of these are simple enough thatwe can change register sizes from 128-bit to 256-bit trivially; many are frequency transforms, FIR filters, and other functions with a lot of data mixing and dependency. Just like moving from 64-bit to128-bit registers rarely gave an exact doubling of performance, moving from 128-bit to 256-bit rarelywill here, too.The shift from SSE2-derived integer SIMD to AVX2 is rather analogous to the change fromMMX to SSE2 back during the era of the Pentium 4, Athlon 64, and Core 2 CPUs. However, there's afew core differences.When SSE2 was introduced, the first implementations were often quite slow. The most extremeexamples, the Athlon 64 and Core 1, only had 64-bit execution units, so an SSE2 function was oftenequal to or slower than the MMX equivalent in speed. It wasn't until the Core 2 Penryn (45nm) that thisfinally ended, when Intel blessed Penryn with a lightning-fast 128-bit shuffle unit
Secondly, AVX2 operations are often extended from 128-bit in a somewhat unusual way.Instead of being logically expanded in the expected fashion, they operate on 128-bit
lanes
of data, witha few special inter-lane instructions like
vpermd
. This lets Intel implement larger SIMD by copy- pasting smaller SIMD units, but it also adds yet another challenge in extending 128-bit code to 256-bitcode.
Thirdly, x264 has many functions that are smaller in width than 256-bit AVX registers, makingit more difficult (or sometimes, impossible) to effectively use AVX2. This issue existed when movingfrom 64-bit MMX to 128-bit SSE2, of course, but it's more dramatic now since there's quite a lot of functions that aren't wide enough to make easy use of AVX2.To facilitate the move to AVX2, we've extended the x264asm abstraction layer (
x86inc.asm
) toto include AVX operations and registers. This let us port a number of rather complex functions to AVX2with far fewer changes than we would have expected. As always, x264asm is available under a BSD-like license for anyone to use – if you're working on lots of x86 assembly code, I strongly recommendtrying it!I'd love to write an AVX2 optimization guide; there's plenty of tricks and subtleties that we'vediscovered along the way. Many of the functions I wrote before I had access to a Haswell turned out to be grossly suboptimal. Unfortunately, I'm pretty sure I'm still bound by the NDA here; the most I think I'm currently allowed to do is post performance numbers for individual functions, which you'll see onthe next page.
1Two of them, actually!2Like always, there might be some minor exceptions: for example, on Sandy Bridge, most AVX floating point operationshad 256-bit execution units, but rcpps (reciprocal approximation) did not, likely to save hardware. This is also noguarantee for future (especially non-Intel) CPUs; for example, Bulldozer supported AVX, but only uses 128-bitexecution units, so it's probably no faster than SSE for floating point math.3Each 128-bit lane is operated on separately, so for example, pshufb shuffles 2 groups of 16 bytes separately instead of 1group of 32 bytes. The latter would be more convenient, but require more hardware and greater circuit depth, thusincreasing latency or lowering clock speed.