An Introduction to GCC Compiler Intrinsics in Vector Processing
Speed is essential in multimedia, graphics and signal processing. Sometimes programmers resort to assembly language to get every last bit of speed out of their machines. GCC offers an intermediate between assembly and standard C that can get you more speed and processor features without having to go all the way to assembly language: compiler intrinsics. This article discusses GCC's compiler intrinsics, emphasizing vector processing on three platforms: X86 (using MMX, SSE and SSE2); Motorola, now Freescale (using Altivec); and ARM Cortex-A (using Neon). We conclude with some debugging tips and references.
Download the sample code for this article here: http://www.linuxjournal.com/files/linuxjournal.com/code/11108.tar
So, What Are Compiler Intrinsics?Compiler intrinsics (sometimes called "builtins") are like the library functions you're used to, except they're built in to the compiler. They may be faster than regular library functions (the compiler knows more about them so it can optimize better) or handle a smaller input range than the library functions. Intrinsics also expose processor-specific functionality so you can use them as an intermediate between standard C and assembly language. This gives you the ability to get to assembly-like functionality, but still let the compiler handle details like type checking, register allocation, instruction scheduling and call stack maintenance. Some builtins are portable, others are not--they are processor-specific. You can find the lists of the portable and target specific intrinsics in the GCC info pages and the include files (more about that below). This article focuses on the intrinsics useful for vector processing.
Vectors and ScalarsIn this article, a vector is an ordered collection of numbers, like an array. If all the elements of a vector are measures of the same thing, it's said to be a uniform vector. Non-uniform vectors have elements that represent different things, and their elements have to be processed differently. In software, vectors have their own types and operations. A scalar is a single value, a vector of size one. Code that uses vector types and operations is said to be vector code. Code that uses only scalar types and operations is said to be scalar code.
Vector Processing ConceptsVector processing is in the category of Single Instruction, Multiple Data (SIMD). In SIMD, the same operation happens to all the data (the values in the vector) at the same time. Each value in the vector is computed independently. Vector operations include logic and math. Math within a single vector is called horizontal math. Math between two vectors is called vertical math.
Instead of writing: 10 x 2 = 20, express it vertically as:
10 x 2 ------ 20
In vertical math, vectors are lines of these values; multiple operations happen at the same time:
------------------------------- | 10 | 10 | 10 | 10 | vector1 ------------------------------- ------------------------------- x | 2 | 2 | 2 | 2 | vector2 ------------------------------- -------------------------------------- ------------------------------- | 20 | 20 | 20 | 20 | vector3 -------------------------------
All 10s are multiplied by all 2s at the same time.
So, to convert Celsius to Fahrenheit using F = (9/5) * C + 32 for a vector of temperatures in Celsius:
------------------------------- | C0 | C1 | C2 | C3 | Celsius temperatures vector ------------------------------- ------------------------------- x | 9 | 9 | 9 | 9 | vector2 ------------------------------- -------------------------------------- ------------------------------- | p1 | p2 | p3 | p4 | partial result ------------------------------- ------------------------------- / | 5 | 5 | 5 | 5 | vector3 ------------------------------- -------------------------------------- ------------------------------- | p1 | p2 | p3 | p4 | partial result ------------------------------- ------------------------------- + | 32 | 32 | 32 | 32 | vector4 ------------------------------- -------------------------------------- ------------------------------- | F0 | F1 | F2 | F3 | Fahrenheit temperatures vector -------------------------------
Saturation arithmetic is like normal arithmetic except that when the result of the operation that would overflow or underflow an element in the vector, that is clamped at the end of the range and not allowed to wrap around. (For instance, 255 is the largest unsigned character. In saturation arithmetic on unsigned characters, 250 + 10 = 255.) Regular arithmetic would allow the value to wrap around zero and become smaller. For example, saturation arithmetic is useful if you have a pixel that is slightly brighter than maximum brightness. It should be maximum brightness, not wrap around to be dark.
We're including integer math in the discussion. While integer math is not unique to vector processing, it's useful to have in case your vector hardware is integer-only or if integer math is much faster than floating-point math. Integer math will be an approximation of floating-point math, but you might be able to get a faster answer that is acceptably close.
The first option in integer math is rearranging operations. If your formula is simple enough, you might be able to rearrange operations to preserve precision. For instance, you could rearrange:
F = (9/5)C + 32into:
F = (9*C)/5 + 32
So long as 9 * C doesn't overflow the type you're using, precision is preserved. It's lost when you divide by 5; so do that after the multiplication. Rearrangement may not work for more complex formulas.
The seond choice is scaled math. In the scaled math option, you decide how much precision you want, then multiply both sides of your equation by a constant, round or truncate your coefficients to integers, and work with that. The final step to get your answer then is to divide by that constant. For instance, if you wanted to convert Celsius to Fahrenheit:
F = (9/5)C + 32 = 1.8C + 32 -- but we can't have 1.8, so multiply by 10 sum = 10F = 18C + 320 -- 1.8 is now 18: now all integer operations F = sum/10
If you multiply by a power of 2 instead of 10, you change that final division into a shift, which is almost certainly faster, though harder to understand. (So don't do this gratuitously.)
The third choice for integer math is shift-and-add. Shift-and-add is another method based on the idea that a floating-point multiplication can be implemented with a number of shifts and adds. So our troublesome 1.8C can be approximated as:
1.0C + 0.5C + 0.25C + ... OR C + (C >> 1) + (C >> 2) + ...
Again, it's almost certainly faster, but harder to understand.
There are examples of integer math in samples/simple/temperatures*.c, and shift-and-add in samples/colorconv2/scalar.c.
Vector Types, the Compiler and the DebuggerTo use your processor's vector hardware, tell the compiler to use intrinsics to generate SIMD code, include the file that defines the vector types, and use a vector type to put your data into vector form.
The compiler's SIMD command-line arguments are listed in Table 1. (This article covers only these, but GCC offers much more.)
Table 1. GCC Command-Line Options to Generate SIMD Code
Processor/ | Options | |
---|---|---|
X86/MMX/SSE1/SSE2 | -mfpmath=sse -mmmx -msse -msse2 | |
ARM Neon | -mfpu=neon -mfloat-abi=softfp | |
Freescale Altivec | -maltivec -mabi=altivec |
Here are the include files you need:
- arm_neon.h - ARM Neon types & intrinsics
- altivec.h - Freescale Altivec types & intrinsics
- mmintrin.h - X86 MMX
- xmmintrin.h - X86 SSE1
- emmintrin.h - X86 SSE2
The X86 compatibles with MMX, SSE1 and SSE2 have the following types:
- MMX: __m64 64 bits of integers broken down as eight 8-bit integers, four 16-bit shorts or two 32-bit integers.
- SSE1: __m128 128 bits: four single precision floats.
- SSE2: __m128i 128 bits of any size packed integers, __m128d 128 bits: two doubles.
Because the debugger doesn't know how you're using these types, printing X86 vector variables in gdb/ddd shows you the packed form of the vector instead of the collection of elements. To get to the individual elements, tell the debugger how to decode the packed form as "print (type[]) x" For instance if you have:
__m64 avariable; /* storing 4 shorts */
You can tell ddd to list individual elements as shorts saying:
print (short[]) avariable
If you're working with char vectors and want gdb to print the vector's elements as numbers instead of characters, you can tell it to using the "/" option. For instance:
print/d acharvector
will print the contents of acharvector as a series of decimal values.
PowerPC Altivec Types and DebuggingPowerPC Processors with Altivec (also known as VMX and Velocity Engine) add the keyword "vector" to their types. They're all 16 bytes long. The following are some Altivec vector types:
- vector unsigned char: 16 unsigned chars
- vector signed char: 16 signed chars
- vector bool char: 16 unsigned chars (0 false, 255 true)
- vector unsigned short: 8 unsigned shorts
- vector signed short: 8 signed shorts
- vector bool short: 8 unsigned shorts (0 false, 65535 true)
- vector unsigned int: 4 unsigned ints
- vector signed int: 4 signed ints
- vector bool int: 4 unsigned ints (0 false, 2^32 -1 true)
- vector float: 4 floats
The debugger prints these vectors as collections of individual elements.
ARM Neon Types and DebuggingOn ARM processors that have Neon extensions available, the Neon types follow the pattern [type]x[elementcount]_t. Types include those in the following list:
- uint64x1_t - single 64-bit unsigned integer
- uint32x2_t - pair of 32-bit unsigned integers
- uint16x4_t - four 16-bit unsigned integers
- uint8x8_t - eight 8-bit unsigned integers
- int32x2_t - pair of 32-bit signed integers
- int16x4_t - four 16-bit signed integers
- int8x8_t - eight 8-bit signed integers
- int64x1_t - single 64-bit signed integer
- float32x2_t - pair of 32-bit floats
- uint32x4_t - four 32-bit unsigned integers
- uint16x8_t - eight 16-bit unsigned integers
- uint8x16_t - 16 8-bit unsigned integers
- int32x4_t - four 32-bit signed integers
- int16x8_t - eight 16-bit signed integers
- int8x16_t - 16 8-bit signed integers
- uint64x2_t - pair of 64-bit unsigned integers
- int64x2_t - pair of 64-bit signed integers
- float32x4_t - four 32-bit floats
- uint32x4_t - four 32-bit unsigned integers
- uint16x8_t - eight 16-bit unsigned integers
The debugger prints these vectors as collections of individual elements.
There are examples of these in the samples/simple directory.
Now that we've covered the vector types, let's talk about vector programs.
As Ian Ollman points out, vector programs are blitters. They load data from memory, process it, then store it to memory elsewhere. Moving data between memory and vector registers is necessary, but it's overhead. Taking big bites of data from memory, processing it, then writing it back to memory will minimize that overhead.
Alignment is another aspect of data movement to watch for. Use GCC's "aligned" attribute to align data sources and destinations on 16-bit boundaries for best performance. For instance:
float anarray[4] __attribute__((aligned(16))) = { 1.2, 3.5, 1.7, 2.8 };
Failure to align can result in getting the right answer, silently getting the wrong answer or crashing. Techniques are available for handling unaligned data, but they are slower than using aligned data. There are examples of these in the sample code.
The sample code uses intrinsics for vector operations on X86, Altivec and Neon. These intrinsics follow naming conventions to make them easier to decode. Here are the naming conventions:
Altivec intrinsics are prefixed with "vec_". C++ style overloading accomodates the different type arguments.
Neon intrinsics follow the naming scheme [opname][flags]_[type]. A "q" flag means it operates on quad word (128-bit) vectors.
X86 intrinsics are follow the naming convention _mm_[opname]_[suffix]
suffix s single-precision floating point d double-precision floating point i128 signed 128-bit integer i64 signed 64-bit integer u64 unsigned 64-bit integer i32 signed 32-bit integer u32 unsigned 32-bit integer i16 signed 16-bit integer u16 unsigned 16-bit integer i8 signed 8-bit integer u8 unsigned 8-bit integer pi# 64-bit vector of packed #-bit integers pu# 64-bit vector of packed #-bit unsigned integers epi# 128-bit vector of packed #-bit unsigned integers epu# 128-bit vector of packed #-bit unsigned integers ps 128-bit vector of packed single precision floats ss 128-bit vector of one single precision float pd 128-bit vector of double precision floats sd 128-bit vector of one double precision (128-bit) float si64 64-bit vector of single 64-bit integer si128 128 bit vector
Table 2 lists the intrinsics used in the sample code.
Table 2. Subset of vector operators and intrinsics used in the examples.
Operation | Altivec | Neon | MMX/SSE/SSE2 |
---|---|---|---|
loading | vec_ld | vld1q_f32 | _mm_set_epi16 |
vector | vec_splat | vld1q_s16 | _mm_set1_epi16 |
vec_splat_s16 | vsetq_lane_f32 | _mm_set1_pi16 | |
vec_splat_s32 | vld1_u8 | _mm_set_pi16 | |
vec_splat_s8 | vdupq_lane_s16 | _mm_load_ps | |
vec_splat_u16 | vdupq_n_s16 | _mm_set1_ps | |
vec_splat_u32 | vmovq_n_f32 | _mm_loadh_pi | |
vec_splat_u8 | vset_lane_u8 | _mm_loadl_pi | |
storing | vec_st | vst1_u8 | |
vector | vst1q_s16 | _mm_store_ps | |
vst1q_f32 | |||
vst1_s16 | |||
add | vec_madd | vaddq_s16 | _mm_add_epi16 |
vec_mladd | vaddq_f32 | _mm_add_pi16 | |
vec_adds | vmlaq_n_f32 | _mm_add_ps | |
subtract | vec_sub | vsubq_s16 | |
multiply | vec_madd | vmulq_n_s16 | _mm_mullo_epi16 |
vec_mladd | vmulq_s16 | _mm_mullo_pi16 | |
vmulq_f32 | _mm_mul_ps | ||
vmlaq_n_f32 | |||
arithmetic | vec_sra | vshrq_n_s16 | _mm_srai_epi16 |
shift | vec_srl | _mm_srai_pi16 | |
vec_sr | |||
byte | vec_perm | vtbl1_u8 | _mm_shuffle_pi16 |
permutation | vec_sel | vtbx1_u8 | _mm_shuffle_ps |
vec_mergeh | vget_high_s16 | ||
vec_mergel | vget_low_s16 | ||
vdupq_lane_s16 | |||
vdupq_n_s16 | |||
vmovq_n_f32 | |||
vbsl_u8 | |||
type | vec_cts | vmovl_u8 | _mm_packs_pu16 |
conversion | vec_unpackh | vreinterpretq_s16_u16 | |
vec_unpackl | vcvtq_u32_f32 | ||
vec_cts | vqmovn_s32 | _mm_cvtps_pi16 | |
vec_ctu | vqmovun_s16 | _mm_packus_epi16 | |
vqmovn_u16 | |||
vcvtq_f32_s32 | |||
vmovl_s16 | |||
vmovq_n_f32 | |||
vector | vec_pack | vcombine_u16 | |
combination | vec_packsu | vcombine_u8 | |
vcombine_s16 | |||
maximum | _mm_max_ps | ||
minimum | _mm_min_ps | ||
vector | _mm_andnot_ps | ||
logic | _mm_and_ps | ||
_mm_or_ps | |||
rounding | vec_trunc | ||
misc | _mm_empty |
Examine the Tradeoffs
Writing vector code with intrinsics forces you to make trade-offs. Your program will have a balance between scalar and vector operations. Do you have enough work for the vector hardware to make using it worthwhile? You must balance the portability of C against the need for speed and the complexity of vector code, especially if you maintain code paths for scalar and vector code. You must judge the need for speed versus accuracy. It may be that integer math will be fast enough and accurate enough to meet the need. One way to make those decisions is to test: write your program with a scalar code path and a vector code path and compare the two.
Data Structures
Start by laying out your data structures assuming that you'll be using intrinsics. This means getting data items aligned. If you can arrange the data for uniform vectors, do that.Write Portable Scalar Code and Profile
Next, write your portable scalar code and profile it. This will be your reference code for correctness and the baseline to time your vector code. Profiling the code will show where the bottlenecks are. Make vector versions of the bottlenecks.
Write Vector Code
When you're writing that vector code, group the non-portable code into separate files by architecture. Write a separate Makefile for each architecture. That makes it easy to select the files you want to compile and supply arguments to the compiler for each architecture. Minimize the intermixing of scalar and vector code.
Use Compiler-Supplied Symbols if you #ifdef
For files that are common to more than one architecture, but have architecture-specific parts, you can #ifdef with symbols supplied by the compiler when SIMD instructions are available. These are:
- __MMX__ -- X86 MMX
- __SSE__ -- X86 SSE
- __SSE2__ -- X86 SSE2
- __VEC__ -- altivec functions
- __ARM_NEON__ -- neon functions
To see the baseline macros defined for other processors:
touch emptyfile.c
gcc -E -dD emptyfile.c | more
To see what's added for SIMD, do this with the SIMD command-line arguments for your compiler (see Table 1). For example:
touch emptyfile.c
gcc -E -dD emptyfile.c -mmmx -msse -msse2 -mfpmath=sse | more
Then compare the two results.
Check Processor at Runtime
Next, your code should check your processor at runtime to see if you have vector support for it. If you don't have a vector code path for that processor, fall back to your scalar code. If you have vector support, and the vector support is faster, use the vector code path. Test processor features on X86 with the cpuid instruction from <cpuid.h>. (You saw examples of that in samples/simple/x86/*c.) We couldn't find something that well established for Altivec and Neon, so the examples there parse /proc/cpuinfo. (Serious code might insert a test SIMD instruction. If the processor throws a SIGILL signal when it encounters that test instruction, you do not have that feature.)
Test, Test, Test
Test everything. Test for timing: see if your scalar or vector code is faster. Test for correct results: compare the results of your vector code against the results of your scalar code. Test at different optimization levels: the behavior of the programs can change at different levels of optimization. Test against integer math versions of your code. Finally, watch for compiler bugs. GCC's SIMD and intrinsics are a work in progress.
This gets us to our last code sample. In samples/colorconv2 is a colorspace conversion library that takes images in non-planar YUV422 and turns them into RGBA. It runs on PowerPCs using Altivec; ARM Cortex-A using Neon; and X86 using MMX, SSE and SSE2. (We tested on PowerMac G5 running Fedora 12, a Beagleboard running Angstrom 2009.X-test-20090508 and a Pentium 3 running Fedora 10.) Colorconv detects CPU features and uses code for them. It falls back to scalar code if no supported features are detected.
To build, untar the sources file and run make. Make uses the "uname" command to look for an architecture specific Makefile. (Unfortunately, Angstrom's uname on Beagleboard returns "unknown", so that's what the directory is called.)
Test programs are built along with the library. Testrange compares the results of the scalar code to the vector code over the entire range of inputs. Testcolorconv runs timing tests comparing the code paths it has available (intrinsics and scalar code) so you can see which runs faster.
Finally, here are some performance tips.
First, get a recent compiler and use the best code generation options. (Check the info pages that come with your compiler for things like the -mcpu option.)
Second, profile your code. Humans are bad at guessing where the bottlenecks are. Fix the bottlenecks, not other parts.
Third, get the most work you can from each vector operation by using the vector with the narrowest type elements that your data will fit into. Get the most work you can in each time slice by having enough work that you keep your vector hardware busy. Take big bites of data. If your vector hardware can handle a lot of vectors at the same time, use them. However, exceeding the number of vector registers you have available will slow things down. (Check your processor's documentation.)
Fourth, don't re-invent the wheel. Intel, Freescale and ARM all offer libraries and code samples to help you get the most from their processors. These include Intel's Integrated Performance Primitives, Freescale's libmotovec and ARM's OpenMAX.
SummaryIn summary, GCC offers intrinsics that allow you to get more from your processor without the work of going all the way to assembly. We have covered basic types and some of the vector math functions. When you use intrinsics, make sure you test thoroughly. Test for speed and correctness against a scalar version of your code. Different features of each processor and how well they operate means that this is a wide open field. The more effort you put into it, the more you will get out.
References:The GCC include files that map intrinsics to compiler built-ins (eg arm_neon.h) and the GCC info pages that explain those built-ins:
http://gcc.gnu.org/onlinedocs/gcc/Target-Builtins.htmlhttp://ds9a.nl/gcc-simd/
http://softpixel.com/~cwright/programming/simd/index.php
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html
http://www.arm.com/products/processors/technologies/neon.php
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0205j/BABGHIFH.html
http://www.tommesani.com/Docs.html
http://www.linuxjournal.com/article/7269
http://developer.apple.com/hardwaredrivers/ve/sse.html
http://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add
http://www.ibm.com/developerworks/power/library/pa-unrollav1/
http://en.wikipedia.org/wiki/MMX_(instruction_set)
Integrated Performance Primitives
http://software.intel.com/en-us/articles/intel-ipp/
http://software.intel.com/en-us/articles/non-commercial-software-download/
OpenMAX
http://www.khronos.org/developers/resources/openmax
Freescale AltiVec Libs for Linux
http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCNWALTVCLIB
AltiVec TM Technology Programming Interface Manual
http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf
http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html
Ian Ollmann's Altivec Tutorial
http://www-linux.gsi.de/~ikisel/reco/Systems/Altivec.pdf
http://arstechnica.com/civis/viewtopic.php?f=19&t=381165
RealView Compilation Tools Compiler Reference Guide (especially Appendix E)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348c/index.html
RealView Compilation Tools Assembler Guide (esp chapter 5)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/index.html
Intel C++ Intrinsics Reference
http://software.intel.com/sites/default/files/m/9/4/c/8/e/18072-347603.pdf