- Transformation for Performance
- Turning off and on vectorization
- Architecture notes
- MCDRAM and Cluster Modes
- Cache performance
- NUMACTL and memory allocations
- Tile Architecture
- Performance recommendations
- Vector Operation Costs
- Data Alignment
- General Programming Advice
- Environmental Variables
- Streaming Stores
- Loop Vectorization Requirements
- Compiler options for Vectorization
- Vector Directives: ivdep
- Vectorization of Random Numbers
- Optimization and Profiling
Transformation for Performance
Quoting from Jeffers, Reinders, and Sodani:
- Memory access and loop transformations (e.g., cache blocking, loop unrolling, prefetching, tiling, loop interchange, alignment, affinity).
- Vectorization works best on unit-stride vectors (the data being consumed is contiguous in memory). Data structure transformations can increase the amount of data accessed with unit-strides (such as Array of Structures to Structure of Arrays transformations or recoding to use packed arrays instead of indirect accesses).
- Use of full (not partial) vectors is best, and data transformations to accomplish this should be considered.
- Vectorization is best with properly aligned data.
- Large page considerations (we recommend the widely used Linux libhugetlbfs library).
- Algorithm selection (change) to favor those that are parallelization and vectorization friendly.
Turning off and on vectorization
- To turn off vectorization: -no-vec-no-simd
- When using vectorization, use at least: -O2 -xhost
- Each processor consists of dozens of tiles.
- Each tile has 2 cores, 2 vector processing units per core, and 1MB L2 cache. And a caching/home agent.
- L2 cache is coherent across tiles.
- Aggregate bandwith on 2D mesh interconnect is 700 GB/s.
- Cluster modes may affect performance when using more than 1 MPI rank per processor.
- There are 8 MCDRAM devices, each with 2GB. Aggregate bandwidth is 450GB/s.
- MCDRAM can be cache, flat (standard memory), or hybrid.
- Aggregate DDR bandwidth from 6 channels is 90GB/s.
MCDRAM and Cluster Modes
- MPI+OpenMP may run faster with SNC-4 cluster mode than Quadrant
- Hard to beat performance in MCDRAM Cache mode
- Many applications will run fine in Quadrant+Cache
- Most applications will benefit from parallelism more than cluster and mcdram mode fiddling.
- Key difference in Quadrant vs. SNC is whether MCDRAM and DDR are UMA or NUMA.
- For SNC, applications must be NUMA aware and divided into multiple MPI ranks per processor.
- Two-way modes have higher latency. Use quadrant or SNC-4.
- When using more than 16GB, using MCDRAM as non-cache might be better.
- Memory usage model summary on page 29.
- numactl -H will print information on memory mode
- numastat can provide additional information
- setKNLmodes script on page 59 can help with setting the cluster and memory modes
- SNC-4 is analogous to a 4-socket Intel Xeon system (p75)
- L1 cache is 16KB per core
- L2 cache is 1MB per tile, or about 512KB per core.
- Performance degrades exponentially across each cache memory utilization (L1->L2->MCDRAM)
- DDR is exponentially worse than MCDRAM (see figure 3.4 on page 32)
NUMACTL and memory allocations
- numactl -m 1 program will force a program to run in MCDRAM
- numactl -p 1 program will enable a program to run in MCDRAM
- See page 38 for an example
- memkind enables C++ to override new to allocate directly into MCDRAM
- In cache mode, memkind cannot be used because hbw_check_available() will return 0.
- Each VPU can execute 512-bit vector multiply-add instructions per cycle
- Each core can therefore do 32 dual-precision FP ops per cycle
- Cores share the L2 cache read and write bandwidth
- AVX-512 registers are 8 DP wide (512 bits)
- Using two threads per core usually provides maximum performance
- Use static libraries
- Put “export LD_PREFER_MAP_32BIT_EXEC=1” in bashrc
- Use 2M or 1G pages.
- Avoid SSE instructions.
- Reference multiple pointers before deferencing the first.
- Use AVX-512 instructions.
Vector Operation Costs
- Simple math, load, and stores have cost 1
- Gather for 8 or 16 elements have 14 or 20 cost
- Horizontal reductions have cost 30
- Division or square roots have cost 15
- See examples on pages 122-123.
- Data Alignment to Assist Vectorization
- Use “_mm_malloc()” and “_mm_free”
- use “assume_aligned(a,64)” before a loop
- Also “#pragma vector aligned”
- Use after “#pragma omp parallel for”
- Data alignment information on page 181
- Example using assume aligned directive:
- Example where all data is aligned in loop:
General Programming Advice
- Manage Domain Parallelism
- Increase Thread Parallelism
- Exploit Data Parallelism
- Improve Data Locality
- KMP_AFFINITY=SCATTER to distribute threads across cores
- KMP_STACKSIZE=16MB instead of standard 12MB
- KMP_BLOCKTIME=Infinite to prevent threads from sleeping
- There are other OMP variables for nested threads, for future reference.
- Autovectorization using -O2 or -O3
- Compiler optimization report add “-qopt-report -qopt-report-phase=loop,vec”
- Avoid gather/scatter, instead align and pack memory
- Fetch from cache, not memory. Prefetch to L2, then prefetch from L2 to L1. Look at “mm_prefetch”.
- Re-use data in cache if possible.
- If data is being written out and will not be re-used, use streaming stores to prevent evictions from cache. Data must occupy linear memory without gaps.
- Avoid manual loop unrolling.
- SIMD directives on page 193
- Vectorization may not produce numerically identical results to scalar operations, especially in reductions. Use “-fp-model precise” to prevent vectorization of reductions (and other things).
- Compiler prefetching via “-opt-prefetch=n”. Automatically set to n=3 with -Ox.
- Pragma hint “#pragma prefetch var:hint:distance”. hint=0 (L1 and L2) or hint=1 (L2)
- “mm_prefetch(char const address, int hint)” Loads one cache line of data at address.
- Too many prefetches are problmeatic. Can disable compiler prefetching with “-opt-prefetch=0”
- Disable compiler preftech with “#pragma noprefetch” within loop.
- Example code on page 184
- Compiler options “-opt-streaming-stores keyword” auto always never, auto default.
- Streaming stores from a loop can only be determined at runtime, so variable loop iterations need “#pragma vector nontemporal”
Loop Vectorization Requirements
- Inner loop in a loop nest.
- Straight-line code, no jumps or branches, but can mask with if statement.
- Must be countable, with no data-dependent exit conditions.
- No backward loop-carried dependencies. a[i] must be computed before a[i-1] is used.
- No special operators, functions, or subroutines called.
- Intrinsic math functions such as sin(), log(), and fmax() are OK.
- Following math functions OK: sin, cos, tan, asin, acos, atan, log, log2, log10, exp, exp2, sinh, cosh, tanh, asinh, acosh, atanh, erf, erfc, erfinv, sqrt, cbrt, trunk, round, ceil, floor, fabs, fmin, fmax, pow, and atan2.
- Reductions and vector assignments OK.
- Avoid mixed data types.
- Use contiguous memory locations, with unit stride.
- Use ivdep to advise that there are no loop-carried dependencies.
- Use vector always pragma to force vectorization.
- Check vectorization report.
Compiler options for Vectorization
- “-restrict” Allows restrict to be used as a keyword in C.
Vector Directives: ivdep
- The following would not vectorize without ivdep since the value of k is not known and could be k<0.
Vectorization of Random Numbers
- drand48, erand48, lrand48, nrand48, mrand48, and jrand48 can be vectorized.
Optimization and Profiling
- Use “-xCOMMON-AVX512”
- For profiling, use “-g”
- Survey usage:
- Set environment variable: “source /opt/intel/advisor_xe_2016/advixe-vars.sh”
- Collect Survey data: “advixe-cl –collect-=survey –projectdir=
- Launch the advisor gui: “advixe-gui
- Output answer data is usually e000 or something similar.
- Information on Vectorization Advisor on page 217
Perform operations on packed 8 doubles or 16 singles in 512 bit chunks. Other data types available, and 4 element w Provides vectorized add, subtract, multiply, divide, and FMA. See the following code from Jeffers et al.:
Note the casting of the simd 512 bit data types when passing to a function.
Intel Intrinsics Guide
Here is the Intel Intrinsics Guide.