This is a ~30 mins share in idea.edu.cn

Random Performance Share

image
image


有两类程序员几乎不出 多线程 bug:一类是啥也不懂,只要涉及到多 线程 就直接上锁;另一类熟读内存模型、体系结构、缓存一致性协议、内存屏障、竞态条件、指令重排……然后决定只要涉及到 多线程 就直接上锁。


SIMD

  • Single (Assembly) Instruction Multiple Data
  • How much data? It depends on the register.
  • Great source of performance improvement.

  • x86/x86-64 Architecture (Intel/AMD)
    • MMX Registers (MMX Technology):
      • MM0-MM7: 64-bit registers used for integer SIMD operations.
    • SSE (Streaming SIMD Extensions):
      • XMM0-XMM15: 128-bit registers introduced with SSE and extended in later SSE versions.
    • AVX (Advanced Vector Extensions):
      • YMM0-YMM15: 256-bit registers introduced with AVX. Each YMM register extends the corresponding XMM register.
    • AVX-512:
      • ZMM0-ZMM31: 512-bit registers introduced with AVX-512, extending the YMM and XMM registers.
      • Mask Registers (K0-K7): 8-bit registers used for controlling AVX-512 operations.

  • ARM Architecture
    • NEON (Advanced SIMD):
      • Q0-Q15: 128-bit registers, with each register being able to be accessed as two 64-bit D registers or four 32-bit S registers.
      • D0-D31: 64-bit registers, which can also be accessed as 32-bit S registers.
      • S0-S31: 32-bit registers.
  • Caveats:
    • In order execution pipeline, High latency due to memory access.

Example

  • AVX512 SSE SIMD Example
  • They key to SIMD is to keep an efficient data layout, which must be done manually.
  • Modern compiler now comes with Auto Vectorization.
    • SIMD Condition / Branchless Programming using SIMD mask.

Conditional SIMD

for (i = 0; i < 1024; i++)
{
    if (A[i] > 0)
    {
        C[i] = B[i];
        if (B[i] < 0)
            D[i] = E[i];
    }
}
for (i = 0; i < 1024; i += 4)
{
    if (vec_any_gt(A[i:i+3], (0, 0, 0, 0)))
    {
        vPA = A[i:i+3] > (0,0,0,0);
        C[i:i+3] = vec_sel(C[i:i+3], B[i:i+3], vPA);
        vT = B[i:i+3] < (0, 0, 0, 0);
        vPB = vec_sel((0, 0, 0, 0), vT, vPA);
        if (vec_any_ne(vPB, (0, 0, 0, 0)))
            D[i:i+3] = vec_sel(D[i:i+3], E[i:i+3], vPB);
    }
}

SIMD: The good parts

  • String processing: finding characters, validating UTF-8, parsing JSON, and CSV.
  • Hashing: random generation, cryptography (AES).
  • Columnar databases: bit packing, filtering, joins.
  • Sorting built-in types: VQSort, QuickSelect.
  • Machine Learning and Artificial Intelligence: speeding up PyTorch, TensorFlow.
  • Aside: CloudFlare continues to use Intel CPU because of SIMD, even though Quanlcomm’s ARM chip is half the price with better performance.
    • It didn’t even turbo to handle the request.

Caveats

  • Also CloudFlare: They observed that AVX512 results in lower CPU frequency and very hard to scale.
    • That’s probably why auto vectorization defaults to SSE/AVX.
  • It may underperform in some CPU, even for the same assembly, due to CPU design.
    • GCC’s default -mtune=generic may only put 128 bits data in 256 bits register to improve performance, LOL.
  • Alignment is the key.
    • Have fun handle padding (it is also type checked)
    • Have fun handle Odd number operation :).

False Sharing && True Sharing

  • Happens only on multi threaded program.

Pseudo Random Number Generator