This is a ~30 mins share in idea.edu.cn
Random Performance Share
有两类程序员几乎不出 多线程 bug:一类是啥也不懂,只要涉及到多 线程 就直接上锁;另一类熟读内存模型、体系结构、缓存一致性协议、内存屏障、竞态条件、指令重排……然后决定只要涉及到 多线程 就直接上锁。
SIMD
- Single (Assembly) Instruction Multiple Data
- How much data? It depends on the register.
- Great source of performance improvement.
- x86/x86-64 Architecture (Intel/AMD)
- MMX Registers (MMX Technology):
- MM0-MM7: 64-bit registers used for integer SIMD operations.
- SSE (Streaming SIMD Extensions):
- XMM0-XMM15: 128-bit registers introduced with SSE and extended in later SSE versions.
- AVX (Advanced Vector Extensions):
- YMM0-YMM15: 256-bit registers introduced with AVX. Each YMM register extends the corresponding XMM register.
- AVX-512:
- ZMM0-ZMM31: 512-bit registers introduced with AVX-512, extending the YMM and XMM registers.
- Mask Registers (K0-K7): 8-bit registers used for controlling AVX-512 operations.
- MMX Registers (MMX Technology):
- ARM Architecture
- NEON (Advanced SIMD):
- Q0-Q15: 128-bit registers, with each register being able to be accessed as two 64-bit D registers or four 32-bit S registers.
- D0-D31: 64-bit registers, which can also be accessed as 32-bit S registers.
- S0-S31: 32-bit registers.
- NEON (Advanced SIMD):
- Caveats:
- In order execution pipeline, High latency due to memory access.
Example
- AVX512 SSE SIMD Example
- They key to SIMD is to keep an efficient data layout, which must be done manually.
- Modern compiler now comes with Auto Vectorization.
- SIMD Condition / Branchless Programming using SIMD mask.
Conditional SIMD
SIMD: The good parts
- String processing: finding characters, validating UTF-8, parsing JSON, and CSV.
- Hashing: random generation, cryptography (AES).
- Columnar databases: bit packing, filtering, joins.
- Sorting built-in types: VQSort, QuickSelect.
- Machine Learning and Artificial Intelligence: speeding up PyTorch, TensorFlow.
- Aside: CloudFlare continues to use Intel CPU because of SIMD, even though Quanlcomm’s ARM chip is half the price with better performance.
- It didn’t even turbo to handle the request.
Caveats
- Also CloudFlare: They observed that AVX512 results in lower CPU frequency and very hard to scale.
- That’s probably why auto vectorization defaults to SSE/AVX.
- It may underperform in some CPU, even for the same assembly, due to CPU design.
- GCC’s default
-mtune=generic
may only put 128 bits data in 256 bits register to improve performance, LOL.
- GCC’s default
- Alignment is the key.
- Have fun handle padding (it is also type checked)
- Have fun handle Odd number operation :).
False Sharing && True Sharing
- Happens only on multi threaded program.