Lab 13: Data-Level Parallelism and SIMD

Getting Started

Download the files for Lab13 first.

In modern processor design, Data-Level Parallelism (DLP) is essential for exploiting structured, throughput-oriented workloads.

Scenario. SisTorLink, a scrappy telecom startup, is rushing to demo the world's first software-defined 6G receiver at next month's MWC. Their prototype works — but it's embarrassingly slow, processing only a trickle of the 50 GB/s link budget. The CTO dumps a hotspots.txt on your desk with three bottlenecks circled in red:

A signal clipper full of unpredictable branches that mispredict like crazy on random power spikes.
A 4×4 MIMO projection written as naive triply-nested loops — fine for MATLAB, disastrous for a real-time pipeline crunching millions of timestamps.
Nobody on the team actually knows what -O3 does. They just add it to the Makefile and hope.

Your mission: vectorize the first two with hand-written x86 SIMD, and settle the -O3 debate with cold, hard assembly evidence. Demos are in four weeks. Clock's ticking.

SIMD applies the same operation to multiple data elements in parallel using vector registers. This course is taught in a RISC-V setting, but this lab focuses on x86 SIMD (spanning from SSE to modern AVX-512) because most students do not have native execution environments for the latest RISC-V vector extensions.

Useful technical references:

Intel Software Developer's Manual
Intel Intrinsics Guide (crucial for finding SSE/AVX/AVX-512 commands)
AMD AMD64 Architecture Programmer's Manual
RISC-V V Standard Extension for Vector Operations, Version 1.0
UC Berkeley RISC-V Vector Extension Intrinsic API Reference

Part 1: SIMD Vectorization and Fringe Handling

The first hotspot circled on hotspots.txt: the signal clipper. Incoming samples suffer random power spikes (interference, amplifier nonlinearity) and must be hard-clipped to $[-5.0, +5.0]$ before hitting the demodulator. The legacy code does this with an if-ladder — two branches per sample, mispredicting constantly on noise-like data. Every branch mispredict flushes the pipeline. At 50 GB/s, that's billions of wasted cycles per second.

Your job: replace the inner loop with branchless SIMD intrinsics that clip four samples at a time. (The tail fringe will still use a scalar fallback — that's fine.)

You start in part1.c. The starter code provides a naive_clip_signal() function in standard C.

Your Task: Implement simd_clip_signal() using SSE intrinsics (_mm_loadu_ps, _mm_set1_ps, _mm_min_ps, _mm_max_ps, and _mm_storeu_ps). Crucially, your SIMD loop must handle cases where the signal length is not a clean multiple of 4. Processing these leftover elements is known as "fringe" or "tail" handling.

Run your code:

make test1

The cases 37 and 43 explicitly exercise your leftover-element logic.

Checkoff Questions

Describe how your simd_clip_signal() manages the bounds requirement. Why would a blind 128-bit vector load/store on the final 1–3 leftover elements cause a segmentation fault or memory corruption?
You replaced the conditional if branches in the naive code with dedicated min/max intrinsics. Architecturally, why is eliminating conditional branches crucial for maintaining continuous Instruction-Level Parallelism (ILP)?

Part 2: Loop Unrolling and Instruction-Level Parallelism

Second hotspot: the 4×4 MIMO spatial projection. SisTorLink's receiver uses a 4-antenna array; at every timestamp $t$ , the incoming 4-element signal vector must be multiplied by a fixed 4×4 precoding matrix. The original author wrote it as triply-nested loops (for i, for j, for k) — readable, but the inner-loop bookkeeping and pointer arithmetic bury the actual floating-point work under a mountain of overhead. Over millions of timestamps, those j and k loop counters add up.

The file contains:

naive_mimo — the original triple loop
loop_unroll_simd_mimo — blank canvas for your hand-vectorized version

Your Task: In the starter code, loop_unroll_simd_mimo() is empty. Complete this function by using explicitly x86 SIMD primitives (_mm_loadu_ps, _mm_set1_ps, _mm_add_ps, _mm_mul_ps, etc.). Hint: Instead of a complex horizontal dot-product, think about how broadcasting scalar signal elements across matrix rows cleanly resolves the math. Because both the signal vectors and the projection matrix are fixed exactly at 4 dimensions, you can completely unroll the matrix multiplication, eliminating all j and k loop overhead inside the main i loop!

Run:

make test2

Checkoff Questions

In your loop_unroll_simd_mimo() kernel, why does fully unrolling the 4x4 projection mathematically map perfectly to 4-lane wide vectors?
How many total _mm_add_ps and _mm_mul_ps instructions execute per timestamp i in your unrolled loop?

Part 3: Compiler Auto-Vectorization and Assembly

Third hotspot isn't really code — it's a culture problem. The SisTorLink build system throws -O3 at everything because "it makes stuff faster," but nobody has ever checked what it actually does to their loops. The team lead wants proof before she'll sign off on the -O3 flag for the production firmware. Your job: show her the assembly receipts.

First, make sure the binaries are built (make test2 or make all). Then disassemble both the slow (-O0) and optimized (-O3) versions:

objdump -d part2_3 > part2_3_O0.s
objdump -d part2_3_opt > part2_3_O3.s

Open the .s files and search for the <naive_mimo> function block.

Checkoff Questions

Compare naive_mimo in part2_3_O0.s to part2_3_O3.s. Do you see vectorized loop structures being used (addps, mulps, or ymm usage) in the -O3 dump?
Look at naive_mimo in part2_3_O3.s. Do you see vectorized loop structures being used (addps, mulps, or ymm usage) even though the source C code wasn't using intrinsics? Does the compiler successfully auto-vectorize the traditionally written nested loops?

📋 目录

Lab 13: Data-Level Parallelism and SIMD

Getting Started

Part 1: SIMD Vectorization and Fringe Handling

Checkoff Questions

Part 2: Loop Unrolling and Instruction-Level Parallelism

Checkoff Questions

Part 3: Compiler Auto-Vectorization and Assembly

Checkoff Questions