Lab 13: Data-Level Parallelism and SIMD
Computer Architecture I@ShanghaiTech
Getting Started
Download the files for Lab13 first.
In modern processor design, Data-Level Parallelism (DLP) is essential for exploiting structured, throughput-oriented workloads.
Scenario. SisTorLink, a scrappy
telecom startup, is rushing to demo the world's first
software-defined 6G receiver at next month's
MWC. Their prototype works — but it's embarrassingly slow,
processing only a trickle of the 50 GB/s link budget. The CTO
dumps a hotspots.txt on your desk with three
bottlenecks circled in red:
- A signal clipper full of unpredictable branches that mispredict like crazy on random power spikes.
- A 4×4 MIMO projection written as naive triply-nested loops — fine for MATLAB, disastrous for a real-time pipeline crunching millions of timestamps.
- Nobody on the team actually knows what
-O3does. They just add it to the Makefile and hope.
Your mission: vectorize the first two with hand-written x86
SIMD, and settle the -O3 debate with cold, hard
assembly evidence. Demos are in four weeks. Clock's ticking.
SIMD applies the same operation to multiple data elements in parallel using vector registers. This course is taught in a RISC-V setting, but this lab focuses on x86 SIMD (spanning from SSE to modern AVX-512) because most students do not have native execution environments for the latest RISC-V vector extensions.
Useful technical references:
- Intel Software Developer's Manual
- Intel Intrinsics Guide (crucial for finding SSE/AVX/AVX-512 commands)
- AMD AMD64 Architecture Programmer's Manual
- RISC-V V Standard Extension for Vector Operations, Version 1.0
- UC Berkeley RISC-V Vector Extension Intrinsic API Reference
Part 1: SIMD Vectorization and Fringe Handling
The first hotspot circled on hotspots.txt: the
signal clipper. Incoming samples suffer random power spikes
(interference, amplifier nonlinearity) and must be hard-clipped
to
before hitting the demodulator. The legacy code does this with
an if-ladder — two branches per sample,
mispredicting constantly on noise-like data. Every branch
mispredict flushes the pipeline. At 50 GB/s, that's billions of
wasted cycles per second.
Your job: replace the inner loop with branchless SIMD intrinsics that clip four samples at a time. (The tail fringe will still use a scalar fallback — that's fine.)
You start in part1.c. The starter code provides
a naive_clip_signal() function in standard C.
Your Task: Implement
simd_clip_signal() using SSE intrinsics
(_mm_loadu_ps, _mm_set1_ps,
_mm_min_ps, _mm_max_ps, and
_mm_storeu_ps). Crucially, your SIMD loop must
handle cases where the signal length is not a clean multiple of
4. Processing these leftover elements is known as "fringe" or
"tail" handling.
Run your code:
make test1The cases 37 and 43 explicitly
exercise your leftover-element logic.
Checkoff Questions
- Describe how your
simd_clip_signal()manages the bounds requirement. Why would a blind 128-bit vector load/store on the final 1–3 leftover elements cause a segmentation fault or memory corruption? - You replaced the conditional
ifbranches in the naive code with dedicated min/max intrinsics. Architecturally, why is eliminating conditional branches crucial for maintaining continuous Instruction-Level Parallelism (ILP)?
Part 2: Loop Unrolling and Instruction-Level Parallelism
Second hotspot: the 4×4 MIMO spatial projection. SisTorLink's
receiver uses a 4-antenna array; at every timestamp
,
the incoming 4-element signal vector must be multiplied by a
fixed 4×4 precoding matrix. The original author wrote it as
triply-nested loops (for i, for j,
for k) — readable, but the inner-loop bookkeeping
and pointer arithmetic bury the actual floating-point work under
a mountain of overhead. Over millions of timestamps, those
j and k loop counters add up.
The file contains:
naive_mimo— the original triple looploop_unroll_simd_mimo— blank canvas for your hand-vectorized version
Your Task: In the starter code,
loop_unroll_simd_mimo() is empty. Complete this
function by using explicitly x86 SIMD primitives
(_mm_loadu_ps, _mm_set1_ps,
_mm_add_ps, _mm_mul_ps, etc.).
Hint: Instead of a complex horizontal dot-product,
think about how broadcasting scalar signal elements across
matrix rows cleanly resolves the math. Because both the signal
vectors and the projection matrix are fixed exactly at 4
dimensions, you can completely unroll the matrix
multiplication, eliminating all j and
k loop overhead inside the main i
loop!
Run:
make test2Checkoff Questions
- In your
loop_unroll_simd_mimo()kernel, why does fully unrolling the 4x4 projection mathematically map perfectly to 4-lane wide vectors? - How many total
_mm_add_psand_mm_mul_psinstructions execute per timestampiin your unrolled loop?
Part 3: Compiler Auto-Vectorization and Assembly
Third hotspot isn't really code — it's a culture problem. The
SisTorLink build system throws -O3 at everything
because "it makes stuff faster," but nobody has ever checked
what it actually does to their loops. The team lead
wants proof before she'll sign off on the -O3 flag
for the production firmware. Your job: show her the assembly
receipts.
First, make sure the binaries are built
(make test2 or make all). Then
disassemble both the slow (-O0) and optimized
(-O3) versions:
objdump -d part2_3 > part2_3_O0.s
objdump -d part2_3_opt > part2_3_O3.sOpen the .s files and search for the
<naive_mimo> function block.
Checkoff Questions
- Compare
naive_mimoinpart2_3_O0.stopart2_3_O3.s. Do you see vectorized loop structures being used (addps,mulps, orymmusage) in the-O3dump? - Look at
naive_mimoinpart2_3_O3.s. Do you see vectorized loop structures being used (addps,mulps, orymmusage) even though the source C code wasn't using intrinsics? Does the compiler successfully auto-vectorize the traditionally written nested loops?