Lab13

Computer Architecture I (CS110 / CS110P) Document
Reference: CS110 Course Page

Lab 13: Data-Level Parallelism and SIMD

Computer Architecture I@ShanghaiTech

Getting Started

Download the files for Lab13 first.

In modern processor design, Data-Level Parallelism (DLP) is essential for exploiting structured, throughput-oriented workloads.

Scenario. SisTorLink, a scrappy telecom startup, is rushing to demo the world's first software-defined 6G receiver at next month's MWC. Their prototype works — but it's embarrassingly slow, processing only a trickle of the 50 GB/s link budget. The CTO dumps a hotspots.txt on your desk with three bottlenecks circled in red:

  1. A signal clipper full of unpredictable branches that mispredict like crazy on random power spikes.
  2. A 4×4 MIMO projection written as naive triply-nested loops — fine for MATLAB, disastrous for a real-time pipeline crunching millions of timestamps.
  3. Nobody on the team actually knows what -O3 does. They just add it to the Makefile and hope.

Your mission: vectorize the first two with hand-written x86 SIMD, and settle the -O3 debate with cold, hard assembly evidence. Demos are in four weeks. Clock's ticking.

SIMD applies the same operation to multiple data elements in parallel using vector registers. This course is taught in a RISC-V setting, but this lab focuses on x86 SIMD (spanning from SSE to modern AVX-512) because most students do not have native execution environments for the latest RISC-V vector extensions.

Useful technical references:

Part 1: SIMD Vectorization and Fringe Handling

The first hotspot circled on hotspots.txt: the signal clipper. Incoming samples suffer random power spikes (interference, amplifier nonlinearity) and must be hard-clipped to [−5.0,+5.0][-5.0, +5.0] before hitting the demodulator. The legacy code does this with an if-ladder — two branches per sample, mispredicting constantly on noise-like data. Every branch mispredict flushes the pipeline. At 50 GB/s, that's billions of wasted cycles per second.

Your job: replace the inner loop with branchless SIMD intrinsics that clip four samples at a time. (The tail fringe will still use a scalar fallback — that's fine.)

You start in part1.c. The starter code provides a naive_clip_signal() function in standard C.

Your Task: Implement simd_clip_signal() using SSE intrinsics (_mm_loadu_ps, _mm_set1_ps, _mm_min_ps, _mm_max_ps, and _mm_storeu_ps). Crucially, your SIMD loop must handle cases where the signal length is not a clean multiple of 4. Processing these leftover elements is known as "fringe" or "tail" handling.

Run your code:

make test1

The cases 37 and 43 explicitly exercise your leftover-element logic.

Checkoff Questions

  1. Describe how your simd_clip_signal() manages the bounds requirement. Why would a blind 128-bit vector load/store on the final 1–3 leftover elements cause a segmentation fault or memory corruption?
  2. You replaced the conditional if branches in the naive code with dedicated min/max intrinsics. Architecturally, why is eliminating conditional branches crucial for maintaining continuous Instruction-Level Parallelism (ILP)?

Part 2: Loop Unrolling and Instruction-Level Parallelism

Second hotspot: the 4×4 MIMO spatial projection. SisTorLink's receiver uses a 4-antenna array; at every timestamp tt, the incoming 4-element signal vector must be multiplied by a fixed 4×4 precoding matrix. The original author wrote it as triply-nested loops (for i, for j, for k) — readable, but the inner-loop bookkeeping and pointer arithmetic bury the actual floating-point work under a mountain of overhead. Over millions of timestamps, those j and k loop counters add up.

The file contains:

  • naive_mimo — the original triple loop
  • loop_unroll_simd_mimo — blank canvas for your hand-vectorized version

Your Task: In the starter code, loop_unroll_simd_mimo() is empty. Complete this function by using explicitly x86 SIMD primitives (_mm_loadu_ps, _mm_set1_ps, _mm_add_ps, _mm_mul_ps, etc.). Hint: Instead of a complex horizontal dot-product, think about how broadcasting scalar signal elements across matrix rows cleanly resolves the math. Because both the signal vectors and the projection matrix are fixed exactly at 4 dimensions, you can completely unroll the matrix multiplication, eliminating all j and k loop overhead inside the main i loop!

Run:

make test2

Checkoff Questions

  1. In your loop_unroll_simd_mimo() kernel, why does fully unrolling the 4x4 projection mathematically map perfectly to 4-lane wide vectors?
  2. How many total _mm_add_ps and _mm_mul_ps instructions execute per timestamp i in your unrolled loop?

Part 3: Compiler Auto-Vectorization and Assembly

Third hotspot isn't really code — it's a culture problem. The SisTorLink build system throws -O3 at everything because "it makes stuff faster," but nobody has ever checked what it actually does to their loops. The team lead wants proof before she'll sign off on the -O3 flag for the production firmware. Your job: show her the assembly receipts.

First, make sure the binaries are built (make test2 or make all). Then disassemble both the slow (-O0) and optimized (-O3) versions:

objdump -d part2_3 > part2_3_O0.s
objdump -d part2_3_opt > part2_3_O3.s

Open the .s files and search for the <naive_mimo> function block.

Checkoff Questions

  1. Compare naive_mimo in part2_3_O0.s to part2_3_O3.s. Do you see vectorized loop structures being used (addps, mulps, or ymm usage) in the -O3 dump?
  2. Look at naive_mimo in part2_3_O3.s. Do you see vectorized loop structures being used (addps, mulps, or ymm usage) even though the source C code wasn't using intrinsics? Does the compiler successfully auto-vectorize the traditionally written nested loops?