Lab10

Computer Architecture I (CS110 / CS110P) Document
Reference: CS110 Course Page

Lab10 Cache

Computer Architecture I@ShanghaiTech

Getting Started

Download the files for Lab10 first.

This lab has two parts:

  1. Access Pattern Cache Simulation
  2. 6G Signal Demodulation prototyping Optimization

SIST’s recent publication in Nature, which achieves an astonishing breakthrough in communication link throughput, reports rates of up to 50 GB/s for wireless communication. This has been described by People’s Daily as “China’s new 6G breakthrough.” However, achieving high end-to-end link speed is not solely a physical-layer problem. The fully digital side of the receiver can become a bottleneck during signal demodulation, thereby limiting overall system throughput.

Traditionally, the communications community has relied on DSPs or MCUs, implemented using C and assembly, to achieve efficient demodulation. As the market share of RISC-V continues to grow—particularly in the MCU domain—engineers are increasingly developing demodulation subsystems on RISC-V platforms. This is the focus of this lab. These platforms must process data at high speed under tight resource constraints, making cache-friendly code critical.

Use the Venus simulator for Part 1.

Use a Linux environment for Part 2.


Part 1: Cache Visualization

This part uses cache.s to model locality issues:

  • execution order: row-major traversal vs. column-major traversal
  • data layout: padded records vs. compact records

For all scenarios:

  1. Copy the code in cache.s to Venus.
  2. In main, set the program parameters by changing the immediates of the commented li instructions.
  3. Click Simulator. In the right partition, open the Cache tab.
  4. Set the cache parameters listed for the scenario.
  5. Click Simulator -> Assemble & Simulate from Editor.
  6. Click Simulator -> Step and observe the cache state.

Program parameters:

  • a0: number of rows
  • a1: number of columns
  • a2: repetition count
  • a3: mode selector

Modes:

  • 0: packed records, row-major traversal
  • 1: packed records, column-major traversal
  • 2: padded records, row-major traversal
  • 3: compact records, row-major traversal

Scenario 1: Execution Order

Cache Parameters:

  • Cache Levels: 1
  • Block Size (Bytes): 16
  • Number of Blocks: 16
  • Associativity: 1
  • Placement Policy: Direct Mapping
  • Block Replacement Policy: LRU
  • Enable current selected level of the cache.

Program Parameters

Run Rows Cols Rep Count Mode
A 16 16 1 0
B 16 16 1 1

Checkoff

  1. Compare the hit rates of Run A and Run B. Why are they different even though both runs touch the same logical records?
  2. In the column-major case, what stride does the program use between two consecutive accesses in the inner loop?
  3. Which traversal order is more suitable for the row-major data layout used in Part 2, and why?

Scenario 2: Data Layout

Cache Parameters:

  • Cache Levels: 1
  • Block Size (Bytes): 16
  • Number of Blocks: 16
  • Associativity: 1
  • Placement Policy: Direct Mapping
  • Block Replacement Policy: LRU
  • Enable current selected level of the cache.

Program Parameters

Run Rows Cols Rep Count Mode
A 16 16 1 2
B 16 16 1 3

Checkoff

  1. Compare the hit rates of Run A and Run B. Why does the padded layout perform worse?
  2. In the padded-record case, how many different cache blocks can be touched by one logical record access?
  3. Which layout is closer to what you want for Part 2, and why?

Scenario 3: Associativity

Use the same program parameters in both runs.

Program Parameters

Run Rows Cols Rep Count Mode
A 16 16 1 1
B 16 16 1 1
C 16 16 1 1

Shared Cache Parameters

  • Cache Levels: 1
  • Block Size (Bytes): 16
  • Number of Blocks: 16
  • Block Replacement Policy: LRU
  • Enable current selected level of the cache.

Run-Specific Cache Parameters

Run Associativity Placement Policy
A 1 Direct Mapping
B 2-way Set Associative
C 16 Fully Associative

Checkoff

  1. Compare the hit rates of Run A, Run B, and Run C. What changes as associativity increases?
  2. Which misses from the direct-mapped case are reduced by the set-associative and fully associative cases?
  3. Why does higher associativity help less than changing the access order in Scenario 1?

Part 2: 6G Signal Demodulation Prototyping Optimization

In this part, you will optimize simplified demodulation-side code in exe2_template.

A simple background (task requires none of these):

Modern communication systems such as 6G and Wi-Fi 7 typically use OFDM with QAM modulation. In a digital modem, transmitted symbols are mapped to baseband I/Q values and later recovered during demodulation. In OFDM systems, these values are arranged on a two-dimensional resource grid indexed by OFDM symbol and subcarrier.

This lab uses a simplified receiver model: assume that filtering, sampling, and RF/front-end processing have already been completed by a 6G modem ASIC (And this is the main reason you must COPY the data into your data structure (struct QAMRecord in existing apply_qam_fast.c, you may optimize the layout of it)rather than directly bypass some of them in following task, read the code!). Your task is to prototype the remaining RISC-V-based digital processing subsystem, that means you do nothing to real communication stuffs, focus on memory-side optimization.

Run all Part 2 commands from inside exe2_template:

cd exe2_template

For this part, only edit:

  • apply_qam_fast.c

Keep the function interface and final output unchanged. The setup code is fixed; your goal is to improve the demodulation-side implementation while preserving correctness.

How to test

cd exe2_template
make base_test
make fast_test
make check_test
make benchmark
  • make base_test runs the baseline version
  • make fast_test runs your optimized version
  • make check_test compares the outputs
  • make benchmark reports performance on a larger workload

You can also run:

cd exe2_template
make all

You could compare the throughput in MB/s, while noticing that for simplicity this lab use 4QAM (QPSK), so thorughput < 100Mbps is totally normal and expected.

What to show at checkoff

Checkoff

  1. What locality problems did you find in the baseline?
  2. What changes did you make in apply_qam_fast.c?
  3. Why do your changes improve cache performance?
  4. Show that your output matches the baseline.