Lab10 Cache
Computer Architecture I@ShanghaiTech
Getting Started
Download the files for Lab10 first.
This lab has two parts:
- Access Pattern Cache Simulation
- 6G Signal Demodulation prototyping Optimization
SIST’s recent publication in Nature, which achieves an astonishing breakthrough in communication link throughput, reports rates of up to 50 GB/s for wireless communication. This has been described by People’s Daily as “China’s new 6G breakthrough.” However, achieving high end-to-end link speed is not solely a physical-layer problem. The fully digital side of the receiver can become a bottleneck during signal demodulation, thereby limiting overall system throughput.
Traditionally, the communications community has relied on DSPs or MCUs, implemented using C and assembly, to achieve efficient demodulation. As the market share of RISC-V continues to grow—particularly in the MCU domain—engineers are increasingly developing demodulation subsystems on RISC-V platforms. This is the focus of this lab. These platforms must process data at high speed under tight resource constraints, making cache-friendly code critical.
Use the Venus simulator for Part 1.
Use a Linux environment for Part 2.
Part 1: Cache Visualization
This part uses cache.s to model locality
issues:
- execution order: row-major traversal vs. column-major traversal
- data layout: padded records vs. compact records
For all scenarios:
- Copy the code in
cache.sto Venus. - In
main, set the program parameters by changing the immediates of the commentedliinstructions. - Click Simulator. In the right partition, open the Cache tab.
- Set the cache parameters listed for the scenario.
- Click Simulator -> Assemble & Simulate from Editor.
- Click Simulator -> Step and observe the cache state.
Program parameters:
a0: number of rowsa1: number of columnsa2: repetition counta3: mode selector
Modes:
0: packed records, row-major traversal1: packed records, column-major traversal2: padded records, row-major traversal3: compact records, row-major traversal
Scenario 1: Execution Order
Cache Parameters:
- Cache Levels: 1
- Block Size (Bytes): 16
- Number of Blocks: 16
- Associativity: 1
- Placement Policy: Direct Mapping
- Block Replacement Policy: LRU
- Enable current selected level of the cache.
Program Parameters
| Run | Rows | Cols | Rep Count | Mode |
|---|---|---|---|---|
| A | 16 | 16 | 1 | 0 |
| B | 16 | 16 | 1 | 1 |
Checkoff
- Compare the hit rates of Run A and Run B. Why are they different even though both runs touch the same logical records?
- In the column-major case, what stride does the program use between two consecutive accesses in the inner loop?
- Which traversal order is more suitable for the row-major data layout used in Part 2, and why?
Scenario 2: Data Layout
Cache Parameters:
- Cache Levels: 1
- Block Size (Bytes): 16
- Number of Blocks: 16
- Associativity: 1
- Placement Policy: Direct Mapping
- Block Replacement Policy: LRU
- Enable current selected level of the cache.
Program Parameters
| Run | Rows | Cols | Rep Count | Mode |
|---|---|---|---|---|
| A | 16 | 16 | 1 | 2 |
| B | 16 | 16 | 1 | 3 |
Checkoff
- Compare the hit rates of Run A and Run B. Why does the padded layout perform worse?
- In the padded-record case, how many different cache blocks can be touched by one logical record access?
- Which layout is closer to what you want for Part 2, and why?
Scenario 3: Associativity
Use the same program parameters in both runs.
Program Parameters
| Run | Rows | Cols | Rep Count | Mode |
|---|---|---|---|---|
| A | 16 | 16 | 1 | 1 |
| B | 16 | 16 | 1 | 1 |
| C | 16 | 16 | 1 | 1 |
Shared Cache Parameters
- Cache Levels: 1
- Block Size (Bytes): 16
- Number of Blocks: 16
- Block Replacement Policy: LRU
- Enable current selected level of the cache.
Run-Specific Cache Parameters
| Run | Associativity | Placement Policy |
|---|---|---|
| A | 1 | Direct Mapping |
| B | 2-way | Set Associative |
| C | 16 | Fully Associative |
Checkoff
- Compare the hit rates of Run A, Run B, and Run C. What changes as associativity increases?
- Which misses from the direct-mapped case are reduced by the set-associative and fully associative cases?
- Why does higher associativity help less than changing the access order in Scenario 1?
Part 2: 6G Signal Demodulation Prototyping Optimization
In this part, you will optimize simplified demodulation-side
code in exe2_template.
A simple background (task requires none of these):
Modern communication systems such as 6G and Wi-Fi 7 typically use OFDM with QAM modulation. In a digital modem, transmitted symbols are mapped to baseband I/Q values and later recovered during demodulation. In OFDM systems, these values are arranged on a two-dimensional resource grid indexed by OFDM symbol and subcarrier.
This lab uses a simplified receiver model: assume that
filtering, sampling, and RF/front-end processing have already
been completed by a 6G modem ASIC (And this is the main reason
you must COPY the data into your data
structure (struct QAMRecord in existing
apply_qam_fast.c, you may optimize the layout of
it)rather than directly bypass some of them in following task,
read the code!). Your task is to prototype the remaining
RISC-V-based digital processing subsystem, that means you
do nothing to real communication stuffs, focus
on memory-side optimization.
Run all Part 2 commands from inside
exe2_template:
cd exe2_templateFor this part, only edit:
apply_qam_fast.c
Keep the function interface and final output unchanged. The setup code is fixed; your goal is to improve the demodulation-side implementation while preserving correctness.
How to test
cd exe2_template
make base_test
make fast_test
make check_test
make benchmarkmake base_testruns the baseline versionmake fast_testruns your optimized versionmake check_testcompares the outputsmake benchmarkreports performance on a larger workload
You can also run:
cd exe2_template
make allYou could compare the throughput in MB/s, while noticing that for simplicity this lab use 4QAM (QPSK), so thorughput < 100Mbps is totally normal and expected.
What to show at checkoff
Checkoff
- What locality problems did you find in the baseline?
- What changes did you make in
apply_qam_fast.c?- Why do your changes improve cache performance?
- Show that your output matches the baseline.