Lab 8

Computer Architecture I @ ShanghaiTech University

Getting Started

Download the files for lab8 first.

In this lab, you will build an 8-input signed adder tree in Logisim Evolution - first as a pure combinational circuit (Exercise 1), then pipelined at every level (Exercise 2). The tree reduces eight narrow samples into one 11-bit sum, which is then sign-extended to the native XLEN = 64 register width of an RV64I machine and written into a sum register - the same convention that lb / lh / mulw / addw use when writing a narrow result back into a 64-bit GPR. It is also your first taste of the latency vs. throughput tradeoff that shows up everywhere in Project 2.

Tip. addw / mulw are RV64-only instructions. They are mentioned here, together with lb / lh, only as examples of the same pattern: a narrow signed result is sign-extended before being written back to a full-width register.

In this lab, the adder tree output is an 11-bit signed value, so it must be sign-extended to the 64-bit sum register.

Exercises

Exercise 1: Non-pipelined Adder Tree

Load the given starter file ex1.circ into Logisim and implement an 8-input signed adder tree as a purely combinational block feeding a single output register. Show this completed circuit to your TA (remember to save!).

Functionality

Parallel Summation: The circuit takes 8 signed 8-bit operands \(S_0, S_1, \ldots, S_7\) at the same time and produces the signed sum

\(P = \sum_{i=0}^{7} S_i\)

using a balanced binary adder tree (4 adders at level 1, 2 adders at level 2, 1 adder at level 3).
Width Growth: To avoid overflow, each level must widen its operands by 1 bit before adding. The tree output is therefore an 11-bit signed value, since \(8 + \log_2 8 = 11\).
Sign Extension to XLEN = 64 bits: The 11-bit tree output is sign-extended to 64 bits before being written into the sum register. Think of the adder tree as a narrow functional unit sitting inside a 64-bit datapath: its result must be promoted to the full XLEN = 64 register width before write-back, exactly like the addw / mulw family does in RV64I. Zero extension would silently corrupt negative sums once sum is consumed by a signed 64-bit ALU.
Registered Output Only: There are no pipeline registers inside the tree. On every rising edge of clk, the current combinational sum is latched into the 64-bit sum register.

Input Ports

Operands (S0 - S7): Eight signed 8-bit inputs supplied via a single S bus and a splitter, as shown in the starter file.
Clock Signal (clk): Drives the 64-bit output register. On every rising edge the register captures the current sign-extended sum.

Output Ports

Sum (sum): A 64-bit signed value holding the most recently latched tree output.

Details

You must use the Adder component from the Arithmetic library. Do not use Bit Adder, and do not collapse the 7 adders into fewer, wider ones - we are explicitly practicing the tree topology.
At every level, widen both operands by 1 bit using a Bit Extender (Arithmetic -> Bit Extender, type = sign). The three levels therefore use adders of width 9, 10, 11.
Use a Bit Extender with type = sign and output width 64 to extend the 11-bit tree output to 64 bits before the register.
The register must be 64-bit wide. Leave the enable pin tied high (or unconnected - Logisim defaults to enabled) and keep the reset pin tied low unless the starter file says otherwise.
Use a Tunnel for clk (and any other signals you want to keep from crossing the schematic) to keep things readable. Your final layout should look roughly like this:

flowchart LR S0["S0 | 8b"] --> A1["+ | 9b"] S1["S1 | 8b"] --> A1 S2["S2 | 8b"] --> A2["+ | 9b"] S3["S3 | 8b"] --> A2 S4["S4 | 8b"] --> A3["+ | 9b"] S5["S5 | 8b"] --> A3 S6["S6 | 8b"] --> A4["+ | 9b"] S7["S7 | 8b"] --> A4 A1 --> B1["+ | 10b"] A2 --> B1 A3 --> B2["+ | 10b"] A4 --> B2 B1 --> C1["+ | 11b"] B2 --> C1 C1 --> SX["sign extend
11 -> 64"] SX --> REG["sum
64b register"] REG --> OUT["sum"]

Check-off

Show ex1.circ to your TA, and explain:
- Why each level of the tree needs one more bit of width than the previous level.
- Why sign extension (not zero extension) is used on the 11-bit tree output.
- What the critical path of this circuit is, in terms of number of adders in series.
Advance the clock one tick at a time with Ctrl+T or F9 - do not use Ctrl+K (Ticks Enabled) - and set a few operand combinations with the Poke tool. After a single tick, sum should already reflect the current inputs. Tell your TA what this implies for the maximum clock frequency compared to Exercise 2.

Exercise 2: Pipelined Adder Tree

Load the given starter file ex2.circ into Logisim. Starting from your Exercise 1 circuit, insert pipeline registers so that every adder level is followed by a register, turning the tree into a 3-stage pipeline that also feeds the 64-bit sum register. Show this completed circuit to your TA (remember to save!).

Functionality

Pipelined Summation: The circuit computes the same signed sum

\(P = \sum_{i=0}^{7} S_i\)

but breaks the combinational path into 3 pipeline stages, one per tree level, plus the existing output register. The final sign-extended result appears in sum 4 cycles after its inputs are applied.
Throughput = 1 sample / cycle: On every rising edge the circuit accepts a new set of 8 operands and, once the pipeline is full, produces one valid sum per cycle - matching the instruction throughput you would expect from a fully pipelined functional unit in a 64-bit datapath.
Latency = 4 cycles: 3 cycles through the tree stages + 1 cycle into the output register. Stage widths are 9, 10, 11 bits respectively, matching Exercise 1.
Uniform Clock Domain: All pipeline registers share the same clk. There is no per-stage enable - data moves forward every cycle, just like a textbook 5-stage RV64 pipeline.

Input Ports

Operands (S0 - S7): Eight signed 8-bit inputs, identical to Exercise 1.
Clock Signal (clk): Drives all pipeline registers and the output register via a single clk tunnel, so every stage shares one synchronous clock domain.

Output Ports

Sum (sum): The 64-bit signed sum corresponding to the operands applied 4 cycles earlier.

Details

Insert registers at the following points (see diagram below):
- After level 1: 4 x 9-bit registers (R1A, R1B, R1C, R1D).
- After level 2: 2 x 10-bit registers (R2A, R2B).
- After level 3: 1 x 11-bit register (R3).
- The existing 64-bit sum register stays as the final stage.
Every register must be driven by the same clk tunnel. Do not introduce separate clocks - in this course, all pipeline stages are synchronous to a single clock domain.
The sign extender (11 -> 64) sits between R3 and the sum register, exactly as in Exercise 1. Do not widen pipeline registers to 64 bits - that wastes area without changing behavior.
Do not reorder the tree or merge levels. The whole point of this exercise is that each pipeline stage contains exactly one adder delay.

Check-off

Show ex2.circ to your TA, and explain:
- The latency (in cycles) from applying S0-S7 to observing the corresponding value on sum.
- The throughput of the circuit, and why inserting registers does not reduce it.
- The new critical path, and why this lets a pipelined version run at a higher clock frequency than Exercise 1 even though both compute the same function.
Drive the circuit with a changing input bus, advancing the clock one tick at a time with Ctrl+T or F9 (do not use Ctrl+K). Verify with your TA that:
1. The first valid sum appears exactly 4 cycles after the first valid input - i.e. the pipelined version's outputs are shifted by 4 cycles relative to the non-pipelined version in Exercise 1.
2. Stopping the clock mid-computation freezes all intermediate register values, and resuming produces the correct result without any data loss.

The following TA(s) are responsible for this lab: Chaofan Li <lichf2025@shanghaitech.edu.cn>