Lab 8

Computer Architecture I (CS110) Lab Document
Reference: CS110 Course Page

Computer Architecture I @ ShanghaiTech University

Getting Started

Download the files for lab8 first.

In this lab, you will build an 8-input signed adder tree in Logisim Evolution - first as a pure combinational circuit (Exercise 1), then pipelined at every level (Exercise 2). The tree reduces eight narrow samples into one 11-bit sum, which is then sign-extended to the native XLEN = 64 register width of an RV64I machine and written into a sum register - the same convention that lb / lh / mulw / addw use when writing a narrow result back into a 64-bit GPR. It is also your first taste of the latency vs. throughput tradeoff that shows up everywhere in Project 2.

Tip. addw / mulw are RV64-only instructions. They are mentioned here, together with lb / lh, only as examples of the same pattern: a narrow signed result is sign-extended before being written back to a full-width register.

In this lab, the adder tree output is an 11-bit signed value, so it must be sign-extended to the 64-bit sum register.

Exercises

Exercise 1: Non-pipelined Adder Tree

Load the given starter file ex1.circ into Logisim and implement an 8-input signed adder tree as a purely combinational block feeding a single output register. Show this completed circuit to your TA (remember to save!).

Functionality

  • Parallel Summation: The circuit takes 8 signed 8-bit operands \(S_0, S_1, \ldots, S_7\) at the same time and produces the signed sum

    \(P = \sum_{i=0}^{7} S_i\)

    using a balanced binary adder tree (4 adders at level 1, 2 adders at level 2, 1 adder at level 3).

  • Width Growth: To avoid overflow, each level must widen its operands by 1 bit before adding. The tree output is therefore an 11-bit signed value, since \(8 + \log_2 8 = 11\).

  • Sign Extension to XLEN = 64 bits: The 11-bit tree output is sign-extended to 64 bits before being written into the sum register. Think of the adder tree as a narrow functional unit sitting inside a 64-bit datapath: its result must be promoted to the full XLEN = 64 register width before write-back, exactly like the addw / mulw family does in RV64I. Zero extension would silently corrupt negative sums once sum is consumed by a signed 64-bit ALU.

  • Registered Output Only: There are no pipeline registers inside the tree. On every rising edge of clk, the current combinational sum is latched into the 64-bit sum register.

Input Ports

  • Operands (S0 - S7): Eight signed 8-bit inputs supplied via a single S bus and a splitter, as shown in the starter file.
  • Clock Signal (clk): Drives the 64-bit output register. On every rising edge the register captures the current sign-extended sum.

Output Ports

  • Sum (sum): A 64-bit signed value holding the most recently latched tree output.

Details

  • You must use the Adder component from the Arithmetic library. Do not use Bit Adder, and do not collapse the 7 adders into fewer, wider ones - we are explicitly practicing the tree topology.
  • At every level, widen both operands by 1 bit using a Bit Extender (Arithmetic -> Bit Extender, type = sign). The three levels therefore use adders of width 9, 10, 11.
  • Use a Bit Extender with type = sign and output width 64 to extend the 11-bit tree output to 64 bits before the register.
  • The register must be 64-bit wide. Leave the enable pin tied high (or unconnected - Logisim defaults to enabled) and keep the reset pin tied low unless the starter file says otherwise.
  • Use a Tunnel for clk (and any other signals you want to keep from crossing the schematic) to keep things readable. Your final layout should look roughly like this:
flowchart LR S0["S0 | 8b"] --> A1["+ | 9b"] S1["S1 | 8b"] --> A1 S2["S2 | 8b"] --> A2["+ | 9b"] S3["S3 | 8b"] --> A2 S4["S4 | 8b"] --> A3["+ | 9b"] S5["S5 | 8b"] --> A3 S6["S6 | 8b"] --> A4["+ | 9b"] S7["S7 | 8b"] --> A4 A1 --> B1["+ | 10b"] A2 --> B1 A3 --> B2["+ | 10b"] A4 --> B2 B1 --> C1["+ | 11b"] B2 --> C1 C1 --> SX["sign extend
11 -> 64"] SX --> REG["sum
64b register"] REG --> OUT["sum"]

Check-off

  • Show ex1.circ to your TA, and explain:
    • Why each level of the tree needs one more bit of width than the previous level.
    • Why sign extension (not zero extension) is used on the 11-bit tree output.
    • What the critical path of this circuit is, in terms of number of adders in series.
  • Advance the clock one tick at a time with Ctrl+T or F9 - do not use Ctrl+K (Ticks Enabled) - and set a few operand combinations with the Poke tool. After a single tick, sum should already reflect the current inputs. Tell your TA what this implies for the maximum clock frequency compared to Exercise 2.

Exercise 2: Pipelined Adder Tree

Load the given starter file ex2.circ into Logisim. Starting from your Exercise 1 circuit, insert pipeline registers so that every adder level is followed by a register, turning the tree into a 3-stage pipeline that also feeds the 64-bit sum register. Show this completed circuit to your TA (remember to save!).

Functionality

  • Pipelined Summation: The circuit computes the same signed sum

    \(P = \sum_{i=0}^{7} S_i\)

    but breaks the combinational path into 3 pipeline stages, one per tree level, plus the existing output register. The final sign-extended result appears in sum 4 cycles after its inputs are applied.

  • Throughput = 1 sample / cycle: On every rising edge the circuit accepts a new set of 8 operands and, once the pipeline is full, produces one valid sum per cycle - matching the instruction throughput you would expect from a fully pipelined functional unit in a 64-bit datapath.

  • Latency = 4 cycles: 3 cycles through the tree stages + 1 cycle into the output register. Stage widths are 9, 10, 11 bits respectively, matching Exercise 1.

  • Uniform Clock Domain: All pipeline registers share the same clk. There is no per-stage enable - data moves forward every cycle, just like a textbook 5-stage RV64 pipeline.

Input Ports

  • Operands (S0 - S7): Eight signed 8-bit inputs, identical to Exercise 1.
  • Clock Signal (clk): Drives all pipeline registers and the output register via a single clk tunnel, so every stage shares one synchronous clock domain.

Output Ports

  • Sum (sum): The 64-bit signed sum corresponding to the operands applied 4 cycles earlier.

Details

  • Insert registers at the following points (see diagram below):
    • After level 1: 4 x 9-bit registers (R1A, R1B, R1C, R1D).
    • After level 2: 2 x 10-bit registers (R2A, R2B).
    • After level 3: 1 x 11-bit register (R3).
    • The existing 64-bit sum register stays as the final stage.
  • Every register must be driven by the same clk tunnel. Do not introduce separate clocks - in this course, all pipeline stages are synchronous to a single clock domain.
  • The sign extender (11 -> 64) sits between R3 and the sum register, exactly as in Exercise 1. Do not widen pipeline registers to 64 bits - that wastes area without changing behavior.
  • Do not reorder the tree or merge levels. The whole point of this exercise is that each pipeline stage contains exactly one adder delay.
flowchart LR subgraph S1["Stage 1"] A1["+ | 9b"] --> R1A["R1A | 9b"] A2["+ | 9b"] --> R1B["R1B | 9b"] A3["+ | 9b"] --> R1C["R1C | 9b"] A4["+ | 9b"] --> R1D["R1D | 9b"] end subgraph S2["Stage 2"] B1["+ | 10b"] --> R2A["R2A | 10b"] B2["+ | 10b"] --> R2B["R2B | 10b"] end subgraph S3["Stage 3"] C1["+ | 11b"] --> R3["R3 | 11b"] end subgraph S4["Stage 4"] SX["sign extend
11 -> 64"] --> REG["sum | 64b"] end R1A --> B1 R1B --> B1 R1C --> B2 R1D --> B2 R2A --> C1 R2B --> C1 R3 --> SX

Check-off

  • Show ex2.circ to your TA, and explain:
    • The latency (in cycles) from applying S0-S7 to observing the corresponding value on sum.
    • The throughput of the circuit, and why inserting registers does not reduce it.
    • The new critical path, and why this lets a pipelined version run at a higher clock frequency than Exercise 1 even though both compute the same function.
  • Drive the circuit with a changing input bus, advancing the clock one tick at a time with Ctrl+T or F9 (do not use Ctrl+K). Verify with your TA that:
    1. The first valid sum appears exactly 4 cycles after the first valid input - i.e. the pipelined version's outputs are shifted by 4 cycles relative to the non-pipelined version in Exercise 1.
    2. Stopping the clock mid-computation freezes all intermediate register values, and resuming produces the correct result without any data loss.

The following TA(s) are responsible for this lab: Chaofan Li <lichf2025@shanghaitech.edu.cn>