Homework 2: FP16 Multiplication Simulator
Computer Architecture - CS110 @ ShanghaiTech University
Overview
In this homework, you will implement a software simulator for IEEE 754 half-precision floating-point (FP16) multiplication. Instead of merely computing the final arithmetic result, your program is required to accurately trace and output the intermediate states of the hardware multiplication pipeline.
Through this assignment, you are expected to:
- Deeply understand the bit-level representation of IEEE 754 floating-point numbers.
- Practice extracting and manipulating mantissas, including the proper handling of implicit hidden bits.
- Implement fractional multiplication and the dynamic bit-shifting required for result normalization.
- Comprehend and enforce the strict Round-to-+ (Up) rounding mode, specifically handling its counter-intuitive behaviors when applied to negative numbers. (Note: The default rounding mode is typically Round-to-Nearest-Ties-to-Even, but for simplicity of implementation, we choose Round-to-.)
- Correctly manage complex edge cases, including Underflow, Overflows, NaNs, and Infinities.
Before you start, you have to join the GitHub Classroom and accept the assignment via this link. A repository containing the starter code will be generated to you. You can then start on the assignment by cloning it to your Ubuntu system. To submit the assignment, you should push your local clone to GitHub and turn in your work on Gradescope by connecting to your GitHub repo.
Important Restriction: You must simulate the
hardware logic using purely unsigned/signed integer types (e.g.,
uint16_t, uint32_t) and bitwise
operations. Type-casting to float or
double, or using standard library math/rounding
functions is STRICTLY FORBIDDEN. Your code will
be audited both automatically and manually.
Input & Output Specification
The C program should read from stdin line by
line until EOF is reached. Each line contains two
FP16 floating-point numbers expressed in 4-digit lowercase hex.
For each pair of operands, your program must output exactly 5
lines detailing the pipeline steps.
Floating-Point Format (FP16)
The format comprises 16 bits: 1 sign bit, 5 exponent bits, and 10 explicit mantissa (fraction) bits. The exponent bias is 15.
MSB LSB
+----+----------------+---------------+
| 15 | 14 ... 10 | 9 ... 0 |
+----+----------------+---------------+
Sign Exponent Mantissa
| Exponent | Mantissa = 0 | Mantissa != 0 |
|---|---|---|
0x00 |
zero | subnormal |
0x1F |
infinity | NaN |
| otherwise | normal | normal |
Rounding Mode: Round-to-+ (Up)
The objective of this rounding mode is to round the result to the smallest representable floating-point number that is greater than or equal to the exact mathematical result.
When truncating the intermediate 22-bit product to fit the 10-bit target fraction, the discarded low-order bits are summarized by three bits to determine the loss of precision:
- G (Guard bit): The first truncated bit immediately to the right of the target fraction's least significant bit.
- R (Round bit): The bit immediately to the right of the Guard bit.
- S (Sticky bit): The logical OR of all remaining truncated bits to the right of the Round bit.
Let
.
Note that the Sticky bit
()
must represent the logical OR of all discarded
bits. This includes not only the lower bits of the raw 22-bit
product, but crucially, any 1 bits that are shifted
out during normalization or underflow right-shifts. Therefore,
if any precision is lost at any step.
- If : The result is exact. Truncate GRS bits.
- If AND (Positive): The exact mathematical result is greater than the truncated result. You must round up (add 1 to the lowest bit of the mantissa).
- If AND (Negative): The exact mathematical result (e.g., ) is greater (closer to ) than the rounded-down magnitude (e.g., ). Therefore, you must truncate (do nothing to the mantissa and discard GRS).
Overflow Rules:
- Positive Overflow: Rounds to Positive Infinity
(
+Inf). - Negative Overflow: Rounds to the Maximum Negative
Normal Number (
-MaxNormal, which is0xFBFF), NOT-Inf.
Output Format Details
Line 1 & 2: Unpack
Op[1|2]: S=<sign> E=<unbiased_exp> M=<implicit>.<fraction_3hex> <type>
<sign>: The sign bit (0or1).<unbiased_exp>: The actual exponent value in decimal (e.g.,0,-10). For zeros and subnormals, output the implicit exponent-14. For Inf/NaN, output0.<implicit>: The hidden leading bit shown in the unpack trace.1for normal/inf/nan,0for subnormal/zero.<fraction_3hex>: The 10-bit explicit fraction left-adjusted to 12 bits, formatted as 3 lowercase hex digits. (e.g., fraction0000000001becomes004).<type>: Must be one of"normal","subnormal","zero","inf","nan".,
Line 3: Raw Product
Raw: <22_bit_binary> E_raw=<raw_exp>
<22_bit_binary>: The pure integer multiplication of the two 11-bit mantissas (including hidden bits).<raw_exp>: The decimal sum of the two unbiased exponents.
Line 4: Normalize & Round
Norm: E_norm=<e_norm> Fraction=<10_bit_binary> G=<g> R=<r> S=<s> Action=<Up|Truncate>
<e_norm>: The decimal unbiased exponent after normalization or underflow alignment, but before rounding logic is applied.<10_bit_binary>: The 10-bit fraction before rounding.<g>,<r>,<s>: The computed Guard, Round, and Sticky bits (0or1).<Action>: Determine the rounding action (UporTruncate).
Note: If a special value(nan,
inf) bypasses Raw/Norm steps, output
Raw: N/A and Norm: N/A for those
lines.
Special invalid operation rule: 0 × inf (in
either operand order, with any sign) must produce
nan.
Line 5: Final Result
Result: <4_digit_hex_lower_case>
Note: If nan bypasses Result steps, output
Result: N/A for those lines.
An Example Run
Input:
3e00 3e00
bc01 3c01
1400 1400
Output:
Op1: S=0 E=0 M=1.800 normal
Op2: S=0 E=0 M=1.800 normal
Raw: 1001000000000000000000 E_raw=0
Norm: E_norm=1 Fraction=0010000000 G=0 R=0 S=0 Action=Truncate
Result: 4080
Op1: S=1 E=0 M=1.004 normal
Op2: S=0 E=0 M=1.004 normal
Raw: 0100000000100000000001 E_raw=0
Norm: E_norm=0 Fraction=0000000010 G=0 R=0 S=1 Action=Truncate
Result: bc02
Op1: S=0 E=-10 M=1.000 normal
Op2: S=0 E=-10 M=1.000 normal
Raw: 0100000000000000000000 E_raw=-20
Norm: E_norm=-14 Fraction=0000010000 G=0 R=0 S=0 Action=Truncate
Result: 0010
Compilation Standard
Your code will be evaluated using the following strict compilation flags:
gcc -std=c11 -Wall -Wextra -Wpedantic -Werror -O2 -ggdb fpmul.c -o fpmul
Warnings are treated as errors. No memory leaks are allowed.
Academic Integrity
You are expected to submit your own original work. Plagiarism, including but not limited to copying code from peers, downloading from online repositories, or submitting AI-generated code, is a serious academic offense and will not be tolerated. Penalties may include a failing grade for the course.