Project 1.1 : A RISC-V Assembler (Individual Project)
IMPORTANT INFO - PLEASE READ
The projects are part of your CS110P Design Project, worth 2 credit points. Project 1.1 contributes 17% to your CS110P grade. These projects run in parallel with the course, meaning that project and homework deadlines may be very close to each other. Start early and avoid procrastination!
Introduction to Project 1.1
Our objective is to implement a basic RISC-V assembler that
converts assembly instructions into machine code. This
implementation will support the .data(.space .byte
.word) segment of assembly files and focus specifically on the
RV32I instruction set along with partial RV32M extensions.
Before you start, please accept the assignment via this link. A repository containing the starter code will be generated to you. You can then start on the assignment by cloning it to your Ubuntu system. To submit the assignment, you should push your local clone to GitHub and turn in your work on Gradescope by connecting to your GitHub repo.
Academic integrity is strictly enforced in this course and any plagiarism behavior will cause serious consequence.
The assembler operates in two phases:
- Phase 1: Parse the assembly source code
(
.Sfile), remove comments, process the data in the.datasection, create symbol tables for labels, and generate an intermediate representation consisting of basic instructions and extended pseudo-instructions. - Phase 2: Using the symbol table and intermediate representation, translate each instruction into its corresponding machine code (binary representation) and output the result in hexadecimal format.
For machine code transformation, you can refer to the RISC-V Green Card and Venus.
Instruction Set to Support
R-Type (14)
add`, `sub`, `xor`, `or`, `and`, `sll`, `srl`, `sra`, `slt`, `sltu`, `mul`, `mulh`, `div`, `rem
I-Type (16)
addi`, `xori`, `ori`, `andi`, `slli`, `srli`, `srai`, `slti`, `sltiu`, `lb`, `lh`, `lw`, `lbu`, `lhu`, `jalr`, `ecall
S-Type (3)
sb`, `sh`, `sw
SB-Type (6)
beq`, `bne`, `blt`, `bge`, `bltu`, `bgeu
U-Type (2)
lui`, `auipc
UJ-Type (1)
jal
Text segment pseudo-instructions(10)
beqz`, `bnez`, `li`, `mv`, `j`, `jr`, `jal`, `jalr`, `lw`, `la
Data segment pseudo-instructions(3)
.word`,` .space`,` .byte
Hint:
divis DIVide(word),mulis MULtiply(word),mulhis MULtiply High,remis REMainder(word). If you are unfamiliar with divใmulใmulhใrem๏ผplease refer to RISC-V Green SheetHint:
.spaceis used to allocate an uninitialized space in memory and initialized to 0. If you feel confused, you can try these instructions in Venus.Hint For pseudo-instructions beyond RISC-V Green Sheet, refer to this link.
Hint: The immediate value following
limay exceed the range supported byaddi. In such cases, you should use a combination ofluiandaddito correctly representli.
Getting Started
Directory Tree
.
|-- Makefile
|-- README.m
|-- assembler.c
|-- assembler.h
โโโ test
โ โโโ in
โ โ โโโ labels.s
โ โ ...
โ โโโ Makefile
โ โโโ ref
โ โโโ labels.log
โ ...
|-- src
|-- block.c
|-- block.h
|-- tables.c
|-- tables.h
|-- translate.c
|-- translate.h
|-- translate_utils.c
|-- translate_utils.h
|-- utils.c
`-- utils.h
How a RISC-V Assembly File Is Translated
The main assembly process is implemented in
assembler.c:assembler().
Consider an assembly input file with the following content:
.data
my_label:
.word 0x01234567
.byte 0x11 0x22 0x33 0x44
.space 4
.text
li x1, 0
li x2, 5
loop:
addi x1, x1, 1
blt x1, x2, loop
j exit
exit:
Phase 1 -
pass_one()
During this phase, the assembler processes the input file line by line:
Labels (e.g.,
my_labelใloopใexit) are recorded in the symbol tabletablealong with their corresponding addresses.Data: After performing appropriate error checks on the parameters, record the data of each byte in the memory.
Instructions:
- Pseudo-instructions are expanded using their respective
handlers and stored in the intermediate representation block
blk. - Regular instructions are recorded directly in
blk.
- Pseudo-instructions are expanded using their respective
handlers and stored in the intermediate representation block
After Phase 1, the intermediate results are as follows:
table(Symbol Table): Stores label-address mappings. Seesrc/table.handsrc/table.cfor details. note: data address starts from 0x1000 0000, and text address starts from 0, which is same asVenus.268435456 my_label 8 loop 20 exitblk(Intermediate Representation Block): Stores expanded instructions while retaining unresolved label references. For example,li x1, 0is expanded intoaddi x1, x0, 0.addi x1, x0, 0 addi x2, x0, 5 addi x1, x1, 1 blt x1, x2, loop jal zero, exitDataImage: Store the data of each byte that is valid in the .data segment. For example, The data in the above .data section will be output as๏ผ0x10000000 0x67 0x10000001 0x45 0x10000002 0x23 0x10000003 0x01 0x10000004 0x11 0x10000005 0x22 0x10000006 0x33 0x10000007 0x44 0x10000008 0x00 0x10000009 0x00 0x1000000a 0x00 0x1000000b 0x00If you have any doubts about this, you can refer to Venus. Then, check how this data is stored in memory.
Phase 2 -
pass_two()
In this phase, each instruction in blk is
processed sequentially:
- Labels are resolved using the symbol table
(
src/translate.c:translate_inst()). - Each instruction is converted into its corresponding machine code in binary form.
Final Output (Hexadecimal Machine Code)
0x00000093
0x00500113
0x00108093
0xFE20CEE3
0x0040006F
This output represents the fully assembled machine code for the given input assembly file.
The main assembly workflow (assemble()) and
helper functions are provided. You need to implement the
following label addition function and two-phase parsing
functions:
static int data_image_reserve(DataImage *image, size_t additional)
static int parse_data_number(const char *token, long *out)
static int parse_data_directive(uint32_t input_line, char *directive, char *rest_ctx, DataImage *data_image,
uint32_t *data_offset)
static int parse_data_line(uint32_t input_line, char *line, SymbolTable *table,
DataImage *data_image, uint32_t *data_offset)
static int add_if_label(uint32_t input_line, char* str, uint32_t byte_offset, SymbolTable* symtbl);
int pass_one(FILE* input, Block* blk, SymbolTable* table);
int pass_two(Block* blk, SymbolTable* table, FILE* output);tables.h
and tables.c
You need to complete the SymbolTable structure
defined in src/tables.h:
typedef struct {
/* Define your data structure here. */
uint32_t len;
uint32_t cap;
int mode;
} SymbolTable;And implement following SymbolTable management
interfaces in src/tables.c:
SymbolTable* create_table(int mode);
void free_table(SymbolTable* table);
static Symbol* lookup(SymbolTable* table, const char* name);
int add_to_table(SymbolTable* table, const char* name, uint32_t addr);
int64_t get_addr_for_symbol(SymbolTable* table, const char* name);
void resize_table(SymbolTable* table);translate_util.h
and translate_util.c
You need to implement the following utility functions, which will be frequently used during the instruction translation process:
int translate_num(long int* output, const char* str, ImmType type);
int translate_reg(const char* str);
int is_valid_imm(long imm, ImmType type);Different instruction types have varying immediate value
ranges. Add more immediate types in ImmType and
complement corresponding range validation in
is_valid_imm():
typedef enum {
IMM_NONE, /* No immediate value */
IMM_12_SIGNED, /* 12-bit signed number */
/* Add more types here */
} ImmType;Pass One
Writing Function (translate.c)
write_pass_one() has already called relevant
handlers for pseudo expansion. You need to complete the
processing for general functions.
unsigned write_pass_one(Block* blk, const char* name, char** args, int num_args, uint32_t current_line, uint32_t offset)Pseudo Expansion
(translate.c)
You need to implement the following functions to expand pseudoinstructions and save them in the intermediate representation block:
static const InstrInfo instr_table[];
unsigned transform_beqz(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_bnez(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_li(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_mv(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_j(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_jr(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_jal(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_jalr(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_lw(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);
unsigned transform_la(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);Instruction
Writing Functions (translate.c)
You need to implement the following functions to translate regular instructions and output them in hexadecimal format:
int write_rtype(FILE* output, const InstrInfo* info, char** args, size_t num_args) ;
int write_itype(FILE *output, const InstrInfo *info, char **args,
size_t num_args, uint32_t offset, SymbolTable *symtbl);
int write_stype(FILE *output, const InstrInfo *info, char **args,
size_t num_args);
int write_sbtype(FILE* output, const InstrInfo* info, char** args,
size_t num_args, uint32_t offset, SymbolTable* symtbl) ;
int write_utype(FILE* output, const InstrInfo* info, char** args,
size_t num_args, uint32_t offset, SymbolTable* symtbl) ;
int write_ujtype(FILE* output, const InstrInfo* info, char** args,
size_t num_args, uint32_t offset, SymbolTable* symtbl);For the functions mentioned above, you can quickly locate
them by searching for TO DO in the source
files.
HINT: Recommend that you can download and
use the Better Comments plugin in VSCode. Then,
the parts that need to be completed in the project will have the
TO DO highlighted.
Each function should be implemented :
//TODO
/* === start === */
...
/* === end === */The framework code is only meant to provide a general approach. In addition to completing the required sections, please also pay attention to the return values. Some functions have default return values as placeholders, which may be out of the expected range. Make sure to update the return values accordingly.
Important: If you need to add helper
functions, additional variables, structures, macro definitions,
etc., you are free to include them in the files we provide.
However, since the Makefile is fixed,
please do not add extra files, as this will
lead to compilation issues.
How to Run the Assembler
After running:
make assembleran assembler executable will be generated. (This command will
automatically invoke make clean first.)
To run your newly generated assembler, use the following command:
./assembler --input_file <input_file> --output_folder <output_folder><input_file>: The input RISC-V file.<output_folder>: The directory where the output files will be stored.
When processing an input file such as test.S,
the <output_folder> will contain two
files:
test.out: Contains the binary instruction results.test.log: A log file that records all errors or confirms a successful assembly.test.data:Contains data address and content of every byte in the memory .
To achieve a correct result, ensure that both of these files match the corresponding reference files provided.
Example Usage
If you want to compile testcases/testcase1.S and
save the output in the out/ directory, run:
./assembler --input_file testcases/testcase1.S --output_folder out/To check valid labels and instructions after the first pass,
add the --test option:
./assembler --input_file <input_file> --output_folder <output_folder> --testThis will generate .tbl and .inst
files in <output_folder> for intermediate
verification.
Note: The correctness of these
.tbl and .inst files does not affect
your gradeโthey are provided solely as a reference for you to
verify your intermediate results. If your .out and
.log files are correct, you can ignore any
differences in .tbl and .inst.
How to Perform Testing
1. Provided Test Cases
We have provided several test cases that you can run using:
make checkTest case inputs are stored in ./test/in/, and
their outputs are saved in the ./test/out/
folder.
To remove all previously generated output, you can run:
make cleanImportant: Running make check
does not automatically execute
make assembler. You must ensure that you have built
the latest version of your assembler before running the
tests.
Evaluation Criteria
Your assembler will be evaluated based on three
aspects:
- Correct binary instruction generation
- Proper error handling (catching all errors)
- Memory safety (no memory leaks)
Checking Output and Errors
For correctness, we compare your .log and
.out files against the reference outputs using the
diff command. If you see "Diff .out check
failed" or "Diff .log check failed",
it means your output differs from the expected result. You can
manually compare the files using:
diff file1 file2For more details on diff results, refer to this
guide.
Memory Leak Detection
If you receive a "Valgrind check failed"
message, check out/%.memcheck for error
details.
To detect memory leaks, we use the following command:
valgrind --tool=memcheck --leak-check=full --track-origins=yes ./assembler --input_file <input_file> --output_folder <output_folder>2. Running a Single Test Case
We also provide a way to run selected test cases, saving time and allowing you to test custom cases. To run a specific test case:
make test TEST_NAME=<test_name>This will use ./test/in/<test_name>.S as
input and output results to ./test/out/.
If you want to include your own test case every time you run
make check, you can modify the
Makefile located in ./test/. Simply
add the name of your test case to the FULL_TESTS
variable.
Running Your Own Test Case
If you create a custom test file (test.S),
follow these steps:
Move your test case to the correct folder:
mv test.S ./test/in/test.SRun the test:
make test TEST_NAME=test
Using Venus for Test Case Creation
For easier test case generation, you can use Venus, a powerful RISC-V simulator.
Steps to Use Venus
- Write RISC-V assembly instructions in the Editor page.
- Navigate to the Simulator page to view the corresponding machine code.
- Use the Dump button to export the machine code for reference.
Grading
Warning: Passing all local tests does not guarantee full marks!
The provided test cases are only a basic guide. The Online
Judge (OJ) system contains many additional corner cases that
will rigorously test your assembler.
You should thoroughly test your implementation to ensure robustness. Do not rely on OJ as your debugger!
Submission
Submit on Gradescope by selecting your GitHub repository and the right branch.
Only the last active submission will be accounted for your project grade. Make sure it is your best version.
- Due:2026-04-07 23:59
In Project 1.1 are: Siting Liu liust@shanghaitech.edu.cn Yuan Xiao xiaoyuan@shanghaitech.edu.cn
Maoxi Ma mamx2023@shanghaitech.edu.cn
Luntian Zhang zhanglt2023@shanghaitech.edu.cn