Proj1-1_website

Computer Architecture I (CS110 / CS110P) Document
Reference: CS110 Course Page

Project 1.1 : A RISC-V Assembler (Individual Project)

IMPORTANT INFO - PLEASE READ

The projects are part of your CS110P Design Project, worth 2 credit points. Project 1.1 contributes 17% to your CS110P grade. These projects run in parallel with the course, meaning that project and homework deadlines may be very close to each other. Start early and avoid procrastination!

Introduction to Project 1.1

Our objective is to implement a basic RISC-V assembler that converts assembly instructions into machine code. This implementation will support the .data(.space .byte .word) segment of assembly files and focus specifically on the RV32I instruction set along with partial RV32M extensions.

Before you start, please accept the assignment via this link. A repository containing the starter code will be generated to you. You can then start on the assignment by cloning it to your Ubuntu system. To submit the assignment, you should push your local clone to GitHub and turn in your work on Gradescope by connecting to your GitHub repo.

Academic integrity is strictly enforced in this course and any plagiarism behavior will cause serious consequence.

The assembler operates in two phases:

  • Phase 1: Parse the assembly source code (.S file), remove comments, process the data in the .data section, create symbol tables for labels, and generate an intermediate representation consisting of basic instructions and extended pseudo-instructions.
  • Phase 2: Using the symbol table and intermediate representation, translate each instruction into its corresponding machine code (binary representation) and output the result in hexadecimal format.

For machine code transformation, you can refer to the RISC-V Green Card and Venus.

Instruction Set to Support

R-Type (14)

add`, `sub`, `xor`, `or`, `and`, `sll`, `srl`, `sra`, `slt`, `sltu`, `mul`, `mulh`, `div`, `rem

I-Type (16)

addi`, `xori`, `ori`, `andi`, `slli`, `srli`, `srai`, `slti`, `sltiu`, `lb`, `lh`, `lw`, `lbu`, `lhu`, `jalr`, `ecall

S-Type (3)

sb`, `sh`, `sw

SB-Type (6)

beq`, `bne`, `blt`, `bge`, `bltu`, `bgeu

U-Type (2)

lui`, `auipc

UJ-Type (1)

jal

Text segment pseudo-instructions(10)

beqz`, `bnez`, `li`, `mv`, `j`, `jr`, `jal`, `jalr`, `lw`, `la

Data segment pseudo-instructions(3)

.word`,` .space`,` .byte
  • Hint: div is DIVide(word), mul is MULtiply(word), mulh is MULtiply High, rem is REMainder(word). If you are unfamiliar with divใ€mulใ€mulhใ€rem๏ผŒplease refer to RISC-V Green Sheet

  • Hint:.space is used to allocate an uninitialized space in memory and initialized to 0. If you feel confused, you can try these instructions in Venus.

  • Hint For pseudo-instructions beyond RISC-V Green Sheet, refer to this link.

  • Hint: The immediate value following li may exceed the range supported by addi. In such cases, you should use a combination of lui and addi to correctly represent li.

Getting Started

Directory Tree

.

|-- Makefile
|-- README.m
|-- assembler.c
|-- assembler.h
โ”œโ”€โ”€ test
โ”‚   โ”œโ”€โ”€ in
โ”‚   โ”‚   โ”œโ”€โ”€ labels.s
โ”‚   โ”‚   ...
โ”‚   โ”œโ”€โ”€ Makefile
โ”‚   โ””โ”€โ”€ ref
โ”‚       โ”œโ”€โ”€ labels.log
โ”‚       ...
|-- src
    |-- block.c
    |-- block.h
    |-- tables.c
    |-- tables.h
    |-- translate.c
    |-- translate.h
    |-- translate_utils.c
    |-- translate_utils.h
    |-- utils.c
    `-- utils.h

How a RISC-V Assembly File Is Translated

The main assembly process is implemented in assembler.c:assembler().

Consider an assembly input file with the following content:

    .data
my_label:
    .word 0x01234567
    .byte 0x11 0x22 0x33 0x44
    .space 4

    .text
    li   x1, 0
    li   x2, 5
loop:  
    addi x1, x1, 1
    blt  x1, x2, loop
    j    exit
exit:

Phase 1 - pass_one()

During this phase, the assembler processes the input file line by line:

  • Labels (e.g., my_labelใ€loop ใ€ exit ) are recorded in the symbol table table along with their corresponding addresses.

  • Data: After performing appropriate error checks on the parameters, record the data of each byte in the memory.

  • Instructions:

    • Pseudo-instructions are expanded using their respective handlers and stored in the intermediate representation block blk.
    • Regular instructions are recorded directly in blk.

After Phase 1, the intermediate results are as follows:

  • table (Symbol Table): Stores label-address mappings. See src/table.h and src/table.c for details. note: data address starts from 0x1000 0000, and text address starts from 0, which is same as Venus.

    268435456 my_label
    
    8  loop
    
    20 exit
  • blk (Intermediate Representation Block): Stores expanded instructions while retaining unresolved label references. For example, li x1, 0 is expanded into addi x1, x0, 0.

    addi x1, x0, 0
    
    addi x2, x0, 5
    
    addi x1, x1, 1
    
    blt x1, x2, loop
    
    jal zero, exit
  • DataImage: Store the data of each byte that is valid in the .data segment. For example, The data in the above .data section will be output as๏ผš

    0x10000000 0x67
    0x10000001 0x45
    0x10000002 0x23
    0x10000003 0x01
    0x10000004 0x11
    0x10000005 0x22
    0x10000006 0x33
    0x10000007 0x44
    0x10000008 0x00
    0x10000009 0x00
    0x1000000a 0x00
    0x1000000b 0x00

    If you have any doubts about this, you can refer to Venus. Then, check how this data is stored in memory.

Phase 2 - pass_two()

In this phase, each instruction in blk is processed sequentially:

  • Labels are resolved using the symbol table (src/translate.c:translate_inst()).
  • Each instruction is converted into its corresponding machine code in binary form.

Final Output (Hexadecimal Machine Code)

0x00000093
0x00500113
0x00108093
0xFE20CEE3
0x0040006F

This output represents the fully assembled machine code for the given input assembly file.

The main assembly workflow (assemble()) and helper functions are provided. You need to implement the following label addition function and two-phase parsing functions:

static int data_image_reserve(DataImage *image, size_t additional)

static int parse_data_number(const char *token, long *out)

static int parse_data_directive(uint32_t input_line, char *directive, char *rest_ctx, DataImage *data_image,
                                uint32_t *data_offset)
                                
static int parse_data_line(uint32_t input_line, char *line, SymbolTable *table,
                            DataImage *data_image, uint32_t *data_offset)                                

static int add_if_label(uint32_t input_line, char* str, uint32_t byte_offset, SymbolTable* symtbl);

int pass_one(FILE* input, Block* blk, SymbolTable* table);

int pass_two(Block* blk, SymbolTable* table, FILE* output);

tables.h and tables.c

You need to complete the SymbolTable structure defined in src/tables.h:

typedef struct {

  /* Define your data structure here. */

  uint32_t len;

  uint32_t cap;

  int mode;

} SymbolTable;

And implement following SymbolTable management interfaces in src/tables.c:

SymbolTable* create_table(int mode);

void free_table(SymbolTable* table);

static Symbol* lookup(SymbolTable* table, const char* name);

int add_to_table(SymbolTable* table, const char* name, uint32_t addr);

int64_t get_addr_for_symbol(SymbolTable* table, const char* name);

void resize_table(SymbolTable* table);

translate_util.h and translate_util.c

You need to implement the following utility functions, which will be frequently used during the instruction translation process:

int translate_num(long int* output, const char* str, ImmType type);

int translate_reg(const char* str);

int is_valid_imm(long imm, ImmType type);

Different instruction types have varying immediate value ranges. Add more immediate types in ImmType and complement corresponding range validation in is_valid_imm():

typedef enum {

  IMM_NONE,         /* No immediate value */

  IMM_12_SIGNED,    /* 12-bit signed number */

  

  /* Add more types here */

} ImmType;

Pass One Writing Function (translate.c)

write_pass_one() has already called relevant handlers for pseudo expansion. You need to complete the processing for general functions.

unsigned write_pass_one(Block* blk, const char* name, char** args, int num_args, uint32_t current_line, uint32_t offset)

Pseudo Expansion (translate.c)

You need to implement the following functions to expand pseudoinstructions and save them in the intermediate representation block:

static const InstrInfo instr_table[];

unsigned transform_beqz(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_bnez(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_li(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_mv(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_j(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_jr(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_jal(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_jalr(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_lw(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

unsigned transform_la(Block* blk, char** args, int num_args, uint32_t current_line, uint32_t offset);

Instruction Writing Functions (translate.c)

You need to implement the following functions to translate regular instructions and output them in hexadecimal format:

int write_rtype(FILE* output, const InstrInfo* info, char** args, size_t num_args) ;

int write_itype(FILE *output, const InstrInfo *info, char **args,
                size_t num_args, uint32_t offset, SymbolTable *symtbl);

int write_stype(FILE *output, const InstrInfo *info, char **args,
                size_t num_args);

int write_sbtype(FILE* output, const InstrInfo* info, char** args,
                 size_t num_args, uint32_t offset, SymbolTable* symtbl) ;

int write_utype(FILE* output, const InstrInfo* info, char** args,
                size_t num_args, uint32_t offset, SymbolTable* symtbl) ;

int write_ujtype(FILE* output, const InstrInfo* info, char** args,
                 size_t num_args, uint32_t offset, SymbolTable* symtbl);

For the functions mentioned above, you can quickly locate them by searching for TO DO in the source files.

HINT: Recommend that you can download and use the Better Comments plugin in VSCode. Then, the parts that need to be completed in the project will have the TO DO highlighted.

Each function should be implemented :

//TODO
/* === start === */


...

/* === end === */

The framework code is only meant to provide a general approach. In addition to completing the required sections, please also pay attention to the return values. Some functions have default return values as placeholders, which may be out of the expected range. Make sure to update the return values accordingly.

Important: If you need to add helper functions, additional variables, structures, macro definitions, etc., you are free to include them in the files we provide. However, since the Makefile is fixed, please do not add extra files, as this will lead to compilation issues.

How to Run the Assembler

After running:

make assembler

an assembler executable will be generated. (This command will automatically invoke make clean first.)

To run your newly generated assembler, use the following command:

./assembler --input_file <input_file> --output_folder <output_folder>
  • <input_file>: The input RISC-V file.
  • <output_folder>: The directory where the output files will be stored.

When processing an input file such as test.S, the <output_folder> will contain two files:

  • test.out: Contains the binary instruction results.
  • test.log: A log file that records all errors or confirms a successful assembly.
  • test.data:Contains data address and content of every byte in the memory .

To achieve a correct result, ensure that both of these files match the corresponding reference files provided.

Example Usage

If you want to compile testcases/testcase1.S and save the output in the out/ directory, run:

./assembler --input_file testcases/testcase1.S --output_folder out/

To check valid labels and instructions after the first pass, add the --test option:

./assembler --input_file <input_file> --output_folder <output_folder> --test

This will generate .tbl and .inst files in <output_folder> for intermediate verification.

Note: The correctness of these .tbl and .inst files does not affect your gradeโ€”they are provided solely as a reference for you to verify your intermediate results. If your .out and .log files are correct, you can ignore any differences in .tbl and .inst.

How to Perform Testing

1. Provided Test Cases

We have provided several test cases that you can run using:

make check

Test case inputs are stored in ./test/in/, and their outputs are saved in the ./test/out/ folder.

To remove all previously generated output, you can run:

make clean

Important: Running make check does not automatically execute make assembler. You must ensure that you have built the latest version of your assembler before running the tests.

Evaluation Criteria

Your assembler will be evaluated based on three aspects:

  1. Correct binary instruction generation
  2. Proper error handling (catching all errors)
  3. Memory safety (no memory leaks)

Checking Output and Errors

For correctness, we compare your .log and .out files against the reference outputs using the diff command. If you see "Diff .out check failed" or "Diff .log check failed", it means your output differs from the expected result. You can manually compare the files using:

diff file1 file2

For more details on diff results, refer to this guide.

Memory Leak Detection

If you receive a "Valgrind check failed" message, check out/%.memcheck for error details.

To detect memory leaks, we use the following command:

valgrind --tool=memcheck --leak-check=full --track-origins=yes ./assembler --input_file <input_file> --output_folder <output_folder>

2. Running a Single Test Case

We also provide a way to run selected test cases, saving time and allowing you to test custom cases. To run a specific test case:

make test TEST_NAME=<test_name>

This will use ./test/in/<test_name>.S as input and output results to ./test/out/.

If you want to include your own test case every time you run make check, you can modify the Makefile located in ./test/. Simply add the name of your test case to the FULL_TESTS variable.

Running Your Own Test Case

If you create a custom test file (test.S), follow these steps:

  1. Move your test case to the correct folder:

    mv test.S ./test/in/test.S
  2. Run the test:

    make test TEST_NAME=test

Using Venus for Test Case Creation

For easier test case generation, you can use Venus, a powerful RISC-V simulator.

Steps to Use Venus
  1. Write RISC-V assembly instructions in the Editor page.
  2. Navigate to the Simulator page to view the corresponding machine code.
  3. Use the Dump button to export the machine code for reference.

Grading

Warning: Passing all local tests does not guarantee full marks!

The provided test cases are only a basic guide. The Online Judge (OJ) system contains many additional corner cases that will rigorously test your assembler.

You should thoroughly test your implementation to ensure robustness. Do not rely on OJ as your debugger!

Submission

Submit on Gradescope by selecting your GitHub repository and the right branch.

Only the last active submission will be accounted for your project grade. Make sure it is your best version.

  • Due:2026-04-07 23:59

In Project 1.1 are: Siting Liu liust@shanghaitech.edu.cn Yuan Xiao xiaoyuan@shanghaitech.edu.cn

Maoxi Ma mamx2023@shanghaitech.edu.cn

Luntian Zhang zhanglt2023@shanghaitech.edu.cn