Hw3

Computer Architecture I (CS110 / CS110P) Document
Reference: CS110 Course Page

Homework 3: vLLM PagedAttention Memory Manager

Computer Architecture - CS110 @ ShanghaiTech University

Overview

In this homework, you will implement a custom physical memory allocator. This memory manager uses a multi-level block table (similar to Page Tables in Operating Systems) to manage a large memory pool. This design is heavily inspired by the PagedAttention mechanism from the vLLM framework, a foundational technology for modern AI systems.

Before you start, you have to join the GitHub Classroom and accept the assignment via this link. A repository containing the starter code will be generated to you. You can then start on the assignment by cloning it to your Ubuntu system. To submit the assignment, you should push your local clone to GitHub and turn in your work on Gradescope by connecting to your GitHub repo.

Part 1: Background & Motivation (The "Why")

Note: You do not need to understand AI algorithms to complete this C programming assignment. This is just the real-world motivation.

1. What is a "Sequence" and "KV Cache"? When you chat with an AI (like ChatGPT), a single chat session is called a Sequence. The AI generates its response word by word (token by token). To generate the next word efficiently, it needs to remember the context of all previous words. Instead of re-reading the entire conversation every time, the AI calculates and saves the historical context data. This cached context is known as the KV Cache (Key-Value Cache).

2. The Problem: Memory Fragmentation The system starts with a massive, continuous physical memory pool. Traditionally, when a new Sequence (chat session) starts, the memory allocator guesses the "maximum possible length" of the conversation and reserves a huge, continuous chunk of memory from the pool for that Sequence.

  • Internal Fragmentation: If the allocator reserves 1,000 slots for a Sequence, but the user says "Hello" and ends the chat after 5 slots, the remaining 995 slots in that chunk are completely wasted and cannot be used by anyone else.
  • External Fragmentation: Over time, as different Sequences reserve and free continuous chunks of varying sizes, the memory pool ends up looking like Swiss cheeseโ€”filled with irregular, scattered empty spaces that are too small to serve new large requests.

3. The Solution: Virtual Memory & Paging (PagedAttention) To solve this, PagedAttention borrows the concept of Paging from Operating Systems. Instead of giving a Sequence a continuous chunk of memory, the system divides the entire memory pool into small, fixed-size Physical Blocks (Pages).

When a Sequence needs memory, the allocator simply finds any free blocks in the pool and assigns them. The Sequence logically thinks its KV Cache is growing continuously, but physically, the blocks are scattered everywhere. A Block Table (like a Page Table) is used to map the Sequence's logical progress to the scattered physical blocks.

Part 2: Core Concepts & Architecture (The "How")

To implement this, we map these high-level AI concepts into basic C data types.

1. KV Records and Physical Blocks

  • KV Record: In reality, the KV cache stores complex data. For this assignment, one KV record is simplified as a single int (occupying 4 Bytes).
  • Physical Block: The total memory pool provided to your program (pool_addr) is divided into fixed-size chunks called "Blocks".
    • Each block is exactly 64 Bytes.
    • Since an KV record is 4 Bytes, each block can hold exactly KV_PER_BLOCK = 16 KV records.

2. The 2-Level Block Table (Address Translation)

To manage a sequence's scattered memory, you will build a 2-Level Block Table.

  • L1 Block Table (L1_PT_SIZE = 16): An array of 16 pointers. Each pointer can point to an L2 Table.
  • L2 Block Table (L2_PT_SIZE = 16): An array of 16 integers. It stores the actual Physical Block ID mapping.

Maximum Capacity: 16 (L1 slots)ร—16 (L2 slots)=25616 \text{ (L1 slots)} \times 16 \text{ (L2 slots)} = 256 logical blocks per sequence.

๐Ÿ“– Step-by-Step Address Translation Example: Imagine a sequence wants to write its 613th KV record (so seq->kv_count = 613). Where exactly in the physical memory does this integer go?

  1. Logical Block Index: 613 / 16 = 38. This means the data belongs in the 38rd logical block of this sequence.
  2. Intra-block Offset: 613 % 16 = 5. Inside that specific block, it goes into the 5th slot.
  3. Table Traversal: How do we find the 38rd logical block?
    • L1 Index: 38 / 16 = 2. Go to block_table[2].
    • L2 Index: 38 % 16 = 6. Go to block_table[2][6].
  4. Physical Write: You read the integer stored at block_table[2][6] (e.g., let's say it's 3). This means the data belongs in Physical Block #3. You calculate the memory address of Physical Block #3, jump to its 5th integer slot, and write the data.

HW3_1

3. Copy-on-Write (CoW) & Reference Counting

Why do we need CoW? Have you ever clicked "Regenerate response" in an AI chat, or asked the AI to "give me 3 different ideas"? In these cases, the AI creates multiple Branches (child sequences) from the same initial user prompt. Because all these branches share the exact same prompt, they have the exact same initial KV Cache. To save memory, they shouldn't duplicate this data; they should just share the physical memory blocks of the prompt.

  • Reference Counting (ref_count): You must maintain a global array(size equal to total_local_blocks, each ref_count represent a reference counting of one individual block) tracking how many sequences are currently using a specific physical block.

  • Copy-on-Write Protocol (The Implementation): If Sequence A and Sequence B share Physical Block #8 (ref_count = 2), and Sequence A wants to append a new, unique word to Block #8, it must not write it directly. If it does, Sequence B's memory is corrupted.

    Instead, Sequence A must trigger CoW: It finds a new free block (e.g., Block #12), copies all 64 bytes from Block #8 to Block #12, updates its own L2 table to point to Block #12, decrements the ref_count of Block #8, sets the ref_count of Block #12 to 1, and then appends the new data to Block #12.

4. Core Structures

The entire system relies on the following structs defined in your header file:

struct list_node {
    struct list_node *prev;
    struct list_node *next;
};

typedef struct Sequence {
    int seq_id;                     // Unique identifier for the sequence
    int **block_table;              // 2-Level block table root: array of L2 table pointers
    int kv_count;                   // Total number of KV records currently held
    struct list_node node;          // Intrusive list node (Must be at the end)
} Sequence;

typedef struct {
    void *pool_base_ptr;            // Base address of the memory pool (Injected by TA)
    int total_local_blocks;         // Total number of physical blocks
    int free_local_blocks;          // Number of currently free physical blocks
    int *ref_count;                 // Dynamic array: Reference count for each block
    struct list_node active_seqs;   // Dummy head for the active sequence list
} LLMEngine;

5. Intrusive Linked List (๐Ÿšจ DANGER ZONE)

We use a Linux-kernel style intrusive linked list to track active sequences. In a standard linked list, a node contains data. In an intrusive linked list, the data contains the node. Look closely at the Sequence struct above: the node is embedded at the very end.

Memory Layout Analysis: Imagine the Sequence struct in memory. It starts at address 0x1000.

  • 0x1000: seq_id (4 bytes)
  • 0x1008: block_table (8 bytes)
  • 0x1010: kv_count (4 bytes)
  • 0x1018: node (16 bytes)

If you are traversing the linked list, you will be holding a pointer to the node (address 0x1018). If you simply cast (Sequence*)node, C will mistakenly treat 0x1018 as the start of the Sequence. It will read node's memory as if it were seq_id and block_table, causing immediate memory corruption and Segmentation Faults.

(Note: You might notice the addresses jump by 8 bytes between seq_id and block_table, as well as kv_count and node. This is due to Data Structure Alignment (Padding) in 64-bit systems, which inserts empty bytes to ensure 8-byte pointers are aligned to 8-byte boundaries in memory.)

To solve this, you must use the <stddef.h> macro offsetof(type, member). This macro computes the exact byte offset of a field within a struct. By doing pointer arithmetic (subtracting the offset from your current node pointer), you can correctly recover the true starting address (0x1000) of the Sequence.

HW3_2

API Implementation Specification

Implement the following APIs in llm_memory.c.

๐Ÿšจ Global Constraints (STRICTLY ENFORCED)

Your code will be evaluated using White-Box testing. You must strictly follow these rules, or the autograder will fail you.

  1. Pointer Safety: Any API receiving LLMEngine *engine MUST check if engine == NULL on its first line. If true, return NULL or -1.
  2. Validation Priority Rule (Check Before Allocate): You must follow this exact order:
    • Phase 1: Basic validity (e.g., pointer checks, sizes <= 0).
    • Phase 2: ID Constraints (check if seq_id exists or is unique).
    • Phase 3: Resource Probing. Before modifying any state, check if there are enough free physical blocks.
    • Contract: If any phase fails, return immediately. DO NOT allocate memory (malloc) or modify block tables if the operation will eventually fail due to Out-Of-Memory (OOM).
  3. First-Fit Allocation: When you need a new physical block, you MUST use a for loop from 0 to total_local_blocks - 1 and pick the very first block where ref_count == 0.
  4. No Premature Memory Modification: The autograder pre-fills the physical memory pool (pool_addr) with special validation magic numbers. You are strictly forbidden to call memset or zero out a physical block when you allocate it during create_sequence. You may ONLY write to the physical memory during an actual append_kv operation.
  5. Free Block Consistency: The variable engine->free_local_blocks tracks available physical blocks. It must be decremented when a new block is assigned, and incremented when a block's ref_count drops to 0.

Task 1: Engine Initialization

Background: The framework hands you a raw block of memory (pool_addr). Your task is to set up the management structures (the LLMEngine) before any Sequence arrives.

LLMEngine* llm_init(void *pool_addr, size_t pool_size);
  • Return NULL if pool_addr == NULL or pool_size < BLOCK_SIZE.
  • Allocate LLMEngine. Calculate total_local_blocks = pool_size / BLOCK_SIZE.
  • Initialize free_local_blocks = total_local_blocks. Store pool_addr.
  • Allocate the ref_count array using calloc (initialized to 0).
  • Initialize the active_seqs dummy head. For an empty list, the dummy head's prev and next pointers MUST point to itself.

Task 2: Sequence Creation & Deletion

Background (The "Prefill" Phase): When a user sends a prompt to the AI, we immediately know the exact length of that prompt (initial_kv_len). Therefore, we can preemptively calculate how many physical blocks are required and map them all at once.

Sequence* create_sequence(LLMEngine *engine, int seq_id, int initial_kv_len);
  • Phase 1: Return NULL if initial_kv_len <= 0 or if seq_id already exists.
  • Phase 2 (Capacity Pre-Check): Calculate required blocks using ceiling division: (initial_kv_len + KV_PER_BLOCK - 1) / KV_PER_BLOCK. If required blocks exceed 256 OR exceed free_local_blocks, return NULL immediately.
  • Allocate Sequence and its L1 block table (an array of 16 pointers). Initialize all 16 pointers to NULL.
  • Iteratively map the required blocks:
    • If an L2 block table is needed, malloc it (an array of 16 ints) and initialize all 16 entries to -1.
    • Allocate a free physical block (First-Fit), map its ID into the L2 table, set its ref_count to 1, and decrement free_local_blocks.
    • ๐Ÿšจ Reminder: Do not write data or memset the physical blocks here!
  • Set seq->kv_count = initial_kv_len, append the sequence to the tail of active_seqs (which is head->prev), and return it.
int free_sequence(LLMEngine *engine, int seq_id);
  • Find and detach the target sequence from the list. Return -1 if not found.
  • ๐Ÿšจ CRITICAL TRAVERSAL RULE: You MUST loop through ALL 16 slots of the L1 table. DO NOT break the loop early if you encounter a NULL pointer. A sequence might have blocks logically scattered, leaving empty slots in between.
  • For every mapped physical block (!= -1), decrement its ref_count. If a block's count becomes 0, increment free_local_blocks.
  • Free all allocated L2 arrays, the L1 array, and the sequence itself. Return 0.

Task 3: Sequence Forking (Tree Deep Copy)

Background: The AI wants to explore a new branch (e.g., "regenerate response"). We create a child sequence. To save memory, it should point to the exact same physical blocks as its parent.

Sequence* fork_sequence(LLMEngine *engine, int parent_seq_id, int new_seq_id);
  • Ensure the parent exists and new_seq_id is globally unique.
  • Allocate a new Sequence and a new L1 block table for the child.
  • For each non-NULL parent L2 table: allocate a child L2 table, initialize to -1, then exactly copy the mapped entries.
  • Increment the global ref_count for every valid physical block inherited by the child.
  • Assign kv_count and new_seq_id. Append the child sequence to the tail of active_seqs, and return it.

Task 4: Virtual Addressing, CoW & Appending

Background (The "Decode" Phase): After the initial prompt, the AI generates text one word at a time. The KV Cache grows dynamically. We only allocate a new physical block when the current one is completely full, or if we trigger CoW.

int append_kv(LLMEngine *engine, int seq_id, int kv_data);
  • Find the sequence. Calculate logical_idx and offset (as explained in the Address Translation section). Return -1 if logical_idx >= 256.
  • OOM Interception: Does this append require a new physical block?
    • Yes, if the L2 slot is -1 (unmapped).
    • Yes, if the L2 slot is mapped but ref_count > 1 (Requires CoW).
    • No, if the L2 slot is mapped and ref_count == 1 (Exclusive access).
    • If a new block is required but free_local_blocks == 0, return -1 immediately without modifying any states.
  • Execution:
    • Allocate an L2 table if it is NULL.
    • New Page: Allocate a free block (First-Fit), map it, set ref to 1, and decrement free_local_blocks.
    • Copy-on-Write (CoW): Allocate a free block. Use memcpy to copy exactly 64 bytes from the old physical block to the new one. Adjust reference counts, update the L2 mapping, and decrement free_local_blocks.
  • Calculate the exact physical memory address of the target block, cast it to int*, and store kv_data at the correct offset. Increment kv_count and return 0.

Task 5: Engine Destruction

Background: Shutting down the allocator and cleaning up.

int llm_destroy(LLMEngine *engine);
  • Safely traverse the list and destroy all remaining sequences. (Do NOT attempt to free the dummy head itself).
  • Free the ref_count array and the engine structure.
  • DO NOT free pool_base_ptr (the testing framework owns it). Return 0.

Compilation Standard

Your code will be evaluated using the following strict compilation flags.

Compile object file only:

gcc -std=c11 -Wall -Wextra -Wpedantic -Werror -O2 -ggdb -c llm_memory.c -o llm_memory.o

When linking with a grader/main program:

gcc -std=c11 -Wall -Wextra -Wpedantic -Werror -O2 -ggdb llm_memory.c <grader_or_main>.c -o <program>

Warnings are treated as errors. Your code will be rigorously tested with Valgrind. Memory leaks (e.g., un-freed inner L2 block tables) are strictly forbidden and must report 0 bytes lost.

Academic Integrity

All submissions must be your own work. Collaboration is allowed only within the limits specified in the course policy. Plagiarism, including copying code from others or online sources without proper attribution, is strictly prohibited and will result in severe penalties.