Homework 3: vLLM PagedAttention Memory Manager
Computer Architecture - CS110 @ ShanghaiTech University
Overview
In this homework, you will implement a custom physical memory allocator. This memory manager uses a multi-level block table (similar to Page Tables in Operating Systems) to manage a large memory pool. This design is heavily inspired by the PagedAttention mechanism from the vLLM framework, a foundational technology for modern AI systems.
Before you start, you have to join the GitHub Classroom and accept the assignment via this link. A repository containing the starter code will be generated to you. You can then start on the assignment by cloning it to your Ubuntu system. To submit the assignment, you should push your local clone to GitHub and turn in your work on Gradescope by connecting to your GitHub repo.
Part 1: Background & Motivation (The "Why")
Note: You do not need to understand AI algorithms to complete this C programming assignment. This is just the real-world motivation.
1. What is a "Sequence" and "KV Cache"? When you chat with an AI (like ChatGPT), a single chat session is called a Sequence. The AI generates its response word by word (token by token). To generate the next word efficiently, it needs to remember the context of all previous words. Instead of re-reading the entire conversation every time, the AI calculates and saves the historical context data. This cached context is known as the KV Cache (Key-Value Cache).
2. The Problem: Memory Fragmentation The system starts with a massive, continuous physical memory pool. Traditionally, when a new Sequence (chat session) starts, the memory allocator guesses the "maximum possible length" of the conversation and reserves a huge, continuous chunk of memory from the pool for that Sequence.
- Internal Fragmentation: If the allocator reserves 1,000 slots for a Sequence, but the user says "Hello" and ends the chat after 5 slots, the remaining 995 slots in that chunk are completely wasted and cannot be used by anyone else.
- External Fragmentation: Over time, as different Sequences reserve and free continuous chunks of varying sizes, the memory pool ends up looking like Swiss cheeseโfilled with irregular, scattered empty spaces that are too small to serve new large requests.
3. The Solution: Virtual Memory & Paging (PagedAttention) To solve this, PagedAttention borrows the concept of Paging from Operating Systems. Instead of giving a Sequence a continuous chunk of memory, the system divides the entire memory pool into small, fixed-size Physical Blocks (Pages).
When a Sequence needs memory, the allocator simply finds any free blocks in the pool and assigns them. The Sequence logically thinks its KV Cache is growing continuously, but physically, the blocks are scattered everywhere. A Block Table (like a Page Table) is used to map the Sequence's logical progress to the scattered physical blocks.
Part 2: Core Concepts & Architecture (The "How")
To implement this, we map these high-level AI concepts into basic C data types.
1. KV Records and Physical Blocks
- KV Record: In reality, the KV cache stores
complex data. For this assignment, one KV record is simplified
as a single
int(occupying 4 Bytes). - Physical Block: The total memory pool
provided to your program (
pool_addr) is divided into fixed-size chunks called "Blocks".- Each block is exactly
64 Bytes. - Since an
KV recordis 4 Bytes, each block can hold exactlyKV_PER_BLOCK = 16KV records.
- Each block is exactly
2. The 2-Level Block Table (Address Translation)
To manage a sequence's scattered memory, you will build a 2-Level Block Table.
- L1 Block Table
(
L1_PT_SIZE = 16): An array of 16 pointers. Each pointer can point to an L2 Table. - L2 Block Table
(
L2_PT_SIZE = 16): An array of 16 integers. It stores the actual Physical Block ID mapping.
Maximum Capacity: logical blocks per sequence.
๐ Step-by-Step Address Translation Example: Imagine a sequence wants to write its 613th KV record (so
seq->kv_count = 613). Where exactly in the physical memory does this integer go?
- Logical Block Index:
613 / 16 = 38. This means the data belongs in the 38rd logical block of this sequence.- Intra-block Offset:
613 % 16 = 5. Inside that specific block, it goes into the 5th slot.- Table Traversal: How do we find the 38rd logical block?
- L1 Index:
38 / 16 = 2. Go toblock_table[2].- L2 Index:
38 % 16 = 6. Go toblock_table[2][6].- Physical Write: You read the integer stored at
block_table[2][6](e.g., let's say it's3). This means the data belongs in Physical Block #3. You calculate the memory address of Physical Block #3, jump to its 5th integer slot, and write the data.
3. Copy-on-Write (CoW) & Reference Counting
Why do we need CoW? Have you ever clicked "Regenerate response" in an AI chat, or asked the AI to "give me 3 different ideas"? In these cases, the AI creates multiple Branches (child sequences) from the same initial user prompt. Because all these branches share the exact same prompt, they have the exact same initial KV Cache. To save memory, they shouldn't duplicate this data; they should just share the physical memory blocks of the prompt.
Reference Counting (
ref_count): You must maintain a global array(size equal tototal_local_blocks, eachref_countrepresent a reference counting of one individual block) tracking how many sequences are currently using a specific physical block.Copy-on-Write Protocol (The Implementation): If Sequence A and Sequence B share Physical Block #8 (
ref_count = 2), and Sequence A wants to append a new, unique word to Block #8, it must not write it directly. If it does, Sequence B's memory is corrupted.Instead, Sequence A must trigger CoW: It finds a new free block (e.g., Block #12), copies all 64 bytes from Block #8 to Block #12, updates its own L2 table to point to Block #12, decrements the
ref_countof Block #8, sets theref_countof Block #12 to 1, and then appends the new data to Block #12.
4. Core Structures
The entire system relies on the following structs defined in your header file:
struct list_node {
struct list_node *prev;
struct list_node *next;
};
typedef struct Sequence {
int seq_id; // Unique identifier for the sequence
int **block_table; // 2-Level block table root: array of L2 table pointers
int kv_count; // Total number of KV records currently held
struct list_node node; // Intrusive list node (Must be at the end)
} Sequence;
typedef struct {
void *pool_base_ptr; // Base address of the memory pool (Injected by TA)
int total_local_blocks; // Total number of physical blocks
int free_local_blocks; // Number of currently free physical blocks
int *ref_count; // Dynamic array: Reference count for each block
struct list_node active_seqs; // Dummy head for the active sequence list
} LLMEngine;
5. Intrusive Linked List (๐จ DANGER ZONE)
We use a Linux-kernel style intrusive linked list to track
active sequences. In a standard linked list, a node contains
data. In an intrusive linked list, the data contains
the node. Look closely at the Sequence struct
above: the node is embedded at the very end.
Memory Layout Analysis: Imagine the
Sequence struct in memory. It starts at address
0x1000.
0x1000:seq_id(4 bytes)0x1008:block_table(8 bytes)0x1010:kv_count(4 bytes)0x1018:node(16 bytes)
If you are traversing the linked list, you will be holding a
pointer to the node (address 0x1018).
If you simply cast (Sequence*)node, C will
mistakenly treat 0x1018 as the start of the
Sequence. It will read node's memory as if it were
seq_id and block_table, causing
immediate memory corruption and Segmentation
Faults.
(Note: You might notice the addresses jump by 8 bytes
between seq_id and block_table, as
well as kv_count and node. This is due
to Data Structure Alignment (Padding) in 64-bit
systems, which inserts empty bytes to ensure 8-byte pointers are
aligned to 8-byte boundaries in memory.)
To solve this, you must use the <stddef.h>
macro offsetof(type, member). This macro computes
the exact byte offset of a field within a struct. By doing
pointer arithmetic (subtracting the offset from your current
node pointer), you can correctly recover the true
starting address (0x1000) of the
Sequence.
API Implementation Specification
Implement the following APIs in
llm_memory.c.
๐จ Global Constraints (STRICTLY ENFORCED)
Your code will be evaluated using White-Box testing. You must strictly follow these rules, or the autograder will fail you.
- Pointer Safety: Any API receiving
LLMEngine *engineMUST check ifengine == NULLon its first line. If true, returnNULLor-1. - Validation Priority Rule (Check Before
Allocate): You must follow this exact order:
- Phase 1: Basic validity (e.g., pointer checks,
sizes
<= 0). - Phase 2: ID Constraints (check if
seq_idexists or is unique). - Phase 3: Resource Probing. Before modifying any state, check if there are enough free physical blocks.
- Contract: If any phase fails, return immediately.
DO NOT allocate memory (
malloc) or modify block tables if the operation will eventually fail due to Out-Of-Memory (OOM).
- Phase 1: Basic validity (e.g., pointer checks,
sizes
- First-Fit Allocation: When you need a new
physical block, you MUST use a
forloop from0tototal_local_blocks - 1and pick the very first block whereref_count == 0. - No Premature Memory Modification: The
autograder pre-fills the physical memory pool
(
pool_addr) with special validation magic numbers. You are strictly forbidden to callmemsetor zero out a physical block when you allocate it duringcreate_sequence. You may ONLY write to the physical memory during an actualappend_kvoperation. - Free Block Consistency: The variable
engine->free_local_blockstracks available physical blocks. It must be decremented when a new block is assigned, and incremented when a block'sref_countdrops to0.
Task 1: Engine Initialization
Background: The framework hands you a raw block of memory (
pool_addr). Your task is to set up the management structures (theLLMEngine) before any Sequence arrives.
LLMEngine* llm_init(void *pool_addr, size_t pool_size);
- Return
NULLifpool_addr == NULLorpool_size < BLOCK_SIZE. - Allocate
LLMEngine. Calculatetotal_local_blocks = pool_size / BLOCK_SIZE. - Initialize
free_local_blocks = total_local_blocks. Storepool_addr. - Allocate the
ref_countarray usingcalloc(initialized to 0). - Initialize the
active_seqsdummy head. For an empty list, the dummy head'sprevandnextpointers MUST point to itself.
Task 2: Sequence Creation & Deletion
Background (The "Prefill" Phase): When a user sends a prompt to the AI, we immediately know the exact length of that prompt (
initial_kv_len). Therefore, we can preemptively calculate how many physical blocks are required and map them all at once.
Sequence* create_sequence(LLMEngine *engine, int seq_id, int initial_kv_len);
- Phase 1: Return
NULLifinitial_kv_len <= 0or ifseq_idalready exists. - Phase 2 (Capacity Pre-Check): Calculate
required blocks using ceiling division:
(initial_kv_len + KV_PER_BLOCK - 1) / KV_PER_BLOCK. If required blocks exceed 256 OR exceedfree_local_blocks, returnNULLimmediately. - Allocate
Sequenceand its L1 block table (an array of 16 pointers). Initialize all 16 pointers toNULL. - Iteratively map the required blocks:
- If an L2 block table is needed,
mallocit (an array of 16 ints) and initialize all 16 entries to-1. - Allocate a free physical block (First-Fit), map its ID into
the L2 table, set its
ref_countto 1, and decrementfree_local_blocks. - ๐จ Reminder: Do not write data or
memsetthe physical blocks here!
- If an L2 block table is needed,
- Set
seq->kv_count = initial_kv_len, append the sequence to the tail ofactive_seqs(which ishead->prev), and return it.
int free_sequence(LLMEngine *engine, int seq_id);
- Find and detach the target sequence from the list. Return
-1if not found. - ๐จ CRITICAL TRAVERSAL RULE: You MUST loop
through ALL 16 slots of the L1 table. DO NOT
breakthe loop early if you encounter aNULLpointer. A sequence might have blocks logically scattered, leaving empty slots in between. - For every mapped physical block (
!= -1), decrement itsref_count. If a block's count becomes0, incrementfree_local_blocks. - Free all allocated L2 arrays, the L1 array, and the sequence
itself. Return
0.
Task 3: Sequence Forking (Tree Deep Copy)
Background: The AI wants to explore a new branch (e.g., "regenerate response"). We create a child sequence. To save memory, it should point to the exact same physical blocks as its parent.
Sequence* fork_sequence(LLMEngine *engine, int parent_seq_id, int new_seq_id);
- Ensure the parent exists and
new_seq_idis globally unique. - Allocate a new
Sequenceand a new L1 block table for the child. - For each non-
NULLparent L2 table: allocate a child L2 table, initialize to-1, then exactly copy the mapped entries. - Increment the global
ref_countfor every valid physical block inherited by the child. - Assign
kv_countandnew_seq_id. Append the child sequence to the tail ofactive_seqs, and return it.
Task 4: Virtual Addressing, CoW & Appending
Background (The "Decode" Phase): After the initial prompt, the AI generates text one word at a time. The KV Cache grows dynamically. We only allocate a new physical block when the current one is completely full, or if we trigger CoW.
int append_kv(LLMEngine *engine, int seq_id, int kv_data);
- Find the sequence. Calculate
logical_idxandoffset(as explained in the Address Translation section). Return-1iflogical_idx >= 256. - OOM Interception: Does this append require
a new physical block?
- Yes, if the L2 slot is
-1(unmapped). - Yes, if the L2 slot is mapped but
ref_count > 1(Requires CoW). - No, if the L2 slot is mapped and
ref_count == 1(Exclusive access). - If a new block is required but
free_local_blocks == 0, return-1immediately without modifying any states.
- Yes, if the L2 slot is
- Execution:
- Allocate an L2 table if it is
NULL. - New Page: Allocate a free block
(First-Fit), map it, set ref to 1, and decrement
free_local_blocks. - Copy-on-Write (CoW): Allocate a free block.
Use
memcpyto copy exactly 64 bytes from the old physical block to the new one. Adjust reference counts, update the L2 mapping, and decrementfree_local_blocks.
- Allocate an L2 table if it is
- Calculate the exact physical memory address of the target
block, cast it to
int*, and storekv_dataat the correctoffset. Incrementkv_countand return0.
Task 5: Engine Destruction
Background: Shutting down the allocator and cleaning up.
int llm_destroy(LLMEngine *engine);
- Safely traverse the list and destroy all remaining
sequences. (Do NOT attempt to
freethe dummy head itself). - Free the
ref_countarray and the engine structure. - DO NOT free
pool_base_ptr(the testing framework owns it). Return0.
Compilation Standard
Your code will be evaluated using the following strict compilation flags.
Compile object file only:
gcc -std=c11 -Wall -Wextra -Wpedantic -Werror -O2 -ggdb -c llm_memory.c -o llm_memory.o
When linking with a grader/main program:
gcc -std=c11 -Wall -Wextra -Wpedantic -Werror -O2 -ggdb llm_memory.c <grader_or_main>.c -o <program>
Warnings are treated as errors. Your code will be rigorously
tested with Valgrind. Memory leaks (e.g.,
un-freed inner L2 block tables) are strictly forbidden and must
report 0 bytes lost.
Academic Integrity
All submissions must be your own work. Collaboration is allowed only within the limits specified in the course policy. Plagiarism, including copying code from others or online sources without proper attribution, is strictly prohibited and will result in severe penalties.