Skip to content

belelaritra/xv6-memory-management-walkthrough

 
 

Repository files navigation

📑 Xv6 Memory Management Walkthrough

Table of Content

Terminology

  • VA : Virtual Address
  • PA : Physical Address
  • PD : Page Directory
  • PDE : Page Directory Entry
  • PT : Page Table
  • PTE : Page Table Entry
  • PPN : Physical Page Number

Overview of xv6 page table

All process works on virtual address. Machine’s RAM is physical address. Page Table maps virtual address to physical address.

Xv6 does this in 2 steps.

Untitled

Some things to note for xv6 :

  • 32 bit virtual address (so, virtual address space 4GB)
  • page size of 4KB
  • The CPU register CR3 contains a pointer to the outer page directory of the current running process.

Untitled

Caution

xv6 does not do demand paging, so there is no concept of virtual memory. That is, all valid pages of a process are always allocated physical pages.

Memory Layout

Every page table in xv6 has mappings for user pages as well as kernel pages. The part of the page table dealing with kernel pages is the same across all processes.

Kernel Only Memory

Untitled

  • virtual memory > KERNBASE (0x8000 0000) is for kernel
  • always mapped as kernel-mode only
    • check PTE_U fag
    • protection fault for user-mode programs to access
  • physical memory address N is mapped to KERNBASE+N or 0:PHYSTOP to KERNBASE:KERNBASE+PHYSTOP
    • Note that although the size of physical memory is 4 GB, only 2 GB can be used by Xv6
  • kernel code loaded into contiguous physical addresses

User Only Memory

Each process has separate page table. For any process, user memory VA range is 0:KERNBASE where KERNBASE is 0x80000000 i.e. 2 GB of memory is available to process.

Untitled

  • The kernel code doesn’t exactly begin at KERNBASE, but a bit later at KERNLINK, to leave some space at the start of memory for I/O devices. Next comes the kernel code+read-only data from the kernel binary. Apart from the memory set aside for kernel code and I/O devices, the remaining memory is in the form of free pages managed by the kernel. When any user process requests for memory to build up its user part of the address space, the kernel allocates memory to the user process from this free space list. That is, most physical memory can be mapped twice, once into the kernel part of the address space of a process, and once into the user part.

Untitled

P2V and V2P

  • V2P(a) (virtual to physical)

    • convert kernel address a to physical address
      • subtract KERNBASE (0x8000 0000)
    #define V2P(a) (((uint) (a)) - KERNBASE)
  • P2V(a) (physical to virtual)

    • convert physical address a to kernel address
      • add KERNBASE (0x8000 0000)
    #define P2V(a) ((void *)(((char *) (a)) + KERNBASE))

Functions of interest {Part 1 : building blocks}

PGROUNDUP

What it does is round the address up as a multiple of page number i.e. do a CEIL type operation.

#define PGROUNDUP(sz)  (((sz)+PGSIZE-1) & ~(PGSIZE-1))

PGROUNDUP(620) → ((620 + (1024 -1)) & ~(1023)) → 1024

PGROUNDDOWN

another related function is PGROUNDDOWN , works similarly. just makes the address round down as a multiple of page number

#define PGROUNDDOWN(a) (((a)) & ~(PGSIZE-1))

PGROUNDDOWN(2400) → (2400 & ~(1023)) → 2048

switchuvm

  • u here stands for user
  • basically OS loads the user process information to run it
  • loads process's Page Table to %cr3

kalloc

This function is responsible to return an address of one new, currently unused page (4096 byte) in RAM. kalloc removes first free page from kmem and returns its (virtual!) address, where kmem points to the head of a list of free (that is, available) pages of memory.

Failure : If it returns 0, that means there are no available unused pages currently.

walkpgdir

Main job of this function is to get the content of 2nd level PTE. This is such an important function that we will go line by line.

static pte_t *
walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
  pde_t *pde;
  pte_t *pgtab;

  pde = &pgdir[PDX(va)]; // PDX(va) returns the first 10 bit. pgdir is level 1 page table. So, pgdir[PDX(va)] is level 1 PTE where there is PPN and offset
  if(*pde & PTE_P){ // not NULL and Present
    pgtab = (pte_t*)P2V(PTE_ADDR(*pde)); // PTE_ADDR return the first 20 bit or PPN. PPN is converted to VPN for finding 2nd level PTE. pgtab is level 2 page table
  } else {
    if(!alloc || (pgtab = (pte_t*)kalloc()) == 0)
      return 0;
    // Make sure all those PTE_P bits are zero.
    memset(pgtab, 0, PGSIZE);
    // The permissions here are overly generous, but they can
    // be further restricted by the permissions in the page table
    // entries, if necessary.
    *pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;
  }
  return &pgtab[PTX(va)]; // PDX(va) returns the second 10 bit. So, pgtab[PTX(va)] is level 2 PTE where there is PPN and offset
}

Case 1 : when no allocation needed (alloc = 0)

Recall that

// +--------10------+-------10-------+---------12----------+
// | Page Directory |   Page Table   | Offset within Page  |
// |      Index     |      Index     |                     |
// +----------------+----------------+---------------------+
//  \--- PDX(va) --/ \--- PTX(va) --/

Untitled

  • PDX(va) returns the first 10 bit. pgdir is level 1 page table. So, pgdir[PDX(va)] is level 1 PTE where there is PPN and offset

    pde = &pgdir[PDX(va)];
  • PTE_ADDR return the first 20 bit or PPN. PPN is converted to VPN using P2V for finding 2nd level PTE. pgtab is level 2 page table

    pgtab = (pte_t*)P2V(PTE_ADDR(*pde));
  • PDX(va) returns the second 10 bit. So, pgtab[PTX(va)] is level 2 PTE where there is PPN and offset

    &pgtab[PTX(va)];

Case 2 : creating second-level page tables (When alloc = 1)

  • return NULL if not trying to make new page table otherwise use kalloc to allocate it

    if(!alloc || (pgtab = (pte_t*)kalloc()) == 0)
        return 0;
  • clear the new second-level page table. Make sure all those PTE_P bits are zero.

    memset(pgtab, 0, PGSIZE);
  • now that we have made the 2nd level page table, we have to put that address in 1st level page table. plus add some flags

    *pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;

mappages

static int
mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
  char *a, *last;
  pte_t *pte;

  a = (char*)PGROUNDDOWN((uint)va);
  last = (char*)PGROUNDDOWN(((uint)va) + size - 1);
  for(;;){
    if((pte = walkpgdir(pgdir, a, 1)) == 0)
      return -1;
    if(*pte & PTE_P)
      panic("remap");
    *pte = pa | perm | PTE_P;
    if(a == last)
      break;
    a += PGSIZE;
    pa += PGSIZE;
  }
  return 0;
}

As the name suggests, mappages adds a mapping from the virtual address (starting from given VA to VA+SIZE) to the physical address (starting from given PA to PA+SIZE)

  • for each virtual page in range, get its page table entry. if 2nd level not present allocate it (recall alloc=1 case of walkpgdir). failure happens if runs out of memory

    if((pte = walkpgdir(pgdir, a, 1)) == 0)
          return -1;
  • make sure it’s not already set

    if(*pte & PTE_P)
    		panic("remap");
  • set page table entry to valid value pointing to physical page at PA with specified permission (perm) and P for present

    *pte = pa | perm | PTE_P; // pa is first 20 bit, perm is flags, PTE_P is done to mark as present
  • advance to next physical page (PA) and next virtual page (VA)

    a += PGSIZE;
    pa += PGSIZE;

Functions of interest {Part 2 : core functions}

allocuvm

Allocates user virtual memory (page tables and physical memory) to grow process. This function is responsible to increase the user's virtual memory in a specific page directory from oldsz to newsz . This function used for initial allocation plus expanding heap on request

int
allocuvm(pde_t *pgdir, uint oldsz, uint newsz)
{
  char *mem;
  uint a;

  if(newsz >= KERNBASE)
    return 0;
  if(newsz < oldsz)
    return oldsz;

  a = PGROUNDUP(oldsz);
  for(; a < newsz; a += PGSIZE){
    mem = kalloc();
    if(mem == 0){
      cprintf("allocuvm out of memory\n");
      deallocuvm(pgdir, newsz, oldsz);
      return 0;
    }
    memset(mem, 0, PGSIZE);
    if(mappages(pgdir, (char*)a, PGSIZE, V2P(mem), PTE_W|PTE_U) < 0){
      cprintf("allocuvm out of memory (2)\n");
      deallocuvm(pgdir, newsz, oldsz);
      kfree(mem);
      return 0;
    }
  }
  return newsz;
}

Returns newsz if succeeded, 0 otherwise.

  • walks the virtual address space between the old size and new size in page-sized chunks. For each new logical page to be created, it allocates a new free page from the kernel, and adds a mapping from the virtual address to the physical address by calling mappages

  • allocate a new, zero page

    mem = kalloc();
  • add page to second-level page table, also do the allocation. recall that mappages call walkpgdir with alloc = 1

    mappages(pgdir, (char*)a, PGSIZE, v2p(mem), PTE_W|PTE_U);
  • There are indeed 2 cases where this function can fail:

    Case 1: kalloc (kernel allocation) function failed. This function is responsible to return an address of a new, currently unused page in RAM. If it returns 0, that means there are no available unused pages currently.

    Case 2: mappages function failed. This function is responsible of making the new allocated page to be accessible by the process who uses the given page directory by mapping that page with the next virtual address available in the page directory. If this function fails that means it failed in doing so, probably due to the page directory being already full.

    In both cases, allocuvm didn't managed to increase the user's memory to the size requested, Therefore, deallocuvm is undoing all allocations until the point of failure, so the virtual memory will remain unchanged, and returns an error it self.

deallocuvm

deallocuvm looks at all the logical pages from the (bigger) old size of the process to the (smaller) new size, locates the corresponding physical pages, frees them up, and zeroes out the corresponding PTE as well.

copyuvm

Once a child process is allocated, its memory image is setup as a complete copy of the parent’s memory image by a call to copyuvm

  • setupkvm : sets up kernel virtual memory
  • walks through the entire address space of the parent in page-sized chunks, gets the physical address of every page of the parent using a call to walkpgdir
  • allocates a new physical page for the child using kalloc
  • copies the contents of the parent’s page into the child’s page, adds an entry to the child’s page table using mappages
  • returns the child’s page table
  • has similar failure cases as discussed in allocuvm

How all of this works

Now that we are done with the function basics, let’s focus on how things actually work. Most of the findings and intuition gained are from a shit ton of debug statements and by reading a few books and notes. So, take it with a grain of salt.

 

The first user process is the init function. It is initiated by userinit inside main of main.c.

Let’s focus on this first.

init

userinit function

//PAGEBREAK: 32
// Set up first user process.
void
userinit(void)
{
  struct proc *p;
  extern char _binary_initcode_start[], _binary_initcode_size[];

  p = allocproc();
  
  initproc = p;
  if((p->pgdir = setupkvm()) == 0)
    panic("userinit: out of memory?");
  inituvm(p->pgdir, _binary_initcode_start, (int)_binary_initcode_size);
  p->sz = PGSIZE;
  memset(p->tf, 0, sizeof(*p->tf));
  p->tf->cs = (SEG_UCODE << 3) | DPL_USER;
  p->tf->ds = (SEG_UDATA << 3) | DPL_USER;
  p->tf->es = p->tf->ds;
  p->tf->ss = p->tf->ds;
  p->tf->eflags = FL_IF;
  p->tf->esp = PGSIZE;
  p->tf->eip = 0;  // beginning of initcode.S

  safestrcpy(p->name, "initcode", sizeof(p->name));
  p->cwd = namei("/");

  // this assignment to p->state lets other cores
  // run this process. the acquire forces the above
  // writes to be visible, and the lock is also needed
  // because the assignment might not be atomic.
  acquire(&ptable.lock);

  p->state = RUNNABLE;

  release(&ptable.lock);
}

Now let’s see what things are done step by step

  • allocproc : for every process pid needs to be assigned, proc structure needs to be initialized. all of this is done here. for the init process pid is returned 1 by allocproc

    • kalloc : allocproc calls kalloc which sets up data on the kernel stack
  • setupkvm : creates kernel page table of this init process

  • inituvm : allocates one page of physical memory, copies the init executable into that memory, sets up a page table entry (PTE) for the first page of the user virtual address space.

    Let’s also keep track of how many pages are being allocated where, this will be key to our understanding when we want to find what which pages are being deallocated in deallocuvm in later part.

    PAGE ALLOCATED HERE = 1

init executable

now the init executable that has been loaded runs, which to my understanding points to this part inside code

Untitled

exec

SYS_exec in turn calls exec . The exec function looks like this -

int
exec(char *path, char **argv)
{
  char *s, *last;
  int i, off;
  uint argc, sz, sp, ustack[3+MAXARG+1];
  struct elfhdr elf;
  struct inode *ip;
  struct proghdr ph;
  pde_t *pgdir, *oldpgdir;
  struct proc *curproc = myproc();
  
  begin_op();

  if((ip = namei(path)) == 0){
    end_op();
    cprintf("exec: fail\n");
    return -1;
  }
  ilock(ip);
  pgdir = 0;

  // Check ELF header
  if(readi(ip, (char*)&elf, 0, sizeof(elf)) != sizeof(elf))
    goto bad;
  if(elf.magic != ELF_MAGIC)
    goto bad;

  if((pgdir = setupkvm()) == 0)
    goto bad;

  // Load program into memory.
  sz = 0;
  for(i=0, off=elf.phoff; i<elf.phnum; i++, off+=sizeof(ph)){
    if(readi(ip, (char*)&ph, off, sizeof(ph)) != sizeof(ph))
      goto bad;
    if(ph.type != ELF_PROG_LOAD)
      continue;
    if(ph.memsz < ph.filesz)
      goto bad;
    if(ph.vaddr + ph.memsz < ph.vaddr)
      goto bad;
    if((sz = allocuvm(pgdir, sz, ph.vaddr + ph.memsz)) == 0)
      goto bad;
    if(ph.vaddr % PGSIZE != 0)
      goto bad;
    if(loaduvm(pgdir, (char*)ph.vaddr, ip, ph.off, ph.filesz) < 0)
      goto bad;
  }
  iunlockput(ip);
  end_op();
  ip = 0;

  // Allocate two pages at the next page boundary.
  // Make the first inaccessible.  Use the second as the user stack.
  sz = PGROUNDUP(sz);
  if((sz = allocuvm(pgdir, sz, sz + 2*PGSIZE)) == 0)
    goto bad;
 
  clearpteu(pgdir, (char*)(sz - 2*PGSIZE));
  sp = sz;

  // Push argument strings, prepare rest of stack in ustack.
  for(argc = 0; argv[argc]; argc++) {
    if(argc >= MAXARG)
      goto bad;
    sp = (sp - (strlen(argv[argc]) + 1)) & ~3;
    if(copyout(pgdir, sp, argv[argc], strlen(argv[argc]) + 1) < 0)
      goto bad;
    ustack[3+argc] = sp;
  }
  ustack[3+argc] = 0;

  ustack[0] = 0xffffffff;  // fake return PC
  ustack[1] = argc;
  ustack[2] = sp - (argc+1)*4;  // argv pointer

  sp -= (3+argc+1) * 4;
  if(copyout(pgdir, sp, ustack, (3+argc+1)*4) < 0)
    goto bad;

  // Save program name for debugging.
  for(last=s=path; *s; s++)
    if(*s == '/')
      last = s+1;
  safestrcpy(curproc->name, last, sizeof(curproc->name));

  // Commit to the user image.
  oldpgdir = curproc->pgdir;
  curproc->pgdir = pgdir;
  curproc->sz = sz;
  curproc->tf->eip = elf.entry;  // main
  curproc->tf->esp = sp;

  switchuvm(curproc);

  freevm(oldpgdir);

  return 0;

 bad:
  if(pgdir)
    freevm(pgdir);
  if(ip){
    iunlockput(ip);
    end_op();
  }
  return -1;
}
  • setupkvm : The thing that I understood so far is that, the 1 page allocated inside inituvm is responsible for executing this exec function. Whatever this exec function now wants to execute, will be in a separate new kernel page table, hence a setupkvm is called once more. (what I exactly mean here will be more clear when I talk about other user processes)

  • allocuvm : allocates pages for the executable it wants to run, here 1 page is sufficient for init

    PAGE ALLOCATED HERE = 1
  • loaduvm : loads the executable from disk into the newly allocated pages in allocuvm

  • allocuvm : the new memory image so far only has executable code and data, now we also need stack space. Rather than allocating 1 page for the stack, it allocates 2. The 2nd one is the actual stack, 1st one serves as a guard page. This is the only page in the user’s memory which is marked as present but not user-accessible (PTE_U is cleared)

    Reason for having a guard page : To guard a stack growing off the stack page, xv6 places a guard page right below the stack. As the guard page is not mapped, if the stack runs off the stack page, the hardware will generate an exception because it cannot translate the faulting address.

    Strings containing the command-line arguments, as well as an array of pointers to them, are at the very top of the stack

    Untitled

    PAGE ALLOCATED HERE = 2
  • trap frame update : It is important to note that exec does not replace/reallocate the kernel stack. The exec system call only replaces the user part of the memory image, and does nothing to the kernel part. And if you think about it, there is no way the process can replace the kernel stack, because the process is executing in kernel mode on the kernel stack itself, and has important information like the trap frame stored on it.

    Recall that a process that makes the exec system call has moved into kernel mode to service the software interrupt of the system call. Normally, when the process moves back to user mode again (by popping the trap frame on the kernel stack), it is expected to return to the instruction after the system call. However, in the case of exec, the process doesn’t have to return to the instruction after exec when it gets the CPU next, but instead must start executing the new executable it just loaded from disk. So, the code in exec changes the return address in the trap frame to point to the entry address of the binary. It also sets the stack pointer in the trap frame to point to the top of the newly created user stack.

    curproc->tf->eip = elf.entry;  // main
    curproc->tf->esp = sp;
  • switchuvm : Finally, once all these operations succeed, exec switches page tables to start using the new memory image. That is, it writes the address of the new page table into the CR3 register, so that it can start accessing the new memory image when it goes back to userspace.

  • freevm : the process that called exec i.e. init , was still using the old memory image. we free the old memory image that it was pointing to after updating

    • deallocuvm : The pages that are deleted here are the pages that were created in inituvm . So,

      PAGE DEALLOCATED HERE = 1

The new memory image created by exec looks like this

Untitled

At this point, the process that called exec can start executing on the new memory image when it returns from trap

Note : exec waits until the end to do this switch of page tables, because if anything went wrong in the system call, exec returns from trap into the old memory image and prints out an error

sh

Apart from init, all other processes are created by the fork system call. When the init process runs, it executes the init executable, whose main function forks a shell (sh) and starts listening to the user

fork

  • allocproc : returns pid = 2 for sh

  • copyuvm : once a child process is allocated, its memory image is setup as a complete copy of the parent’s memory. This is done using copyuvm .

    • setupkvm : inside copyuvm, a new kernel page table is created, and parent memory is allocated here.

    Recall that 3 pages were created in exec of init . These 3 pages are copied here

    PAGES ALLOCATED HERE = 3

exec

The description of exec for shell is the same as discussed for init . So lets just find out how many pages are allocated where

  • setupkvm

  • allocuvm

    The executable that sh wants to run needs more memory than init needed. So 2 pages are allocated here.

    PAGES ALLOCATED HERE = 2
  • loaduvm

  • allocuvm

  • freevm

    • deallocuvm The pages that were allocated (copied) using copyuvm are deallocated as they are pointing to old page directory

      PAGE DEALLOCATED HERE = 3

Now let’s look at a general user process. Every other user process will fork sh . Let’s look at echo

echo

fork

  • allocproc

  • copyuvm : here 4 page will be copied (2 + 2 pages that were allocated inside sh exec)

    PAGE ALLOCATED HERE = 4

growproc

After fork, growproc is called to grow userspace memory image. growproc can be called by the sbrk system call. It basically increases the heap size

PAGE ALLOCATED HERE = 8

exec

  • allocuvm

    PAGE ALLOCATED HERE = 1
  • loaduvm

  • allocuvm

    PAGE ALLOCATED HERE = 2
  • freevm

    • deallocuvm : 4+8 pages deleted of old memory image

      PAGE DEALLOCATED HERE = 12

exit

  • freevm
    • deallocuvm : 1+2 pages deleted of new memory image

      PAGE DEALLOCATED HERE = 3

Finally, an overview of all the things said in a picture - because ofcourse a picture speaks more than a thousand words!

Untitled


Final Words

Thanks for reading this far ! I am not sure how much helpful this has been for you, but I hope you can start navigating the memory management part of the xv6 codebase after reading this. Go through this writeup multiple times if you don't undersand anything and if you have some more time to spare, please go through this doc. A lot of the part of this writeup has been taken or inspired from this.

May the force be with you! And if you liked the write-up, you can support me by buying me a cup of coffee!

Buy Me A Coffee

Resource

About

Walkthrough of Xv6 Memory Management

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published