-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: eBPF architecture #394
Comments
In my opinion we should begin with proposal 1: enrich every instrumentation with Additionally I believe that the The guiding principle I would like to push the project towards is: All workloads are cgroups. Essenetially what I am saying is that any workload on a host running As we manage executables, cells, pods, vms, etc we should always have a cgroup associated with the workload even if just to manage an empty Reminding ourselves that Aurae intends to manage every process on a host, we have unveiled another guiding principle: All processes belong to Aurae. I believe with these two guiding principles we can clearly see that the safest way to manage a host is to surface the cgroup information, and If we can guarantee that every workoad has a cgroup, we can guarantee that we can map back to the original workload. DecisionGo with proposal 1, and let's start building our instrumentation metadata that is common to all instrumentations as well as a system in Rust to do the association at runtime. Cgroups are the parent feature we want to be able to trust to tie everything together. |
Update 2/13/2023The bad news:I have been looking into the pid mapping and unfortunately we won't be able to leverage the The good news:We will likely be able to do this fairly trivially by adding a pair of tracepoint probes to monitor process creation and exiting: |
Can you please reference the authentication if we are hooking into |
Yep, will make sure to mention this in the new issue. Will share the link in the #ebpf-kernel channel when I've typed it up. |
Background
In Aurae we leverage eBPF to surface kernel-level information. As of now, we expose POSIX signals being generated from the kernel, but in the future, many more use cases will likely be built upon the eBPF subsystem. Think of
syscall
tracing, tracing theOOM killer
, etc.Problem
The eBPF probe we have traces every process on the host. We want to be able to narrow the scope of the eBPF instrumentation to one or more user-specified Aurae workloads. The different workload types Aurae is planning to support are
executables
,cells
,pods
,virtual machines
, and spawned Aurae instances [1]. We also have to consider that users could spawn (fork) other processes fromexecutables
that they schedule via the Aurae API. The Aurae daemon, as it stands now, doesn't have knowledge of these processes. However, when we instrument a workload that is running such a forked process, we should surface instrumentation for this process and thus be able to associate it with the Aurae workload it is running in.There are two main problems:
There is additional complexity with (1) as there might be no meaningful way to create this association from kernel facilities when processing that instrumentation. Typically we should be able to associate instrumentation with a workload via the
cgroup
. The instrumentation will be associated with a process (PID), and we could look up thecgroup
of that process inprocfs
to create the association with an Aurae workload. However, when we receive an event pertaining to the exiting of a process (signal/signal_generate
withsignr
9,sched/sched_process_exit
for a process, or maybe akprobe
that traces theoom_kill_process
kernel function), that process won't be registered anymore inprocfs
and we won't be able to do the look up to determine which workload the instrumentation is associated with.This complexity exists for (2) as well in cases where the PID namespace is unshared as we won't be able to look up the
NSPid
anymore fromprocfs
after a process has exited.Proposal 1: Enrich every instrumentation with
cgroup
andnspid
informationWe could augment every kernel-level event with cgroup information. There is a BPF helper
u64 bpf_get_current_cgroup_id(void)
[2] that returns a cgroup id, and we should be able to map this to a cgroup path and thus associate it with a workload. There is a similar helper for getting thenspid
from the kernel:long bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size)
.Drawbacks/constraints for proposal 1
This doesn't actually solve for the namespace problem (2).Found a helper for getting thenspid
, updated the proposal ☝️.Proposal 2: Leverage our cache and use eBPF to keep the cache up-to-date
We could start registering
executables
in a cache in the daemon. We could then leverage eBPF and attach akprobe
tosyscall__execve
to reverse-register processes that Aurae-scheduled executables are forking in the cache. Once we have all theexecutables
in the cache, we can do the workload association via theexecutables
in the cache.Drawbacks/constraints for proposal 2
task struct
to create thoseexecutables
in AuraedReferences
[1] https://aurae.io/
[2] https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
[3] https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html
The text was updated successfully, but these errors were encountered: