Although they are called containers, it might be more accurate to use the term “containerized processes”. A container is still a Linux process, running on the host machine - it just has a limited view of that host machine, and it only has access to a subtree of the file system and perhaps to a limited set of resources restricted by cgroups. Because it’s really just a process, it exists within the context of the host operating system, and it shares the host’s kernel.
-- Liz Rice
Pull down code and presentation. I assume that you are running this on a modern Linux host. I am using AWS Cloud-9
with Ubuntu.
git clone https://github.com/nbrandaleone/what-is-a-container.git
I will provide a filesystem based upon frapsoft/fish Fish Docker container, which has been flattened into a single tarball.
wget bit.ly/fish-container -O fish.tar
mkdir rootfs; cd rootfs
tar -xf ../fish.tar
Poke around in your new filesystem. It is based upon busybox, so it is not very large.
I like to create a "YOU-ARE-HERE" file to mark this fs.
# While in the new rootfs
touch YOU-ARE-HERE
# Have 2 terminals open.
# in terminal 1
sleep 500 &
# grab PID
# In terminal 2
sudo lsns
sudo lsns -p <PID>
The proc
filesystem is virtual, in that it does not exist on disk - only in memory.
ls -l /proc/self/ns
readlink /proc/$$/ns/pid
# In terminal 1, outside of chroot
top
We can see the top
from the other terminal
sudo chroot rootfs /bin/sh
# mount -t proc proc /proc
# ps aux | grep top
Better yet, our new shell is running as root, so...
# kill <top process id>
So much for containment.
sudo unshare --uts /bin/sh
# change hostname. Verify it has changed
Now, let's unshare
the pid namespace and chroot
into our new filesystem. Finally, change the hostname and run the fish shell.
sudo unshare --pid --fork --mount-proc chroot "$PWD" \
/bin/sh -c "/bin/mount -t proc none /proc && \
hostname containers-fun-times && /usr/bin/fish"
A powerful aspect of namespaces is their composability; processes may choose to separate some namespaces but share others. For instance it may be useful for two programs to have isolated PID namespaces, but share a network namespace (e.g. Kubernetes pods). This brings us to the setns syscall and the nsentercommand line tool.
Let's find the shell running in a chroot from our last example.
# From the host, not the chroot.
ps aux | grep /bin/sh | grep root
...
root <PID>
Grab the PID of the running shell (inside the container). For example, say it is 28840.
sudo nsenter --pid=/proc/29840/ns/pid \
unshare -f --mount-proc=$PWD/rootfs/proc \
chroot rootfs /bin/sh
Having entered the namespace successfully, when we run ps in the second shell (PID 5) we see the first shell (PID 1).
The kernel exposes cgroups through the /sys/fs/cgroup directory.
sudo apt install cgroup-tools
cgroup_id="cgroup_$(shuf -i 1000-2000 -n 1)"
sudo cgcreate -g "cpu,cpuacct,memory,pids:$cgroup_id"
sudo cgset -r cpu.shares=128 "$cgroup_id"
sudo cgset -r memory.limit_in_bytes="100M" "$cgroup_id"
sudo cgset -r memory.swappiness=0 ${cgroup_id}
sudo cgset -r pids.max=10 "$cgroup_id"
echo ${cgroup_id}
cd /sys/fs/cgroup/memory/${cgroup_id}
Examine limits
Another way of setting cgroup limits...
# echo "100M" > /sys/fs/cgroup/demo/memory/limit_in_bytes
# echo "0" > /sys/fs/cgroup/demo/memory.swappiness
# For AL1. yum install glibc-static -y
cd ~/environment/what-is-a-container/src/c
gcc -static -o munch munch.c
cp munch ~/environment/rootfs/munch
Now - let's create our container with the cgroups
.
You may have to export cgroup_id=<NAME>
from one terminal to the other.
sudo cgexec -g "memory,pids:${cgroup_id}" \
unshare -fmuipn --mount-proc \
chroot "$PWD" \
/bin/sh -c "/bin/mount -t proc proc /proc && hostname container-fun-times && /usr/bin/fish"
Inside of the container, run munch
# Inside container
./munch
cd ~/environment/what-is-a-container/src/c
gcc -static fork-bomb.c -o fb
cp fb ~/environment/rootfs/
./fb
ps aux | grep fb | wc -l
The output should be 8. Ctr-C the program.
# In directory, one above rootfs
runc spec
sudo runc run test
# From other terminal
sudo runc list
sudo runc state test
sudo runc ps test
sudo runc kill test KILL
Look around from other terminal session
ps axfo pid,ppid,command | grep runc
pstree <pid>
Namespaces will clean themselves up if there are no more processes in them.
sudo cgdelete "memory,pids:${cgroup_id}"
There is a separate ignite
directory, and README file.