Kubernetes
In the last post I mentioned I may stuff every service※ I run on my home rack into kubernetes. On Saturday, while drinking my home-brew cider and listening to LTJ Bukem, I did that. While I did wake up with a mild hangover, I found everything still working, so clearly the methodology is sound.
Here are some notes. It isn’t a guide, but by all means reference it if you want to do something similar.
※ When I say ‘service’, I generally mean something you’d use with rcctl or systemctl or initctl. Like systemctl stop website.service. But for the rest of this post, I’ll use service to mean a kubernetes service object, and the process it ultimately sits in front of I’ll begin calling a workload.
Easy Stuff
Any workload that has a developer-maintained container image, uses TCP rather than UDP, and doesn’t need persistent storage or odd capabilities (such as needing to run as root), is a pretty easy bit of work. The general structure is as such:
- One Deployment object (which specifies a template for a pod)
- One Service object (which exposes ports to the cluster with iptables-based load balancing)
- One Ingress object (which links a domain to a service-exposed port, and is consumed by an ingress controller)
- Zero-or-more Secrets that hold data such as credentials.
If I had more to run at home, I’d have finally looked into Yoke. But given I really only run three or four workload, and they’re all disimillar in requirements, I’ve put it off again.
You may consider using a StatefulSet instead of a Deployment. The general point of a StatefulSet is to ensure that the resulting pods are relatively static. They move around nodes less (not at all, unless you delete them), they stop fully before replacements are created, and they have predictable names that increment. I have one node, and I’m using hostPath volume mounts instead of PersistentVolume mounts - so there’s really little merit to using StatefulSets in my case.
My process began with knocking up the three aformentioned objects in YAML for the first workload - Komga. In order to provide the workload access to my existing library (and to preserve the existing configuration I had kicking about), I specify two volumes of type hostPath. This is as close to the --volume flag in docker/podman as you can get. This can work with multiple hosts in the cluster, provided the directories bound are network filesystems like NFS (although if you were doing that, you may as well use the NFS volume type instead.
Here’s what the volume configuration looks like, with unrelated keys/values stripped:
spec:
template:
spec:
volumes:
- name: books
hostPath:
path: /mnt/XS1/books
type: Directory
- name: db
hostPath:
path: /var/lib/komga
type: Directory
containers:
- name: komga
volumeMounts:
- name: books
mountPath: /data
- name: config
mountPath: /config
Limits are just good practice. The CPU limit is arbitrary - 1 is a nice and round number of CPUs to ensure Komga has available. Kubernetes will allow pods to attempt to exceed that limit, and throttle them in response. Memory is less permissive. If a pod goes above that memory limit (and enters the territory where it would begin to use swap) - Kubernetes will execute the pod before it manages to ritual-summon the host’s OOM Killer:
spec:
template:
spec:
containers:
- name: komga
resources:
limits:
memory: 1024Mi
cpu: '1'
The fun part, however, is the securityContext. A lot of these options will (I expect) eventually be the default for Kubernetes. Not quickly, as they’ll all inevitably break existing workloads upon upgrade. But I find it hard to imagine as the Kubernetes API versions increment, it’ll happen at somepoint.
Effectively, the point of these settings is to make the workload run within a constrained environment where it isn’t root, it can’t become root, and even if it could, features such as intercepting traffic a-la libpcap isn’t possible. In theory, at least. This is really about putting hurdles in front of attackers in order to disincentivise.
-
runAsNonRoot: true, alongside therunAsUserandrunAsGroupoptions simply ensure the workload doesn’t run as root. This is probably the most important lever to pull. Even if you’re in a container, root remains root - and being root makes the easy attack vectors easy, and the hard attack vectors possible. -
capabilities.drop: ["ALL"]explicitly drops the container’s ability to do things like load kernel modules. If you are allowing root, and know your container doesn’t need to do things like load kernel modules, intercept traffic (a-la ngrep), then you should be dropping those capabilities. With that said, the defaults shipped by CRI-O are reasonable. -
allowPrivilegeEscalation: falsesets the no_new_privs flag on the process. A hurdle for attackers that manage to spawn processes in the container. -
seccompProfile.type: RuntimeDefaultrestricts the kernel syscalls the workload can make to the default specified by CRI-O. Which are maybe reasonable? I’m finding it difficult to find a clear illustration as to what is possible in a container with CRI-O’s default profile. presumable the code is the documentation? It doesn’t hurt to set, anyway.
spec:
template:
spec:
containers:
- name: komga
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsGroup: 119
runAsUser: 119
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
Harder Stuff
Komga and Jellyfin were very similar. They bind to different hostPaths, but aside from that, they’re very straight forward. What took more effort was Misskey. Not only does it depend on Postgres and Redis, the documentation is primarily in Japanese, fails to mention that Misskey has official container images, and is out of date in regards to necessary configuration options that must be set.
I’m not going to go into setting up Misskey in Kubernetes right now. I think I’d rather contribute proper documentation to the project.
Postgres gave me the most trouble, as I had a good year or two of data sitting on the box that needed to somehow get into the container. While I tried to discern where directories on the host should be mounted into the pod, I ultimately gave up, then dumped and restored the data within the pod. Permissions/Grants/Roles in Postgres got in the way (as is always the case). But once the data was in the pod, everything worked.
While I’m being hard on myself for dumping and restoring, rather than relocating the data (a task that is seldom possible with gargantuan datasets) - it did allow me to migrate from Postgres 15 to 17 quite cleanly.
Static Sites and Ingress
Static sites aren’t really suited to run inside Kubernetes. I mean, you can do it. I’m doing it. But it’s more of a uniformity decision, than anything else. To run a static site in Kubernetes, you’re either compiling the site as part of your image building process, and then dumping the output into a directory that exists within the container, or you’re mounting the directory in the container and the container is simply a generic HTTP server container pointed at (for example) /var/www/html/.
I’m doing that first option. The Dockerfile (which is actually called Containerfile as I’m not using Docker), is a two-stage affair. It uses a recent ruby image with a few gems to compile the site into a directory named ‘output’. The output directory is then copied into a busybox pod that spins up a tiny HTTP server as a non-root user on an unprivileged port.
FROM docker.io/library/ruby:3-bookworm
RUN gem install nanoc kramdown nokogiri
COPY . /build
WORKDIR /build
RUN nanoc
FROM docker.io/library/busybox:1.37.0-glibc
RUN adduser -D static
USER static
WORKDIR /home/static
COPY --from=0 /build/output .
CMD ["busybox", "httpd", "-f", "-v", "-p", "11245"]
To be as transparent as a freshly polished jellyfish - I originally used webrick, and struggled to have it listen on something other than localhost. I’d like this option more if the busybox image shipped with only the httpd code (rather than an entire toolbox). But it works, and it’s all behind ingress-nginx (which is not to be confused with nginx-ingress.
BGP and public addresses
Just as an experiment, I popped over to my ISP’s control panel and assigned myself some /64s I could use for the pod address range and service address range. I don’t really need to be able to reach pods or service addresses - it’s all on port 443 after all. But I thought it would be interesting. Of course, the problem is that my router needs to know to route those /64s to the kubernetes host. I could add static routes, but that’s no fun. Let’s use BGP!
For calico to do this, it needs two things. The first is a BGP configuration that specifies our service address spaces and ASN used to talk BGP:
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
nodeToNodeMeshEnabled: true
nodeMeshMaxRestartTime: 120s
asNumber: 64512
serviceClusterIPs:
- cidr: 10.96.0.0/12
- cidr: 2001:8b0:ca70:32b1::/112
listenPort: 178
bindMode: NodeIP
The second is peer configuration that tells it who it should be speaking BGP to (and their ASN):
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: ruhuna
spec:
peerIP: "\[2001:8b0:ca70:32b2::1\]:178"
asNumber: 64512
Of course, we also need our router to speak BGP. I’m familiar with BIRD and Quagga, but as Ruhuna (my router) currently runs OpenBSD, the obvious choice is OpenBGPD:
AS 64512
router-id 192.168.1.254
neighbor 2001:8b0:ca70:32b2::2 {
remote-as 64512
local-address 2001:8b0:ca70:32b2::1
}
allow from ibgp
All set up, this gives my router (and all the hosts on my home network) visibility of the pod and service ranges within the cluster. Although for all the pod range is a /64 (2001:8b0:ca70:32b0::/64) - calico segments it into /122’s to reduce the number of routes in large deployments. The service range can’t be bigger than /122, so 2001:8b0:ca70:32b1::/64 is mostly wasted:
ruhuna# bgpctl show fib bgp
flags: B = BGP, C = Connected, S = Static
N = BGP Nexthop reachable via this route
r = reject route, b = blackhole route
flags prio destination gateway
B 48 2001:8b0:ca70:32b0:76f4:d124:e3ae:b100/122 2001:8b0:ca70:32b2::2
B 48 2001:8b0:ca70:32b1::/112 2001:8b0:ca70:32b2::2
Yggdrasil
Small note - the yggdrasil mirror is currently down as it binds to not-port-80. It’ll be back soon, ish.