linux capabilities - DEVOPS BLOG

# **Linux Capabilities: A Complete Deep-Dive for DevOps, SRE, Cloud and Kubernetes Security Engineers** Linux Capabilities form one of the most impactful — yet routinely overlooked — foundations of modern Linux security. Although containers, orchestrators, cloud runtimes, and even Kubernetes itself heavily rely on them, most engineers only encounter capabilities when a container throws an “Operation not permitted” error. This article provides a **full and unified deep-dive** into Linux Capabilities: - their conceptual origins - inner working principles inside the Linux kernel - file capabilities and process capability sets - the complete table of 40+ capabilities - real-world usage - how capabilities work in Docker - how they work in Kubernetes SecurityContext - essential best practices for modern DevOps and SRE teams By the end, you’ll understand **exactly** how capabilities protect your systems — and how to use them deliberately rather than accidentally. --- # **1. Why Linux Needed Capabilities** Before Linux kernel **2.2**, the privilege model was simple: - **root (UID 0)** → absolute power - **everyone else (UID ≠ 0)** → highly restricted This binary model created a serious security problem. ### Example: A simple daemon needed to bind to port 80. Binding to ports below 1024 required _full root privileges_. That means: - if the daemon was compromised → the attacker immediately gained **full system control** - if the developer wanted to follow least privilege → they simply couldn't Linux lacked a way to say: > “Give this process _only_ the ability to bind to port 80 — and nothing else.” This is why **Linux Capabilities were introduced**. Capabilities split the huge “root privilege blob” into multiple small permissions that can be granted individually. Instead of “root or not-root,” we now have a fine-grained security model. --- # **2. What Linux Capabilities Actually Are** Capabilities are **discrete, isolated bits of privilege** that permit exactly one administrative action — nothing more. Examples: - Bind low ports (<1024) → `CAP_NET_BIND_SERVICE` - Create raw sockets → `CAP_NET_RAW` - Manage routes, interfaces, firewalls → `CAP_NET_ADMIN` - Change system time → `CAP_SYS_TIME` - Load kernel modules → `CAP_SYS_MODULE` - Perform ptrace debugging → `CAP_SYS_PTRACE` Instead of giving an entire root-level permission set, the system grants _only the required pieces_. ### Conceptually: ``` root privileges = CAP_NET_BIND_SERVICE + CAP_NET_RAW + CAP_NET_ADMIN + CAP_SYS_TIME + CAP_SYS_MODULE + ... ``` You can give a process one capability (e.g., networking), while blocking all others (e.g., kernel module loading). --- # **3. The Linux Kernel's Capability Architecture Capabilities do not operate as simple flags. They integrate deeply with: - **task_struct** - **execve()** - **file metadata (xattrs)** - **Bounding set** - **Ambient set** - **PAM & login processes** - **Filesystem capabilities** - **Process credential structures** A Linux process has **five** capability sets: |Set|Definition| |---|---| |**Permitted (P)**|Capabilities the process _may_ use| |**Effective (E)**|Capabilities currently _in effect_| |**Inheritable (I)**|Capabilities passed across execve()| |**Bounding (B)**|Hard ceiling; cannot exceed this set| |**Ambient (A)**|Capabilities carried over execve by default| ### How capabilities are evaluated During `execve()`: 1. Kernel loads file capabilities from xattrs (`security.capability`) 2. Old capabilities are merged based on rules: - Inheritable set - Ambient set - Bounding set 3. Effective set is computed 4. Final capability mask applied to the process 5. Security modules like **SELinux** or **AppArmor** may further restrict them Capabilities are dropped **irreversibly** unless using file capabilities. --- # **4. File Capabilities (setcap/getcap)** Linux allows you to apply capabilities directly to executable files: ```bash sudo setcap CAP_NET_BIND_SERVICE=+ep /usr/bin/myapp ``` This means: - the file carries the capability - when executed, the capability becomes active even for non-root users Check file capabilities: ```bash getcap /usr/bin/myapp ``` Remove them: ```bash setcap -r /usr/bin/myapp ``` File capabilities are safer than SUID binaries and eliminate the need for root wrappers. --- # **5. Full Table of Linux Capabilities (Complete List)** Below is the full set of capabilities available in modern Linux kernels (5.x–6.x). These represent the decomposed root privilege set. |Capability|What It Allows| |---|---| |**CAP_AUDIT_CONTROL**|Configure audit subsystem| |**CAP_AUDIT_READ**|Read audit logs| |**CAP_AUDIT_WRITE**|Write to audit logs| |**CAP_BLOCK_SUSPEND**|Prevent system suspend| |**CAP_BPF**|Load BPF programs and access BPF maps| |**CAP_CHECKPOINT_RESTORE**|Process checkpoint/restore (CRIU)| |**CAP_CHOWN**|Change file ownership| |**CAP_DAC_OVERRIDE**|Override file access restrictions| |**CAP_DAC_READ_SEARCH**|Read directories ignoring permissions| |**CAP_FOWNER**|Override file owner checks| |**CAP_FSETID**|Set UID/GID bits| |**CAP_IPC_LOCK**|Lock memory| |**CAP_IPC_OWNER**|Bypass IPC ownership checks| |**CAP_KILL**|Kill any process| |**CAP_LEASE**|File leases| |**CAP_LINUX_IMMUTABLE**|Mark files immutable| |**CAP_MAC_ADMIN**|Modify MAC policies| |**CAP_MAC_OVERRIDE**|Override MAC access| |**CAP_MKNOD**|Create special files| |**CAP_NET_ADMIN**|Manage networking| |**CAP_NET_BIND_SERVICE**|Bind privileged ports| |**CAP_NET_BROADCAST**|Broadcast packets| |**CAP_NET_RAW**|Raw sockets (required for ping)| |**CAP_PERFMON**|Kernel performance counters| |**CAP_SETGID**|Change GID| |**CAP_SETFCAP**|Set file capabilities| |**CAP_SETPCAP**|Modify capability sets| |**CAP_SETUID**|Change UID| |**CAP_SYS_ADMIN**|Almost-full root; extremely dangerous| |**CAP_SYS_BOOT**|Reboot the system| |**CAP_SYS_CHROOT**|Use chroot| |**CAP_SYS_MODULE**|Load/unload kernel modules| |**CAP_SYS_NICE**|Change process priority| |**CAP_SYS_PACCT**|Configure process accounting| |**CAP_SYS_PTRACE**|Debug other processes| |**CAP_SYS_RAWIO**|Direct hardware access| |**CAP_SYS_RESOURCE**|Override resource limits| |**CAP_SYS_TIME**|Change system clock| |**CAP_SYS_TTY_CONFIG**|Configure terminal devices| |**CAP_WAKE_ALARM**|Trigger system wake events| Especially dangerous: `CAP_SYS_ADMIN`, `CAP_SYS_PTRACE`, `CAP_SYS_MODULE`, `CAP_SYS_RAWIO`. --- # **6. Capabilities in Docker and Container Runtimes** Even though containers often run as **UID 0**, they do _not_ get host-level root privileges. Container runtimes drop dangerous capabilities: ### Default Docker dropped capabilities include: - SYS_ADMIN - SYS_MODULE - SYS_TIME - SYS_BOOT - SYS_PTRACE - NET_ADMIN - RAWIO This ensures limited attack surface. ### Example: Network interface creation without capabilities ```bash docker run --rm -it busybox sh / # ip link add dummy0 type dummy ip: RTNETLINK answers: Operation not permitted ``` Add capability: ```bash docker run --rm -it --cap-add=NET_ADMIN busybox sh ``` Now it works. ### Drop a capability ```bash docker run --rm --cap-drop=NET_RAW busybox ``` ### Drop everything except what you need (best practice) ```bash --cap-drop=ALL --cap-add=NET_BIND_SERVICE ``` ### Privileged mode (avoid!) ```bash docker run --privileged ... ``` It gives near-full host root. Almost never necessary. --- # **7. Linux Capabilities in Kubernetes** Kubernetes exposes capabilities through **securityContext**. ### Example: Add capabilities ```yaml securityContext: capabilities: add: - NET_ADMIN ``` ### Drop capabilities ```yaml securityContext: capabilities: drop: - NET_RAW ``` ### Drop all and add minimal (best practice) ```yaml securityContext: capabilities: drop: ["ALL"] add: ["NET_BIND_SERVICE"] ``` ### Example: Disable `ping` by removing NET_RAW Default BusyBox pod can ping: ``` 64 bytes from 8.8.8.8... ``` Now deploy pod without NET_RAW: ```yaml apiVersion: v1 kind: Pod metadata: name: busybox-secure spec: containers: - name: busybox image: busybox command: ["sleep", "3600"] securityContext: capabilities: drop: - NET_RAW ``` Now: ``` ping: permission denied ``` This is the foundation of Kubernetes cluster hardening. --- # **8. Best Practices for DevOps & SRE Teams** ### ✔ 1. Drop all capabilities by default ``` drop: ["ALL"] ``` ### ✔ 2. Add only what is explicitly required ``` add: ["NET_BIND_SERVICE"] ``` ### ✔ 3. Avoid CAP_SYS_ADMIN at all costs It is nearly equivalent to root. ### ✔ 4. Avoid `--privileged` containers Except for debugging on isolated test machines. ### ✔ 5. Use Seccomp + AppArmor + Capabilities together Capabilities remove kernel access, Seccomp removes syscalls, AppArmor restricts filesystem actions. ### ✔ 6. Use file capabilities instead of SUID wrappers Safer and more maintainable. ### ✔ 7. Audit container permissions regularly Using tools like: - `docker inspect` - `kube-bench` - `kubeaudit` - `falco` --- # **9. Summary** Linux Capabilities provide: - granular privilege control - safer container security - powerful Kubernetes hardening - compliance with least privilege - greatly reduced attack surface Capabilities are foundational to all container runtimes and are essential knowledge for DevOps, SRE, platform engineers, and anyone running workloads on Linux. With proper use, they help ensure your infrastructure is secure **before** issues are discovered in audits — not after.