eBPF for application observability: the boring parts
The marketing pitch for eBPF in 2026 sounds the same as it did in 2023: kernel-level visibility without modifying applications, with negligible overhead. The reality for a small-team production stack is more nuanced.
What worked
Pixie and Cilium's Hubble gave us network-flow visibility we didn't have before. We caught two slow upstream APIs that had been quietly bleeding latency budget for months. The dashboards just told us where the time went, no instrumentation needed.
For TCP-level metrics on connection setup time, retransmits, and zero-window events, eBPF tooling beat anything we'd built ourselves with sidecars and log aggregation.
What didn't
The kernel version dependency was the biggest pain. Our older Ubuntu 22.04 nodes worked fine, but a couple of older worker nodes on 20.04 needed kernel upgrades that we'd been deferring. That's not eBPF's fault exactly, but it's the kind of "incidentally needs infrastructure work" surprise that doesn't show up in the docs.
Resource overhead on busy nodes was measurable — a few percent CPU on our load-testing rigs. Negligible for steady-state traffic, but it pushed our capacity headroom calculations.
What we removed
We turned off application-layer tracing via eBPF. The kernel-level visibility was great, but trying to use it as a replacement for OpenTelemetry trace IDs added complexity without payoff. Stuck with explicit instrumentation for that layer.
Net: kept the network/system observability, dropped the app-layer dream. Honest assessment six months in.