· 8 min read

Kubernetes Monitoring and Observability: A Practical Guide for Enterprise Teams

THNKBIG Team

Engineering Insights

title: "Kubernetes Monitoring and Observability: A Practical Guide for Enterprise Teams"
meta_description: "A practical guide for enterprise teams on Kubernetes monitoring and observability. Learn to implement comprehensive metrics, logs, and tracing with Prometheus, Loki, and Jaeger."
url_slug: "/kubernetes-monitoring-observability"
primary_keyword: "kubernetes monitoring and observability"
secondary_keywords:
- "kubernetes observability"
- "prometheus kubernetes"
- "kubernetes logging"
- "kubernetes tracing"
author: "Rudy Salo"
date: "2026-04-09"

Kubernetes Monitoring and Observability: A Practical Guide for Enterprise Teams

Introduction

You can't manage what you can't measure. In Kubernetes environments, observability isn't just about detecting issues—it's about understanding system behavior, optimizing performance, and making data-driven decisions about capacity and architecture.

Enterprise Kubernetes deployments generate massive amounts of metrics, logs, and traces. Without proper observability infrastructure, you're flying blind. With the right tools and practices, you gain visibility into every aspect of your cluster's health and performance.

This guide covers the three pillars of observability—metrics, logs, and traces—and how to implement comprehensive monitoring for Kubernetes at scale.

The Three Pillars of Observability

Metrics: Quantitative measurements over time

Metrics answer "what is happening?" They help identify trends, set baselines, and trigger alerts. Prometheus is the standard for Kubernetes metrics collection.

Logs: Detailed event records

Logs answer "why did it happen?" They provide context for incidents and help debug issues. Fluentd, Fluent Bit, and Loki are common choices.

Traces: Request-level visibility across services

Traces answer "how did it happen?" They follow requests through distributed systems, identifying bottlenecks and latency sources. Jaeger and Tempo provide distributed tracing.

Metrics: Prometheus and kube-prometheus-stack

Installation

# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.replicas=2

Core Metrics to Track

**Node-level metrics:**

# NodeExporter metrics you should monitor
- node_cpu_seconds_total
- node_memory_MemAvailable_bytes
- node_filesystem_avail_bytes
- node_network_receive_bytes_total
- node_network_transmit_bytes_total
- node_disk_io_time_seconds_total

**Kubernetes control plane:**

# API server metrics
- apiserver_request_total
- apiserver_request_duration_seconds
- etcd_server_leader_changes_total
- etcd_disk_wal_fsync_duration_seconds

# Controller manager metrics
- leader_election_status
- workqueue_depth

# Scheduler metrics
- schedulerSchedulingDuration_seconds
- schedulerPendingPods

**Workload metrics:**

# Deployment metrics
- kube_deployment_status_replicas_available
- kube_deployment_spec_replicas
- kube_deployment_metadata_generation

# Pod metrics
- kube_pod_status_phase
- kube_pod_container_status_restarts_total
- kube_pod_container_resource_requests
- kube_pod_container_resource_limits

Example Grafana Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-overview-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
cluster-overview.json: |
{
"dashboard": {
"title": "Kubernetes Cluster Overview",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(node_cpu_seconds_total[5m])) by (instance)",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Pod Count by Namespace",
"type": "piechart",
"targets": [
{
"expr": "sum(kube_pod_info) by (namespace)",
"legendFormat": "{{namespace}}"
}
]
}
]
}
}

Alerting Best Practices

PromQL Alert Examples

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes-resources
rules:
# High CPU usage
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace, pod)
/ sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (namespace, pod) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on pod {{ $labels.namespace }}/{{ $labels.pod }}"
description: "CPU usage is above 80% for more than 5 minutes."

# Pod restarting frequently
- alert: PodRestartingTooMuch
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently"
description: "Pod is restarting more than 6 times per hour."

# Node memory pressure
- alert: NodeMemoryPressure
expr: |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} has low memory available"
description: "Available memory is below 15%."

# Deployment unavailable
- alert: DeploymentReplicasMismatch
expr: |
kube_deployment_status_replicas_available{namespace=~"production.*"}
ne kube_deployment_spec_replicas{namespace=~"production.*"}
for: 10m
labels:
severity: critical
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has replica mismatch"

Alert Routing with Alertmanager

apiVersion: monitoring.coreos.com/v1
kind: AlertmanagerConfig
metadata:
name: critical-alerts
namespace: monitoring
spec:
receivers:
- name: "critical-slack"
slackConfigs:
- channel: "#critical-alerts"
apiURL:
name: slack-webhook
key: url
title: "{{ .CommonLabels.alertname }}"
text: "{{ .CommonAnnotations.summary }}"
route:
receiver: "critical-slack"
groupBy: ["alertname", "severity"]
matchers:
- severity = critical
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h

Logging with Loki and Fluent Bit

Loki Installation

helm install loki grafana/loki \
--namespace monitoring \
--set loki.config.schema_config.configs[0].from=2020-08-01

Fluent Bit Configuration

apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On

[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 50MB
Skip_Long_Lines On

[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels On
Annotations Off

[OUTPUT]
Name loki
Match kube.*
Host loki.monitoring.svc.cluster.local
Port 3100
Batch_Size 14280
Line_Filter On

Log Query Examples

# Error logs from production namespace
{namespace="production"} |= "ERROR"

# Logs from specific pod
{pod="api-server-7d8f9c6b5-x4k9m"} |~ "timeout|failed"

# Parse JSON logs
{service="checkout"} | json | status_code >= 500

# Aggregate error rates
rate({namespace="production"} |= "ERROR"[5m])

Distributed Tracing with Jaeger

Jaeger Installation

helm install jaeger jaegertracing/jaeger \
--namespace monitoring \
--set agent.tag=latest \
--set collector.replicas=2

Instrumenting Applications

For Go applications:

import (
"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-client-go"
"github.com/uber/jaeger-client-go/config"
)

func initJaeger(service string) (opentracing.Tracer, io.Closer) {
cfg := config.Configuration{
ServiceName: service,
Sampler: &config.SamplerConfig{
Type: jaeger.SamplerTypeConst,
Param: 1,
},
Reporter: &config.ReporterConfig{
LogSpans: true,
CollectorEndpoint: "http://jaeger-collector.monitoring.svc:14268/api/traces",
},
}

tracer, closer, err := cfg.NewTracer()
if err != nil {
panic(err)
}

opentracing.SetGlobalTracer(tracer)
return tracer, closer
}

For Python applications:

from jaeger_client import JaegerTracer
from opentracing.propagation import FORMAT_HTTP_HEADERS

def init_tracer(service_name):
config = {
'sampler': {
'type': 'const',
'param': 1,
},
'local_agent': {
'reporting_host': 'jaeger-agent.monitoring.svc',
'reporting_port': 6831,
},
}

tracer = JaegerTracer(
service_name=service_name,
config=config,
)
return tracer

Trace Query Examples

# Find traces with specific service
service:api-server operation:/api/v1/users

# Find slow traces
duration:>2s service:checkout

# Find traces with errors
tag:error=true service:*

Service Level Objectives (SLOs)

Define SLOs for Critical Services

apiVersion: sloth.dev/v1
kind: SLO
metadata:
name: api-availability
namespace: monitoring
spec:
service: api-server
sli:
plugin:
id: prometheus/availability
options:
total_metric: http_requests_total
error_metric: http_requests_total{status=~"5.."}
objectives:
- ratio: 0.999
window: 30d
alerting:
name: APIAvailability
labels:
severity: critical
annotations:
summary: "API availability SLO breached"

Common SLOs for Kubernetes Services

| Service Type | Availability SLO | Latency SLO |

|-------------|------------------|-------------|

| API Gateway | 99.9% (43min/month) | p99 < 500ms |

| Core APIs | 99.5% (3.6hr/month) | p99 < 200ms |

| Batch Jobs | 99% (7.3hr/month) | p95 < 5min |

| User-Facing | 99.9% | p95 < 2s, p99 < 5s |

Implementation Roadmap

Phase 1: Foundation (Days 1-3)

  1. Deploy kube-prometheus-stack
  2. Configure persistent storage for Prometheus
  3. Set up basic cluster overview dashboard
  4. Create alerting for critical resources

Phase 2: Logging (Days 4-5)

  1. Deploy Loki and Fluent Bit
  2. Configure log retention policies
  3. Create log dashboards for critical services
  4. Set up log-based alerts

Phase 3: Advanced Observability (Days 6-10)

  1. Deploy Jaeger for distributed tracing
  2. Instrument critical services
  3. Define SLOs for critical paths
  4. Create SLO dashboards and burn rate alerts

Phase 4: Runbooks (Days 11-14)

  1. Document alert response procedures
  2. Create runbooks for common incidents
  3. Set up on-call rotation
  4. Conduct game day exercises

Common Pitfalls to Avoid

  1. **Monitoring everything:** Focus on metrics that drive decisions. Don't collect data you never query.
  2. **No retention policy:** Without retention policies, Prometheus fills up disk and degrades. Set appropriate retention.
  3. **Alert fatigue:** Too many alerts lead to ignored notifications. Prioritize actionable alerts only.
  4. **No log aggregation:** Centralize logs from all namespaces. Searching individual pods doesn't scale.
  5. **Missing trace context:** Trace requests across all services. Incomplete traces are useless.
  6. **Ignoring cardinality:** High-cardinality labels (like user IDs) create storage issues. Aggregate before storing.

Key Metrics to Monitor

Cluster Health

  • API server latency
  • etcd disk and network performance
  • Scheduler queuing time
  • Controller reconciliation time

Node Health

  • CPU and memory utilization
  • Disk I/O and capacity
  • Network throughput and errors
  • Kubelet status

Workload Performance

  • Pod restart count
  • OOM kill frequency
  • Resource request vs. actual usage
  • HPA scaling events

Application Performance

  • Request latency (p50, p95, p99)
  • Error rates by status code
  • Throughput (requests/second)
  • Availability percentage

Conclusion

Comprehensive observability enables you to understand, debug, and optimize your Kubernetes environment. Start with metrics (Prometheus), add logging (Loki), and layer in tracing (Jaeger) for complete visibility.

The investment in observability pays dividends in faster incident resolution, better capacity planning, and improved user experience. Without visibility into your clusters, you're reactively fighting fires instead of proactively improving systems.

OpenTelemetry: The Future of Instrumentation

Consider OpenTelemetry (OTel) for vendor-neutral instrumentation:

  • **Vendor agnostic**: Send data to any backend (Prometheus, Jaeger, Tempo, commercial tools)
  • **Auto-instrumentation**: Auto-generate spans for common frameworks and libraries
  • **Standardized**: CNCF project with broad industry support
  • **Low overhead**: Efficient sampling and compression

Start with OTel if you're building new services. It future-proofs your observability investment.

---

**Want to assess your Kubernetes observability maturity?**

Schedule a free Assessment Workshop with our team to review your current monitoring setup, identify gaps, and develop a practical roadmap for comprehensive observability.

[Book Assessment Workshop]

TB

THNKBIG Team

Engineering Insights

Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.

Ready to make AI operational?

Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.

US-based team · All US citizens · Continental United States only