Kubernetes · 8 min read

Logging in Kubernetes: Best Practices

A practical guide to Kubernetes logging — centralized architecture, backend selection, structured logging, retention policies, and troubleshooting patterns.

THNKBIG Team

Engineering Insights

Logging in Kubernetes: Best Practices

Logging in Kubernetes Is a Different Problem

On a virtual machine, your application writes to /var/log and you set up rsyslog or a Filebeat agent. The file stays put. The machine stays put. In Kubernetes, pods are ephemeral. A crashed container takes its logs with it. A scaled-down deployment deletes the nodes those logs lived on. If you are not shipping logs off-node in real time, you are losing data.

Centralized logging is not optional in Kubernetes. It is infrastructure. This post covers the architecture decisions, tool choices, and operational practices that separate useful logging from expensive noise.

Centralized Logging Architecture

There are two primary patterns for collecting logs in Kubernetes: the DaemonSet collector and the sidecar collector. The DaemonSet approach deploys one log agent per node — typically Fluent Bit or Fluentd — that reads container log files from /var/log/containers/. This is the standard approach for most clusters.

The sidecar approach runs a logging agent inside each pod. It uses more resources but gives you per-application control over log parsing and routing. Use sidecars when specific workloads produce non-standard log formats that need custom parsing before ingestion.

Our recommendation: start with a Fluent Bit DaemonSet for cluster-wide collection. Add sidecars only for workloads that genuinely need them. Fluent Bit uses roughly 15 MB of memory per node versus 60+ MB for Fluentd. At scale, that difference matters.

Choosing a Log Backend: EFK, Loki, or Cloud-Native

The EFK stack (Elasticsearch, Fluent Bit/Fluentd, Kibana) remains the most common open-source logging backend. Elasticsearch is powerful for full-text search and complex queries. It is also resource-intensive. A production Elasticsearch cluster needs dedicated nodes, careful JVM tuning, and index lifecycle management.

Grafana Loki takes a different approach. It indexes only metadata (labels), not the log content itself. Storage costs drop dramatically because logs are compressed and stored in object storage like S3. Query speed on raw text is slower than Elasticsearch, but for most troubleshooting workflows — filtering by namespace, pod, or container — Loki is fast enough and far cheaper.

Cloud-native options (CloudWatch Logs, Google Cloud Logging, Azure Monitor) eliminate operational overhead entirely. If your cluster runs in a single cloud and your team is small, these are reasonable choices. The trade-off is vendor lock-in and potentially higher costs at scale. See our observability practice for help evaluating the right stack for your workloads.

Structured Logging: JSON or Nothing

Unstructured logs (plain text strings) are nearly impossible to query at scale. When every microservice formats log messages differently, your logging backend spends most of its resources on parsing, and your engineers spend most of their time guessing field names.

Standardize on JSON-structured logs across every service. Define a shared schema: timestamp, level, service, traceId, message, and any domain-specific fields. Use a logging library that outputs structured JSON natively — like Zap for Go, structlog for Python, or Pino for Node.js.

Structured logs unlock machine-readable queries. Instead of grepping for a substring, you filter by level:error AND service:checkout AND traceId:abc123. That is the difference between a five-minute investigation and a two-hour one.

Log Levels and Volume Control

Most production applications should run at INFO level. DEBUG logging in production generates enormous volume, increases storage costs, and buries the signals in noise. Use DEBUG only during active troubleshooting, and make log levels configurable at runtime without redeploying.

Define what each log level means in your organization. A common standard: ERROR means an operation failed and requires human attention. WARN means something unexpected happened but the system recovered. INFO records normal operational events. DEBUG records internal state for developers.

Set up alerts on ERROR log rates, not individual errors. A single error is normal. A spike of 10x the baseline error rate in five minutes is an incident. Use your logging backend or a metrics pipeline to track error rates by service and namespace.

Log Retention Policies

Storing every log line forever is expensive and usually unnecessary. Define retention policies based on the data classification. Application debug logs: 7 days. General application logs: 30 days. Audit logs and security events: 1 year or per your compliance requirements (SOC 2, HIPAA, PCI-DSS).

In Elasticsearch, use Index Lifecycle Management (ILM) to automatically roll over, shrink, and delete indices. In Loki, configure retention per tenant. Tier old logs to cheaper storage — S3 Glacier or equivalent — for compliance archives that you query rarely.

Troubleshooting with Logs: Practical Patterns

Effective troubleshooting requires correlating logs across services. Distributed tracing with OpenTelemetry gives you trace IDs. Inject those trace IDs into every log line. When a user reports a failure, pull the trace ID from the request and query all logs across all services for that single transaction.

Use Kubernetes metadata enrichment in your Fluent Bit config. Every log line should include the pod name, namespace, node, container name, and labels. This metadata lets you answer questions like: did this failure happen on a specific node? A specific replica? After a specific deployment?

Build dashboards in Grafana or Kibana that show error rates by service, log volume by namespace, and top error messages. These dashboards are the first place your on-call engineers should look during an incident.

Common Logging Pitfalls in Kubernetes

Logging secrets. Applications that log request bodies or headers may inadvertently log API keys, tokens, or PII. Scrub sensitive fields before they reach your logging backend. Most logging libraries support redaction middleware.

Stdout vs. file logging. Kubernetes captures stdout and stderr from containers automatically. If your application writes to a log file inside the container, the DaemonSet collector will not see it. Refactor applications to log to stdout, or add a sidecar to tail the file.

Ignoring log pipeline backpressure. When your logging backend is slow or down, Fluent Bit buffers logs in memory or on disk. If the buffer fills up, you lose logs. Configure appropriate buffer sizes, set up dead-letter queues, and monitor your log pipeline as critical infrastructure.

Build Logging That Survives Incidents

Your logging architecture is only tested when something breaks. If your logs disappear when pods crash, if your team cannot correlate events across services, or if your storage costs are growing faster than your cluster, the architecture needs work.

Talk to an engineer about designing a Kubernetes logging stack that scales with your workloads.

Why This Matters for Your Operations Team

  • Kubernetes logs are ephemeral by default — when a pod is deleted, its logs disappear unless they have been shipped to a centralized backend.
  • The recommended logging stack: Fluent Bit (DaemonSet collector, low overhead) shipping to Loki (cost-effective, label-indexed) or Elasticsearch (full-text search, higher cost).
  • Structured JSON logging dramatically reduces query time during incidents — engineers filter by field value rather than parsing raw text.

Building on the logging architecture covered in this post, the operational priority is ensuring your logging pipeline is resilient enough to survive the incidents you are trying to debug. Fluent Bit should have disk-based buffering configured so that a temporary Loki or Elasticsearch outage does not cause log loss. Monitor the Fluent Bit DaemonSet health aggressively — a failing log collector on any node creates blind spots during the exact moments when full visibility is most critical.

For compliance-driven logging (HIPAA audit trails, SOC 2 evidence, FedRAMP AU control family), logging infrastructure must be treated as security-grade infrastructure. Logs must be tamper-evident, retained for the required period, and accessible for audit queries within a defined SLA. This often requires a dedicated audit log pipeline separate from operational logs. THNKBIG designs logging architectures for regulatory compliance as part of our cybersecurity and compliance practice. Contact us.

TB

THNKBIG Team

Engineering Insights

Expert infrastructure engineers at THNKBIG, specializing in Kubernetes, cloud platforms, and AI/ML operations.

Ready to make AI operational?

Whether you're planning GPU infrastructure, stabilizing Kubernetes, or moving AI workloads into production — we'll assess where you are and what it takes to get there.

US-based team · All US citizens · Continental United States only