Container logs in Kubernetes follow this flow:
Application → Container Runtime → Node Filesystem → Log Aggregation System
↓
/var/log/containers/ (symlinks)
↓
/var/log/pods/ (actual log files)
↓
/var/lib/docker/containers/ (container runtime logs)
Root Causes of Log-Related Disk IO Issues
1. Uncontrolled Log Volume
- Applications logging at verbose levels (DEBUG, TRACE)
- High-frequency log generation without rate limiting
- Large log messages or stack traces
- No log rotation or size limits
2. Inefficient Log Handling
- Multiple processes reading the same log files
- Lack of centralized logging leading to local accumulation
- Poor log rotation policies
- Insufficient disk space allocation for logs
3. Container Runtime Configuration
- Default log drivers without size limits
- Missing log rotation configuration
- Inadequate garbage collection policies
Kubernetes-Native Solutions
1. Pod-Level Log Management
Container Log Configuration:
apiVersion: v1
kind: Pod
metadata:
name: app-with-log-limits
spec:
containers:
- name: app
image: myapp:latest
env:
- name: LOG_LEVEL
value: "INFO" # Reduce log verbosity
- name: LOG_FORMAT
value: "structured" # Efficient log format
Key logging environment variables:
LOG_LEVEL: Controls application verbosity
LOG_FORMAT: Structured logs (JSON) are more efficient to process
Application-specific configuration to limit log output
Ephemeral Storage Limits:
spec:
containers:
- name: app
resources:
limits:
ephemeral-storage: "2Gi" # Limit total ephemeral storage
requests:
ephemeral-storage: "1Gi" # Reserve storage for logs
2. Node-Level Configuration
kubelet Log Rotation Settings:
# kubelet configuration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
containerLogMaxSize: "10Mi" # Maximum size per log file
containerLogMaxFiles: 5 # Maximum number of log files
Container Runtime Configuration:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
Centralized Logging Architecture
1. Log Aggregation Strategy
Application Pods → Node Log Files → Log Shipper (DaemonSet) → Centralized Storage
Benefits of centralized logging:
- Reduced local disk usage
- Centralized search and analysis
- Retention policy management
- Separation of concerns
2. DaemonSet-Based Log Collection
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: log-collector
namespace: logging
spec:
selector:
matchLabels:
name: log-collector
template:
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENTD_SYSTEMD_CONF
value: "disable"
resources:
limits:
memory: 200Mi # Limit collector resource usage
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
DaemonSet design considerations:
- Resource limits to prevent collector from overwhelming nodes
- Read-only mounts for security
- Efficient log parsing and filtering
Advanced Log Management Patterns
1. Structured Logging Implementation
apiVersion: v1
kind: ConfigMap
metadata:
name: app-logging-config
data:
log4j2.xml: |
Benefits of structured logging:
- Efficient parsing and indexing
- Reduced storage requirements
- Better query performance
- Consistent log format across services
2. Application-Level Log Sampling
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
application.yml: |
logging:
level:
com.company.app: INFO
org.springframework: WARN
pattern:
console: "%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n"
sampling:
enabled: true
rate: 100 # Sample 1 in 100 debug logs
Storage Optimization Strategies
1. Node Storage Management
Automated Cleanup CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: log-cleanup
namespace: kube-system
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: cleanup
image: alpine:latest
command:
- /bin/sh
- -c
- |
# Clean up old container logs
find /host/var/log/containers -name "*.log" -mtime +7 -delete
# Clean up old pod logs
find /host/var/log/pods -name "*.log" -mtime +7 -delete
# Clean up Docker container logs
find /host/var/lib/docker/containers -name "*.log" -mtime +7 -delete
volumeMounts:
- name: host-var
mountPath: /host/var
- name: host-var-lib
mountPath: /host/var/lib
securityContext:
privileged: true
volumes:
- name: host-var
hostPath:
path: /var
- name: host-var-lib
hostPath:
path: /var/lib
restartPolicy: OnFailure
2. Storage Class Optimization
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ephemeral
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
Monitoring and Alerting
1. Disk Usage Monitoring
Key metrics to monitor:
- Node disk utilization by mount point
- Container log file sizes and growth rates
- Log rotation effectiveness
- I/O wait times and disk pressure
2. Log-Specific Alerts
# Prometheus alert rules
groups:
- name: logging.rules
rules:
- alert: HighLogVolume
expr: increase(container_fs_writes_bytes_total[5m]) > 100000000 # 100MB in 5min
for: 2m
labels:
severity: warning
annotations:
summary: "High log volume detected on {{ $labels.instance }}"
- alert: DiskSpaceForLogs
expr: (node_filesystem_avail_bytes{mountpoint="/var/log"} / node_filesystem_size_bytes{mountpoint="/var/log"}) < 0.1
for: 1m
labels:
severity: critical
annotations:
summary: "Low disk space for logs on {{ $labels.instance }}"
Best Practices for Production
1. Log Lifecycle Management
- Define clear retention policies
- Implement automated cleanup procedures
- Regular capacity planning and monitoring
- Cost optimization through appropriate storage tiers
2. Application Design
- Implement log sampling for high-volume debug logs
- Use appropriate log levels for different environments
- Structured logging for efficient processing
- Error aggregation to reduce duplicate log entries
3. Operational Excellence
- Regular log infrastructure health checks
- Disaster recovery procedures for log data
- Performance testing of logging infrastructure
- Integration with incident response procedures