How do you monitor and operate Kafka in production?
Answer
Production Kafka monitoring stack: JMX Metrics: Kafka exposes 700+ metrics via JMX. Critical ones: UnderReplicatedPartitions (should be 0), OfflinePartitionsCount (should be 0), ActiveControllerCount (should be 1 per cluster), BytesInPerSec/BytesOutPerSec (throughput), RequestQueueSize (broker overload indicator). Prometheus + Grafana: use JMX Exporter to scrape and Kafka-specific Grafana dashboards. Consumer lag: use kafka-consumer-groups.sh or Burrow/Kafka Lag Exporter. Operational tools: kafka-topics.sh (manage topics), kafka-configs.sh (alter configs), kafka-reassign-partitions.sh (manual rebalancing). Automated operations: Cruise Control for rebalancing, Strimzi operator for Kubernetes deployments. Runbooks: prepare documented procedures for: under-replicated partitions, consumer lag alerts, disk full (add storage or increase retention period), broker failure recovery, and rolling restarts for upgrades.
More Apache Kafka Questions
View all →- Advanced How do you tune Kafka for ultra-low latency?
- Advanced What is Kafka's ISR (In-Sync Replicas) management and unclean leader election?
- Advanced What is Kafka's controller and how is leader election handled in KRaft mode?
- Advanced How do you implement a dead letter queue (DLQ) pattern in Kafka?
- Advanced What is Kafka's exactly-once semantics in multi-broker transactions?