How do you monitor and operate Kafka in production?

Answer

Production Kafka monitoring stack: JMX Metrics: Kafka exposes 700+ metrics via JMX. Critical ones: UnderReplicatedPartitions (should be 0), OfflinePartitionsCount (should be 0), ActiveControllerCount (should be 1 per cluster), BytesInPerSec/BytesOutPerSec (throughput), RequestQueueSize (broker overload indicator). Prometheus + Grafana: use JMX Exporter to scrape and Kafka-specific Grafana dashboards. Consumer lag: use kafka-consumer-groups.sh or Burrow/Kafka Lag Exporter. Operational tools: kafka-topics.sh (manage topics), kafka-configs.sh (alter configs), kafka-reassign-partitions.sh (manual rebalancing). Automated operations: Cruise Control for rebalancing, Strimzi operator for Kubernetes deployments. Runbooks: prepare documented procedures for: under-replicated partitions, consumer lag alerts, disk full (add storage or increase retention period), broker failure recovery, and rolling restarts for upgrades.