Apache Kafka default checks and policies
Summarize
Summary of Apache Kafka default checks and policies
ServiceNow's Agent Client Collector offers a comprehensive set of default policies and checks for monitoring the health of Apache Kafka deployments on both Windows and Linux platforms. These checks cover Kafka Zookeeper status, topic replication, leader assignments, partition counts, broker status, and key metrics from Kafka brokers and Zookeeper instances. They help ensure Kafka clusters operate reliably by identifying critical issues early.
Show less
Kafka Topic and Broker Health Checks
- Kafka Zookeeper Status Check: Detects if the Kafka Zookeeper service is down and raises a critical event.
- Topic Replicas Check: Identifies topics with partitions having unknown replicas, supporting topic inclusion/exclusion filters and detailed output.
- Replication Factor Check: Alerts when any topic’s replication factor deviates from the expected value, with options for detailed reporting and topic filtering.
- Topic Leader Check: Detects partitions with unknown leaders or unpreferred replicas acting as leaders, allowing detailed output and filtering.
- Topic Partitions Check: Raises alerts if topics have fewer partitions than the defined minimum, with inclusion/exclusion support.
- Kafka Broker Status Check: Monitors if the Kafka broker process is running, raising critical events when down.
Kafka Broker and Zookeeper Metrics Collection
- Broker Metrics: Collects key Kafka broker metrics via JMX, such as request rates and leader election statistics, configurable by Java path and JMX port.
- Zookeeper Metrics: Gathers Zookeeper performance metrics from the admin server port, including outstanding requests, latency, connection counts, and file descriptors.
Practical Implementation Notes
- All checks utilize the
commonchecksutility with specific flags to customize monitoring parameters such as ports, topic filters (using wildcards), and expected values. - Include and exclude lists for topics must be enclosed in double quotes and support comma-separated names with wildcards.
- Default ports are 2181 for Zookeeper, 9092 for Kafka broker, 9999 for JMX, and 8085 for Zookeeper admin server but can be customized per deployment.
Benefits for ServiceNow Customers
By leveraging these predefined policies and checks, ServiceNow customers can proactively monitor Apache Kafka clusters to detect issues like broker downtime, replication inconsistencies, and partition misconfigurations. This facilitates early troubleshooting, maintains data integrity, and ensures high availability of Kafka services integrated with their IT operations management workflows.
Agent Client Collector provides the following policies for Apache Kafka health monitoring. Policies come with the checks specified in the indicated table. Policies and checks are available for both Windows and Linux.
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.check-zookeeper-status | Raises a critical event if the hosted Kafka Zookeeper is down. | commonchecks check-kafka-zk-status [flags]Where the flags are: -p, --port = Zookeeper Port (default "2181").Usage
example: |
Kafka Zookeeper Status OK: Kafka Zookeeper is Up! |
| kafka.check-topic-replicas | Raises critical event if any topic has partitions with unknown replicas. | commonchecks check-kafka-replicas [flags]Where the flags are:
|
<topic> has partitions with unknown replicas. Unknown replicas are: {"0":["0"],"1":["0"],"2":["0"]}. <topic> has partitions with unknown replicas. Unknown replicas are: {"0":["0"]}. |
| kafka.check-topic-replication-factor | Raises critical event if replication factor of at least one topic is above or below provided replication factor param. | commonchecks check-kafka-rf [flags]Where the flags are:
Examples: |
TestTopic has replication factor 1, which is less than expected: 2. accMetrics has replication factor 1, which is less than expected: 2. |
| kafka.check-topic-leader | Raises critical event if any topic has partitions with unknown leaders or unpreferred replica as leader. | commonchecks check-kafka-leader [flags]Where the flags are
Examples:
|
<topic> contains, partitions with unpreferred replica as leader.(partitions with unpreferred replicas are [0]). <topic> contains, partitions with unpreferred replica as leader.(partitions with unpreferred replicas are [0]). |
| kafka.check-topic-partitions | Raises critical events if number of partitions for a topic is less the min_partitions param. | commonchecks check-kafka-partitions [flags]
Where the flags are:
|
|
Usage example 1: |
<topic> has 1 partitions, expected at least 3. <topic> has 1 partitions, expected at least 3. <topic> has 1 partitions, expected at least 3. |
||
| Usage example 2: commonchecks check-kafka-partitions -H localhost -p 2181 -P 3 -i "accMetrics,*Topic" -e "testTopic" | <topic> has 1 partitions, expected at least 3. <topic> has 1 partitions, expected at least 3. |
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.check-broker-status | Raises critical event if Kafka Broker on the host is down. | commonchecks check-kafka-broker-status [flags]Where the flags are: -p, --port = Kafka Broker port (default
"9092").Usage example: |
Kafka Broker Status OK: Kafka Broker ubuntu20:9092 is Up! |
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.metrics.broker | Collects Kafka Broker Metrics from the host. | commonchecks metric-kafka-broker [flags]Where the flags
are:
Usage example: |
hostname.Kafka.Broker.ReplicaManager.IsrExpandsPerSec.OneMinuteRate 0.000 hostname.Kafka.Broker.DelayedOperationPurgatory.PurgatorySize.Fetch.Value 627.000 hostname.Kafka.Broker.ControllerStats.UncleanLeaderElectionsPerSec.OneMinuteRate 0.000 hostname.Kafka.Broker.RequestMetrics.RequestsPerSec.Produce.OneMinuteRate 0.000 |
| Check | Description | Usage | Output |
|---|---|---|---|
| kafka.metrics.zookeeper | Collects Zookeeper Metrics from the host. | commonchecks metric-kafka-zookeeper [flags]Where the flag
is: Usage example: |
hostname.Kafka.Zookeeper.outstanding_requests 2.000 1648183249 hostname.Kafka.Zookeeper.avg_latency 1.05 1648183249 hostname.Kafka.Zookeeper.num_alive_connections 1.000 1648183249 hostname.Kafka.Zookeeper.open_file_descriptor_count 124.000 1648183249 |