Advanced High Availability transfer with Hermes

  • Release version: Zurich
  • Updated July 31, 2025
  • 3 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Advanced High Availability transfer with Hermes

    ServiceNow production instances operate across geographically separated data centers paired for redundancy and failover support, with one designated as active and the other as standby. Hermes uses active/active Kafka clusters in both data centers to ensure high availability and failover capabilities. Each instance interacts with a "near" Kafka cluster in its own data center and a "far" Kafka cluster in the paired data center.

    Show full answer Show less

    Normal Operation

    During normal conditions, messages are produced to the near Hermes Kafka cluster associated with the active data center. Topics are created simultaneously in both clusters. Two consumers run—one for each cluster—but only the consumer for the near cluster actively processes messages. External clients produce messages through defined ports in the 400x range, while consumers connect via separate bootstrap URLs in the 410x and 420x ranges.

    Failover and Advanced High Availability (AHA) Transfer

    • AHA Transfer: When an instance switches roles between active and standby (for example, switching active data centers from DC1 to DC2), the instance changes to use the Hermes cluster in the newly active data center.
    • Hermes Failover: The instance continuously monitors the health of the near Hermes cluster. If issues arise, it enters failover mode and produces messages to the far cluster (the standby data center’s cluster) until the near cluster recovers, after which normal operation resumes.

    During failover, if consumers lag behind, both consumers may process messages concurrently until the lagging consumer catches up, ensuring message continuity.

    Maintaining Message Order

    If message order is critical, it must be managed by the consumer application. The global ordering depends on the Kafka topic configuration, so customers should design their consumers accordingly to handle ordering requirements.

    Learn how messages are produced and consumed in Hermes during normal operation, Advanced High Availability (AHA) transfer, and failover scenarios.

    ServiceNow production instances operate in geographically separate data centers. Each data center is paired with another data center to provide redundancy with failover support. One data center is designated as the active side and the other as standby. For example, your instance might be configured in the DC1 and DC2 data centers, with DC1 as the active side.

    With the activation of StreamConnect, LES, or IDR, a new Hermes Kafka cluster is provisioned in both data centers. To confirm high availability and provide failover support, Hermes uses a pair of active/active Kafka clusters, one in each data center.

    Near cluster
    The Hermes Kafka cluster located in the same data center as the instance is the near cluster.
    Far cluster
    The cluster running in the other data center is the far cluster. The opposite is true for the other instance. Its near cluster is in its data center, and its far cluster is running in the other data center.
    Figure 1. Near and far Hermes Kafka clusters
    Near and far Hermes Kafka clusters are relative to the instance.

    Normal operation

    Under normal operating conditions, messages are produced by the instance or an external client to the near Hermes cluster. For example, if your instance is running in the DC1 datacenter, messages are produced to the near Hermes cluster in DC1. Messages sent from an external client are produced to the cluster using a port in the 400x range as defined in the producer bootstrap URL.

    When a topic is created in Hermes, its created in both clusters. Two consumer processes are used for consuming messages from both clusters, but only a single consumer is actively consuming under normal circumstances. Each consumer must use distinct bootstrap URLs, one in the 410x range and the other in the 420x range.

    Failover process

    Under the following circumstances, the cluster where messages are produced can change.

    Instance Advanced High Availability (AHA) Transfer
    When an instance undergoes an AHA transfer, the standby instance is set to active, and the previously active instance is set to standby. In this scenario, the instance switches to using the Hermes cluster on the newly-active side.

    For example, if the instance is running in DC1 and DC2 datacenters with DC1 as the current active side, and an AHA transfer occurs, the instance switches to using the Hermes cluster in DC2.

    Hermes failover
    The instance actively monitors the health of the Hermes cluster. If it detects any issues with the cluster, it enters failover mode. In this case, until the instance detects that the near Hermes cluster has recovered, it uses the Hermes cluster near the standby instance.

    For example, if the instance is running in DC1 and DC2 datacenters with DC1 as the active side, it uses the Hermes cluster in DC1. If it detects an issue with the Hermes cluster in DC1, it enters Hermes failover mode and starts producing messages to the DC2 cluster until the DC1 cluster is healthy again. After recovery, it resumes using the Hermes cluster in DC1.

    When failover occurs, if consumers are lagging, both consumers can potentially consume messages until one of the consumers finishes processing. For example, if the current active side is DC1, the consumer consuming from DC1 is actively processing messages. If a problem occurs in the DC1 cluster resulting in failover to the DC2 cluster, the consumer consuming from the DC2 cluster starts processing messages. If the consumer consuming from the DC1 cluster was lagging, both consumers continue to consume messages until the DC1 consumer catches up.

    Maintaining order

    If maintaining message order is required, it’s the responsibility of the consumer application to manage this. Note that the global ordering of messages is dependent on how the topic in Kafka is defined.