Stream Connect Implementation Tips

Marcello Correa

Hi ServiceNow Community,

After being part of a few Stream Connect implementations over the past year or two, I thought I would create a community post to share some lessons learnt and some techniques I had to come up during my projects to fully unleash the power of Stream Connect.

As a reminder, this post is by no means fully comprehensive. Its intent is to become another source of knowledge covering Stream Connect capabilities. My hope is that, together with other sources, this post could assist you during a future Stream Connect implementation. I strongly recommend that you read and become familiar with the existing documentation. I will list some useful links at the Resource Links & References section at the bottom of this post.

Without further ado, let us jump in. 🙂

Please remind me again what Stream Connect is

Stream Connect is a ServiceNow offering that links your Kafka environment to your ServiceNow instance, enabling you to stream data between your instance and your external systems.

Topics and Partitions (Freeway with Lanes)

This part is fun to explain. Imagine a topic as being a channel managed by the Kafka cluster where messages are organized and isolated from other channels. Each topic, in its turn, is divided into partitions for parallelism and scalability. Like in a freeway (topic), the more lanes (partitions) you have, the more cars (messages) can get through at the same time (parallelism).

One important characteristic of a topic partition is its ability to maintain the order of the messages. It will always deliver messages in the same order they were received. The same cannot be said of two different partitions even if they run under the same topic.

Since we have a recommendation to limit the number of topics to 30, please spend some time planning how the topics and namespaces will be organized. Try avoiding using a topic for each table as you would run out of topics quickly. Think more about how to group tables by processes (e.g. records and tables related to Incident). It would give a nice starting point. Also, please reserve topics for future expansion. Don't use all topics for, let's say, ITSM and leave none for other ServiceNow products (e.g., HR, CMDB, CSM, etc.).

Finally, if you have a use case to move dozens of tables (related or not) to a data lake solution on the customer side, it could justify having a separate topic just for that.

How to use the Header section

When producing messages/payloads, you are offered a chance to add a header. As its own name states, it is an introductory section of the payload. I like to use it to hold the highest level of information about the message.

There is no convention here, neither there is no right or wrong, but I usually standardize the header to contain the application source of the message (e.g., servicenow) and the source base table that produced/triggered the message (e.g., incident). I encoded the values/attributes mentioned above as a JSON object, making it easier for downstream consumers/subscribers to parse it.

Header example:

{“source”: “servicenow”, “table”: “incident”}

Naturally, you can add more details to your header but try to keep it short. The bigger the header the longer the parsing time. Remember: milliseconds matter when dealing with millions of messages a day.

A well-designed header can also help when you combine multiple tables in the same topic. By having the table name in the header, it allows consumer applications to quickly parse the header, instead of the usually much bigger payload, to determine whether a given message is to be processed or dropped.

Validating the source base table before processing the message is a common use case when multiple tables are combined in the same topic, and some consumer applications are interested in just some of the tables. In other words, if a downstream consumer application only cares about incident details, it would be able to analyze the header and decide to drop a message containing the details of, let us say, an incident task or an incident SLA.

The use of an attribute containing the application source in the header allows this design to expand and add more producers in the future.

Finally, it goes without saying that if multiple producers are part of the solution, then they all need to use the same header format and have their own unique application identifier.

How does the partition key work?

When producing messages/payloads, you will notice that you have the option to set a partition key. The key is used by Kafka to decide which one of the existing topics should be used for that particular message. Without going into too many details, Kafka will hash the key and select the appropriate partitions based on the hash and the number of available partitions.

Kafka hash function:

hash(key) % number_of_partitions

An important thing to remember is because the hash function is deterministic; the same key will always end up putting the message in the same partition (unless, of course, new partitions are added).

This feature allows us to use the partition key to put messages in the same partition, and more importantly, in a specific sequence that is guaranteed to be delivered to the consumers.

A practical example is the use of the parent incident sys_id as the partition key for both the parent incident record and its child task incidents. This would ensure the parent incident and its incident tasks would be delivered in the correct order (an incident task would not be delivered before its parent incident is processed).

Does the order matter? When?

It is an excellent question! The order may or may not matter depending on your use cases. For example, sending Kafka messages to a data lake used solely for reporting may not need the data to be perfectly sequenced every time. If a couple of messages are processed out of order, it will not cause any issues (and it simplifies the mechanisms used to produce such messages).

On the other hand, if the use case includes a real-time consumer application expecting data in a precise sequence, then the order not only matters, but it will be a deciding factor on the design. For example, a child incident task MUST be sent after its parent is sent, so that the consumer application would have processed the parent incident (and potentially created a local representation of the record on their end) by the time the incident tasks are received. Now imagine a bad scenario where a consumer application receives an incident task first and does not know what to do with it (or where to link it to) because it had not received the actual parent incident. Not a good thing to happen, right?

The order that ServiceNow sometimes saves records into its database also does not help, particularly if you are leveraging Business Rules and Flows to create messages/payloads. For example, catalog variables that belong to a RITM are saved into the database before the actual RITM record (only a few milliseconds before but nevertheless before its parent record). Again, not good, right? There are a few of these cases throughout ServiceNow where the order is a bit off. It usually does not matter for the ServiceNow platform itself, but it may matter a lot if you are sending records in a precise sequence to a real-time consumer application. Be on the lookout for those scenarios. Always test to confirm the order the records are being saved into the database.

Remember: Topic Inspector is your best friend

Topic Inspector has always been an important troubleshooting tool when using Stream Connect. It consolidates details about Topics, Partitions and even consumer information. It allows you to see actual payloads and save them for testing. Newer releases have improved it even more by offering better filters.

But do not stick with the Topic Inspector only. The new Stream Connect Home is also another source of information. Use it too!

Finally, if you still need additional logging, you can either add it to your running script/jobs or leverage existing logs (e.g., Flow logs if using Flow Designer). If creating your own log messages, just make sure you have an option to enable or disable it as needed. We don’t want millions of additional log entries in PROD if everything is working as expected.

Far and near clusters? What are those? And what is an offset? Can I use it? No!

When using the Topic Inspector for the first time, you will notice ServiceNow calling its Kafka cluster. Stream Connect infrastructure contains two clusters in a primary/standby architecture to provide high availability. Kafka messages are ONLY sent to whatever cluster is active (primary) at that moment. No messages are sent to the stand-by cluster.

This is the reason for configuring the consumer client to use two sets of ports (4100-4150 and 4200-4250). Each port set points to a different cluster. The expected behavior is for the cluster in stand-by to not receive any messages unless something happens to the primary cluster. As a quick reminder, Producer clients use a distinct set of ports (4000-4050)

In the event of an issue with the primary cluster, messages are temporarily stored in special ServiceNow tables while the failover transfer between primary and stand-by cluster takes place. Once the primary cluster is restored or the stand-by cluster takes over, the stored messages are retried and sent again to whatever cluster is active (primary) at that moment.

The cluster is called “near” if it is in the same datacenter currently hosting your active ServiceNow application server. If the cluster is in the datacenter hosting your stand-by application servers, then it is called “far”.

Finally, each cluster will have their own offsets for each partition. Offset is just a sequential ID number that uniquely identifies a message within the partition. Offsets are NOT synchronized between clusters because we do not synchronize messages either. Therefore, each cluster will have its own sets of offsets for the messages received by each one of them. Because of that, please DO NOT try to use offsets to order (or reorder) your messages. In the event of a cluster failover transfer, the offsets will be totally different, and it will mess up with the order of the message if you are using offset numbers to reorder them.

Help! My BRs sometimes do not run. The CDC may help soon! 🙁

This one is important. Certain features and operations in ServiceNow will not trigger Business Rules (BRs) or Flows, and it is the expected behavior. It could, however, break your Stream Connection implementation by not sending messages to Kafka. Skipping a BR is usually done to either intentionally avoid triggering business rules logic or to maximize performance.

Any script, OOTB or custom, that uses GlideRecord with function setWorkflow() set to “false” will not run BRs. IDR and Transform Maps both have options to not execute business rules. Operations executed by Archiving and Table Rotation, for example, will not execute BRs either. The list goes on.

There are alternatives, but they all come with tradeoffs. If timing and order are not critical then scheduled jobs can be used (but then you must find a way to batch them if the volume can become too large otherwise you will run into performance issues). If timing and order are critical requirements, then we start running out of options.

There are good news, however. Let us start first with the usual Safe Harbor statement that timelines and new capabilities may change or be dropped. For the Brazil release, the smart people behind Stream Connect is working to leverage Change Data Capture (CDC) capabilities in ServiceNow to overcome some of the limitations described above. Stream Connect should be able to see change operations without relying on Business Rules.

Producers: Producer API (Script) vs IH Producer Step

Another important design decision to make is choosing between ProducerV2 API and Kafka Producer Step (within Workflow Studio) for producers use cases (outbound).

In my opinion, the decision comes down to the transaction volume. For a few thousand messages/payloads a day, I would give the Producer Step a good look. For much bigger transaction volume, however, I would stick with scripting. For high volumes, the idea of tens of thousands of additional flow contexts being created does not feel right even if you turn the flow logging to a minimum.

Another factor is the familiarity of the developers with pro-code. Usually, ProducerV2 API will require more actual coding while Kafka Producer Step (within Workflow Studio) may be a better solution for low-code/no-code.

Static Payload vs Dynamic Payload vs Schemas

We made it to the best part of this post: the message/payload creation section. There are many ways to create messages. Whatever the way we chose to create our messages, it is probably a good idea to use a JSON object to represent them. Most consumer applications should have no issues parsing a JSON object.

Before explaining the many ways we can create messages, it may be a suitable time to explain the relationship between message formats/contents and their source tables. Each implementation is different, each customer user case is different, but frequently, for producer use cases, the message format/content is directly associated with a source table. Assume we want to produce and share a Kafka message every time an incident record is created (or updated) in ServiceNow. It makes sense that the message would contain incident attributes (and sometimes dot-walked data too). That is what a call a “message format-table relationship”. Now let us continue.

I usually only use static payloads if I am only doing a handful of message formats (source tables), and the customer has no intention of drastically increasing the number of tables in the near future. It should not take a lot of time to develop a script to build a message containing attributes from a GlideRecord, and send it to Kafka. It also has some advantages if a lot of dot-walking is needed. The problem with this approach is the potential maintenance it could require if constant changes to the message layout are being made or if new tables are constantly added.

At some point, especially for customers wanting to use Stream Connect to produce messages for dozens (if not hundreds) of tables, it makes more sense to create a dynamic script that would create the message format automatically based on the GlideRecord source table. Such a script would loop over all fields from the source table and dynamically create the message. Additional code would have to be added to cover design decisions to, for example, include all field values for every message, or only include the field values that have changed. Also, code determines whether to include values only, display values only, or a combination of both. Perhaps some code to blacklist or whitelist fields. There is a lot of potential with a script creating dynamic message formats. The benefits are obvious; you do not have to map every field every time you want to share a new table. New custom fields are automatically added (assuming the customer wants that). The problem with this solution is that the development is a bit more complex, nothing too advanced but certainly more complex than the static message creation approach. Another potential problem is the inclusion of dot-walked fields. Again, it can be done, but additional code would have to be added to the script.

When creating a payload, sometimes I also add metadata attributes to it in addition to the expected field names and values. An example of metadata attributes I always use are the source table and the database operation that triggered the message (e.g., Insert, Update and Delete).

Finally, the last option is the use of schemas when mapping messages and tables. They are based on the Apache Avro format. Schemas can be imported from the Confluence Registry or created locally in ServiceNow (standalone). Hopefully, the diagram below can help with the understanding of how the schemas works:

One of main benefits of this approach is that the message sizes are smaller. More details can be found in the Resource Links & References section below.

One final aspect related to messages is their max size. Stream Connect supports a max size of 2MB for a single message which is bigger than the default max size of 1MB. Always confirm the max configured message size used by the customer Kafka cluster to avoid issues (especially if you are replicating/mirroring messages). With that being said, 2M is a very big limit for a single message. On my experience, a well-designed message ranges from a few hundred bytes to a few thousand bytes.

You have not covered Consumers. Why? 🙁

Although Stream Connect Consumers are frequent use cases, on my experience, they are a little less common than the Producer use cases. The are diverse ways for consuming Kafka messages into ServiceNow, it would make this post even longer to cover them in detail. If I receive enough requests, I may create another post dedicated solely to Stream Connect Consumers.

Resource Links & References