"ECC Queue Retry" plugin - can be used for monitoring/fixing outgoing errors?

bradmartin · ‎06-04-2017

We have a MID server setup for passing calls between our instance and that of one of our partners. Occasionally, due to some intermittent network issues, we get Bridge transactions failing with a "handshake timeout" error.

I've been having a look at the ECC Queue Retry plug-in as a possible fix/work-around for this, but as per the doco, it seems it's only geared towards dealing with incoming transactions failing?? Does anyone know if there's a way this plug-in can work for the outgoing transactions as well... or if there's a viable alternative for this? (or am I just totally misunderstanding the doco provided for the plugin?)

http://wiki.servicenow.com/index.php?title=ECC_Queue_Retry_Policy#gsc.tab=0

What we are seeing is entries like this, for a record showing up in error:

05-06-2017 11:35:17 - SNCMIDServer SNCMIDServerLog

Probe Error: run failed with error javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake

05-06-2017 11:35:16 - Optus Ticket ExchangeLog

Successfully Create Outbound Client Ack Health Check (28)

05-06-2017 11:35:16 - Optus Ticket ExchangeLog

Created ECC Queue entry f2210f934f47b60036604c111310c70d

Generally, we can "fix" this issue by manually hitting the "retry" option, but I would much rather be able to use an automated/scripted option like that in the Retry plugin. Does anyone have any experience with setting this kind of thing up? At present, am having to manually deal with anything up to 20 or so failed transactions in a day, with some days being quieter than others... but it's a manual task I could quite happily go without 😉

Any constructive suggestions/feedback greatfully welcomed.

Cheers,

Brad.

bradmartin · ‎06-11-2017

Thanks Shivani for the follow-up - I hadn't forgotten the thread - I was wanting to confirm successful testing on what we had set up before coming back to close this one off

On Friday we managed to have a successful attempt at a retry in our DEV environment, after finally get another chance to take a look at this. In case others are looking into this also, what we ended up using for the script was:

var ProbeError = gs.getXMLText(current.payload, "//ProbeError");

answer = ProbeError.startsWith('run failed with error javax.net.ssl.SSLHandshakeException:');

The answers from both Ahmed and Brian were helpful in getting us to a working solution, so not entirely sure how to handle that, with regard to accepting a solution?

View solution in original post

Ahmed Hmeid1 · ‎06-05-2017

Hi Brad,

The ECC queue retry plugin is for outbound integrations not inbound. ECC queue isn't designed for inbound integrations in general. As per the documentation:

"1 Overview

Define retry policies for outbound Web Services that are executed via the ECC Queue table. Retry policies specify a matching error condition for ECC Queue input records that are a result or response of an output queue record, the interval for retry, and the maximum number of retries. Because it matches on the input queue record, the retry policies only work when an input ECC Queue record is expected, and therefore requires that the outbound messages are queued on the ECC Queue table as well. Advanced matching criteria may be specified using script."

It is retry policies for outbound web services, but it runs on the "input" queue record. The reason for this is how the ECC queue is managed. Firstly an outbound record is created. Then when the other system responds (or not), an input ECC queue record is created in response to the outbound. This is where you'll be seeing your handshake error.

So yes, enable the plugin, set the conditions to match the input record (that'll be the one with the handshake error), and it will automatically retry.

Ahmed

brian_degroot · ‎06-05-2017

Hi Brad,

The MID Server (at least the JDBCProbe to my knowledge) is configured to retry 3 times by default in scenarios like this. What might help remedy this is to set a Connection Timeout value in your Data Source [sys_data_source] record. The MID Server uses connection pooling and will determine how long a connection to this data source is kept in cache based on this value. When set to 0, the connection will never be reclaimed, and will remain in the pool - even if invalidated. Setting this to say, 300, would terminate the connection if not used in 5 minutes, so any process ran afterwards would simply initiate a new connection.

Now, even if it attempts to run on an invalid cached connection, it should still go through the retry process on a new connection. If the data source is not online though, or the MID Server is still unable to connect at that time - the retries would still fail. In this scenario, and there's not much that could be done. You may be able to parse through the payloads of input records in the ECC Queue for that error message. If found, you can use the 'response_to' value to look up the outbound job and either set the state back to 'Ready' - or execute the 'Run Again' action. This would probably be best handled through a scheduled job.

On a side note, I have come across issues where the retries do not occur as they should when the ojdbc6.jar JDBC Driver is being used. If you're using this driver and you don't see this happening in your MID Server agent logs, I would recommend opening an incident in HI for further analysis.

Best regards,

Brian

bradmartin · ‎06-05-2017

Thank you both for your responses - they were both very helpful.

Ahmed - the integration we have set up is actually a two-way integration, but the transactions I am looking at this for are the outgoing ones (as per the title, and the pasted error logs). However, what you said has helped me locate what I think I was needing to find - an incoming ECC transaction (I was looking at too high a level - looking at Bridge transactions instead of the ECC's, and there was no returning Bridge transaction for these error records).

Looking at the relevant inbound ECC Queue record, is this the error that the retry policy is looking for?

<ProbeError>run failed with error javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake</ProbeError>

I found the example in the Wiki a little misleading/confusing, though - in the Retry policy example it gives, the logic says "starts with", and the description of what it's doing says "contains". Is that an error in the Wiki, or just how the logic works? I did have my Retry Policy rule configured with:

answer = current.error_string.toString().startsWith('javax.net.ssl.SSLHandshakeException'); (generated by plagiarising the existing rules and the example)

But it failed to fire. The syntax matches that which is in the doco, but can't help thinking it should start with "run failed with error javax.net.ssl.SSLHandshakeException" instead, given the above?? (I have actually set it to this for now, but it can sometimes be days before getting one of these errors in our DEV environment, due to considerably lower volumes - is there a way to generate these artificially??)

Brian - I'll take a look at the Connection Timeout config also, to see if that's possibly part of the underlying issue.

Ahmed Hmeid1 · ‎06-05-2017

The best thing to do is look at the inbound ECC queue record. If the error_string field isn't visible, add it to the form, that'll display exactly what it contains. You could add have a more generic current.error_string.toString().indexOf('javax.net.ssl.SSLHandshakeException') > -1; so wherever it is in the string, it'll retry it.

I'm not sure there is a way to force it to error - is it a certificate based integration? Maybe delete the certificate and let it fail and set a long retry policy and in the meantime, add the certificate back.