Sudden MID Server Connectivity Failure

Daisy3 · ‎07-18-2024

Our MID server suddenly fails to connect to the instance every now and then. We are looking at understanding what is causing this issue and try to optimise it from our end. I am unable to understand what might have gone wrong as the below log shows a heartbeat and a sudden connection failure.

Can anyone please guide me how to debug or what next steps can be taken to avoid this issue frequently.

Thanks!

2024-07-12T17:09:01.817+1000 INFO (Worker-Interactive:HeartbeatProbe-0b99fe1e47d38a90685ada30116d43a9) [AWorker:145] Worker completed: HeartbeatProbe time: 0:00:00.001
2024-07-12T17:09:01.818+1000 INFO (ECCQueueMonitor.1) [ECCQueueMonitor:389] Received message with timestamp: 2024-07-12 07:09:01. Existing Query window is : 2024-07-12 05:05:49, Updated the query window to: 2024-07-12 05:09:01
2024-07-12T17:09:01.819+1000 INFO (ECCQueueMonitor.1) [FileReadWrite:75] Time being written to the file : 1720760941000
2024-07-12T17:09:02.044+1000 INFO (ECCSender.1) [ECCSenderCache:409] Sending ecc_queue.0b99fe1e47d38a90685ada30116d43a9.xml
2024-07-12T17:09:20.394+1000 INFO (LogStatusMonitor.60) [LogStatusMonitor:54] 2024-07-12T07:09:20.394Z, stats threads: 102, memory max: 910.0mb, allocated: 460.0mb, used: 98.0mb, standard.queued: 0 probes, standard.processing: 0 probes, expedited.queued: 0 probes, expedited.processing: 0 probes, interactive.queued: 0 probes, interactive.processing: 0 probes
2024-07-12T17:10:20.399+1000 INFO (LogStatusMonitor.60) [LogStatusMonitor:54] 2024-07-12T07:10:20.399Z, stats threads: 101, memory max: 910.0mb, allocated: 460.0mb, used: 98.0mb, standard.queued: 0 probes, standard.processing: 0 probes, expedited.queued: 0 probes, expedited.processing: 0 probes, interactive.queued: 0 probes, interactive.processing: 0 probes
2024-07-12T17:10:49.124+1000 WARN (ECCQueueMonitor.40) [HTTPClient:830] java.net.ConnectException: Connection refused: connect
2024-07-12T17:10:49.124+1000 ERROR (ECCQueueMonitor.40) [RemoteGlideRecord:918] getRecords failed (java.net.ConnectException: Connection refused: connect)
2024-07-12T17:10:49.125+1000 WARN (ECCQueueMonitor.40) [RetryExecutor:114] MIDRemoteGlideRecord.query failed with error: java.net.ConnectException: Connection refused: connect, retrying in 10 seconds
2024-07-12T17:11:20.434+1000 INFO (LogStatusMonitor.60) [LogStatusMonitor:54] 2024-07-12T07:11:20.434Z, stats threads: 102, memory max: 910.0mb, allocated: 460.0mb, used: 98.0mb, standard.queued: 0 probes, standard.processing: 0 probes, expedited.queued: 0 probes, expedited.processing: 0 probes, interactive.queued: 0 probes, interactive.processing: 0 probes
2024-07-12T17:11:25.165+1000 WARN (ECCQueueMonitor.40) [HTTPClient:830] java.net.ConnectException: Connection refused: connect
2024-07-12T17:11:25.165+1000 ERROR (ECCQueueMonitor.40) [RemoteGlideRecord:918] getRecords failed (java.net.ConnectException: Connection refused: connect)
2024-07-12T17:11:25.166+1000 WARN (ECCQueueMonitor.40) [RetryExecutor:114] MIDRemoteGlideRecord.query failed with error: java.net.ConnectException: Connection refused: connect, retrying in 15 seconds
2024-07-12T17:12:06.200+1000 WARN (ECCQueueMonitor.40) [HTTPClient:830] java.net.ConnectException: Connection refused: connect
2024-07-12T17:12:06.200+1000 ERROR (ECCQueueMonitor.40) [RemoteGlideRecord:918] getRecords failed (java.net.ConnectException: Connection refused: connect)
2024-07-12T17:12:06.200+1000 WARN (ECCQueueMonitor.40) [RetryExecutor:114] MIDRemoteGlideRecord.query failed with error: java.net.ConnectException: Connection refused: connect, retrying in 22 seconds
2024-07-12T17:12:20.395+1000 INFO (LogStatusMonitor.60) [LogStatusMonitor:54] 2024-07-12T07:12:20.394Z, stats threads: 101, memory max: 910.0mb, allocated: 460.0mb, used: 98.0mb, standard.queued: 0 probes, standard.processing: 0 probes, expedited.queued: 0 probes, expedited.processing: 0 probes, interactive.queued: 0 probes, interactive.processing: 0 probes
2024-07-12T17:12:54.733+1000 WARN (ECCQueueMonitor.40) [HTTPClient:830] java.net.ConnectException: Connection refused: connect
2024-07-12T17:12:54.733+1000 ERROR (ECCQueueMonitor.40) [RemoteGlideRecord:918] getRecords failed (java.net.ConnectException: Connection refused: connect)
2024-07-12T17:12:54.734+1000 WARN (ECCQueueMonitor.40) [RetryExecutor:114] MIDRemoteGlideRecord.query failed with error: java.net.ConnectException: Connection refused: connect, retrying in 33 seconds

Community Alums · ‎07-18-2024

Hi @Daisy3 ,

Here are a few steps to troubleshoot and resolve this issue:

Network Connectivity:
- Verify that the MID server can reach the ServiceNow instance. Try to ping the instance from the MID server machine.
- Ensure there are no firewall rules blocking the connection between the MID server and the ServiceNow instance.
Proxy Settings:
- If your environment uses a proxy server, make sure the MID server is configured correctly to use the proxy. You can check the proxy settings in the config.xml file of the MID server.
Instance URL:
- Ensure that the instance URL configured in the MID server is correct. The URL should include the protocol (https://) and should not have any typos.
MID Server Credentials:
- Verify that the credentials used by the MID server to connect to the ServiceNow instance are correct and have the necessary permissions.
ServiceNow Instance:
- Check if the ServiceNow instance is up and running. Sometimes, the instance might be down or undergoing maintenance.
MID Server Logs:
- Continue monitoring the MID server logs for any additional errors or warnings that could provide more context about the issue.
Network Configuration:
- Ensure that there are no intermittent network issues between the MID server and the ServiceNow instance. Sometimes, network flakiness can cause sporadic connection issues.

Thanks.

Hope this helps.

If my response turns useful, you can mark it helpful and accept solution.

Daisy3 · ‎07-18-2024

Thanks @Community Alums for the detailed steps. I do not see any concern with Step 1-5 at all.

The issue occurs intermittently with multiple MID servers, and the connection sets back again. I have looked into the logs of two MID servers against the time the heartbeat failed, unfortunately could not get any specific warnings or errors.

Apart from network flakiness, I do not see any other way of reviewing this. Is there anything or steps we can check from this perspective from our MID server or instance logs.

Ronald Lucas TA · ‎11-05-2024

Our MID servers have recently started having intermittent issues. Symptoms include:

Azure discovery failure with the error "Failed Exploring CI Pattern, Pattern name: Azure - Sub Account (LP)".
AWS discovery failure with the error ""The credentials can't be used with the account ID provided".
MID server issue "AMB client disconnected after connecting".
MID server automatically stopping shortly after it starts due to errors about not able to verify upgrade even though it is up to date.

In some cases, retrying Discovery Schedules or restarting the MID servers get things working again.

I have several cases open with ServiceNow. If I get an answer from them, I'll post here.