Srey Waghray
Moderator
Moderator

Introduction

In today’s digital landscape, organizations rely heavily on real-time monitoring and automated incident detection to maintain system stability and performance. However, identifying the root cause of performance degradation or failures can be challenging.

 

This is where Root Cause Correlation (RCC) comes in. RCC is an intelligent feature in Instance Observer that helps users quickly diagnose and resolve issues by automatically correlating logs, metrics, and alerts. By reducing manual intervention and leveraging machine learning-based pattern recognition, RCC enables teams to respond proactively rather than reactively to incidents.

 

In this article, we will explore:

  1. What Root Cause Correlation is
  2. How RCC works
  3. How to set up and configure RCC alerts
  4. How to generate and interpret RCC reports
  5. The benefits of using RCC
  6. Best practices for implementation

Let’s dive in!

 

  1. What is Root Cause Correlation?

Root Cause Correlation (RCC) is a feature within Instance Observer that automates the correlation of performance metrics, logs, and events to determine the root cause of alerts.

When an alert is triggered due to an anomaly—such as memory issues, database locks, slow transactions, or high CPU usage—RCC analyzes data to correlate different symptoms and provide a clear diagnosis of the root cause.

 

Key Capabilities of RCC:

  • Automated correlation of logs and metrics to identify root causes faster
  • Reduced mean time to resolution (MTTR) by providing actionable insights
  • Advanced visualization tools to track performance issues
  • Machine learning-driven analysis to detect patterns
  • Customizable alert configurations for proactive monitoring

By automating root cause identification, RCC helps teams focus on strategic tasks rather than spending hours on manual troubleshooting.

 

  1. How RCC Works 

RCC follows a structured workflow to correlate symptoms and pinpoint the root cause of an incident. The process includes:

  1. Data Collection

RCC collects data from various system components, including:

  • Logs: System logs, application logs, error logs
  • Metrics: CPU usage, memory consumption, database response times
  • Events: Performance degradation, application crashes, system anomalies
  1. Correlation Engine Analysis

Once the data is collected, RCC’s correlation engine runs a machine learning-based analysis to find patterns and establish relationships between symptoms.

For example:

  • High memory usage on a node correlates with garbage collection pauses
  • Slow transactions might be caused by database locks
  • CPU spikes could be linked to a background job running abnormally
  1. Root Cause Identification

Based on the correlation analysis, RCC determines the most probable root cause and categorizes it under symptom types like:

  • Memory issues
  • Slow transactions
  • Database locks
  • Cache flush issues
  1. Visualization & Actionable Insights

RCC then displays findings through intuitive dashboards and suggests remedial actions Teams can take to fix the issue.

 

 

  1. Configuring RCC Alerts

To use RCC effectively, alerts need to be configured correctly.

Prerequisites:

  • Admin role is required to configure alerts
  • Ensure Instance Observer is enabled

Steps to Configure RCC Alerts:

  1. Navigate to: Impact → Platform Health → Monitor → Instance Observer
  2. Go to Alerts Menu and select Alerts Console
  3. Select a Production Instance
  4. Click "Get Snapshot" to load available alerts
  5. Choose Alert Options:
    • Instance
    • Date range
    • Metrics to monitor
    • Alert type (Self-Service Alerts, Diagnostic Events)
  6. Fine-Tune Thresholds:
    • Set an alert if transaction anomalies persist for X minutes
    • Track top 5%, 10%, or 15% of anomalous jobs
  7. Test the Alert Configuration
    • Simulate anomalies over 5, 10, or 15 minutes
    • Adjust thresholds for better accuracy
  8. Choose Notification Method (Email, SMS, or System Integration)
  9. Monitor Generated Alerts & Take Action

Once configured, RCC will continuously monitor system performance and trigger alerts based on predefined thresholds.

 

  1. RCC Symptom Categories and Corresponding Alerts

RCC categorizes system symptoms into six major groups:

Symptom Category

Description

Corresponding Alert

Database Impact

Detects extended SQL queries impacting performance

Database Response Time

Cache Flush

Identifies cache flushes & node restarts causing delays

Semaphore Mean Time

Longest Running Session

Finds high-processing transactions causing delays

Semaphore Mean Time

Slow Transactions

Identifies long-running transactions affecting system response

Semaphore Mean Time

Memory Issues

Detects garbage collection pauses and high memory usage

Garbage Collection Time

Database Locks

Identifies anomalous DB lock events

Threads Running

Understanding these categories helps Teams focus their troubleshooting efforts more effectively.

 

  1. Generating RCC Reports

RCC provides detailed reports to help Teams analyze alerts more effectively.

Steps to Generate an RCC Report:

  1. Go to Impact → Platform Health → Monitor → Instance Observer
  2. Select "Semaphores" from the Performance Menu
  3. Choose Report Options:
    • Select Production Instance
    • Pick Metrics & Date Range
    • Choose Alert Type (Self-Service Alerts)
  4. Click "Get Snapshot" to generate a report
  5. Analyze Findings:
    • Drill down into specific alerts
    • Use graphical visualizations
    • Export reports in PNG, SVG, or CSV formats
  6. Click "Generate Root Cause" to display detailed correlation insights
  1. Benefits of Using RCC

1.Faster Incident Resolution

RCC reduces MTTR by automating the correlation process, ensuring quick issue identification.

 

2. Proactive Issue Management

Instead of reacting to system failures, RCC enables proactive monitoring, minimizing potential downtime.

 

3. Reduced Manual Effort

By automating log and metric correlation, Teams save time spent on manual troubleshooting.

 

4. Improved System Stability

By identifying issues before they escalate, RCC helps maintain a stable ServiceNow environment.

 

5. Enhanced Operational Efficiency

Organizations can focus on strategic initiatives rather than spending resources on resolving recurring issues.

 

7. Best Practices for Implementing RCC

To get the most out of RCC:

  • Fine-tune alert thresholds for more relevant insights
  • Regularly review RCC reports to identify repeated issues
  • Integrate RCC alerts with workflows for automated remediation
  • Train Teams on RCC functionalities
  • Continuously monitor RCC history to detect recurring patterns

 

Conclusion

Root Cause Correlation (RCC) is a powerful tool that streamlines incident resolution through automated correlation of logs, metrics, and events.** By reducing manual effort, RCC helps teams identify root causes faster, leading to better system reliability and reduced downtime.

 

Organizations leveraging RCC will experience enhanced operational efficiency, proactive monitoring, and significant cost savings. Implementing RCC with best practices will ensure your environment remains stable, efficient, and resilient.

 

Ready to implement RCC? Start configuring alerts today and take your incident resolution process to the next level!

 

Version history
Last update:
3 weeks ago
Updated by: