
- Post History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
on 10-08-2021 02:32 PM
How ServiceNow Predictive AIOps can help with Facebook’s outage
By Lisa Wolfe, Eytan Chamovitz, Dor Juraski, Aleck Lin, Jimmy Yuan
According to Santosh Janardhan, the VP of infrastructure at Facebook, their recent 6 hour outage was caused by configuration changes on the backbone routers that caused “issues” that interrupted the flow of traffic between routers in Facebook’s data centers around the world. He said, “This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt”. In addition, Facebook discovered a bug in their auditing software that allowed the planned configuration command to go through successfully.
We learned that Facebook had broken many of their internal tools due to the DNS loss, which exacerbated their ability to investigate and resolve the outage. If ServiceNow's® Predictive AIOps was leveraged, as a SaaS platform, it would have continued streaming logs and metrics out of the Facebook’s datacenters to ServiceNow’s secure cloud for further analysis using our AI engine. Not only could ServiceNow Predictive AIOps have detected anomalies quickly, it could also have aided in chasing down interdependencies of the anomalies across their datacenters, and have allowed them to perform deep investigation while their engineers were on-route to the site, potentially saving critical time during an all-out global outage.
For many of our customers, ServiceNow Predictive AIOps can be a life-saver, enabling their operations teams to handle outages and help prevent hours of downtime, and in some cases, it would outright prevent the outages from happening by fixing symptoms that take place before the outage.
How does it work:
Predictive AIOps monitors the relevant services, servers, routers, and other infrastructure. The AI engine knows the normal behavior of the environment according to events, logs and metrics, and would pick up on the first symptoms of a “cascading effect” that took place, or general out-of-the-ordinary occurrences in the environment. When multiple components are affected, the AI engine correlates between symptoms from those components and then presents insights that describe what the problem is. Additionally, by using ITSM historical incident, problem, and change data, our customers are empowered to leverage the contextual data to help with finding the root cause.
Learn more about ServiceNow Predictive AIOps: https://www.servicenow.com/products/predictive-aiops.html
- 286 Views