You can't transform without observability

Q&A | June 12, 2023

You can't transform without observability

Organizations need real-time insights into the digital platforms that drive their businesses.

In the always-on digital marketplace, 24/7 reliability is crucial. But it can be difficult to assess the performance of—and diagnose potential issues with—today’s complex, cloud-native applications. Real-time observability is necessary for organizations to meet these vital strategic needs. Ben Sigelman, cofounder of Lightstep (acquired by ServiceNow in 2021) and general manager of ServiceNow’s Cloud Observability business unit, spoke with Workflow about the present—and future—of this important capability.

The following interview has been edited for length and clarity.

Organisations need responsible AI

Read article

Digital transformation has been used to describe such a wide variety of things. It's not that it doesn't mean anything, but when it starts to mean everything, it's hard to build a strategy around it.

I think there were two real digital transformations, and we use the same words to talk about them. The first was the transformation of operations themselves, which has been going on for decades. It was about taking your filing cabinet and replacing it with a giant mainframe. You're not changing a product or service, but you're digitizing the way that we deliver it. And then there's the transformation of customer relationships and of revenue itself. Even if you're not in the business of selling software—if you’re an airline, for example, —you want to have an app. While the first digital transformation was ultimately about being more efficient and delivering things faster and cheaper, the second one is about changing what you're delivering. We use some of the same nouns and verbs to discuss both, but strategically they’re very different. The challenges—especially in recent years with truly digital products and services, —are truly unique.

I see organizations struggling because they're trying to innovate faster by spending money faster. They hire 10 times the engineers expecting to accelerate the pace of innovation tenfold. Unfortunately, it’s really difficult to get more than a couple dozen engineers to work on one piece of software without running into fundamental collaboration and software development lifecycle problems. To overcome this, organizations split the software into microservices. And then, instead of having 1,000 engineers working on one thing, they have groups of 20 engineers working on 50 things. And in the end, they all come together to form one application.

Microservice, or cloud- native, architecture is a very difficult thing to understand, especially from an executive standpoint, and it creates a lot of business risk and many net new challenges. I've seen organizations far and wide struggle with how to move quickly while still creating reliable products that can compete in the marketplace. And it all hinges on this entirely new way of building software. It's not just moving to cloud. It's about moving to cloud and splitting your teams into these self-contained units that have to collaborate.

Observability is a lot like digital transformation—. It’s a term that's been applied so broadly to so many different things that the definition is a bit blurry. Fundamentally, it’s the ability to understand the insides of a system just given the bits of data that are coming out. It's become increasingly important for complex production software, often run by hundreds or thousands of engineers operating in parallel to make changes to the software while it's running. Observability is the capacity to understand how software is working without a really robust test or staging environment. It’s the only way to actually see how these software applications are working when they're under a real workload affecting real customers and real revenue.

Without good observability, you can't make changes quickly, and you can't react to changes quickly.

When we look at AI, machine learning, and large language models (LLMs), people often describe their functionality as reducing a lot of toil for human operators through automation of incident management, garden-variety debugging, and things of that nature. In the next five to 10 years, that's going to continue to improve, but I think we're getting really close to a breakthrough with LLMs.

We’re going to see what they can do to the maintenance and operations of really complex software systems. They can see around corners in a way that most people can't. It takes a developer or site reliability engineer many, many years to develop a truly coherent understanding of how the whole thing fits together. I've been absolutely gob smacked by how an LLM can take a 6,000-word strategy doc and pull out the one aspect you care about. Imagine what that could do for a complex system with thousands of microservices that are constantly being deployed and redeployed and redeployed. And with every new release having something that you can direct questions to, to help give you global knowledge at the snap of your fingers, that’s mind-blowing for me. And I think observability will never be the same after those innovations have been incorporated into these products.