ServiceNow Dashboard

What is distributed tracing?

Distributed tracing is a method for tracking service requests in distributed systems, providing visibility into latency, performance bottlenecks, etc.

The demands of modern business have led to an explosion in information technology, with centralized, legacy computer systems evolving into powerful and complex distributed IT environments. Unfortunately, along with the enhanced capabilities of today’s cloud-based networks and remote-access data processing, this increased complexity also carries greater risk.

Due to their intricate interdependencies, complex systems are more likely to experience problems. Failures in one part can cascade across the system, and identifying and fixing issues is often far more challenging than in centralized systems. At the same time, the more complex the system, the harder it is to predict how changes in one part will affect the others, leading to unexpected consequences for even the most innocuous adjustments. And through it all, thoroughly testing a complex system is exponentially more difficult, meaning that problems are increasingly likely to slip through undetected. Distributed tracing provides a solution.

Distributed tracing can be said to have begun with the Dapper paper—introduced by Google in 2010—which laid the foundational groundwork for large-scale distributed systems tracing infrastructure. Interestingly, Ben Sigelman, the founder of Lightstep (which later became ServiceNow Cloud Observability), was instrumental in the creation of Dapper. Following Dapper, Twitter released Zipkin in 2012, the first open-source distributed tracing project. Then in 2015, Uber launched Jaeger, which was itself inspired by Dapper.

In 2016, Sigelman wrote a blog post ("Toward Turnkey Distributed Tracing," which would come to be known as the OpenTracing Manifesto). This pivotal text introduced OpenTracing as a single standard, addressing the lack of standardization within the tracing ecosystem and laying the foundation for OpenTracing to become a project under the Cloud Native Computing Foundation (CNCF) and eventually merge with OpenTelemetry in 2019.

OpenTelemetry version 1.0 was released in 2021, and has since become the de facto standard for tracing, metrics, and logging. From Dapper in 2010 to today's OpenTelemetry capabilities, in little over a decade, distributed tracing has evolved from a single backend system to a widely used end-to-end solution, ultimately paving the way for modern comprehensive observability practices.

Connecting DevOps, Observability, AIOps

Connecting DevOps, Observability, and AIOps

Read this ebook to learn how connecting DevOps, Observability, and AIOps can improve application delivery and explore ServiceNow solutions that can help.

Distributed tracing allows organizations to profile and monitor their full range of applications, especially those built using a microservices architecture. This approach provides visibility into how individual services within a distributed system interact with one another, building an accurate picture of individual requests as they flow through the system.

By tracking the journey of requests and measuring how long each part takes, distributed tracing aids in pinpointing performance bottlenecks, latency issues, and potential failures. As such, distributed tracing is a crucial tool for DevOps and IT teams, allowing them to optimize, troubleshoot, and maintain their systems more effectively.

Distributed tracing structure

Distributed tracing is built around three core components:

  • Span
    A span is a single unit of work carried out by a service within the system, marked with start and end time intervals and possibly including metadata such as logs or tags. Spans are the building blocks of a trace, representing different parts of the workflow. Root spans (also called parent spans) can have their own spans (called child spans), which in turn can also have spans.
  • Trace
    A trace is made up of a span or multiple spans which together represent the complete execution path of the service request as it moves through the distributed system. Traces are often visualized as trees, where the root node represents the user's interaction, and the other nodes represent the various microservices involved in processing the request and preparing the response.
  • Tags
    Tags are metadata elements attached to spans. These provide context and classification.

The trace/span structure offers a request-centric view—bridging the gaps between independent microservices and providing a unified perspective of the system's performance. With this information, organizations are better prepared to understand and improve the user's experience.

Distributed tracing vs. logging

Tracing, logging, and metrics play pivotal roles in observability, but they are not the same concepts. Each serves distinct purposes, and understanding the differences and complementary nature of these concepts is essential for comprehensive system monitoring and debugging:

  • Tracing
    Distributed tracing provides a detailed view of requests as they traverse through the components of a distributed system, capturing the flow of a request through assorted services playing a role in performance tuning and troubleshooting. Unlike logging and metrics, distributed tracing focuses on the journey of specific requests, giving a clear picture of the interactions between microservices.
  • Logging
    Logging is the practice of recording specific, individual events in a system, such as user actions, system errors, or other activities. Time-stamped logs provide granular information about what happened in the system at a particular moment, often essential for debugging and auditing. While distributed tracing tracks the flow of requests, logging offers a more static snapshot of events, without necessarily showing the relationships between different parts of the system.

Metrics in distributed tracing

Metrics are numerical values that represent the state of a system at a particular point in time or over a time interval, and may include response times, error rates, and system resource utilization. Metrics play a vital role in distributed tracing, offering a quantifiable way to monitor and analyze the performance of various services within a distributed system. These numerical values are extrapolated from traces and logs, providing "at-a-glance" information, or even detailed reporting on specific aspects such as response times, error rates, and system throughput.

By considering trace and log data through the lens of metrics that summarize key performance indicators, organizations can gain a comprehensive understanding of their distributed architecture, allowing for quick diagnostics and actionable insights, and facilitating effective system optimization.

Microservices in distributed tracing

Microservices are a software architectural design where an application is structured as a collection of loosely coupled, independently deployable services. Each microservice focuses on a specific functional area and operates as an individual component within the broader system. This modular approach promotes flexibility, scalability, and can enhance development speed. In the context of distributed tracing, microservices play a significant role as the individual nodes that a request passes through.

As a request travels from one microservice to another, distributed tracing captures the details of these interactions, including the time taken at each step. This information details how the request flows through the numerous services, identifying bottlenecks, latencies, and potential failures.

Understanding how microservices interact within a distributed system can be complex; distributed tracing provides invaluable insights into these interactions, empowering organizations to visualize the paths, monitor system performance, and troubleshoot any problems that may arise to foster a more robust and efficient system architecture.

Distributed tracing has become an indispensable tool for organizations working with distributed systems, particularly in the context of microservices and dynamic architectures. By comprehensively tracking and recording every interaction that a request has with each service, distributed tracing provides crucial insights into monitoring, debugging, and performance optimization. Attributes can be added to traces for further clarification, and aach span is recorded with detailed metadata, including span parent-child relationships, allowing a complete understanding of how requests move through and across services.

As such, more and more organizations are turning to distributed tracing to manage the complexity of their modern application environments. With numerous potential failure points in today's intricate application stacks, pinpointing root causes of issues can be difficult, time-consuming, and potentially fraught with errors. Distributed tracing streamlines this process, facilitating quicker and more accurate identification of problems, thereby directly enhancing a company's ability to provide an excellent user experience.

At the same time, distributed tracing is an effective answer to the problem of cardinality, where data volumes increase to the point where data storage and computing power become difficult to manage.

The benefits of distributed tracing extend to enhancing microservices' performance understanding, fostering quick issue resolution, and boosting customer satisfaction. By providing a detailed view of how each microservice performs, organizations can ensure steady revenue streams while also dedicating more time to strategy and innovation.

The data provided through distributed tracing is crucial, but at the end of the day it is still just data. Without a clear understanding of what the data represents, it cannot positively impact the decision-making process. The true value in the data is the actionable insight that can be derived from the numbers—provided they are recent, relevant, and reliable.

It’s in the intelligent analysis and contextual understanding of this data where organizations can pinpoint issues, identify causes, and implement effective solutions. How does distributed tracing move beyond mere data collection to provide profound insights into various scenarios? Consider the following:

Tracing illuminates the relationship between cause and effect

Distributed tracing plays a critical role in recognizing the symptoms of poor software health, such as latency or low throughput. It acts like a diagnostic tool, connecting the observable effects with their underlying causes, allowing validation of hypotheses about what may have triggered the observed change.

Tracing helps identify the causes of outages

When a service becomes unavailable, that demands an explanation. Tracing aids organizations in determining what changes—internal or external—have been made prior to an outage. Whether the variation is a result of bugs in the software, changes driven by users, or alterations in infrastructure that lead to performance issues, distributed tracing makes it possible to determine the state of the system before and after the outage, and clearly identify what may have caused it.

Tracing informs insight into service changes

Understanding changes within individual services is vital. Whether it's deployments or version updates, distributed tracing breaks down performance across distinct stages, tagging each span with the version information. This detailed view aids in diagnosing changes that affect a service’s performance.

Tracing accounts for changing user demands

Systems are not static, and neither are the users who operate within the systems. External factors such as shifts in user behavior can drive changes in service performance. Insightful tracing uses tags to capture essential parts of requests and user features, offering a deeper understanding of how users interact with the application, and how these interactions can create unexpected demands.

Tracing uncovers areas where resources may be strained

Resources are finite, and sometimes there simply is not enough to go around. Resource competition in distributed networks can significantly affect performance. Distributed tracing offers insights into how shared resources like CPUs, containers, or databases are utilized. Properly tagged traces allow aggregate analysis, uncovering when and where slower performance correlates with specific resource usage, helping in resource planning and conflict resolution.

Tracing creates visibility into upstream changes

The dynamic nature of dependencies means that upstream changes can impact your service’s performance. Insightful distributed tracing, including the tagging of egress operations and version numbers, enables visibility into how upstream services affect performance. Understanding these relationships helps in adapting to or mitigating the impacts of these changes.

Open-source distributed tracing standards are essential frameworks that guide the collection, management, and analysis of tracing data across different services in a standardized manner. These standards promote interoperability and reduce vendor lock-in, allowing developers to switch between different tracing backends and tools with minimal adjustments. They also provide a common ground for integrating various platforms, languages, and applications within complex distributed systems.

Among the most widely used open-source distributed tracing standards are:

OpenTracing

Part of the Cloud Native Computing Foundation (CNCF), OpenTracing is among the earliest open-source distributed tracing standards. This framework includes APIs that support distributed context propagation and allows developers to add instrumentation to their application code (without locking into any specific vendor). OpenTracing offered consistent tracing semantics across different platforms, but has since been archived—the CNCF no longer provides support for the project.

OpenCensus

OpenCensus is designed to support multiple exporters, allowing users to send trace data to different backends for analysis. This framework (which originated from Google) provides a clear set of APIs and libraries that allow automatic and manual collection of distributed traces and metrics. By offering a unified solution for observability, OpenCensus streamlines the process of gathering and managing essential statistical data. OpenCensus has likewise been discontinued.

OpenTelemetry

OpenTelemetry is a merged project between OpenTracing and OpenCensus, combining the best features of both standards. Cofounded by Lightstep, OpenTelemetry offers a unified and more extensive set of APIs, libraries, agents, and instrumentation to provide a complete observability framework for cloud-native software. OpenTelemetry simplifies application instrumentation, offering built-in support for various popular frameworks and libraries. It aims to become the observability standard for distributed traces, logging, and metrics, backed by a growing community and broad industry support.

Various features are integral to the functionality and success of distributed tracing. Here is how each feature plays a role:

Alerts

Alerting mechanisms in the observability backend allow teams to set thresholds for specific metrics or behaviors that might indicate a problem. When these thresholds are reached, alerts can be sent to the relevant personnel, allowing for quick identification and resolution of potential issues, thus enhancing system reliability.

End-to-end insights

Distributed tracing provides comprehensive visibility into the entire journey of a request through different services and components. This end-to-end insight helps in identifying bottlenecks, inefficiencies, and anomalies within the system, providing a detailed context for performance tuning and error analysis.

Time and cost efficiency

By offering precise information on system behavior, distributed tracing significantly reduces the time spent on debugging and identifying issues. This efficiency translates into cost savings, as teams can spend more time on feature development and innovation instead of troubleshooting.

Multi-region/multi-cloud Integration

With the rise of distributed computing across multiple geographical locations and cloud providers, distributed tracing facilitates integration across these complex environments. It allows for a coherent view of the system's performance across different regions and cloud platforms, ensuring consistent monitoring and analysis.

Service-performance monitoring

Distributed tracing enables real-time monitoring and tracking of the performance of each service—understanding how they interact and pinpointing areas that might need optimization.

Collector

The collector acts as a critical component in gathering, processing, and exporting telemetry data (i.e., traces, metrics, and logs). It provides a unified, vendor-agnostic option for collecting and transmitting data, enabling seamless integration with various observability backends. The flexibility of the collector ensures that tracing can be adapted to different environments without altering the instrumentation code.

Sampling

Sampling is a feature that allows the collection of a subset of requests (rather than every request) to reduce total volume of data sent to the tracing backend. This enables distributed tracing systems to operate at scale without overwhelming resources, while still providing valuable insights.

Scalability

As systems grow, distributed tracing must adapt to the increased complexity and volume of data. Scalability features ensure that tracing can handle large-scale environments, providing consistent performance insights regardless of the system's size. Of course, capturing the data is only the first step; to ensure that organizations can make sense of the data, they need access to a platform that can scale alongside distributed tracing as the systems grow and evolve.

Ability to function across heterogeneous full-stack environments

Modern applications are made up of various languages, frameworks, technologies, and clients (both web-based and mobile). Distributed tracing's ability to operate across heterogeneous full-stack environments ensures that developers have the insights they need from across the entire technology stack, no matter how diverse it may be.

As previously stated, the benefits of distributed tracing are tied to the enhanced visibility it provides into an organization’s distributed systems. But visibility in and of itself is only valuable when it makes other benefits possible. Key benefits of distributed tracing include:
Graphic outlining the considerations of distributed tracing.

Accurate evaluation of specific user actions

One of the key benefits of distributed tracing is its ability to measure the duration needed to perform essential user actions, such as making a purchase. Tracing the request pathway helps in locating and correcting backend impediments that may otherwise negatively impact the user's experience.

Easy assessment of SLAs

Most organizations operate within the bounds of service-level agreements (SLAs), formalizing performance commitments to customers or other internal divisions. Distributed tracing tools compile performance data from individual services, making it convenient for teams to assess whether they are adhering to their SLAs.

Support for managing SLOs and SLIs

Service-level objectives (SLOs) are specific, measurable targets that define the expected performance and availability of a service, and are supported by service-level indicators which help organizations measure their service levels. Properly deployed, distributed tracing provides an opportunity to track and meet SLOs by choosing which specific signals to monitor and setting alerts for any errors or data that falls outside of a predetermined range. This makes it possible to address any related incidents quickly and effectively.

Enhanced understanding of service relationships

Distributed tracing provides insights into the intricate cause-and-effect relationships between assorted services. This understanding helps companies optimize their overall system performance.

Improved collaboration

In environments where different teams oversee various services involved in fulfilling a request, distributed tracing offers clarity about where an error has occurred and which team needs to address it. This clarity enhances collaboration between teams, significantly reduces the time spent ‘finger pointing,’ and contributes to overall productivity within the organization.

Reduced time to resolution

When issues arise in application performance, support teams can utilize distributed traces to pinpoint whether the problem lies in the backend. By analyzing traces from the affected service, engineers can identify and resolve the problem. Utilizing end-to-end distributed tracing tools even enables investigation of frontend performance issues within the same platform, thereby reducing both the mean time to detection (MTTD) and the mean time to resolution (MTTR) for potentially problematic issues.

While distributed tracing brings numerous benefits, it also comes with inherent obstacles that might hinder its full potential. Understanding these challenges is key for organizations that aim to implement distributed tracing effectively. Below are some of the most notable challenges:

Difficulty of in manual instrumentation

One of the hurdles that some distributed tracing platforms present is the necessity for manual instrumentation. This means the organization may have to alter or modify their existing code to initiate the tracing of requests. Such manual intervention not only takes up valuable engineering resources but can also lead to the introduction of errors within the applications as the code is revised.

Restriction to back-end coverage

Traditional distributed tracing is often restricted to the backend services, generating a trace ID only when the request hits the first backend service. Without utilizing an end-to-end distributed tracing platform, visibility into the corresponding user session at the frontend remains obscured. This limitation makes it more difficult to discover the root cause of some problematic requests and to determine whether the issue needs to be resolved by the front-end or back-end team.

Thankfully, the adoption of frameworks such as OpenTelemetry alleviates or removes the challenges of limited visibility to frontend transactions as well as issues associated with instrumentation. These cand other challenges are inherent in many industry technologies (such as Kubernetes) that incorporate OpenTelemetry into their core codebases.

Man reading pricing on mobile device

Pricing for Cloud Observability

Choose a package to find a ServiceNow Cloud Observability edition that fits your needs.

As the modern business IT landscape continues to expand in terms of size and complexity, the benefits of distributed tracing are becoming ever more obvious. ServiceNow Cloud Observability—leveraging the award-winning Now Platform®—sets a new standard for tracing, delivering complete visibility across requests in distributed systems.

Integrate with existing tools. Bridge metrics and tracing to create unified telemetry. Significantly reduce your organization’s MTTR. And, through it all, align pricing with business outcomes, for enhanced value without scaling costs for increased usage.

Cloud Observability is revolutionizing distributed tracing to benefit your business. Contact ServiceNow to learn more!

Let our experts show you how ServiceNow Cloud Observability can help your organization accelerate the transition to cloud-native applications.

Loading spinner
Contact
Demo