Monitoring NGINX with OpenTelemetry and Cloud Observability

Heather_Waters · ‎10-10-2023

NGINX is a commonly used web server, boasting performance that is 2.5x faster than Apache. Many enterprises use it to host both internal and customer-facing web services. Because these web services are critical, continuous monitoring of the NGINX web server (especially when coupled with alerts) is essential to ensuring the performance and uptime of web applications.

This guide provides an overview of the metrics available from NGINX and how to ingest metrics from NGINX and send them to Cloud Observability for more intelligent analysis. Then, it looks at how to create charts in Cloud Observability to help with NGINX monitoring.

Key Metrics in NGINX

When NGINX runs with the ngx_http_stub_status_module, you gain access to basic information about the current status of your NGINX server. This module exposes the following key metrics:

Requests: The total number of client requests.
Accepts: The total number of accepted client connections.
Handled: The total number of handled connections.
Connections:
- Writing: The current number of connections where NGINX is writing the response back to the client.
- Waiting: The current number of idle connections waiting for a request.
- Active: The current number of active connections, including those that are waiting for a request.
- Reading: The current number of connections where NGINX is reading the request header.

The metric with the total number of requests is a single number that increases over time. It is only reset to zero when the NGINX server restarts. For this reason, simply monitoring the number is not very helpful. Instead, you would want to look at the rate at which this number is increasing. For example, consider the following time series data:

Timestamp		Total number of requests
08:00:00		300
08:00:30		312
08:01:00		312
08:01:30		372
08:02:00		1,572

From 8:00:00 to 8:01:00, the total number of requests increased by 12, meaning the server saw requests arrive at an average 0.2 requests (12 / 60) per second over that one-minute window. From 8:01:00 to 8:02:00, the number of requests increased from 312 to 1,572, yielding 1,260 new requests in a single minute. On average, this means the server saw 21 requests (1260 / 60) per second over that one-minute window. Over the entire two-minute window, the rate of incoming requests would be 10.6 requests (1272 / 120) per second. Your DevOps team would want to be alerted if the rate of requests suddenly spiked, perhaps indicating a DDoS attack underway.

The metrics representing current connections based on state are also helpful for alerting your team to possible issues. For example, if you see a high number of connections with writing state and very few in waiting state, then your server might be trying to process requests, but is blocked as it waits for results from other third-party, upstream services.

Consider when metrics show an unchanging number of handled connections, but a rapidly increasing number of accepts. This indicates a continuing influx of connection attempts that are not being handled by NGINX. These dropped connections may point to a wider problem that requires deeper investigation.

To capture NGINX metrics and visualize them, use the OpenTelemetry Collector and Cloud Observability.

Configure OpenTelemetry Collector

Before you can begin using OpenTelemetry Collector to gather and ship metrics, you need to set up NGINX.

Set up the NGINX server

First, spin up an NGINX server locally, with a metrics scraping endpoint exposed. Ensure that the NGINX module has been built and enabled with the --with-http_stub_status_module configuration parameter.

Next, create a new virtual host file configuration called 00-site-with-status, placing it in the /etc/nginx/sites-available folder. The file has the following contents:

server {
  listen 80;
  server_name localhost;
  location / {
    proxy_pass http://127.0.0.1:3000;
  }
  location /status {
    stub_status;
  }
}

With this configuration file, you've opened up a localhost at port 80, with requests to the root path pointing to the web application, which is listening on port 3000. Let’s assume that you've spun up a simple web application (for example, a simple Node.js Express application) that is listening on port 3000.

Next, configure the /status path to display the metrics from the NGINX status module. Then, restart the NGINX server with the following command:

$ sudo systemctl restart nginx

In the web browser, this is what you see when you visit http://localhost/status

Install the Collector

The NGINX receiver is bundled with the contributor distribution of the OpenTelemetry Collector. To use it, you need to install the contributor distribution binary found on GitHub. For the Debian Linux system, install the collector as follows:

$ wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.72.0/otelcol-contrib_0.72.0_linux_amd64.deb
$ sudo dpkg -i otelcol-contrib_0.72.0_linux_amd64.deb

After installing the collector, verify that it’s running:

● otelcol-contrib.service - OpenTelemetry Collector Contrib
     Loaded: loaded (/lib/systemd/system/otelcol-contrib.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2023-02-27 14:10:00 PST; 5s ago
   Main PID: 677031 (otelcol-contrib)
      Tasks: 13 (limit: 18868)
     Memory: 24.7M
     CGroup: /system.slice/otelcol-contrib.service
             ├─677031 /usr/bin/otelcol-contrib --config=/etc/otelcol-contrib/config.yaml
             └─677047 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session

Configure the collector

Next, configure the collector receiver to scrape metrics from NGINX. To do this, edit the collector configuration file found at /etc/otelcol-contrib/config.yaml. Add the NGINX receiver, setting it to retrieve metrics from the /status endpoint every 10 seconds.

receivers:
  nginx:
    endpoint: http://localhost/status
    collection_interval: 10s

processors:
  batch:

exporters:
  logging:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [nginx]
      processors: [batch]
      exporters: [logging]

For the processor, use the batch processor, which batches and compresses incoming data for more efficient exporting. For the exporter, use the logging exporter to start out, which is configured with verbosity set to detailed. After verifying that the collector is properly capturing metrics from NGINX, change the exporter to send metrics to Cloud Observability.

Once the collector is configured, restart it:

$ sudo systemctl restart otelcol-contrib

To verify that the collector is receiving metrics from NGINX, run the following command:

x$ journalctl -u otelcol-contrib -f
…
Feb 27 14:16:00 demo otelcol-contrib[1215451]: Resource SchemaURL:
Feb 27 14:16:00 demo otelcol-contrib[1215451]: ScopeMetrics #0
Feb 27 14:16:00 demo otelcol-contrib[1215451]: ScopeMetrics SchemaURL:
Feb 27 14:16:00 demo otelcol-contrib[1215451]: InstrumentationScope otelcol/nginxreceiver 0.68.0
Feb 27 14:16:00 demo otelcol-contrib[1215451]: Metric #0
Feb 27 14:16:00 demo otelcol-contrib[1215451]: Descriptor:
Feb 27 14:16:00 demo otelcol-contrib[1215451]:      -> Name: nginx.connections_accepted
Feb 27 14:16:00 demo otelcol-contrib[1215451]:      -> Description: The total number of accepted client connections
Feb 27 14:16:00 demo otelcol-contrib[1215451]:      -> Unit: connections
Feb 27 14:16:00 demo otelcol-contrib[1215451]:      -> DataType: Sum
Feb 27 14:16:00 demo otelcol-contrib[1215451]:      -> IsMonotonic: true
Feb 27 14:16:00 demo otelcol-contrib[1215451]:      -> AggregationTemporality: Cumulative
Feb 27 14:16:00 demo otelcol-contrib[1215451]: NumberDataPoints #0
Feb 27 14:16:00 demo otelcol-contrib[1215451]: StartTimestamp: 2023-02-27 21:15:50.558593496 +0000 UTC
Feb 27 14:16:00 demo otelcol-contrib[1215451]: Timestamp: 2023-02-27 21:16:00.583045553 +0000 UTC
Feb 27 14:16:00 demo otelcol-contrib[1215451]: Value: 1

After you've verified that NGINX metrics are being collected and logged, you're ready to send those metrics to Cloud Observability.

Send metrics from OpenTelemetry Collector to Cloud Observability

After logging into Cloud Observability, navigate to Project Settings, and then to the Access Tokens page. Your OpenTelemetry Collector will need an access token to authenticate requests when exporting metrics data to Cloud Observability. Create a new access token and copy down its value.

Configure the collector to export to Cloud Observability

Please note: Cloud Observability continues to use lightstep (the former product name) in code for ongoing compatibility.

Returning to the collector configuration at /etc/otelcol-contrib/config.yaml, you'll configure a different exporter called otlp/lightstep. You’ll use Lightstep’s ingestion endpoint and then paste in your token. The resulting file should look like this:

receivers:
  nginx:
    endpoint: http://localhost/status
    collection_interval: 10s

processors:
  batch:

exporters:
  logging:
    verbosity: detailed
  otlp/lightstep:
    endpoint: ingest.lightstep.com:443
    headers: {"lightstep-access-token": "INSERT YOUR TOKEN HERE"}

service:
  pipelines:
    metrics:
      receivers: [nginx]
      processors: [batch]
      exporters: [otlp/lightstep]

Then, restart the collector.

$ sudo systemctl restart otelcol-contrib

Now that you're up and running with NGINX and OpenTelemetry Collector, you can begin working with metrics in Cloud Observability.

Working with metrics in Cloud Observability

Let’s look at some basic ways to use Cloud Observability in conjunction with NGINX metrics. If you’re interested in more detailed examples, you can go here.

Create a dashboard

First, you'll create a new dashboard to display charts related to NGINX metrics. Go to the Dashboards page, and then click on Create Dashboard. Next, provide a name and description for the new dashboard.

Add a chart

Next, add a chart to the dashboard. Click on Add a chart.

The first chart will show the current number of connections, regardless of connection state. For this, you need to select the telemetry type that you'll be charting. Select “Metric.”

Then, search for the metric you're looking for: nginx.connections_current.

The resulting chart shows the number of current connections over the last 60 minutes.

If you want to focus on a smaller window of time, you can adjust the time range. For example, you can adjust it to show the last 10 minutes.

The scale of the chart adjusts, showing us data from the last 10 minutes only.

Finally, you can save the chart to your dashboard. The resulting dashboard shows the first chart.

Filter metrics by an attribute

The chart you created was based on the nginx.connections_current metric. However, if you look closely at the verbose logging of that metric, you'll see that it's a single metric made up of four data points—one for each connection state (active, reading, writing, waiting). The first chart did not filter by state, but instead aggregated the values across the different states and simply displayed the max of the four values.

A more helpful display of current connections would use filtering to show the value for each state. To do this, add a new chart. Although you'll use the same nginx.connections_current metric, you then want to use the Filter box and select state for the attribute key.

Then, you can select the attribute value that you want to display.

For the first metric in this chart (metric a), select connections with active state.

14-current-connections-active-all-metric-attributes-displayed.png

A single chart can display more than one metric, and it can also display formulas that operate on metrics. For this chart, you want to display four metrics—one for each connection state. Click on Plot another metric.

Then, you'll add the nginx.connections_current metric with state = reading. This will be metric b on the chart.

Now, do the same for state = waiting (metric c).

And again for state = writing (metric d).

The chart shows multiple bands of color, showing current connection numbers for each state.

A key, below the chart, shows us which color corresponds to which metric.

Lastly, save the chart and add it to the dashboard. After creating additional charts, your dashboard of multiple NGINX visualizations begins to take shape.

UQL and Alerts

So far, you've created charts using the Query Builder, which provides a simple and easy-to-use interface for selecting metrics and configurations. However, users who are familiar with the Unified Querying Language (UQL) can use the Query Editor to write queries directly. For example, your chart of current connections filtered by state would be represented in UQL like this:

In addition, you can create alerts to notify you when metrics surpass certain thresholds. For example, a DevOps team monitoring their NGINX server may want to be alerted when nginx.connections_accepted increases faster than nginx.connections_handled, as this may indicate that NGINX is dropping connections. Notifications can be set up to use webhooks or third-party services.

Conclusion

Making sure that your NGINX servers are performing properly is essential to delivering your web applications and services. For most enterprises, web applications and services are mission-critical to business. Therefore, proper monitoring of NGINX and quick alerting on issues is a must-have. With these practices in place, DevOps teams can respond quickly whenever an issue surfaces, and that will lead to increased uptime and reliability.