How can I debug why 99.99% of my spans are being dropped
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-11-2025 02:39 PM
My service, IMS, is getting a ton of errors trying to push to traces to Lightstep.
Here's what I know
* Some pushes work, so I have a few but not many, traces in LS
* I have a consistently high `rate(lightstep_error_event_total)` metric on all my pods, so traces are being continously dropped
* I have a lot of `flush failed, could not send report to Collector` errors
* I am getting a lot of `status code (400) is not ok` in my logs for the trace.
Is there a way to change the lightstep micro-satellite to log the requests that produce these 400 errors? For the LS go client library, can I turn up the debugging level to get more details on why this is failing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-12-2025 08:19 AM
Hi,
I am assuming this is for the `plaid-onprem` microsat pool? Please let me know if not.
I don't see any errors on the `microsat` -> `lightstep` path, it looks like this is happening between your clients and the microsat.
Specifically I am seeing a lot of `client.spans.dropped` errors.
https://docs.lightstep.com/docs/understand-statsd-reporting-metrics-micro#clientspansdropped
Can you please try tuning based on the remediation steps suggested here:
https://docs.lightstep.com/docs/load-balance-lightstep#balance-and-tune-tracers
Thank you
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-12-2025 09:16 PM
Hi Dan,
I am also working with Mark on this. We tried tuning the batch size today and did not see any improvement.
Where are you seeing the "clients.spans.dropped" errors? When I load the "Reporting Status Overview" page, it appears to be showing 0 client-dropped spans for this service; instead it's surprising that the "Spans sent" value is so low. The error message we are seeing with "code 400" is not particularly helpful for debugging without more information, and Lightstep docs don't provide much more insight ("An HTTP response code of 400 Bad Request indicates either an issue with the access token in use, or a problem with the payload sent to the public Microsatellite pool").
How can we determine the actual underlying cause of the 400?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-13-2025 07:04 AM - edited 02-13-2025 07:16 AM
Hi,
Can you please confirm:
1. The project in lightstep these spans are being sent to.
2. The name of the microsat pool that's having the issue?
3. The path this data is taking. I am assuming: client -> your_mirosat_pool -> lightstep (but please correct me if this is wrong)
Thank you
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-13-2025 08:38 AM
1. SERVICE_ITEM_MANAGER
2. Microsat pool plaid-onprem
3. Yes, client -> microsat -> lightstep is the expected path