How can I debug why 99.99% of my spans are being dropped

MarkR7232213139 · ‎02-11-2025

My service, IMS, is getting a ton of errors trying to push to traces to Lightstep.

Here's what I know

* Some pushes work, so I have a few but not many, traces in LS

* I have a consistently high `rate(lightstep_error_event_total)` metric on all my pods, so traces are being continously dropped

* I have a lot of `flush failed, could not send report to Collector` errors

* I am getting a lot of `status code (400) is not ok` in my logs for the trace.

Is there a way to change the lightstep micro-satellite to log the requests that produce these 400 errors? For the LS go client library, can I turn up the debugging level to get more details on why this is failing?

DanTulovsky · ‎02-12-2025

Hi,

I am assuming this is for the `plaid-onprem` microsat pool? Please let me know if not.

I don't see any errors on the `microsat` -> `lightstep` path, it looks like this is happening between your clients and the microsat.

Specifically I am seeing a lot of `client.spans.dropped` errors.

https://docs.lightstep.com/docs/understand-statsd-reporting-metrics-micro#clientspansdropped

Can you please try tuning based on the remediation steps suggested here:

https://docs.lightstep.com/docs/load-balance-lightstep#balance-and-tune-tracers

Thank you

Dan

EvanF8532133249 · ‎02-12-2025

Hi Dan,

I am also working with Mark on this. We tried tuning the batch size today and did not see any improvement.

Where are you seeing the "clients.spans.dropped" errors? When I load the "Reporting Status Overview" page, it appears to be showing 0 client-dropped spans for this service; instead it's surprising that the "Spans sent" value is so low. The error message we are seeing with "code 400" is not particularly helpful for debugging without more information, and Lightstep docs don't provide much more insight ("An HTTP response code of 400 Bad Request indicates either an issue with the access token in use, or a problem with the payload sent to the public Microsatellite pool").

How can we determine the actual underlying cause of the 400?

DanTulovsky · ‎02-13-2025

Hi,

Can you please confirm:

1. The project in lightstep these spans are being sent to.

2. The name of the microsat pool that's having the issue?

3. The path this data is taking. I am assuming: client -> your_mirosat_pool -> lightstep (but please correct me if this is wrong)

Thank you

Dan

EvanF8532133249 · ‎02-13-2025

1. SERVICE_ITEM_MANAGER

2. Microsat pool plaid-onprem

3. Yes, client -> microsat -> lightstep is the expected path