How can I debug why 99.99% of my spans are being dropped

MarkR7232213139 · ‎02-11-2025

My service, IMS, is getting a ton of errors trying to push to traces to Lightstep.

Here's what I know

* Some pushes work, so I have a few but not many, traces in LS

* I have a consistently high `rate(lightstep_error_event_total)` metric on all my pods, so traces are being continously dropped

* I have a lot of `flush failed, could not send report to Collector` errors

* I am getting a lot of `status code (400) is not ok` in my logs for the trace.

Is there a way to change the lightstep micro-satellite to log the requests that produce these 400 errors? For the LS go client library, can I turn up the debugging level to get more details on why this is failing?

DanTulovsky · ‎02-14-2025

Ok thank you.

For 1), I don't think that's the lightstep project. The project is what appears as the first thing in the URL after app.lightstep.com. e.g. https://app.lightstep.com/plaid-dev (I assume this is the `plaid-dev` project?

The value of `clients.spans.dropped` is what I see on our system. I also see it on the reporting dashboard right now:

Also, it looks like data for that project is being sent via your plaid-onprem microsat pool, but also via our public microsats.

Can you also please share exactly what setting you changed and to what value? Specifically what are the values for these settings: https://docs.lightstep.com/docs/load-balance-lightstep#balance-and-tune-tracers

Can you please confirm that your microsats have enough CPU and memory and are not running hot?

But to be clear, the dropping of data is happening on your clients, it's never actually making it to the microsat. I would double check the tracer settings as in the above links and also check if your workload itself has enough CPU on it.

Thank you

Dan