SSHCommandLong, AKA "Long Running Commands"

tim_broberg · ‎05-07-2019

Here's an attempt to capture the usual speeches I give on how long running commands work, how they have changed over time, and a few gotchas that tend to bite LRC users.

Overview

Long running commands, SSHCommandLong, are probes that allow running an SSHCommand that take longer than a few minutes to complete without hitting timeouts or tying up a mid server worker thread for an extended period of time.

Discovery does not currently support long running commands. From a user point of view, the primary way long running commands are used is by ticking the "long running" checkbox on an activity in a workflow. (The discovery sensor kicks off as soon as it sees the result of SSHCommandLong instead of waiting for the completion probe.)

SSHCommand probes runs a single ssh command and wait for completion, capturing stdout and stderr output in real time as it flows back from the target server.

SSHCommandLong probes create a temp directory on the target server containing scripts to run your command. The command is run under nohup with the output going into files. The initial SSHCommandLong sets it all in motion, then a series of polling SSHCommand probes poll for completion, and finally a completion script SSHCommand probe is run to collect the results and clean up. The sensor is run on the output from this final completion probe.

Under the Hood - Universal Truths

How all this works has been evolving, and is continuing to evolve, particularly in how the polling works, but the basic flow remains the same.

In all cases, the mid server creates a script file based on the command in your SSHCommandLong, creates a temp directory in /tmp/.run.<guid> which contains

command script
complete script
nohup.out for stdout contents
nohup.out2 for stderr contents
a flag file, running, whose existence indicates that the command has not yet completed
any scripts you provided as arguments

The output of a successful SSHCommandLong probe contains the name of this temp directory.

Polling SSHCommand probes will look for the existence of the running file and report whether it still exists.

When the mid server processes the SSHCommandLong probe, it will insert a long_sensor parameter in the input ecc_queue which provides tells the SensorProcessor what to do with the results. This script runs in lieu of the real sensor which runs on completion, and it is this script that manages the long running command. The polling SSHCommand's will modify the long_sensor parameter to manage what should be done next in handling the long running command.

One good way to see this on an instance is to find an SSHCommandLong probe of interest on the ecc_queue, filter for his agent_correlator (containing rba.<workflow sysid>), and then remove any filters besides the agent_correlator. Alternately, you can look at the ecc_queue for a workflow containing the long running command. You should be able to see an initial SSHCommandLong, many polling SSHCommands, (depending on version) a polling cancellation SSHCommand or two, and a completion SSHCommand.

Under the Hood - London

In releases <= London, the initial long_sensor pointed to the DiscoveryLongRunner class. (No, the irony of the name has not escaped me.)

Polling was handled by the mid server with the obscure (and buggy) "repeating commands" feature. A repeat_interval parameter was set in the polling SSHCommand's causing the same probe output to be run repeatedly until canceled: one output, many inputs.

Once a polling command reported completion, DiscoveryLongRunner would send an SSHCommand to cancel the polling, and once the polling cancelled, he would send the completion SSHCommand. long_sensor would get adjusted each step along the way with instructions to perform the next step.

Under the Hood - Madrid

Madrid adds a new class to manage long running commands, LongRunner, and the SensorProcessor learned some new tricks to enable more flexibility in responding to problems.

The changes in Madrid address the problem that polling commands sometimes fail in transient ways. The chances of a glitch happening are proportional to the duration of the command, as are the costs of failure such that the pain of transient issues rises with the square of command duration.

In London, the SensorProcessor would unconditionally bypass the long_sensor on error and sent the result straight to the sensor. For workflows, this meant instant termination of the workflow, even though the long running command might continue on for a very long time thereafter. The long_sensor had no say in whether the error should be regarded as fatal or recoverable. Also, the complete script never got called, so the temp directory remained on the target system indefinitely.

In Madrid, a separate error handler, long_error_handler, analogous to long_sensor is defined which the sensor processor calls on error. The error handler returns true for recoverable failures, false for fatal ones.

Mid property mid.property.long_runner, which defaults to "LongRunner," allows admins to control what LongRunner class is used. SSHCommandLong sets long_sensor to the start() method of that class and long_error_handler to the error() method.

While this may seem a little convoluted, it allows admins the capability to customize any aspect of the LongRunner they like to, for example, adjust what constitutes a fatal vs a recoverable polling error. Since the class itself can be specified, customizations can be created without introducing update skips in LongRunner.

The Madrid LongRunner provides simple retries of failed polling probes.

Under the Hood - New York

Prior to New York, polling was driven by the mid server. If the mid server went down, polling stopped and the long running command never completed.

New York moves polling to the instance.

A new table is created, long_runner_poll, which identifies when we should next poll and what the previous ecc_queue record was with all the details.

The repeating command initiation and cancellation are gone, which saves some traffic for short long commands.

To keep the overhead from all the additional outbound commands under control, the polling frequency now rolls off as the command goes on. This takes advantage of the fact that it becomes increasingly unlikely that a command will complete in any given interval as the command duration increases to reduce the number of polls.

There are several parameters to tune this, which are listed at the beginning of the New York version of the LongRunner class. Initial polling period, max polling period, and % decay per poll are all controllable. (For that matter, you can write your own LongRunner and create any polling scheme you like.)

One would think the last poll could also collect the results and save one round trip. I certainly thought that. The reason it is not currently implemented that way is that a failure during the complete probe is currently considered always to be fatal whereas polling probes are often recoverable. When a probe that might be a complete or might be a polling probe fails, it's hard to know whether it's recoverable. Perhaps we'll figure out a good answer to this little problem in the future and we can trim away one round trip per long running command.

Gotcha #1 - File Permissions

The long running command script sets umask 0077 before running your command to hide your files from prying eyes.

If you need those files to be accessible to other users, you will need to explicitly set the permissions.

Gotcha #2 - Sudo Handling

Sudo handling in long running commands is touchy because any password prompt that appears while running under nohup is instant death. There is no console by which to receive the password prompt or send the password.

For this reason, if you require sudo for a long running command, it generally gets applied outside of the nohup command, which means you have to configure sudoers to allow running /tmp/.run.<anything>. You might as well allow root ALL(ALL), really, as you, as a sysadmin, have no control over what's in those scripts.

If sudoers has require_tty set for this command and user, and you specify sudo for a command inside the script, it will fail because there is no tty. It's nohup. Not having a tty is the whole point, so SSHCommandLong is forced to strip sudo out of your script and apply it around the script.

If sudoers does not have NOPASSWD for this command and user, sudo will prompt for a password, which we will be in no position to respond to, so SSHCommandLong will, again, strip sudo from the script and apply it to the whole nohup'ping thing.

So, you must not have require_tty and you must have NOPASSWD to use sudo inside an SSHCommandLong command.

Future Expansion

Long running commands could still use a cancellation mechanism to handle timeouts, fatal commands, workflow cancellations, etc. We're working on it, but these are future development plans, so you didn't hear any promises from me.

I fully expect that the current determination of fatal vs recoverable errors is still naive, and we're failing some cases that could be recovered. The LongRunner design is flexible to allow quick fixes to issues just like this.

The current polling frequency scheme is a straight exponential decay, which is fairly efficient, but the initial poll is considerably less likely to catch completion than the one that follows. The perfect polling scheme would poll hardest near the peak instead of at the beginning, perhaps even collecting data on when completion tends to happen and adjusting polling frequency to create bins of roughly equal probability of catching completion. The exponential decay approach is probably good enough, but one could certainly override the relevant method in LongRunner and adjust the polling frequencies based on a more sophisticated approach if one were just desperate to reduce ecc_queue traffic while maintaining responsiveness to command completion.

tim_broberg · ‎07-19-2020

Paris update: In Paris, the lr.sh script template becomes editable in the mid server scripts.

This opens the door to reworking long running command behavior on the target servers without editing the mid server download package.

Probably not a notable improvement to users, or at least the sane ones, but provided here for completeness.