When a child process returns to a waiting parent, the wait( &status ) system call family fills in the integer status with the condition of the child's demise. For all practical purposes, this integer should be treated as an opaque data type. Every libc since back in the day provides a set of functions (actually macros, but never mind) to test for a variety of demise conditions. For details about these test and extract macros, please refer to the man -s 2 wait manual page. The only things that one can say reliably about this integer, which I will call the raw exit status, is that you can treat it as boolean in the sense of the C language:

  • A value of 0 means the child terminated by returning 0 from main or calling exit(0) or similar means of exiting.
    This is typically equated with success of the child.
  • A value other than 0 means that the child either died on a signal, or terminated with a non-zero result from main.
    This is typically equated with failure of the child.

You should not make any assumptions on what the bits in the raw exit status mean, and should treat it as boolean like shown above. Unless you are investigating the raw exit status on an equivalent OS and libc versions that generated it, you may derive the wrong conclusions from the raw exit status.

Warning

An application which uses a non-zero exit code to indicate conditions that are still successful always requires special treatment. The boolean approach outlined above does not hold for such programs. (But that will be the least of your problems.)

The raw status may be useful for debugging after the fact, though, and for this reason, I am not fully convinced to do away with it in the Stampede database. However, the caveats need to be clearly documented. The raw value may only be safely investigated on a platform that is equivalent in OS version and libc version to the system that generated it.

When we scratch on the surface of the raw exit status, the following bit patterns emerge that have been used ever since, but might change in some future (looking at them Debian zealots):

raw exit code bit pattern
'1 1
 5 4        8 7 6     0
+-+----------+-+-------+
|?| exitcode |C| signo |
+-+----------+-+-------+

bits

meaning

range

15

unused

0

8..14

exit code

0..127

7

core flag

0 or 1

0..6

signal of death

1..127

While the status has room for both, an exit code and a signal of death, they are mutually exclusive and only one will be set. If the process died on a signal, there will be no exit code. If it exited regularly and not on a signal, the exit code is set, but the signal is undefined (though typically 0).

kickstart already translates the raw exit code, given the functions (macros) at the remote system's libc, into the appropriate mutually-exclusive parses contained within the status element:

  1. The regular element is used when the program terminates regularly. It sets the attribute exitcode between 0 and 127.
  2. The signalled element is used when the program terminates on a signal. This element has two attributes.
    • The core attribute is boolean and identifies that a core file was dumped (provided the OS tell kickstart about it).
    • The signal attribute is a positive integer between 1 and 127 identifying the signal of death.
  3. The failure element is special, i.e. if the executable does not exist. The attribute error contains the libc errno value that was captured as cause for the failure.
  4. The suspended element should not happen. It is a bug, if you detect it.

In case of failure, kickstart will set the raw exit status to a special artificial value of -1, which means all bits set. The raw value of -1, or any negative raw value, must not be investigated with the macros. Any negative value for raw should be taken as indication of failure.

Warning

It is plain wrong to grep for the string regular or exitcode in kickstart outputs, because you will miss other termination conditions.

  1. if you must grep, look for the status element.
  2. one should use an xmlgrep not plain grep.

In the presence of kickstart, the raw status already exists, and the translation into the cooked exit code to be stored in the Stampede schema should be fairly easy:

  • For regular, store the exitcode value between 0 and 127.
  • For signalled, store the negated signal of death as value between -1 and -127.
  • For failure, store either a NULL value, or a special value from your database data type.

Info

In the case above, if you decided to store the cooked exit code in the database equivalent to signed char or signed byte, the only special value left, besides NULL, will be -128.

In the absence of kickstart, we have the same three test cases as above. However, obtaining the proper information to store will be more tricky:

  • For regular, DAGMan should report the exit code of the child.
  • For signalled, Fabio will have to test this case, and we need to revisit. Remember to negate the signal of death number.
  • For failure, it depends on the execution universe of Condor. However, monitord must ultimately detect this form of failure and store the appropriate values.

In addition, should we decide to keep the raw exit status in the data, in the absence of kickstart, we get to make the translation rules:

  • For a regular, compute the raw status as reported exit code value between 0 and 127, shifted left by 8
  • For a signalled, compute the raw status as the signal number that caused death, a value between 1 and 127.
  • For a failure, store a value of -1 for raw.

For the shell planner, we will need to investigate the value of $? for the various cases. This special variable captures the value of above status integer, albeit cooked by the shell. The trick is to uncook it.

  • No labels