This document lists out issues for the algorithm developers to keep in mind while
developing the respective codes. Keeping these in mind will alleviate a lot of problems
while trying to run the codes on the Grid.

Supported Platforms

Most of the hosts making a Grid run variants of Linux or in some case Solaris. For the
purposes of this project, we should narrow down to a manageable list of supported Linux
versions and hardware platforms.

At the very least the algorithm codes should be able to run on the following Grids, during
the first phase of the project.

Running on Windows

The majority of the machines making up the various Grid sites run Linux. In fact, there is
no widespread deployment of a Windows-based Grid. Currently, the server side software
of Globus does not run on Windows. Only the client tools can run on Windows.

The algorithm developers should not code exclusively for the Windows platforms. They
must make sure that their codes run on Linux or Solaris platforms. If the code is written
in a portable language like Java, then porting should not be an issue.

Packaging of software

As far as possible, binary packages (preferably statically linked) of the codes should be
provided. If for some reason the codes, need to be built from the source then they should
have an associated makefile ( for C/C++ based tools) or an ant file ( for Java tools). The
building process should refer to the standard libraries that are part of a normal Linux
installation. If the codes require non-standard libraries, clear documentation needs to be
provided, as to how to install those libraries, and make the build process refer to those
libraries.

Further, installing software as root is not a possibility. Hence, all the external libraries
that need to be installed can only be installed as non-root in non-standard locations.

MPI codes

If any of the algorithm codes are MPI based, they should contact the Grid group. MPI can
be run on the Grid but the codes need to be compiled against the installed MPI libraries
on the various Grid sites. The Grid group has some experience running MPI code through
PBS.

Maximum Running time of the algorithm codes

Each of the Grid sites has a policy on the maximum time for which they will allow a job
to run. The algorithms catalog should have the maximum time (in minutes) that the job
can run for. This information is passed to the Grid sites while submitting a job, so that
Grid site does not kill a job before that published time expires. (It’s OK if the job runs
only a fraction of the max time).

Codes cannot specify the directory in which they should be run

Codes are installed in some standard location on the Grid Sites or staged on demand.
However, they are not invoked from directories where they are installed. The codes
should be able to be invoked from any directory, as long as one can access the directory
where the codes are installed.

This is especially relevant, while writing scripts around the algorithm codes. At that point
specifying the relative paths do not work. This is because the relative path is constructed
from the directory where the script is being invoked. A suggested workaround is to pick
up the base directory where the software is installed from the environment or by using the
dirname cmd or api. The workflow system can set appropriate environment variables
while launching jobs on the Grid.

No hard-coded paths

The algorithms should not hard-code any directory paths in the code. All directories
paths should be picked up explicitly either from the environment (specifying environment
variables) or from command line options passed to the algorithm code.

Propagating back the right exitcode

A job in the workflow is only released for execution if its parents have executed
successfully. Hence, it is very important that the algorithm codes exit with the correct
error code in case of success and failure. The algorithms should exit with a status of 0 in
case of success, and a non zero status in case of error. Failure to do so will result in
erroneous workflow execution where jobs might be released for execution even though
their parents had exited with an error.

The algorithm codes should catch all errors and exit with a non zero exitcode.
The successful execution of the algorithm code can only be determined by an exitcode of
0. The algorithm code should not rely upon something being written to the stdout to
designate success for e.g. if the algorithm code writes out to the stdout SUCCESS and
exits with a non zero status the job would be marked as failed.

Temporary files

If the algorithm codes create temporary files during execution, they should be cleared by
the codes in case of errors and success terminations. The algorithm codes will run on
scratch file systems that will also be used by others. The scratch directories get filled up
very easily, and jobs will fail in case of directories running out of free space. The
temporary files are the files that are not being tracked explicitly through the workflow
generation process.

STDOUT/STDERR Handling

The stdout and stderr should be used for logging purposes only. Any result of the
algorithm codes should be saved to data files that can be tracked through the workflow
system.

Configuration files

If your code requires a configuration file to run and the configuration changes from one
run to another, then this file needs to be tracked explicitly via the workflow system. The
configuration file should not contain any absolute paths to any data or libraries used by
the code. If any libraries, scripts etc need to be referenced they should refer to relative
paths starting with a ./xyz where xyz is a tracked file (defined in the workflow) or as
$ENV-VAR/xyz where $ENV-VAR is set during execution time and evaluated by your
application code internally.

Logical file naming.

The logical file names used by your code can be of two types.

  1. Without a directory path e.g. f.a, f.b etc
  2. With a directory path e.g. a/1/f.a, b/2/f.b
    Both types of files are supported. We will create any directory structure mentioned in
    your logical files on the remote execution site when we stage in data as well as when we
    store the output data to a permanent location.
    An example invocation of a code that consumes and produces files will be
    $/bin/test --input f.a --output f.b
    
    Or
    $/bin/test --input a/1/f.a --output b/1/f.b
    
    Note: A logical file name should never be an absolute file path. E.g. /a/1/f.a (there should not be a starting /)

1 Comment

  1. Unknown User (voeckler)

    We need to distinguish two levels here:

    1. When writing a new application, it is perfectly acceptable to use stdin for a single piece of input data, stdout for a single piece of output data. In the Unix world, it will work well, but may hiccup in the Windows world. stderr should only be used for logging and debugging, never to put data on it.
    2. We are suggesting that you don't use stdio for data, because there is the implied expectation that stdio data is magically handled, including streaming and staging to the submit host. There is no magic! If you produce data on stdout, you need to declare to Pegasus that your stdout has data that you care about, and what LFN to track it in. After the application is done, it will be a (remote) file like any other data product. If you produce logs on stderr that you care about, you must make it a tracked file in the same manner.

    Internally, Pegasus handles stdio the same way the shell handles stdio redirections. When you redirect stdin or stdout on your command prompt, you are also required to specify a file name.