Some of this information is outdated. For more up-to-date info see: http://pegasus.isi.edu/wms/docs/latest/submit_directory.php

Layout

Each planned workflow is associated with a submit directory. In it you will see the following

  • <daxlabel-daxindex>.dagfile - This is the Condor DAGMman dag file corresponding to the executable workflow generated by Pegasus. The dag file describes the edges in the DAG and information about the jobs in the DAG. Pegasus generated .dag file usually contains the following information for each job
    • the job submit file for each job in the DAG.
    • the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
    • JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.
  • <daxlabel-daxindex>.dag.dagman.out - When a DAG ( .dag file ) is executed by Condor DAGMan , the DAGMan writes out it's output to the <daxlabel-daxindex>.dag.dagman.out file. This file tells us the progress of the workflow, and can be used to determine the status of the workflow. Most of pegasus tools mine the dagman.out or jobstate.log to determine the progress of the workflows.
  • <daxlabel-daxindex>.dot - Pegasus creates a dot file for the executable workflow in addition to the .dag file. This can be used to visualize the executable workflow using the dot program.
  • <job>.sub - Each job in the executable workflow is associated with it's own submit file. The submit file tells Condor on how to execute the job.
  • <job>.out.00n - The stdout of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains the kickstart XML provenance record that captures runtime provenance on the remote node where the job was executed. n varies from 1-N where N is the JOB RETRY value in the .dag file. The exitpost executable is invoked on the <job>.out file and it moves the <job>.out to <job>.out.00n so that the the job's .out files are preserved across retries.
  • <job>.err.00n- The stderr of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains stderr of kickstart. This is usually empty unless there in an error in kickstart e.g. kickstart segfaults , or kickstart location specified in the submit file is incorrect. The exitpost executable is invoked on the <job>.out file and it moves the <job>.err to <job>.err.00n so that the the job's .out files are preserved across retries.
  • jobstate.log - The jobstate.log file is written out by the tailstatd daemon that is launched when a workflow is submitted for execution by pegasus-run. The tailstatd daemon parses the dagman.out file and writes out the jobstate.log that is easier to parse. The jobstate.log captures the various states through which a job goes during the workflow.
  • braindump.txt- Contains information about pegasus version, dax file, dag file, dax label

    dax /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dax/CyberShake_LGU.dax
    dag CyberShake_LGU-0.dag
    basedir /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags
    run /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800
    jsd /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800/jobstate.log
    rundir 20100106T162339-0800
    pegasushome /usr/local/pegasus/default
    vogroup pegasus
    label CyberShake_LGU
    planner /usr/local/pegasus/default/bin/pegasus-plan
    pegasus_generator Pegasus
    pegasus_version 2.4.0cvs
    pegasus_build 20091221194342Z
    pegasus_wf_name CyberShake_LGU-0
    pegasus_wf_time 20100106T162339-0800

Condor DAGMan file

The Condor DAGMan file ( .dag ) is the input to Condor DAGMan ( the workflow executor used by Pegasus ) .

Pegasus generated .dag file usually contains the following information for each job

  • the job submit file for each job in the DAG.
  • the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
  • JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.

Reading a Condor DAG file

The condor DAG file below has the following fields highlighted for a single job

  1. JOB and the submit file for the job
  2. Post script that is invoked on the stdout brought back to the submit directory
  3. JOB RETRY

In the end of the DAG file the relations between the jobs ( that identify the underlying DAG structure ) are highlighted.

Condor DAGMan File

######################################################################
# PEGASUS WMS GENERATED DAG FILE
# DAG scb
# Index = 0, Count = 1
######################################################################

JOB das_tide_ID000001 das_tide_ID000001.sub
SCRIPT POST das_tide_ID000001 /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/das_tide_ID000001.out
RETRY das_tide_ID000001 3

....

JOB create_dir_scb_0_cobalt create_dir_scb_0_cobalt.sub
SCRIPT POST create_dir_scb_0_cobalt /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/create_dir_scb_0_cobalt.out
RETRY create_dir_scb_0_cobalt 3

PARENT das_tide_ID000001 CHILD fcst_tide_ID000002
...
PARENT create_dir_scb_0_cobalt CHILD das_tide_ID000001
PARENT create_dir_scb_0_cobalt CHILD fcst_tide_ID000002
######################################################################
# End of DAG
######################################################################

Kickstart XML Record

Kickstart is a light weight C executable that is shipped with the pegasus worker package. All jobs are launced via Kickstart on the remote end, unless explicitly disabled at the time of running pegasus-plan.

Kickstart does not work with

  1. Condor Standard Universe Jobs
  2. MPI Jobs

Pegasus automatically disables kickstart for the above jobs.

Kickstart captures useful runtime provenance information about the job launched by it on the remote note, and puts in an XML record that it writes to it's stdout. The stdout appears in the workflow submit directory as <job>.out.00n . Some useful information captured by kickstart and logged are as follows

  1. the exitcode with which the job it launched exited.
  2. the duration of the job
  3. the start time for the job
  4. the node on which the job ran
  5. the stdout/stderr of the job
  6. the arguments with which it launched the job
  7. the environment that was set for the job before it was launched.
  8. the machine information about the node that the job ran on

Amongst the above information, the dagman.out file gives a coarser grained estimate of the job duration and start time.

Reading a Kickstart Output File

The kickstart file below has the following fields highlighted

  1. the host on which the job executed and the ipaddress of that host
  2. the duration and start time of the job. The time here is in reference to the clock on the remote node where job executed.
  3. exitcode with which the job executed
  4. the arguments with which the job was launched.
  5. the directory in which the job executed on the remote site
  6. the stdout of the job
  7. the stderr of the job
  8. the environment of the job
Kickstart Output

<?xml version="1.0" encoding="ISO-8859-1"?>
<invocation xmlns="http://pegasus.isi.edu/schema/invocation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/invocation http://pegasus.isi.edu/schema/iv-2.0.xsd" version="2.0" start="2009-01-30T19:17:41.157-06:00" duration="0.321" transformation="pegasus::dirmanager" derivation="pegasus::dirmanager:1.0" resource="cobalt" wf-label="scb" wf-stamp="2009-01-30T17:12:55-08:00" hostaddr="141.142.30.219" hostname="co-login.ncsa.uiuc.edu" pid="27714" uid="29548" user="vahi" gid="13872" group="bvr" umask="0022">
<mainjob start="2009-01-30T19:17:41.426-06:00" duration="0.052" pid="27783">
<usage utime="0.036" stime="0.004" minflt="739" majflt="0" nswap="0" nsignals="0" nvcsw="36" nivcsw="3"/>
<status raw="0"><regular exitcode="0"/></status>
<statcall error="0">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/dirmanager">23212F7573722F62696E2F656E762070</file>
<statinfo mode="0100755" size="8202" inode="85904615883" nlink="1" blksize="16384" blocks="24" mtime="2008-09-22T18:52:37-05:00" atime="2009-01-30T14:54:18-06:00" ctime="2009-01-13T19:09:47-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<argument-vector>
<arg nr="1">--create</arg>
<arg nr="2">--dir</arg>
<arg nr="3">/u/ac/vahi/globus-test/EXEC/vahi/pegasus/scb/run0001</arg>
</argument-vector>

</mainjob>
<cwd>/u/ac/vahi/globus-test/EXEC</cwd>
<usage utime="0.012" stime="0.208" minflt="4232" majflt="0" nswap="0" nsignals="0" nvcsw="15" nivcsw="74"/>
<machine page-size="16384" provider="LINUX">
<stamp>2009-01-30T19:17:41.157-06:00</stamp>
<uname system="linux" nodename="co-login" release="2.6.16.54-0.2.5-default" machine="ia64">#1 SMP Mon Jan 21 13:29:51 UTC 2008</uname>
<ram total="148299268096" free="123371929600" shared="0" buffer="2801664"/>
<swap total="1179656486912" free="1179656486912"/>
<boot idle="1315786.920">2009-01-15T10:19:50.283-06:00</boot>
<cpu count="32" speed="1600" vendor=""></cpu>
<load min1="3.50" min5="3.50" min15="2.60"/>
<proc total="841" running="5" sleeping="828" stopped="5" vmsize="10025418752" rss="2524299264"/>
<task total="1125" running="6" sleeping="1114" stopped="5"/>
</machine>
<statcall error="0" id="stdin">
<!-- deferred flag: 0 -->
<file name="/dev/null"/>
<statinfo mode="020666" size="0" inode="68697" nlink="1" blksize="16384" blocks="0" mtime="2007-05-04T05:54:02-05:00" atime="2007-05-04T05:54:02-05:00" ctime="2009-01-15T10:21:54-06:00" uid="0" user="root" gid="0" group="root"/>
</statcall>
<statcall error="0" id="stdout">
<temporary name="/tmp/gs.out.s9rTJL" descriptor="3"/>
<statinfo mode="0100600" size="29" inode="203420686" nlink="1" blksize="16384" blocks="128" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
<data>mkdir finished successfully.
</data>
</statcall>
<statcall error="0" id="stderr">
<temporary name="/tmp/gs.err.kobn3S" descriptor="5"/>
<statinfo mode="0100600" size="0" inode="203420689" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>

<statcall error="0" id="gridstart">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/kickstart">7F454C46020101000000000000000000</file>
<statinfo mode="0100755" size="255445" inode="85904615876" nlink="1" blksize="16384" blocks="504" mtime="2009-01-30T18:06:28-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T18:06:28-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="logfile">
<descriptor number="1"/>
<statinfo mode="0100600" size="0" inode="53040253" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:39-06:00" atime="2009-01-30T19:17:39-06:00" ctime="2009-01-30T19:17:39-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="channel">
<fifo name="/tmp/gs.app.Ien1m0" descriptor="7" count="0" rsize="0" wsize="0"/>
<statinfo mode="010640" size="0" inode="203420696" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<environment>
<env key="GLOBUS_GRAM_JOB_CONTACT">https://co-login.ncsa.uiuc.edu:50001/27456/1233364659/</env>
<env key="GLOBUS_GRAM_MYJOB_CONTACT">URLx-nexus://co-login.ncsa.uiuc.edu:50002/</env>
<env key="GLOBUS_LOCATION">/usr/local/prews-gram-4.0.7-r1/</env>
....
</environment>

<resource>
<soft id="RLIMIT_CPU">unlimited</soft>
<hard id="RLIMIT_CPU">unlimited</hard>
<soft id="RLIMIT_FSIZE">unlimited</soft>
....
</resource>
</invocation>

Jobstate.log File

The jobstate.log file logs the various states that a job goes through during workflow execution. It is created by the tailstatd daemon that is launched  when a workflow is submitted to Condor DAGMan by pegasus-run. Tailstatd parses the dagman.out file and writes out the jobstate.log file, the format of which is more amenable to parsing.

Note: The jobstate.log file is not created if a user uses condor_submit_dag to submit a workflow to Condor DAGMan.

The jobstate.log file can be created after a workflow has finished executing by running tailstatd on the .dag file in the workflow submit directory.
Executing Tailstatd for cases where pegasus-run was not used to submit workflow

cd workflow-submit-directory
tailstatd -n  --nodatabase $dagman.outfile

Below is a snippet from the jobstate.log for a single job executed via condorg

1239666049 create_dir_blackdiamond_0_isi_viz SUBMIT 3758.0 isi_viz -
1239666059 create_dir_blackdiamond_0_isi_viz EXECUTE 3758.0 isi_viz -
1239666059 create_dir_blackdiamond_0_isi_viz GLOBUS_SUBMIT 3758.0 isi_viz -
1239666059 create_dir_blackdiamond_0_isi_viz GRID_SUBMIT 3758.0 isi_viz -
1239666064 create_dir_blackdiamond_0_isi_viz JOB_TERMINATED 3758.0 isi_viz -
1239666064 create_dir_blackdiamond_0_isi_viz JOB_SUCCESS 0 isi_viz -
1239666064 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_STARTED - isi_viz -
1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_TERMINATED 3758.0 isi_viz -
1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_SUCCESS - isi_viz -


Each entry in jobstate.log has the following

  • the ISO timestamp for the time at which the particular event happened
  • the name of the job.
  • the event recorded by DAGMan for the job.
  • the condor id of the job is the queue on the submit node
  • the pegasus site to which the job is mapped

The lifecycle for the job when executed as part of the workflow are as follows

State/Event

Description

SUBMIT

job is submitted by condor schedd for execution.

EXECUTE

condor schedd detects that a job has started execution.

GLOBUS_SUBMIT

the job has been submitted to the remote resource. It's only written for GRAM jobs (i.e. gt2 and gt4).

GRID_SUBMIT

same as GLOBUS_SUBMIT event. The ULOG_GRID_SUBMIT event is written for all grid universe jobs./

JOB_TERMINATED

job terminated on the remote node.

JOB_SUCCESS

job succeeded on the remote host, condor id will be zero (successful exit code).

JOB_FAILURE

job failed on the remote host, condor id will be the job's exit code.

POST_SCRIPT_STARTED

post script started by DAGMan on the submit host, usually to parse the kickstart output

POST_SCRIPT_TERMINATED

post script finished on the submit node.

POST_SCRIPT_SUCCESS |
POST_SCRIPT_FAILURE

post script succeeded or failed.

Pegasus job.map file

Pegasus creates a workflow.job.map file that links jobs in the DAG with the jobs in the DAX. The contents of the file are in netlogger format. The purpose of this file is to be able to link an invocation record of a task to the corresponding job in the DAX

The workflow is replaced by the name of the workflow i.e. same prefix as the .dag file

In the file there are two types of events.

pegasus.job
pegasus.job.map

pegasus.job - This event is for all the jobs in the DAG. The following information is associated with this event.

  • job.id the id of the job in the DAG
  • job.class an integer designating the type of the job
  • job.xform the logical transformation which the job refers to.
  • task.count the number of tasks associated with the job. This is equal to the number of pegasus.job.task events created for that job.

pegasus.job.map - This event allows us to associate a job in the DAG with the jobs in the DAX. The following information is associated with this event.

  • task.id the id of the job in the DAG
  • task.class an integer designating the type of the job
  • task.xform the logical transformation which the job refers to.

Some sample entries are as follows

ts=2009-04-21T23:09:03.091658Z event=pegasus.job job.id=analyze_ID000004 job.class="7" job.xform="vahi::analyze:1.0" task.count="1"
ts=2009-04-21T23:09:03.091772Z event=pegasus.job.map job.id=analyze_ID000004 task.id="ID000004" task.class="7" task.xform="vahi::analyze:1.0"
ts=2009-04-21T23:09:03.092063Z event=pegasus.job job.id=create_dir_blackdiamond_0_isi_viz job.class="6" job.xform="pegasus::dirmanager" task.count="0"
ts=2009-04-21T23:09:03.092165Z event=pegasus.job job.id=merge_vahi-findrange-1.0_PID2_ID1 job.class="1" job.xform="pegasus::seqexec" task.count="2"
ts=2009-04-21T23:09:03.093259Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000002" task.class="7" task.xform="vahi::findrange:1.0"
ts=2009-04-21T23:09:03.093402Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000003" task.class="7" task.xform="vahi::findrange:1.0"
  • No labels