Some of this information is outdated. For more up-to-date info see: http://pegasus.isi.edu/wms/docs/latest/submit_directory.php
Layout
Each planned workflow is associated with a submit directory. In it you will see the following
- <daxlabel-daxindex>.dagfile - This is the Condor DAGMman dag file corresponding to the executable workflow generated by Pegasus. The dag file describes the edges in the DAG and information about the jobs in the DAG. Pegasus generated .dag file usually contains the following information for each job
- the job submit file for each job in the DAG.
- the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
- JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.
- <daxlabel-daxindex>.dag.dagman.out - When a DAG ( .dag file ) is executed by Condor DAGMan , the DAGMan writes out it's output to the <daxlabel-daxindex>.dag.dagman.out file. This file tells us the progress of the workflow, and can be used to determine the status of the workflow. Most of pegasus tools mine the dagman.out or jobstate.log to determine the progress of the workflows.
- <daxlabel-daxindex>.dot - Pegasus creates a dot file for the executable workflow in addition to the .dag file. This can be used to visualize the executable workflow using the dot program.
- <job>.sub - Each job in the executable workflow is associated with it's own submit file. The submit file tells Condor on how to execute the job.
- <job>.out.00n - The stdout of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains the kickstart XML provenance record that captures runtime provenance on the remote node where the job was executed. n varies from 1-N where N is the JOB RETRY value in the .dag file. The exitpost executable is invoked on the <job>.out file and it moves the <job>.out to <job>.out.00n so that the the job's .out files are preserved across retries.
- <job>.err.00n- The stderr of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains stderr of kickstart. This is usually empty unless there in an error in kickstart e.g. kickstart segfaults , or kickstart location specified in the submit file is incorrect. The exitpost executable is invoked on the <job>.out file and it moves the <job>.err to <job>.err.00n so that the the job's .out files are preserved across retries.
- jobstate.log - The jobstate.log file is written out by the tailstatd daemon that is launched when a workflow is submitted for execution by pegasus-run. The tailstatd daemon parses the dagman.out file and writes out the jobstate.log that is easier to parse. The jobstate.log captures the various states through which a job goes during the workflow.
braindump.txt- Contains information about pegasus version, dax file, dag file, dax label
dax /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dax/CyberShake_LGU.dax
dag CyberShake_LGU-0.dag
basedir /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags
run /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800
jsd /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800/jobstate.log
rundir 20100106T162339-0800
pegasushome /usr/local/pegasus/default
vogroup pegasus
label CyberShake_LGU
planner /usr/local/pegasus/default/bin/pegasus-plan
pegasus_generator Pegasus
pegasus_version 2.4.0cvs
pegasus_build 20091221194342Z
pegasus_wf_name CyberShake_LGU-0
pegasus_wf_time 20100106T162339-0800
Condor DAGMan file
The Condor DAGMan file ( .dag ) is the input to Condor DAGMan ( the workflow executor used by Pegasus ) .
Pegasus generated .dag file usually contains the following information for each job
- the job submit file for each job in the DAG.
- the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
- JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.
Reading a Condor DAG file
The condor DAG file below has the following fields highlighted for a single job
- JOB and the submit file for the job
- Post script that is invoked on the stdout brought back to the submit directory
- JOB RETRY
In the end of the DAG file the relations between the jobs ( that identify the underlying DAG structure ) are highlighted.
######################################################################
# PEGASUS WMS GENERATED DAG FILE
# DAG scb
# Index = 0, Count = 1
######################################################################
JOB das_tide_ID000001 das_tide_ID000001.sub
SCRIPT POST das_tide_ID000001 /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/das_tide_ID000001.out
RETRY das_tide_ID000001 3
....
JOB create_dir_scb_0_cobalt create_dir_scb_0_cobalt.sub
SCRIPT POST create_dir_scb_0_cobalt /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/create_dir_scb_0_cobalt.out
RETRY create_dir_scb_0_cobalt 3
PARENT das_tide_ID000001 CHILD fcst_tide_ID000002
...
PARENT create_dir_scb_0_cobalt CHILD das_tide_ID000001
PARENT create_dir_scb_0_cobalt CHILD fcst_tide_ID000002
######################################################################
# End of DAG
######################################################################
Kickstart XML Record
Kickstart is a light weight C executable that is shipped with the pegasus worker package. All jobs are launced via Kickstart on the remote end, unless explicitly disabled at the time of running pegasus-plan.
Kickstart does not work with
- Condor Standard Universe Jobs
- MPI Jobs
Pegasus automatically disables kickstart for the above jobs.
Kickstart captures useful runtime provenance information about the job launched by it on the remote note, and puts in an XML record that it writes to it's stdout. The stdout appears in the workflow submit directory as <job>.out.00n . Some useful information captured by kickstart and logged are as follows
- the exitcode with which the job it launched exited.
- the duration of the job
- the start time for the job
- the node on which the job ran
- the stdout/stderr of the job
- the arguments with which it launched the job
- the environment that was set for the job before it was launched.
- the machine information about the node that the job ran on
Amongst the above information, the dagman.out file gives a coarser grained estimate of the job duration and start time.
Reading a Kickstart Output File
The kickstart file below has the following fields highlighted
- the host on which the job executed and the ipaddress of that host
- the duration and start time of the job. The time here is in reference to the clock on the remote node where job executed.
- exitcode with which the job executed
- the arguments with which the job was launched.
- the directory in which the job executed on the remote site
- the stdout of the job
- the stderr of the job
- the environment of the job
<?xml version="1.0" encoding="ISO-8859-1"?>
<invocation xmlns="http://pegasus.isi.edu/schema/invocation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/invocation http://pegasus.isi.edu/schema/iv-2.0.xsd" version="2.0" start="2009-01-30T19:17:41.157-06:00" duration="0.321" transformation="pegasus::dirmanager" derivation="pegasus::dirmanager:1.0" resource="cobalt" wf-label="scb" wf-stamp="2009-01-30T17:12:55-08:00" hostaddr="141.142.30.219" hostname="co-login.ncsa.uiuc.edu" pid="27714" uid="29548" user="vahi" gid="13872" group="bvr" umask="0022">
<mainjob start="2009-01-30T19:17:41.426-06:00" duration="0.052" pid="27783">
<usage utime="0.036" stime="0.004" minflt="739" majflt="0" nswap="0" nsignals="0" nvcsw="36" nivcsw="3"/>
<status raw="0"><regular exitcode="0"/></status>
<statcall error="0">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/dirmanager">23212F7573722F62696E2F656E762070</file>
<statinfo mode="0100755" size="8202" inode="85904615883" nlink="1" blksize="16384" blocks="24" mtime="2008-09-22T18:52:37-05:00" atime="2009-01-30T14:54:18-06:00" ctime="2009-01-13T19:09:47-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<argument-vector>
<arg nr="1">--create</arg>
<arg nr="2">--dir</arg>
<arg nr="3">/u/ac/vahi/globus-test/EXEC/vahi/pegasus/scb/run0001</arg>
</argument-vector>
</mainjob>
<cwd>/u/ac/vahi/globus-test/EXEC</cwd>
<usage utime="0.012" stime="0.208" minflt="4232" majflt="0" nswap="0" nsignals="0" nvcsw="15" nivcsw="74"/>
<machine page-size="16384" provider="LINUX">
<stamp>2009-01-30T19:17:41.157-06:00</stamp>
<uname system="linux" nodename="co-login" release="2.6.16.54-0.2.5-default" machine="ia64">#1 SMP Mon Jan 21 13:29:51 UTC 2008</uname>
<ram total="148299268096" free="123371929600" shared="0" buffer="2801664"/>
<swap total="1179656486912" free="1179656486912"/>
<boot idle="1315786.920">2009-01-15T10:19:50.283-06:00</boot>
<cpu count="32" speed="1600" vendor=""></cpu>
<load min1="3.50" min5="3.50" min15="2.60"/>
<proc total="841" running="5" sleeping="828" stopped="5" vmsize="10025418752" rss="2524299264"/>
<task total="1125" running="6" sleeping="1114" stopped="5"/>
</machine>
<statcall error="0" id="stdin">
<!-- deferred flag: 0 -->
<file name="/dev/null"/>
<statinfo mode="020666" size="0" inode="68697" nlink="1" blksize="16384" blocks="0" mtime="2007-05-04T05:54:02-05:00" atime="2007-05-04T05:54:02-05:00" ctime="2009-01-15T10:21:54-06:00" uid="0" user="root" gid="0" group="root"/>
</statcall>
<statcall error="0" id="stdout">
<temporary name="/tmp/gs.out.s9rTJL" descriptor="3"/>
<statinfo mode="0100600" size="29" inode="203420686" nlink="1" blksize="16384" blocks="128" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
<data>mkdir finished successfully.
</data>
</statcall>
<statcall error="0" id="stderr">
<temporary name="/tmp/gs.err.kobn3S" descriptor="5"/>
<statinfo mode="0100600" size="0" inode="203420689" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="gridstart">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/kickstart">7F454C46020101000000000000000000</file>
<statinfo mode="0100755" size="255445" inode="85904615876" nlink="1" blksize="16384" blocks="504" mtime="2009-01-30T18:06:28-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T18:06:28-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="logfile">
<descriptor number="1"/>
<statinfo mode="0100600" size="0" inode="53040253" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:39-06:00" atime="2009-01-30T19:17:39-06:00" ctime="2009-01-30T19:17:39-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="channel">
<fifo name="/tmp/gs.app.Ien1m0" descriptor="7" count="0" rsize="0" wsize="0"/>
<statinfo mode="010640" size="0" inode="203420696" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<environment>
<env key="GLOBUS_GRAM_JOB_CONTACT">https://co-login.ncsa.uiuc.edu:50001/27456/1233364659/</env>
<env key="GLOBUS_GRAM_MYJOB_CONTACT">URLx-nexus://co-login.ncsa.uiuc.edu:50002/</env>
<env key="GLOBUS_LOCATION">/usr/local/prews-gram-4.0.7-r1/</env>
....
</environment>
<resource>
<soft id="RLIMIT_CPU">unlimited</soft>
<hard id="RLIMIT_CPU">unlimited</hard>
<soft id="RLIMIT_FSIZE">unlimited</soft>
....
</resource>
</invocation>
Jobstate.log File
The jobstate.log file logs the various states that a job goes through during workflow execution. It is created by the tailstatd daemon that is launched when a workflow is submitted to Condor DAGMan by pegasus-run. Tailstatd parses the dagman.out file and writes out the jobstate.log file, the format of which is more amenable to parsing.
Note: The jobstate.log file is not created if a user uses condor_submit_dag to submit a workflow to Condor DAGMan.
The jobstate.log file can be created after a workflow has finished executing by running tailstatd on the .dag file in the workflow submit directory.
Executing Tailstatd for cases where pegasus-run was not used to submit workflow
cd workflow-submit-directory tailstatd -n --nodatabase $dagman.outfile
Below is a snippet from the jobstate.log for a single job executed via condorg
1239666049 create_dir_blackdiamond_0_isi_viz SUBMIT 3758.0 isi_viz - 1239666059 create_dir_blackdiamond_0_isi_viz EXECUTE 3758.0 isi_viz - 1239666059 create_dir_blackdiamond_0_isi_viz GLOBUS_SUBMIT 3758.0 isi_viz - 1239666059 create_dir_blackdiamond_0_isi_viz GRID_SUBMIT 3758.0 isi_viz - 1239666064 create_dir_blackdiamond_0_isi_viz JOB_TERMINATED 3758.0 isi_viz - 1239666064 create_dir_blackdiamond_0_isi_viz JOB_SUCCESS 0 isi_viz - 1239666064 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_STARTED - isi_viz - 1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_TERMINATED 3758.0 isi_viz - 1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_SUCCESS - isi_viz -
Each entry in jobstate.log has the following
- the ISO timestamp for the time at which the particular event happened
- the name of the job.
- the event recorded by DAGMan for the job.
- the condor id of the job is the queue on the submit node
- the pegasus site to which the job is mapped
The lifecycle for the job when executed as part of the workflow are as follows
State/Event | Description |
---|---|
SUBMIT | job is submitted by condor schedd for execution. |
EXECUTE | condor schedd detects that a job has started execution. |
GLOBUS_SUBMIT | the job has been submitted to the remote resource. It's only written for GRAM jobs (i.e. gt2 and gt4). |
GRID_SUBMIT | same as GLOBUS_SUBMIT event. The ULOG_GRID_SUBMIT event is written for all grid universe jobs./ |
JOB_TERMINATED | job terminated on the remote node. |
JOB_SUCCESS | job succeeded on the remote host, condor id will be zero (successful exit code). |
JOB_FAILURE | job failed on the remote host, condor id will be the job's exit code. |
POST_SCRIPT_STARTED | post script started by DAGMan on the submit host, usually to parse the kickstart output |
POST_SCRIPT_TERMINATED | post script finished on the submit node. |
POST_SCRIPT_SUCCESS | | post script succeeded or failed. |
Pegasus job.map file
Pegasus creates a workflow.job.map file that links jobs in the DAG with the jobs in the DAX. The contents of the file are in netlogger format. The purpose of this file is to be able to link an invocation record of a task to the corresponding job in the DAX
The workflow is replaced by the name of the workflow i.e. same prefix as the .dag file
In the file there are two types of events.
pegasus.job
pegasus.job.map
pegasus.job - This event is for all the jobs in the DAG. The following information is associated with this event.
- job.id the id of the job in the DAG
- job.class an integer designating the type of the job
- job.xform the logical transformation which the job refers to.
- task.count the number of tasks associated with the job. This is equal to the number of pegasus.job.task events created for that job.
pegasus.job.map - This event allows us to associate a job in the DAG with the jobs in the DAX. The following information is associated with this event.
- task.id the id of the job in the DAG
- task.class an integer designating the type of the job
- task.xform the logical transformation which the job refers to.
Some sample entries are as follows
ts=2009-04-21T23:09:03.091658Z event=pegasus.job job.id=analyze_ID000004 job.class="7" job.xform="vahi::analyze:1.0" task.count="1" ts=2009-04-21T23:09:03.091772Z event=pegasus.job.map job.id=analyze_ID000004 task.id="ID000004" task.class="7" task.xform="vahi::analyze:1.0" ts=2009-04-21T23:09:03.092063Z event=pegasus.job job.id=create_dir_blackdiamond_0_isi_viz job.class="6" job.xform="pegasus::dirmanager" task.count="0" ts=2009-04-21T23:09:03.092165Z event=pegasus.job job.id=merge_vahi-findrange-1.0_PID2_ID1 job.class="1" job.xform="pegasus::seqexec" task.count="2" ts=2009-04-21T23:09:03.093259Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000002" task.class="7" task.xform="vahi::findrange:1.0" ts=2009-04-21T23:09:03.093402Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000003" task.class="7" task.xform="vahi::findrange:1.0"