Author: Karan Vahi October 21st, 2009
Last Updated: Karan Vahi January 28th, 2010
Introduction
This page lists out the various metrics/graphs a user can generate by executing a workflow through Pegasus. Pegasus takes in an abstract workflow ( DAX ) and generates an executable workflow (DAG) in the submit directory.
Layout
Each planned workflow is associated with a submit directory. In it you will see the following
- <daxlabel-daxindex>.dagfile - This is the Condor DAGMman dag file corresponding to the executable workflow generated by Pegasus. The dag file describes the edges in the DAG and information about the jobs in the DAG. Pegasus generated .dag file usually contains the following information for each job
- the job submit file for each job in the DAG.
- the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
- JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.
- <daxlabel-daxindex>.dag.dagman.out - When a DAG ( .dag file ) is executed by Condor DAGMan , the DAGMan writes out it's output to the <daxlabel-daxindex>.dag.dagman.out file. This file tells us the progress of the workflow, and can be used to determine the status of the workflow. Most of pegasus tools mine the dagman.out or jobstate.log to determine the progress of the workflows.
- <daxlabel-daxindex>.dot - Pegasus creates a dot file for the executable workflow in addition to the .dag file. This can be used to visualize the executable workflow using the dot program.
- <job>.sub - Each job in the executable workflow is associated with it's own submit file. The submit file tells Condor on how to execute the job.
- <job>.out.00n - The stdout of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains the kickstart XML provenance record that captures runtime provenance on the remote node where the job was executed. n varies from 1-N where N is the JOB RETRY value in the .dag file. The exitpost executable is invoked on the <job>.out file and it moves the <job>.out to <job>.out.00n so that the the job's .out files are preserved across retries.
- <job>.err.00n- The stderr of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains stderr of kickstart. This is usually empty unless there in an error in kickstart e.g. kickstart segfaults , or kickstart location specified in the submit file is incorrect. The exitpost executable is invoked on the <job>.out file and it moves the <job>.err to <job>.err.00n so that the the job's .out files are preserved across retries.
- jobstate.log - The jobstate.log file is written out by the tailstatd daemon that is launched when a workflow is submitted for execution by pegasus-run. The tailstatd daemon parses the dagman.out file and writes out the jobstate.log that is easier to parse. The jobstate.log captures the various states through which a job goes during the workflow.
braindump.txt- Contains information about pegasus version, dax file, dag file, dax label
dax /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dax/CyberShake_LGU.dax
dag CyberShake_LGU-0.dag
basedir /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags
run /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800
jsd /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800/jobstate.log
rundir 20100106T162339-0800
pegasushome /usr/local/pegasus/default
vogroup pegasus
label CyberShake_LGU
planner /usr/local/pegasus/default/bin/pegasus-plan
pegasus_generator Pegasus
pegasus_version 2.4.0cvs
pegasus_build 20091221194342Z
pegasus_wf_name CyberShake_LGU-0
pegasus_wf_time 20100106T162339-0800
Condor DAGMan file
The Condor DAGMan file ( .dag ) is the input to Condor DAGMan ( the workflow executor used by Pegasus ) .
Pegasus generated .dag file usually contains the following information for each job
- the job submit file for each job in the DAG.
- the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
- JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.
Reading a Condor DAG file
The condor DAG file below has the following fields highlighted for a single job
- JOB and the submit file for the job
- Post script that is invoked on the stdout brought back to the submit directory
- JOB RETRY
In the end of the DAG file the relations between the jobs ( that identify the underlying DAG structure ) are highlighted.
######################################################################
# PEGASUS WMS GENERATED DAG FILE
# DAG scb
# Index = 0, Count = 1
######################################################################
JOB das_tide_ID000001 das_tide_ID000001.sub
SCRIPT POST das_tide_ID000001 /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/das_tide_ID000001.out
RETRY das_tide_ID000001 3
....
JOB create_dir_scb_0_cobalt create_dir_scb_0_cobalt.sub
SCRIPT POST create_dir_scb_0_cobalt /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/create_dir_scb_0_cobalt.out
RETRY create_dir_scb_0_cobalt 3
PARENT das_tide_ID000001 CHILD fcst_tide_ID000002
...
PARENT create_dir_scb_0_cobalt CHILD das_tide_ID000001
PARENT create_dir_scb_0_cobalt CHILD fcst_tide_ID000002
######################################################################
# End of DAG
######################################################################
Kickstart XML Record
Kickstart is a light weight C executable that is shipped with the pegasus worker package. All jobs are launced via Kickstart on the remote end, unless explicitly disabled at the time of running pegasus-plan.
Kickstart does not work with
- Condor Standard Universe Jobs
- MPI Jobs
Pegasus automatically disables kickstart for the above jobs.
Kickstart captures useful runtime provenance information about the job launched by it on the remote note, and puts in an XML record that it writes to it's stdout. The stdout appears in the workflow submit directory as <job>.out.00n . Some useful information captured by kickstart and logged are as follows
- the exitcode with which the job it launched exited.
- the duration of the job
- the start time for the job
- the node on which the job ran
- the stdout/stderr of the job
- the arguments with which it launched the job
- the environment that was set for the job before it was launched.
- the machine information about the node that the job ran on
Amongst the above information, the dagman.out file gives a coarser grained estimate of the job duration and start time.
Reading a Kickstart Output File
The kickstart file below has the following fields highlighted
- the host on which the job executed and the ipaddress of that host
- the duration and start time of the job. The time here is in reference to the clock on the remote node where job executed.
- exitcode with which the job executed
- the arguments with which the job was launched.
- the directory in which the job executed on the remote site
- the stdout of the job
- the stderr of the job
- the environment of the job
<?xml version="1.0" encoding="ISO-8859-1"?>
<invocation xmlns="http://pegasus.isi.edu/schema/invocation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/invocation http://pegasus.isi.edu/schema/iv-2.0.xsd" version="2.0" start="2009-01-30T19:17:41.157-06:00" duration="0.321" transformation="pegasus::dirmanager" derivation="pegasus::dirmanager:1.0" resource="cobalt" wf-label="scb" wf-stamp="2009-01-30T17:12:55-08:00" hostaddr="141.142.30.219" hostname="co-login.ncsa.uiuc.edu" pid="27714" uid="29548" user="vahi" gid="13872" group="bvr" umask="0022">
<mainjob start="2009-01-30T19:17:41.426-06:00" duration="0.052" pid="27783">
<usage utime="0.036" stime="0.004" minflt="739" majflt="0" nswap="0" nsignals="0" nvcsw="36" nivcsw="3"/>
<status raw="0"><regular exitcode="0"/></status>
<statcall error="0">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/dirmanager">23212F7573722F62696E2F656E762070</file>
<statinfo mode="0100755" size="8202" inode="85904615883" nlink="1" blksize="16384" blocks="24" mtime="2008-09-22T18:52:37-05:00" atime="2009-01-30T14:54:18-06:00" ctime="2009-01-13T19:09:47-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<argument-vector>
<arg nr="1">--create</arg>
<arg nr="2">--dir</arg>
<arg nr="3">/u/ac/vahi/globus-test/EXEC/vahi/pegasus/scb/run0001</arg>
</argument-vector>
</mainjob>
<cwd>/u/ac/vahi/globus-test/EXEC</cwd>
<usage utime="0.012" stime="0.208" minflt="4232" majflt="0" nswap="0" nsignals="0" nvcsw="15" nivcsw="74"/>
<machine page-size="16384" provider="LINUX">
<stamp>2009-01-30T19:17:41.157-06:00</stamp>
<uname system="linux" nodename="co-login" release="2.6.16.54-0.2.5-default" machine="ia64">#1 SMP Mon Jan 21 13:29:51 UTC 2008</uname>
<ram total="148299268096" free="123371929600" shared="0" buffer="2801664"/>
<swap total="1179656486912" free="1179656486912"/>
<boot idle="1315786.920">2009-01-15T10:19:50.283-06:00</boot>
<cpu count="32" speed="1600" vendor=""></cpu>
<load min1="3.50" min5="3.50" min15="2.60"/>
<proc total="841" running="5" sleeping="828" stopped="5" vmsize="10025418752" rss="2524299264"/>
<task total="1125" running="6" sleeping="1114" stopped="5"/>
</machine>
<statcall error="0" id="stdin">
<!-- deferred flag: 0 -->
<file name="/dev/null"/>
<statinfo mode="020666" size="0" inode="68697" nlink="1" blksize="16384" blocks="0" mtime="2007-05-04T05:54:02-05:00" atime="2007-05-04T05:54:02-05:00" ctime="2009-01-15T10:21:54-06:00" uid="0" user="root" gid="0" group="root"/>
</statcall>
<statcall error="0" id="stdout">
<temporary name="/tmp/gs.out.s9rTJL" descriptor="3"/>
<statinfo mode="0100600" size="29" inode="203420686" nlink="1" blksize="16384" blocks="128" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
<data>mkdir finished successfully.
</data>
</statcall>
<statcall error="0" id="stderr">
<temporary name="/tmp/gs.err.kobn3S" descriptor="5"/>
<statinfo mode="0100600" size="0" inode="203420689" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="gridstart">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/kickstart">7F454C46020101000000000000000000</file>
<statinfo mode="0100755" size="255445" inode="85904615876" nlink="1" blksize="16384" blocks="504" mtime="2009-01-30T18:06:28-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T18:06:28-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="logfile">
<descriptor number="1"/>
<statinfo mode="0100600" size="0" inode="53040253" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:39-06:00" atime="2009-01-30T19:17:39-06:00" ctime="2009-01-30T19:17:39-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="channel">
<fifo name="/tmp/gs.app.Ien1m0" descriptor="7" count="0" rsize="0" wsize="0"/>
<statinfo mode="010640" size="0" inode="203420696" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<environment>
<env key="GLOBUS_GRAM_JOB_CONTACT">https://co-login.ncsa.uiuc.edu:50001/27456/1233364659/</env>
<env key="GLOBUS_GRAM_MYJOB_CONTACT">URLx-nexus://co-login.ncsa.uiuc.edu:50002/</env>
<env key="GLOBUS_LOCATION">/usr/local/prews-gram-4.0.7-r1/</env>
....
</environment>
<resource>
<soft id="RLIMIT_CPU">unlimited</soft>
<hard id="RLIMIT_CPU">unlimited</hard>
<soft id="RLIMIT_FSIZE">unlimited</soft>
....
</resource>
</invocation>
Jobstate.log File
The jobstate.log file logs the various states that a job goes through during workflow execution. It is created by the tailstatd daemon that is launched when a workflow is submitted to Condor DAGMan by pegasus-run. Tailstatd parses the dagman.out file and writes out the jobstate.log file, the format of which is more amenable to parsing.
Note: The jobstate.log file is not created if a user uses condor_submit_dag to submit a workflow to Condor DAGMan.
The jobstate.log file can be created after a workflow has finished executing by running tailstatd on the .dag file in the workflow submit directory.
Executing Tailstatd for cases where pegasus-run was not used to submit workflow
cd workflow-submit-directory tailstatd -n --nodatabase $dagman.outfile
Below is a snippet from the jobstate.log for a single job executed via condorg
1239666049 create_dir_blackdiamond_0_isi_viz SUBMIT 3758.0 isi_viz - 1239666059 create_dir_blackdiamond_0_isi_viz EXECUTE 3758.0 isi_viz - 1239666059 create_dir_blackdiamond_0_isi_viz GLOBUS_SUBMIT 3758.0 isi_viz - 1239666059 create_dir_blackdiamond_0_isi_viz GRID_SUBMIT 3758.0 isi_viz - 1239666064 create_dir_blackdiamond_0_isi_viz JOB_TERMINATED 3758.0 isi_viz - 1239666064 create_dir_blackdiamond_0_isi_viz JOB_SUCCESS 0 isi_viz - 1239666064 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_STARTED - isi_viz - 1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_TERMINATED 3758.0 isi_viz - 1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_SUCCESS - isi_viz -
Each entry in jobstate.log has the following
- the ISO timestamp for the time at which the particular event happened
- the name of the job.
- the event recorded by DAGMan for the job.
- the condor id of the job is the queue on the submit node
- the pegasus site to which the job is mapped
The lifecycle for the job when executed as part of the workflow are as follows
State/Event | Description |
---|---|
SUBMIT | job is submitted by condor schedd for execution. |
EXECUTE | condor schedd detects that a job has started execution. |
GLOBUS_SUBMIT | the job has been submitted to the remote resource. It's only written for GRAM jobs (i.e. gt2 and gt4). |
GRID_SUBMIT | same as GLOBUS_SUBMIT event. The ULOG_GRID_SUBMIT event is written for all grid universe jobs./ |
JOB_TERMINATED | job terminated on the remote node. |
JOB_SUCCESS | job succeeded on the remote host, condor id will be zero (successful exit code). |
JOB_FAILURE | job failed on the remote host, condor id will be the job's exit code. |
POST_SCRIPT_STARTED | post script started by DAGMan on the submit host, usually to parse the kickstart output |
POST_SCRIPT_TERMINATED | post script finished on the submit node. |
POST_SCRIPT_SUCCESS | | post script succeeded or failed. |
Pegasus job.map file
Pegasus creates a workflow.job.map file that links jobs in the DAG with the jobs in the DAX. The contents of the file are in netlogger format. The purpose of this file is to be able to link an invocation record of a task to the corresponding job in the DAX
The workflow is replaced by the name of the workflow i.e. same prefix as the .dag file
In the file there are two types of events.
pegasus.job
pegasus.job.map
pegasus.job - This event is for all the jobs in the DAG. The following information is associated with this event.
- job.id the id of the job in the DAG
- job.class an integer designating the type of the job
- job.xform the logical transformation which the job refers to.
- task.count the number of tasks associated with the job. This is equal to the number of pegasus.job.task events created for that job.
pegasus.job.map - This event allows us to associate a job in the DAG with the jobs in the DAX. The following information is associated with this event.
- task.id the id of the job in the DAG
- task.class an integer designating the type of the job
- task.xform the logical transformation which the job refers to.
Some sample entries are as follows
ts=2009-04-21T23:09:03.091658Z event=pegasus.job job.id=analyze_ID000004 job.class="7" job.xform="vahi::analyze:1.0" task.count="1" ts=2009-04-21T23:09:03.091772Z event=pegasus.job.map job.id=analyze_ID000004 task.id="ID000004" task.class="7" task.xform="vahi::analyze:1.0" ts=2009-04-21T23:09:03.092063Z event=pegasus.job job.id=create_dir_blackdiamond_0_isi_viz job.class="6" job.xform="pegasus::dirmanager" task.count="0" ts=2009-04-21T23:09:03.092165Z event=pegasus.job job.id=merge_vahi-findrange-1.0_PID2_ID1 job.class="1" job.xform="pegasus::seqexec" task.count="2" ts=2009-04-21T23:09:03.093259Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000002" task.class="7" task.xform="vahi::findrange:1.0" ts=2009-04-21T23:09:03.093402Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000003" task.class="7" task.xform="vahi::findrange:1.0"
Pegasus Workflow Job States and Delays
The various job states that a job goes through ( as caputured in the dagman.out and jobstate.log file) during it's lifecycle are illustrated below. The figure below highlights the various local and remote delays during job lifecycle.
In the some case the Grid Submit and Condor Execute event may be interchanged. That is due to the fact whether Condor Grid Monitor is enabled or not.
- Visualizing Graph Structure of A Workflow
- Visualizing a Single Workflow Run
- show-job
- SCB Workflow Gantt Chart
- Visualizing a Workflow of Workflows Runs using Netlogger
- Netlogger Visualization of SCEC workflows
- show-job
The information in the kickstart output files and the condor dagman logs can be mined to retrieve useful statistics about how the workflow ran. The data retrieved then can be used to generate useful graphs. This section lists the various ways to visualize and mine statistics using helper scripts in the Pegasus distribution and remote tools.
Visualizing Graph Structure of A Workflow
Visualizing the structure of a workflow is a two step process. The first step is to convert the workflow description (DAX or DAG) into a DOT file. DOT is a special file format used for the visual display of graphs. The next step is to view the DOT file in a viewer, or use it to generate an image. These steps are described in more detail below.
Generating DOT from DAX
You can generate a DOT file from a DAX file using the pegasus-graphviz tool provided with Pegasus in the $PEGASUS_HOME/bin directory.
Usage
Usage: pegasus-graphviz [options] FILE Parses FILE and generates a DOT-formatted graphical representation of the DAG. FILE can be a Condor DAGMan file, or a Pegasus DAX file. Options: -h, --help show this help message and exit -s, --nosimplify Do not simplify the graph by removing redundant edges. [default: False] -l LABEL, --label=LABEL What attribute to use for labels. One of 'label', 'xform', 'id', 'xform-id', 'label-xform', 'label-id'. For 'label', the transformation is used for jobs that have no node-label. [default: label] -o FILE, --output=FILE Write output to FILE [default: stdout] -r XFORM, --remove=XFORM Remove jobs from the workflow by transformation name -W WIDTH, --width=WIDTH Width of the digraph -H HEIGHT, --height=HEIGHT Height of the digraph -f, --files Include files. This option is only valid for DAX files. [default: false]
Example
$PEGASUS_HOME/bin/pegasus-graphviz --output scb.dot scb_dax.xml
Generating DOT from DAG
Pegasus automatically generates a DOT file (<daxlabel-daxindex>.dot) for each executable workflow and saves it in the submit directory. You can use this file, or you can generate a different one using the pegasus-graphviz tool provided with Pegasus ($PEGASUS_HOME/bin/pegasus-graphviz). The difference is that pegasus-graphviz gives you some additional options that aren't available if you use the automatically-generated DOT file.
Usage
Usage: pegasus-graphviz [options] FILE Parses FILE and generates a DOT-formatted graphical representation of the DAG. FILE can be a Condor DAGMan file, or a Pegasus DAX file. Options: -h, --help show this help message and exit -s, --nosimplify Do not simplify the graph by removing redundant edges. [default: False] -l LABEL, --label=LABEL What attribute to use for labels. One of 'label', 'xform', 'id', 'xform-id', 'label-xform', 'label-id'. For 'label', the transformation is used for jobs that have no node-label. [default: label] -o FILE, --output=FILE Write output to FILE [default: stdout] -r XFORM, --remove=XFORM Remove jobs from the workflow by transformation name -W WIDTH, --width=WIDTH Width of the digraph -H HEIGHT, --height=HEIGHT Height of the digraph -f, --files Include files. This option is only valid for DAX files. [default: false]
Example
$PEGASUS_HOME/bin/pegasus-graphviz --output scb.dot scb-0.dag
Viewing DOT files
DOT files can be used to generate images or displayed using a viewer. You can find DOT file viewers at http://www.graphviz.org. In addition, on Mac OS X the OmniGraffle program can read and display DOT files. The advantage of OmniGraffle is that you can edit the DOT file visually and export it in a number of formats.
To generate a jpeg file using the "dot" program distributed with GraphViz run:
dot -Tjpeg -o SCB_DAX.jpg scb.dot
Here is an example image generated from a DAX.
Here is an example image generated from a DAG.
Visualizing a Single Workflow Run
show-job
show-job is a perl script that can be used to generate a Gantt char of a workflow run.
It generates the gantt chart in ploticus input format and then generates an eps file and a png file using the ploticus program. The ploticus executable should be in your path.
Usage
$PEGASUS_HOME/contrib/showlog/show-job --color-file <the file mapping job transformation names to color> <path to the dag file>
Sample Usage
sukhna 59% $PEGASUS_HOME/contrib/showlog/show-job --color-file color.in dags/vahi/pegasus/scb/run0001/scb-0.dag # min=1233364634 2009-01-30T17:17:14-08:00 # max=1233376529 2009-01-30T20:35:29-08:00 # diff=11895 # running y=1... # xstubs=1800, xticks=600, width=13.2166666666667, height=5 job scb::das_tide has color green job pegasus::transfer has color magenta job pegasus::dirmanager has color lavender job scb::fcst_tide has color orange job pegasus::rc-client has color powderblue2 job unknown has color gray(0.75) job scb::interpolate has color blue # /old-usr/sukhna/install/ploticus/pl232src/bin/ploticus /tmp/sj-UDs7SZ-1.pls -eps -o /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-1.eps # /usr/ucb/convert -density 96x96 /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-1.eps /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-1.png # running y=2... # xstubs=1800, xticks=600, width=13.2166666666667, height=5 job scb::das_tide has color green job pegasus::transfer has color magenta job pegasus::dirmanager has color lavender job scb::fcst_tide has color orange job pegasus::rc-client has color powderblue2 job unknown has color gray(0.75) job scb::interpolate has color blue # /old-usr/sukhna/install/ploticus/pl232src/bin/ploticus /tmp/sj-UDs7SZ-2.pls -eps -o /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-2.eps # /usr/ucb/convert -density 96x96 /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-2.eps /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-2.png number of jobs: 8 number of script failures: 0 sequential duration of jobs: 9090 s total workflow duration: 11895 s (speed-up 0.8)
Sample color file
The color file can be used to provide different colors for the different transformations in the DAX.
scb::das_tide green scb::fcst_tide orange scb::interpolate blue
Here is a sample gantt chart for a workflow execution run
SCB Workflow Gantt Chart |
---|
![]() |
Added by UWC, the Universal Wiki Converter
|
Click on the above image to see an enlarged view
Visualizing a Workflow of Workflows Runs using Netlogger
If a user executes a workflow that in turn contains other sub workflows, it is possible to visualize them with little help from the netlogger folks.
SCEC runs a workflow of workflows, where each outer level workflow has about 80 sub workflows. The workflow logs need to be populated into the netlogger database, and then a gantt chart can be plotting using R statistical package.
Here is a sample cummulative time gantt chart
Netlogger Visualization of SCEC workflows |
---|
![]() |
|
Click on the above image to see an enlarged view
Netlogger Database Structure
Tables
- event
- ident
- attr
Entries from job map
Event |
Ident |
Attr |
---|---|---|
pegasus.job |
job |
task.count |
|
|
job.xform |
|
|
job.class |
Event |
Ident |
Attr |
pegasus.job.map |
job |
task.class |
|
task |
task.xform |
Entries from condor dag
Event |
Ident |
Attr |
---|---|---|
condor.dag.edge |
comp.parent |
|
|
comp.child |
|
Entries from jobstate log
Event |
Ident |
Attr |
---|---|---|
pegasus.jobstate.submit |
site |
status |
pegasus.jobstate.execute |
condor |
dur |
pegasus.jobstate.job_terminated |
comp |
|
pegasus.jobstate.postscript |
|
|
pegasus.jobstate.image_size |
|
|
pegasus.jobstate.job_evicted |
|
|
pegasus.jobstate.job_disconnected |
|
|
pegasus.jobstate.job_reconnect_failed |
|
|
pegasus.jobstate.shadow_exception |
|
|
Entries from Invocation Records. (.out files)
Events |
Ident |
Attr |
---|---|---|
pegasus.invocation |
workflow |
status |
|
comp |
nsignals |
|
|
transformation |
|
|
host |
|
|
user |
|
|
duration |
|
|
arguments |
Event |
Ident |
Attr |
pegasus.invocation.stat_error |
comp |
status |
|
|
group |
|
|
file |
|
|
user |
Workflow Statistics
genstats
genstats is a perl script distributed with Pegasus that generates a table listing statistics for each job in the executable workflow ( Condor DAG )
Usage
genstats --dag <dagfilename> --output <the output results directory> --jobstate-log <path to the jobstate.log file>
Sample Usage
genstats --dag scb-0.dag --output /lfs1/work/jpl/scb_results/run0001 --jobstate-log jobstate.log
genstats generates the following information for each job in jobs file in the output results directory
- Job - the name of the job
- Site - the site where the job ran
- Kickstart - the actual duration of the job in seconds on the remote compute node
- Post - the postscript time as reported by DAGMan
- Condor - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node
- Resource - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job spent in the remote queue
- Runtime - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart
- CondorQLen - the number of outstanding jobs in the queue when this job was released.
Here is a sample jobs file created by genstats
Job Site Kickstart Post DAGMan Condor Resource Runtime CondorQLen Seqexec Seqexec-Delay
create_dir_scb_0_cobalt cobalt 0.00 5.00 13.00 15.00 0.00 15.00 1 - -
das_tide_ID000001 cobalt 0.00 5.00 5.00 15.00 3906.00 3855.00 1 - -
fcst_tide_ID000002 cobalt 0.00 5.00 5.00 15.00 90.00 465.00 1 - -
interpolate_ID000003 cobalt 0.00 5.00 5.00 15.00 155.00 160.00 1 - -
stage_in_das_tide_ID000001_0 cobalt 0.00 5.00 5.00 20.00 5.00 2946.00 1 - -
stage_in_fcst_tide_ID000002_0 cobalt 0.00 5.00 5.00 20.00 5.00 1805.00 2 - -
stage_in_interpolate_ID000003_0 cobalt 0.00 5.00 5.00 15.00 0.00 435.00 3 - -
stage_out_interpolate_ID000003_0 cobalt 0.00 5.00 5.00 15.00 0.00 135.00 1 - -
genstats-breakdown
genstats-breakdown is a perl script distributed with Pegasus that generates a table listing statistics for each type of logical transformation in the executable workflow ( Condor DAG ). For example this tool will generate statistics grouped by transfer transformation that encompasses the stage-in, stage-out , inter site and symlinking transfer jobs.
Usage
$PEGASUS_HOME/bin/genstats-breakdown --output=<output file> -x <the workflow submit directory>
User can pass workflow submit directories using the -x option. In that case, the statistics are written for each of the submit directories , and also across all the directories
Sample Usage
genstats-breakdown --output breakdown.txt -x dags/vahi/pegasus/scb/run000*
Here is a sample breakdown.txt file created
dags/vahi/pegasus/scb/run0001
Transformation Count Mean Variance*
pegasus::transfer 4 1200.65 1660108.49
scb::das_tide 1 3806.65 0.00
pegasus::dirmanager 1 0.32 0.00
scb::fcst_tide 1 346.39 0.00
scb::interpolate 1 134.49 0.00
dags/vahi/pegasus/scb/run0002
Transformation Count Mean Variance
pegasus::transfer 4 1191.27 1580276.06
scb::das_tide 1 3811.54 0.00
pegasus::dirmanager 1 0.34 0.00
scb::fcst_tide 1 344.90 0.00
scb::interpolate 1 128.56 0.00
dags/vahi/pegasus/scb/run0003
Transformation Count Mean Variance
pegasus::transfer 4 1203.00 1635850.78
scb::das_tide 1 3794.60 0.00
pegasus::dirmanager 1 0.32 0.00
scb::fcst_tide 1 492.81 0.00
scb::interpolate 1 108.58 0.00
dags/vahi/pegasus/scb/run0004
Transformation Count Mean Variance
pegasus::transfer 4 1168.31 1521384.54
scb::das_tide 1 3861.94 0.00
pegasus::dirmanager 1 0.29 0.00
scb::fcst_tide 1 348.76 0.00
scb::interpolate 1 139.54 0.00
All
Transformation Count Mean Variance
pegasus::transfer 16 1190.81 1279724.52
scb::das_tide 4 3818.68 882.31
pegasus::dirmanager 4 0.32 0.00
scb::fcst_tide 4 383.22 5341.00
scb::interpolate 4 127.79 184.18
Populating and Mining Netlogger Database
For large workflows, users can load the workflow logs into a netlogger database.
Populating a Netlogger Database
Details about installing netlogger database and loading data into it can be found at
http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
In general netlogger requires the following components
- mysql | sqllite backend to populate to
- python 2.5
- python bindings for mysql | sqllite
Mining a netlogger database
Once data has been loaded into a netlogger database , a variety of queries can be issued to the db.
The queries can help user answer the following questions
- how many jobs ran on a given day
- what was the cumulative runtime of these jobs
- how many jobs ran on given hosts
- how many jobs of a given type ran on a given day
- how many jobs failed
- how many jobs succeeded
Complex Queries
Users can issue complex queries to the DB on the basis of the DAX label in the original DAX.
In case of workflow of workflows , where each of the dax's have a similar dax labels users can generate statistics either for the individual sub workflow or all the workflows together.
Queries below are for all the workflows together organized by workflow id.
Queries Per Workflow Where Workflow ID Is a DAX Label
- Total number of jobs
select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id where attr.name = 'status' and ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
- Total number of succeeded jobs
select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id where attr.name = 'status' and attr.value = '0' and ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
- Breakdown of jobs
select attr.value, count(attr.e_id) from attr join ident on attr.e_id = ident.e_id where ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%' and attr.name='type' group by attr.value;
- Total Runtime of jobs
select sum(attr.value) from attr join ident on attr.e_id=ident.e_id where attr.name='duration' and ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
Queries Per Workflow Per Job Type
- Runtime Breakdown by job type per workflow
select TRANSFORMATION, count(TRANSFORMATION) as number ,round(sum(attr.value),2) as sum_seconds, round(sum(attr.value)/(3600),2) as sum_hours, round(avg(attr.value),2) as avg_seconds from attr join (select attr.e_id as event_id, attr.value as TRANSFORMATION from attr join ident on attr.e_id=ident.e_id where attr.name='type' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident on attr.e_id=event_id WHERE attr.name='duration' group by TRANSFORMATION;
- Number of failures by job type per workflow
select TRANSFORMATION, count(TRANSFORMATION) as failures from attr join (select attr.e_id as event_id, attr.value as TRANSFORMATION from attr join ident on attr.e_id=ident.e_id where attr.name='type' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident on attr.e_id=event_id WHERE attr.name = 'status' and attr.value != '0' group by TRANSFORMATION;
Queries Per Unit Time Per Workflow
- Jobs Per Day Per Workflow
select count(id) as 'count', day(from_unixtime(time)) as day from event join attr on attr.e_id = event.id join ident on attr.e_id=ident.e_id where event.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_CCP%' group by day;
- Jobs Per Day Per Hour Per Workflow
SELECT day(from_unixtime(time)) as day, hour(from_unixtime(time)) as hour, count(event.id) as 'count' FROM event JOIN attr on attr.e_id = event.id JOIN ident on attr.e_id=ident.e_id WHERE event.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_CCP%' GROUP BY day, hour ORDER BY day, hour;
- Jobs Per Host Per Hour Per Workflow
SELECT attr.value as host, day(from_unixtime(time)) as 'day', hour(from_unixtime(time)) as 'hour', count(event.id) as 'count' from event JOIN attr on attr.e_id = event.id JOIN ident on attr.e_id=ident.e_id WHERE.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%' group by host, day,hour ORDER BY day, hour;
Full details are available at http://www.cedps.net/index.php/Pegasus_Sample_Queries