Author: Karan Vahi October 21st, 2009
Last Updated: Karan Vahi January 28th, 2010

 

Introduction

This page lists out the various metrics/graphs a user can generate by executing a workflow through Pegasus. Pegasus takes in an abstract workflow ( DAX ) and generates an executable workflow (DAG) in the submit directory.

Some of this information is outdated. For more up-to-date info see: http://pegasus.isi.edu/wms/docs/latest/submit_directory.php

Layout

Each planned workflow is associated with a submit directory. In it you will see the following

  • <daxlabel-daxindex>.dagfile - This is the Condor DAGMman dag file corresponding to the executable workflow generated by Pegasus. The dag file describes the edges in the DAG and information about the jobs in the DAG. Pegasus generated .dag file usually contains the following information for each job
    • the job submit file for each job in the DAG.
    • the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
    • JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.
  • <daxlabel-daxindex>.dag.dagman.out - When a DAG ( .dag file ) is executed by Condor DAGMan , the DAGMan writes out it's output to the <daxlabel-daxindex>.dag.dagman.out file. This file tells us the progress of the workflow, and can be used to determine the status of the workflow. Most of pegasus tools mine the dagman.out or jobstate.log to determine the progress of the workflows.
  • <daxlabel-daxindex>.dot - Pegasus creates a dot file for the executable workflow in addition to the .dag file. This can be used to visualize the executable workflow using the dot program.
  • <job>.sub - Each job in the executable workflow is associated with it's own submit file. The submit file tells Condor on how to execute the job.
  • <job>.out.00n - The stdout of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains the kickstart XML provenance record that captures runtime provenance on the remote node where the job was executed. n varies from 1-N where N is the JOB RETRY value in the .dag file. The exitpost executable is invoked on the <job>.out file and it moves the <job>.out to <job>.out.00n so that the the job's .out files are preserved across retries.
  • <job>.err.00n- The stderr of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains stderr of kickstart. This is usually empty unless there in an error in kickstart e.g. kickstart segfaults , or kickstart location specified in the submit file is incorrect. The exitpost executable is invoked on the <job>.out file and it moves the <job>.err to <job>.err.00n so that the the job's .out files are preserved across retries.
  • jobstate.log - The jobstate.log file is written out by the tailstatd daemon that is launched when a workflow is submitted for execution by pegasus-run. The tailstatd daemon parses the dagman.out file and writes out the jobstate.log that is easier to parse. The jobstate.log captures the various states through which a job goes during the workflow.
  • braindump.txt- Contains information about pegasus version, dax file, dag file, dax label

    dax /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dax/CyberShake_LGU.dax
    dag CyberShake_LGU-0.dag
    basedir /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags
    run /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800
    jsd /Users/gmehta/Documents/Work/2009/SCEC/Cybershake/OSG/dags/gmehta/pegasus/CyberShake_LGU/20100106T162339-0800/jobstate.log
    rundir 20100106T162339-0800
    pegasushome /usr/local/pegasus/default
    vogroup pegasus
    label CyberShake_LGU
    planner /usr/local/pegasus/default/bin/pegasus-plan
    pegasus_generator Pegasus
    pegasus_version 2.4.0cvs
    pegasus_build 20091221194342Z
    pegasus_wf_name CyberShake_LGU-0
    pegasus_wf_time 20100106T162339-0800

Condor DAGMan file

The Condor DAGMan file ( .dag ) is the input to Condor DAGMan ( the workflow executor used by Pegasus ) .

Pegasus generated .dag file usually contains the following information for each job

  • the job submit file for each job in the DAG.
  • the post script that is to be invoked when a job completes. This is usually $PEGASUS_HOME/bin/exitpost that parses the kickstart record in the job's .out file and determines the exitcode.
  • JOB RETRY the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.

Reading a Condor DAG file

The condor DAG file below has the following fields highlighted for a single job

  1. JOB and the submit file for the job
  2. Post script that is invoked on the stdout brought back to the submit directory
  3. JOB RETRY

In the end of the DAG file the relations between the jobs ( that identify the underlying DAG structure ) are highlighted.

Condor DAGMan File

######################################################################
# PEGASUS WMS GENERATED DAG FILE
# DAG scb
# Index = 0, Count = 1
######################################################################

JOB das_tide_ID000001 das_tide_ID000001.sub
SCRIPT POST das_tide_ID000001 /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/das_tide_ID000001.out
RETRY das_tide_ID000001 3

....

JOB create_dir_scb_0_cobalt create_dir_scb_0_cobalt.sub
SCRIPT POST create_dir_scb_0_cobalt /lfs1/software/install/pegasus/default/bin/exitpost -Dpegasus.user.properties=/lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/pegasus.32479.properties -e /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/create_dir_scb_0_cobalt.out
RETRY create_dir_scb_0_cobalt 3

PARENT das_tide_ID000001 CHILD fcst_tide_ID000002
...
PARENT create_dir_scb_0_cobalt CHILD das_tide_ID000001
PARENT create_dir_scb_0_cobalt CHILD fcst_tide_ID000002
######################################################################
# End of DAG
######################################################################

Kickstart XML Record

Kickstart is a light weight C executable that is shipped with the pegasus worker package. All jobs are launced via Kickstart on the remote end, unless explicitly disabled at the time of running pegasus-plan.

Kickstart does not work with

  1. Condor Standard Universe Jobs
  2. MPI Jobs

Pegasus automatically disables kickstart for the above jobs.

Kickstart captures useful runtime provenance information about the job launched by it on the remote note, and puts in an XML record that it writes to it's stdout. The stdout appears in the workflow submit directory as <job>.out.00n . Some useful information captured by kickstart and logged are as follows

  1. the exitcode with which the job it launched exited.
  2. the duration of the job
  3. the start time for the job
  4. the node on which the job ran
  5. the stdout/stderr of the job
  6. the arguments with which it launched the job
  7. the environment that was set for the job before it was launched.
  8. the machine information about the node that the job ran on

Amongst the above information, the dagman.out file gives a coarser grained estimate of the job duration and start time.

Reading a Kickstart Output File

The kickstart file below has the following fields highlighted

  1. the host on which the job executed and the ipaddress of that host
  2. the duration and start time of the job. The time here is in reference to the clock on the remote node where job executed.
  3. exitcode with which the job executed
  4. the arguments with which the job was launched.
  5. the directory in which the job executed on the remote site
  6. the stdout of the job
  7. the stderr of the job
  8. the environment of the job
Kickstart Output

<?xml version="1.0" encoding="ISO-8859-1"?>
<invocation xmlns="http://pegasus.isi.edu/schema/invocation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/invocation http://pegasus.isi.edu/schema/iv-2.0.xsd" version="2.0" start="2009-01-30T19:17:41.157-06:00" duration="0.321" transformation="pegasus::dirmanager" derivation="pegasus::dirmanager:1.0" resource="cobalt" wf-label="scb" wf-stamp="2009-01-30T17:12:55-08:00" hostaddr="141.142.30.219" hostname="co-login.ncsa.uiuc.edu" pid="27714" uid="29548" user="vahi" gid="13872" group="bvr" umask="0022">
<mainjob start="2009-01-30T19:17:41.426-06:00" duration="0.052" pid="27783">
<usage utime="0.036" stime="0.004" minflt="739" majflt="0" nswap="0" nsignals="0" nvcsw="36" nivcsw="3"/>
<status raw="0"><regular exitcode="0"/></status>
<statcall error="0">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/dirmanager">23212F7573722F62696E2F656E762070</file>
<statinfo mode="0100755" size="8202" inode="85904615883" nlink="1" blksize="16384" blocks="24" mtime="2008-09-22T18:52:37-05:00" atime="2009-01-30T14:54:18-06:00" ctime="2009-01-13T19:09:47-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<argument-vector>
<arg nr="1">--create</arg>
<arg nr="2">--dir</arg>
<arg nr="3">/u/ac/vahi/globus-test/EXEC/vahi/pegasus/scb/run0001</arg>
</argument-vector>

</mainjob>
<cwd>/u/ac/vahi/globus-test/EXEC</cwd>
<usage utime="0.012" stime="0.208" minflt="4232" majflt="0" nswap="0" nsignals="0" nvcsw="15" nivcsw="74"/>
<machine page-size="16384" provider="LINUX">
<stamp>2009-01-30T19:17:41.157-06:00</stamp>
<uname system="linux" nodename="co-login" release="2.6.16.54-0.2.5-default" machine="ia64">#1 SMP Mon Jan 21 13:29:51 UTC 2008</uname>
<ram total="148299268096" free="123371929600" shared="0" buffer="2801664"/>
<swap total="1179656486912" free="1179656486912"/>
<boot idle="1315786.920">2009-01-15T10:19:50.283-06:00</boot>
<cpu count="32" speed="1600" vendor=""></cpu>
<load min1="3.50" min5="3.50" min15="2.60"/>
<proc total="841" running="5" sleeping="828" stopped="5" vmsize="10025418752" rss="2524299264"/>
<task total="1125" running="6" sleeping="1114" stopped="5"/>
</machine>
<statcall error="0" id="stdin">
<!-- deferred flag: 0 -->
<file name="/dev/null"/>
<statinfo mode="020666" size="0" inode="68697" nlink="1" blksize="16384" blocks="0" mtime="2007-05-04T05:54:02-05:00" atime="2007-05-04T05:54:02-05:00" ctime="2009-01-15T10:21:54-06:00" uid="0" user="root" gid="0" group="root"/>
</statcall>
<statcall error="0" id="stdout">
<temporary name="/tmp/gs.out.s9rTJL" descriptor="3"/>
<statinfo mode="0100600" size="29" inode="203420686" nlink="1" blksize="16384" blocks="128" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
<data>mkdir finished successfully.
</data>
</statcall>
<statcall error="0" id="stderr">
<temporary name="/tmp/gs.err.kobn3S" descriptor="5"/>
<statinfo mode="0100600" size="0" inode="203420689" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>

<statcall error="0" id="gridstart">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/kickstart">7F454C46020101000000000000000000</file>
<statinfo mode="0100755" size="255445" inode="85904615876" nlink="1" blksize="16384" blocks="504" mtime="2009-01-30T18:06:28-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T18:06:28-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="logfile">
<descriptor number="1"/>
<statinfo mode="0100600" size="0" inode="53040253" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:39-06:00" atime="2009-01-30T19:17:39-06:00" ctime="2009-01-30T19:17:39-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="channel">
<fifo name="/tmp/gs.app.Ien1m0" descriptor="7" count="0" rsize="0" wsize="0"/>
<statinfo mode="010640" size="0" inode="203420696" nlink="1" blksize="16384" blocks="0" mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<environment>
<env key="GLOBUS_GRAM_JOB_CONTACT">https://co-login.ncsa.uiuc.edu:50001/27456/1233364659/</env>
<env key="GLOBUS_GRAM_MYJOB_CONTACT">URLx-nexus://co-login.ncsa.uiuc.edu:50002/</env>
<env key="GLOBUS_LOCATION">/usr/local/prews-gram-4.0.7-r1/</env>
....
</environment>

<resource>
<soft id="RLIMIT_CPU">unlimited</soft>
<hard id="RLIMIT_CPU">unlimited</hard>
<soft id="RLIMIT_FSIZE">unlimited</soft>
....
</resource>
</invocation>

Jobstate.log File

The jobstate.log file logs the various states that a job goes through during workflow execution. It is created by the tailstatd daemon that is launched  when a workflow is submitted to Condor DAGMan by pegasus-run. Tailstatd parses the dagman.out file and writes out the jobstate.log file, the format of which is more amenable to parsing.

Note: The jobstate.log file is not created if a user uses condor_submit_dag to submit a workflow to Condor DAGMan.

The jobstate.log file can be created after a workflow has finished executing by running tailstatd on the .dag file in the workflow submit directory.
Executing Tailstatd for cases where pegasus-run was not used to submit workflow

cd workflow-submit-directory
tailstatd -n  --nodatabase $dagman.outfile

Below is a snippet from the jobstate.log for a single job executed via condorg

1239666049 create_dir_blackdiamond_0_isi_viz SUBMIT 3758.0 isi_viz -
1239666059 create_dir_blackdiamond_0_isi_viz EXECUTE 3758.0 isi_viz -
1239666059 create_dir_blackdiamond_0_isi_viz GLOBUS_SUBMIT 3758.0 isi_viz -
1239666059 create_dir_blackdiamond_0_isi_viz GRID_SUBMIT 3758.0 isi_viz -
1239666064 create_dir_blackdiamond_0_isi_viz JOB_TERMINATED 3758.0 isi_viz -
1239666064 create_dir_blackdiamond_0_isi_viz JOB_SUCCESS 0 isi_viz -
1239666064 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_STARTED - isi_viz -
1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_TERMINATED 3758.0 isi_viz -
1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_SUCCESS - isi_viz -


Each entry in jobstate.log has the following

  • the ISO timestamp for the time at which the particular event happened
  • the name of the job.
  • the event recorded by DAGMan for the job.
  • the condor id of the job is the queue on the submit node
  • the pegasus site to which the job is mapped

The lifecycle for the job when executed as part of the workflow are as follows

State/Event

Description

SUBMIT

job is submitted by condor schedd for execution.

EXECUTE

condor schedd detects that a job has started execution.

GLOBUS_SUBMIT

the job has been submitted to the remote resource. It's only written for GRAM jobs (i.e. gt2 and gt4).

GRID_SUBMIT

same as GLOBUS_SUBMIT event. The ULOG_GRID_SUBMIT event is written for all grid universe jobs./

JOB_TERMINATED

job terminated on the remote node.

JOB_SUCCESS

job succeeded on the remote host, condor id will be zero (successful exit code).

JOB_FAILURE

job failed on the remote host, condor id will be the job's exit code.

POST_SCRIPT_STARTED

post script started by DAGMan on the submit host, usually to parse the kickstart output

POST_SCRIPT_TERMINATED

post script finished on the submit node.

POST_SCRIPT_SUCCESS |
POST_SCRIPT_FAILURE

post script succeeded or failed.

Pegasus job.map file

Pegasus creates a workflow.job.map file that links jobs in the DAG with the jobs in the DAX. The contents of the file are in netlogger format. The purpose of this file is to be able to link an invocation record of a task to the corresponding job in the DAX

The workflow is replaced by the name of the workflow i.e. same prefix as the .dag file

In the file there are two types of events.

pegasus.job
pegasus.job.map

pegasus.job - This event is for all the jobs in the DAG. The following information is associated with this event.

  • job.id the id of the job in the DAG
  • job.class an integer designating the type of the job
  • job.xform the logical transformation which the job refers to.
  • task.count the number of tasks associated with the job. This is equal to the number of pegasus.job.task events created for that job.

pegasus.job.map - This event allows us to associate a job in the DAG with the jobs in the DAX. The following information is associated with this event.

  • task.id the id of the job in the DAG
  • task.class an integer designating the type of the job
  • task.xform the logical transformation which the job refers to.

Some sample entries are as follows

ts=2009-04-21T23:09:03.091658Z event=pegasus.job job.id=analyze_ID000004 job.class="7" job.xform="vahi::analyze:1.0" task.count="1"
ts=2009-04-21T23:09:03.091772Z event=pegasus.job.map job.id=analyze_ID000004 task.id="ID000004" task.class="7" task.xform="vahi::analyze:1.0"
ts=2009-04-21T23:09:03.092063Z event=pegasus.job job.id=create_dir_blackdiamond_0_isi_viz job.class="6" job.xform="pegasus::dirmanager" task.count="0"
ts=2009-04-21T23:09:03.092165Z event=pegasus.job job.id=merge_vahi-findrange-1.0_PID2_ID1 job.class="1" job.xform="pegasus::seqexec" task.count="2"
ts=2009-04-21T23:09:03.093259Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000002" task.class="7" task.xform="vahi::findrange:1.0"
ts=2009-04-21T23:09:03.093402Z event=pegasus.job.map job.id=merge_vahi-findrange-1.0_PID2_ID1 task.id="ID000003" task.class="7" task.xform="vahi::findrange:1.0"

Pegasus Workflow Job States and Delays

The various job states that a job goes through ( as caputured in the dagman.out and jobstate.log file) during it's lifecycle are illustrated below. The figure below highlights the various local and remote delays during job lifecycle.

In the some case the Grid Submit and Condor Execute event may be interchanged. That is due to the fact whether Condor Grid Monitor is enabled or not.

The information in the kickstart output files and the condor dagman logs can be mined to retrieve useful statistics about how the workflow ran. The data retrieved then can be used to generate useful graphs. This section lists the various ways to visualize and mine statistics using helper scripts in the Pegasus distribution and remote tools.

Visualizing Graph Structure of A Workflow

Visualizing the structure of a workflow is a two step process. The first step is to convert the workflow description (DAX or DAG) into a DOT file. DOT is a special file format used for the visual display of graphs. The next step is to view the DOT file in a viewer, or use it to generate an image. These steps are described in more detail below.

Generating DOT from DAX

You can generate a DOT file from a DAX file using the pegasus-graphviz tool provided with Pegasus in the $PEGASUS_HOME/bin directory.

Usage

Usage: pegasus-graphviz [options] FILE

Parses FILE and generates a DOT-formatted graphical representation of the DAG.
FILE can be a Condor DAGMan file, or a Pegasus DAX file.

Options:

  -h, --help            show this help message and exit
  -s, --nosimplify      Do not simplify the graph by removing redundant edges.
                        [default: False]
  -l LABEL, --label=LABEL
                        What attribute to use for labels. One of 'label',
                        'xform', 'id', 'xform-id', 'label-xform', 'label-id'.
                        For 'label', the transformation is used for jobs that
                        have no node-label. [default: label]
  -o FILE, --output=FILE
                        Write output to FILE [default: stdout]
  -r XFORM, --remove=XFORM
                        Remove jobs from the workflow by transformation name
  -W WIDTH, --width=WIDTH
                        Width of the digraph
  -H HEIGHT, --height=HEIGHT
                        Height of the digraph

  -f, --files           Include files. This option is only valid for DAX
                        files. [default: false]

Example

$PEGASUS_HOME/bin/pegasus-graphviz --output scb.dot scb_dax.xml

Generating DOT from DAG

Pegasus automatically generates a DOT file (<daxlabel-daxindex>.dot) for each executable workflow and saves it in the submit directory. You can use this file, or you can generate a different one using the pegasus-graphviz tool provided with Pegasus ($PEGASUS_HOME/bin/pegasus-graphviz). The difference is that pegasus-graphviz gives you some additional options that aren't available if you use the automatically-generated DOT file.

Usage

Usage: pegasus-graphviz [options] FILE

Parses FILE and generates a DOT-formatted graphical representation of the DAG.
FILE can be a Condor DAGMan file, or a Pegasus DAX file.

Options:

  -h, --help            show this help message and exit
  -s, --nosimplify      Do not simplify the graph by removing redundant edges.
                        [default: False]
  -l LABEL, --label=LABEL
                        What attribute to use for labels. One of 'label',
                        'xform', 'id', 'xform-id', 'label-xform', 'label-id'.
                        For 'label', the transformation is used for jobs that
                        have no node-label. [default: label]
  -o FILE, --output=FILE
                        Write output to FILE [default: stdout]
  -r XFORM, --remove=XFORM
                        Remove jobs from the workflow by transformation name
  -W WIDTH, --width=WIDTH
                        Width of the digraph
  -H HEIGHT, --height=HEIGHT
                        Height of the digraph

  -f, --files           Include files. This option is only valid for DAX
                        files. [default: false]

Example

$PEGASUS_HOME/bin/pegasus-graphviz --output scb.dot scb-0.dag

Viewing DOT files

DOT files can be used to generate images or displayed using a viewer. You can find DOT file viewers at http://www.graphviz.org. In addition, on Mac OS X the OmniGraffle program can read and display DOT files. The advantage of OmniGraffle is that you can edit the DOT file visually and export it in a number of formats.

To generate a jpeg file using the "dot" program distributed with GraphViz run:

dot -Tjpeg -o SCB_DAX.jpg  scb.dot

Here is an example image generated from a DAX.

Here is an example image generated from a DAG.

Visualizing a Single Workflow Run

show-job

show-job is a perl script that can be used to generate a Gantt char of a workflow run.
It generates the gantt chart in ploticus input format and then generates an eps file and a png file using the ploticus program. The ploticus executable should be in your path.

Usage

$PEGASUS_HOME/contrib/showlog/show-job --color-file <the file mapping job transformation names to color> <path to the dag file>

Sample Usage

sukhna 59% $PEGASUS_HOME/contrib/showlog/show-job --color-file color.in dags/vahi/pegasus/scb/run0001/scb-0.dag
# min=1233364634 2009-01-30T17:17:14-08:00
# max=1233376529 2009-01-30T20:35:29-08:00
# diff=11895
# running y=1...
# xstubs=1800, xticks=600, width=13.2166666666667, height=5
job scb::das_tide has color green
job pegasus::transfer has color magenta
job pegasus::dirmanager has color lavender
job scb::fcst_tide has color orange
job pegasus::rc-client has color powderblue2
job unknown has color gray(0.75)
job scb::interpolate has color blue
# /old-usr/sukhna/install/ploticus/pl232src/bin/ploticus /tmp/sj-UDs7SZ-1.pls -eps -o /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-1.eps
# /usr/ucb/convert -density 96x96 /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-1.eps /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-1.png
# running y=2...
# xstubs=1800, xticks=600, width=13.2166666666667, height=5
job scb::das_tide has color green
job pegasus::transfer has color magenta
job pegasus::dirmanager has color lavender
job scb::fcst_tide has color orange
job pegasus::rc-client has color powderblue2
job unknown has color gray(0.75)
job scb::interpolate has color blue
# /old-usr/sukhna/install/ploticus/pl232src/bin/ploticus /tmp/sj-UDs7SZ-2.pls -eps -o /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-2.eps
# /usr/ucb/convert -density 96x96 /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-2.eps /lfs1/work/jpl/dags/vahi/pegasus/scb/run0001/scb-2.png

number of jobs: 8
number of script failures: 0
sequential duration of jobs: 9090 s
total workflow duration: 11895 s (speed-up 0.8)

Sample color file
The color file can be used to provide different colors for the different transformations in the DAX.

scb::das_tide           green
scb::fcst_tide          orange
scb::interpolate        blue

Here is a sample gantt chart for a workflow execution run

Click on the above image to see an enlarged view

Visualizing a Workflow of Workflows Runs using Netlogger

If a user executes a workflow that in turn contains other sub workflows, it is possible to visualize them with little help from the netlogger folks.

SCEC runs a workflow of workflows, where each outer level workflow has about 80 sub workflows. The workflow logs need to be populated into the netlogger database, and then a gantt chart can be plotting using R statistical package.

Here is a sample cummulative time gantt chart

Click on the above image to see an enlarged view

Netlogger Database Structure

Tables

  1. event
  2. ident
  3. attr

Entries from job map

Event

Ident

Attr

pegasus.job

job

task.count

 

 

job.xform

 

 

job.class

Event

Ident

Attr

pegasus.job.map

job

task.class

 

task

task.xform

Entries from condor dag

Event

Ident

Attr

condor.dag.edge

comp.parent

 

 

comp.child

 

Entries from jobstate log

Event

Ident

Attr

pegasus.jobstate.submit

site

status

pegasus.jobstate.execute

condor

dur

pegasus.jobstate.job_terminated

comp

 

pegasus.jobstate.postscript

 

 

pegasus.jobstate.image_size

 

 

pegasus.jobstate.job_evicted

 

 

pegasus.jobstate.job_disconnected

 

 

pegasus.jobstate.job_reconnect_failed

 

 

pegasus.jobstate.shadow_exception

 

 

Entries from Invocation Records. (.out files)

Events

Ident

Attr

pegasus.invocation

workflow

status

 

comp

nsignals

 

 

transformation

 

 

host

 

 

user

 

 

duration

 

 

arguments

Event

Ident

Attr

pegasus.invocation.stat_error

comp

status

 

 

group

 

 

file

 

 

user

Workflow Statistics

genstats

genstats is a perl script distributed with Pegasus that generates a table listing statistics for each job in the executable workflow ( Condor DAG )

Usage

genstats --dag <dagfilename> --output <the output results directory> --jobstate-log <path to the jobstate.log file>

Sample Usage
genstats --dag scb-0.dag --output /lfs1/work/jpl/scb_results/run0001 --jobstate-log jobstate.log

genstats generates the following information for each job in jobs file in the output results directory

  1. Job - the name of the job
  2. Site - the site where the job ran
  3. Kickstart - the actual duration of the job in seconds on the remote compute node
  4. Post - the postscript time as reported by DAGMan
  5. Condor - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node
  6. Resource - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job spent in the remote queue
  7. Runtime - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart
  8. CondorQLen - the number of outstanding jobs in the queue when this job was released.

Here is a sample jobs file created by genstats

Unknown macro: {csv}

Job Site Kickstart Post DAGMan Condor Resource Runtime CondorQLen Seqexec Seqexec-Delay
create_dir_scb_0_cobalt cobalt 0.00 5.00 13.00 15.00 0.00 15.00 1 - -
das_tide_ID000001 cobalt 0.00 5.00 5.00 15.00 3906.00 3855.00 1 - -
fcst_tide_ID000002 cobalt 0.00 5.00 5.00 15.00 90.00 465.00 1 - -
interpolate_ID000003 cobalt 0.00 5.00 5.00 15.00 155.00 160.00 1 - -
stage_in_das_tide_ID000001_0 cobalt 0.00 5.00 5.00 20.00 5.00 2946.00 1 - -
stage_in_fcst_tide_ID000002_0 cobalt 0.00 5.00 5.00 20.00 5.00 1805.00 2 - -
stage_in_interpolate_ID000003_0 cobalt 0.00 5.00 5.00 15.00 0.00 435.00 3 - -
stage_out_interpolate_ID000003_0 cobalt 0.00 5.00 5.00 15.00 0.00 135.00 1 - -

genstats-breakdown

genstats-breakdown is a perl script distributed with Pegasus that generates a table listing statistics for each type of logical transformation in the executable workflow ( Condor DAG ). For example this tool will generate statistics grouped by transfer transformation that encompasses the stage-in, stage-out , inter site and symlinking transfer jobs.

Usage
$PEGASUS_HOME/bin/genstats-breakdown --output=<output file> -x <the workflow submit directory>

User can pass workflow submit directories using the -x option. In that case, the statistics are written for each of the submit directories , and also across all the directories

Sample Usage

genstats-breakdown --output breakdown.txt -x dags/vahi/pegasus/scb/run000*

Here is a sample breakdown.txt file created

dags/vahi/pegasus/scb/run0001

Unknown macro: {csv}

Transformation Count Mean Variance*
pegasus::transfer 4 1200.65 1660108.49
scb::das_tide 1 3806.65 0.00
pegasus::dirmanager 1 0.32 0.00
scb::fcst_tide 1 346.39 0.00
scb::interpolate 1 134.49 0.00

dags/vahi/pegasus/scb/run0002

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 4 1191.27 1580276.06
scb::das_tide 1 3811.54 0.00
pegasus::dirmanager 1 0.34 0.00
scb::fcst_tide 1 344.90 0.00
scb::interpolate 1 128.56 0.00

dags/vahi/pegasus/scb/run0003

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 4 1203.00 1635850.78
scb::das_tide 1 3794.60 0.00
pegasus::dirmanager 1 0.32 0.00
scb::fcst_tide 1 492.81 0.00
scb::interpolate 1 108.58 0.00

dags/vahi/pegasus/scb/run0004

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 4 1168.31 1521384.54
scb::das_tide 1 3861.94 0.00
pegasus::dirmanager 1 0.29 0.00
scb::fcst_tide 1 348.76 0.00
scb::interpolate 1 139.54 0.00

All

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 16 1190.81 1279724.52
scb::das_tide 4 3818.68 882.31
pegasus::dirmanager 4 0.32 0.00
scb::fcst_tide 4 383.22 5341.00
scb::interpolate 4 127.79 184.18

Populating and Mining Netlogger Database

For large workflows, users can load the workflow logs into a netlogger database.

Populating a Netlogger Database

Details about installing netlogger database and loading data into it can be found at

http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page

In general netlogger requires the following components

  • mysql | sqllite backend to populate to
  • python 2.5
  • python bindings for mysql | sqllite

Mining a netlogger database

Once data has been loaded into a netlogger database , a variety of queries can be issued to the db.

The queries can help user answer the following questions

  1. how many jobs ran on a given day
  2. what was the cumulative runtime of these jobs
  3. how many jobs ran on given hosts
  4. how many jobs of a given type ran on a given day
  5. how many jobs failed
  6. how many jobs succeeded

Complex Queries

Users can issue complex queries to the DB on the basis of the DAX label in the original DAX.
In case of workflow of workflows , where each of the dax's have a similar dax labels users can generate statistics either for the individual sub workflow or all the workflows together.

Queries below are for all the workflows together organized by workflow id.

Queries Per Workflow Where Workflow ID Is a DAX Label

  1. Total number of jobs
    select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id
    where  attr.name = 'status' and ident.name='workflow' and ident.value
    LIKE 'CyberShake_WNGC%';
    
  1. Total number of succeeded jobs
    select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id
    where  attr.name = 'status' and attr.value = '0' and
    ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
    
  1. Breakdown of jobs
    select attr.value, count(attr.e_id) from attr 
    join ident on attr.e_id = ident.e_id
    where  ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%'  and
           attr.name='type' group by attr.value;
    
  1. Total Runtime of jobs
    select sum(attr.value) from attr join ident on attr.e_id=ident.e_id
    where attr.name='duration' and ident.name='workflow' and ident.value
    LIKE 'CyberShake_WNGC%';
    

Queries Per Workflow Per Job Type

  1. Runtime Breakdown by job type per workflow
     select TRANSFORMATION, count(TRANSFORMATION) as number
     ,round(sum(attr.value),2) as sum_seconds,
     round(sum(attr.value)/(3600),2) as sum_hours, round(avg(attr.value),2)
     as avg_seconds from attr join (select attr.e_id as event_id,
     attr.value as TRANSFORMATION from  attr join ident on
     attr.e_id=ident.e_id  where attr.name='type' and
     ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident
     on attr.e_id=event_id WHERE attr.name='duration' group by
     TRANSFORMATION;
    
  1. Number of failures by job type per workflow
    select TRANSFORMATION, count(TRANSFORMATION) as failures from attr
    join (select attr.e_id as event_id, attr.value as TRANSFORMATION from
    attr join ident on attr.e_id=ident.e_id  where attr.name='type' and
    ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident on
    attr.e_id=event_id WHERE attr.name = 'status' and attr.value != '0'
    group by TRANSFORMATION;
    

Queries Per Unit Time Per Workflow

  1. Jobs Per Day Per Workflow
    select count(id) as 'count', day(from_unixtime(time)) as day from event 
      join attr on attr.e_id = event.id 
      join ident on attr.e_id=ident.e_id 
      where event.name = 'pegasus.invocation' and attr.name = 'host' and 
            ident.name='workflow' and 
            ident.value LIKE 'CyberShake_CCP%' 
      group by day;
    
  1. Jobs Per Day Per Hour Per Workflow
    SELECT day(from_unixtime(time)) as day, hour(from_unixtime(time)) as hour, count(event.id) as 'count'  FROM event
               JOIN attr on attr.e_id = event.id 
               JOIN ident on attr.e_id=ident.e_id 
    WHERE event.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_CCP%' 
    GROUP BY  day, hour ORDER BY day, hour;
    
  1. Jobs Per Host Per Hour Per Workflow
    SELECT  attr.value  as host, day(from_unixtime(time)) as 'day', hour(from_unixtime(time)) as 'hour', count(event.id) as 'count'  from  event 
               JOIN attr on attr.e_id = event.id
               JOIN ident on attr.e_id=ident.e_id
     WHERE.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%'
     group by host, day,hour  ORDER BY day, hour;
    

Full details are available at http://www.cedps.net/index.php/Pegasus_Sample_Queries

  • No labels