Workflow Statistics

genstats

genstats is a perl script distributed with Pegasus that generates a table listing statistics for each job in the executable workflow ( Condor DAG )

Usage

genstats --dag <dagfilename> --output <the output results directory> --jobstate-log <path to the jobstate.log file>

Sample Usage
genstats --dag scb-0.dag --output /lfs1/work/jpl/scb_results/run0001 --jobstate-log jobstate.log

genstats generates the following information for each job in jobs file in the output results directory

  1. Job - the name of the job
  2. Site - the site where the job ran
  3. Kickstart - the actual duration of the job in seconds on the remote compute node
  4. Post - the postscript time as reported by DAGMan
  5. Condor - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node
  6. Resource - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job spent in the remote queue
  7. Runtime - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart
  8. CondorQLen - the number of outstanding jobs in the queue when this job was released.

Here is a sample jobs file created by genstats

Unknown macro: {csv}

Job Site Kickstart Post DAGMan Condor Resource Runtime CondorQLen Seqexec Seqexec-Delay
create_dir_scb_0_cobalt cobalt 0.00 5.00 13.00 15.00 0.00 15.00 1 - -
das_tide_ID000001 cobalt 0.00 5.00 5.00 15.00 3906.00 3855.00 1 - -
fcst_tide_ID000002 cobalt 0.00 5.00 5.00 15.00 90.00 465.00 1 - -
interpolate_ID000003 cobalt 0.00 5.00 5.00 15.00 155.00 160.00 1 - -
stage_in_das_tide_ID000001_0 cobalt 0.00 5.00 5.00 20.00 5.00 2946.00 1 - -
stage_in_fcst_tide_ID000002_0 cobalt 0.00 5.00 5.00 20.00 5.00 1805.00 2 - -
stage_in_interpolate_ID000003_0 cobalt 0.00 5.00 5.00 15.00 0.00 435.00 3 - -
stage_out_interpolate_ID000003_0 cobalt 0.00 5.00 5.00 15.00 0.00 135.00 1 - -

genstats-breakdown

genstats-breakdown is a perl script distributed with Pegasus that generates a table listing statistics for each type of logical transformation in the executable workflow ( Condor DAG ). For example this tool will generate statistics grouped by transfer transformation that encompasses the stage-in, stage-out , inter site and symlinking transfer jobs.

Usage
$PEGASUS_HOME/bin/genstats-breakdown --output=<output file> -x <the workflow submit directory>

User can pass workflow submit directories using the -x option. In that case, the statistics are written for each of the submit directories , and also across all the directories

Sample Usage

genstats-breakdown --output breakdown.txt -x dags/vahi/pegasus/scb/run000*

Here is a sample breakdown.txt file created

dags/vahi/pegasus/scb/run0001

Unknown macro: {csv}

Transformation Count Mean Variance*
pegasus::transfer 4 1200.65 1660108.49
scb::das_tide 1 3806.65 0.00
pegasus::dirmanager 1 0.32 0.00
scb::fcst_tide 1 346.39 0.00
scb::interpolate 1 134.49 0.00

dags/vahi/pegasus/scb/run0002

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 4 1191.27 1580276.06
scb::das_tide 1 3811.54 0.00
pegasus::dirmanager 1 0.34 0.00
scb::fcst_tide 1 344.90 0.00
scb::interpolate 1 128.56 0.00

dags/vahi/pegasus/scb/run0003

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 4 1203.00 1635850.78
scb::das_tide 1 3794.60 0.00
pegasus::dirmanager 1 0.32 0.00
scb::fcst_tide 1 492.81 0.00
scb::interpolate 1 108.58 0.00

dags/vahi/pegasus/scb/run0004

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 4 1168.31 1521384.54
scb::das_tide 1 3861.94 0.00
pegasus::dirmanager 1 0.29 0.00
scb::fcst_tide 1 348.76 0.00
scb::interpolate 1 139.54 0.00

All

Unknown macro: {csv}

Transformation Count Mean Variance
pegasus::transfer 16 1190.81 1279724.52
scb::das_tide 4 3818.68 882.31
pegasus::dirmanager 4 0.32 0.00
scb::fcst_tide 4 383.22 5341.00
scb::interpolate 4 127.79 184.18

Populating and Mining Netlogger Database

For large workflows, users can load the workflow logs into a netlogger database.

Populating a Netlogger Database

Details about installing netlogger database and loading data into it can be found at

http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page

In general netlogger requires the following components

  • mysql | sqllite backend to populate to
  • python 2.5
  • python bindings for mysql | sqllite

Mining a netlogger database

Once data has been loaded into a netlogger database , a variety of queries can be issued to the db.

The queries can help user answer the following questions

  1. how many jobs ran on a given day
  2. what was the cumulative runtime of these jobs
  3. how many jobs ran on given hosts
  4. how many jobs of a given type ran on a given day
  5. how many jobs failed
  6. how many jobs succeeded

Complex Queries

Users can issue complex queries to the DB on the basis of the DAX label in the original DAX.
In case of workflow of workflows , where each of the dax's have a similar dax labels users can generate statistics either for the individual sub workflow or all the workflows together.

Queries below are for all the workflows together organized by workflow id.

Queries Per Workflow Where Workflow ID Is a DAX Label

  1. Total number of jobs
    select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id
    where  attr.name = 'status' and ident.name='workflow' and ident.value
    LIKE 'CyberShake_WNGC%';
    
  1. Total number of succeeded jobs
    select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id
    where  attr.name = 'status' and attr.value = '0' and
    ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
    
  1. Breakdown of jobs
    select attr.value, count(attr.e_id) from attr 
    join ident on attr.e_id = ident.e_id
    where  ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%'  and
           attr.name='type' group by attr.value;
    
  1. Total Runtime of jobs
    select sum(attr.value) from attr join ident on attr.e_id=ident.e_id
    where attr.name='duration' and ident.name='workflow' and ident.value
    LIKE 'CyberShake_WNGC%';
    

Queries Per Workflow Per Job Type

  1. Runtime Breakdown by job type per workflow
     select TRANSFORMATION, count(TRANSFORMATION) as number
     ,round(sum(attr.value),2) as sum_seconds,
     round(sum(attr.value)/(3600),2) as sum_hours, round(avg(attr.value),2)
     as avg_seconds from attr join (select attr.e_id as event_id,
     attr.value as TRANSFORMATION from  attr join ident on
     attr.e_id=ident.e_id  where attr.name='type' and
     ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident
     on attr.e_id=event_id WHERE attr.name='duration' group by
     TRANSFORMATION;
    
  1. Number of failures by job type per workflow
    select TRANSFORMATION, count(TRANSFORMATION) as failures from attr
    join (select attr.e_id as event_id, attr.value as TRANSFORMATION from
    attr join ident on attr.e_id=ident.e_id  where attr.name='type' and
    ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident on
    attr.e_id=event_id WHERE attr.name = 'status' and attr.value != '0'
    group by TRANSFORMATION;
    

Queries Per Unit Time Per Workflow

  1. Jobs Per Day Per Workflow
    select count(id) as 'count', day(from_unixtime(time)) as day from event 
      join attr on attr.e_id = event.id 
      join ident on attr.e_id=ident.e_id 
      where event.name = 'pegasus.invocation' and attr.name = 'host' and 
            ident.name='workflow' and 
            ident.value LIKE 'CyberShake_CCP%' 
      group by day;
    
  1. Jobs Per Day Per Hour Per Workflow
    SELECT day(from_unixtime(time)) as day, hour(from_unixtime(time)) as hour, count(event.id) as 'count'  FROM event
               JOIN attr on attr.e_id = event.id 
               JOIN ident on attr.e_id=ident.e_id 
    WHERE event.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_CCP%' 
    GROUP BY  day, hour ORDER BY day, hour;
    
  1. Jobs Per Host Per Hour Per Workflow
    SELECT  attr.value  as host, day(from_unixtime(time)) as 'day', hour(from_unixtime(time)) as 'hour', count(event.id) as 'count'  from  event 
               JOIN attr on attr.e_id = event.id
               JOIN ident on attr.e_id=ident.e_id
     WHERE.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%'
     group by host, day,hour  ORDER BY day, hour;
    

Full details are available at http://www.cedps.net/index.php/Pegasus_Sample_Queries

  • No labels