Introduction

pegasus-statistics is a command line tool for generating workflow execution statistics.

Pegasus Statistics Output

pegasus-statistics generates the following statistics.

Workflow Summary :- Summary of the workflow run. If the given workflow has sub workflows it iteratively parses the sub workflow  to generate the summary statistics. It is shown on the command line console. 

Workflow statistics file :-  A file containing statistics of individual workflows separated by their respective workflow uuid. If a given workflow has sub workflows it is considered as a single job, it won't iteratively parse the sub workflow. The file  is named 'workflow.txt'

Job statistics file  :-   A file containing job statistics of individual workflows separated by their respective workflow uuid. The file is named 'jobs.txt'

Transformation statistics file  :-  A  file containing transformation statistics of individual workflows separated by their respective workflow uuid. The file is named 'breakdown.txt'

The document uses the examples  described below to explain statistics information.

Note : The example in Figure 1 is a diamond workflow with 4 tasks in the dax. And the pegasus plan creates three jobs in the dag with B2 and B3 clustered.During the execution of the workflow the clustered job fails  after 3 retries . For the clustered job B2 runs , but B3 fails in all the retries.

The example in Figure 2 is a hierarchical work flow with 4 tasks in DAX A and 4 tasks in DAX B.A3 is sub workflow task.

The example in Figure 3 is a hierarchical work flow with 4 tasks in DAX A and 4 tasks in DAX B.However the A3 sub workflow tasks fails at the Prescript which results in DAX B workflow not getting planned . So the database is not populated with DAX B workflow details.

Figure 1 :- Diamond workflow [Failed Run]

 

Figure 2 :- Hierarchal Workflow [Successful Run]

Figure 3 Hierarchal workflow [Failed Run]

Workflow Summary

Workflow summary is summary of the statistics information of the workflow that is shown on the command line output . It recursively parses the sub workflow to generate the statistics information.

Workflow status ( Shows the last retry details)

Workflow status table contains the information about the planned jobs and task.

The job information is obtained from the jobs table  .Information about the job status i.e failed, succeeded etc is obtained from the jobstate table by looking at the state of the last retry.

The task information is obtained from the tasks table . Information about the task status is obtained  from the invocation table . The query should combine task, job , job instance  and invocation table  using task_id and job_id.

Note :  For workflow of workflows the original job count will include the jobs of the sub workflow if sub workflow was invoked. Otherwise the original job will consider sub dag or sub dax job as single job and the status as 'Failed' . i.e Only if entries corresponding to a sub workflow is present, the count of the workflow jobs will be added to the total original count. Tables below shows the workflow status for examples described above.

 

Original

Succeeded

Failed

Unsubmitted

Unknown

Jobs

3

1

1

1

0

Tasks

4

2

1

NA

NA

Case  1 : Refer Figure 1 [Diamond Failed Run]

 

Original

Succeeded

Failed

Unsubmitted

Unknown

Jobs

7

7

0

0

0

Tasks

8

8

0

NA

NA

Case  2: Refer Figure 2 [Hierarchical Successful Run]

 

Original

Succeeded

Failed

Unsubmitted

Unknown

Jobs

4

2

1

1

0

Tasks

4

2

1

NA

NA

Case  3: Refer Figure 3[Hierarchical Failed Run]

Workflow statistics (Shows cumulative of all retries)

Workflow statistics table contains the information about the  jobs and task actually executed during workflow run.

This information is obtained from the job instance and invocation table respectively. Tables below shows the workflow statistics for examples described above.

 

Actually Run

Succeeded

Failed

Jobs

4

1

3

Tasks

7

4

3

Case  1 : Refer Figure 1 [Diamond Failed Run]

 

Actually Run

Succeeded

Failed

Jobs

7

7

0

Tasks

8

8

0

Case 2 : Refer Figure 2[Hierarchical Successful Run]

 

Actually Run

Succeeded

Failed

Jobs

5

2

3

Tasks

5

2

3

Case 3 : Refer Figure 3 (B3 fails after 3 retries)[Hierarchical Failed Run]

Workflow wall time :

The walltime from the start of the workflow execution to the end as
reported by the DAGMAN.In case of rescue dag the value is the cumulative
of all retries.

Workflow cumulative job wall time :

The sum of the walltime of all jobs as reported by kickstart. In case of
job retries the value is the cumulative of all retries. For workflows
having sub workflow jobs (i.e SUBDAG and SUBDAX jobs), the walltime
value includes jobs from the sub workflows as well. The value is obtained from the remote_runtime in the invocation table

Cumulative job walltime as seen from submit side:

The sum of the walltime of all jobs as reported by DAGMan. This is
similar to the regular cumulative job walltime, but includes job
management overhead and delays. In case of job retries the value is the
cumulative of all retries. For workflows having sub workflow jobs (i.e
SUBDAG and SUBDAX jobs), the walltime value includes jobs from the sub
workflows as well. The value is obtained from the local_duration in the job_instance table

Workflow statistics file

Workflow statistics file contains statistics information of each individual workflow. The parent workflow doesn't recursively calculate sub workflows jobs . Each sub workflow (SUB DAX, SUB DAG ) job is counted as a single job.

The information in these file is calculated similarly to the summary information . However, it is calculated only for the given workflow, if it has SUB DAX and SUB DAG jobs they are not recursively parsed.

Note : Job means the non sub workflow jobs.

Workflow status ( Shows the last retry details)

 

Original

Succeeded

Failed

Unsubmitted

Unknown

Jobs

4

2

1

1

0

SUB DAX

0

0

0

0

0

SUB DAG

0

0

0

0

0

Tasks

4

2

1

NA

NA

Case  1 : Refer Figure 1[Diamond Failed Run]

 

Original

Succeeded

Failed

Unsubmitted

Unknown

Jobs

3

3

0

0

0

SUB DAX

1

1

0

0

0

SUB DAG

0

0

0

0

0

Tasks

4

4

0

NA

NA

Case  2 : Refer Figure 2 (DAX A workflow) [Hierarchal  Successful Run]

 

Original

Succeeded

Failed

Unsubmitted

Unknown

Jobs

3

2

0

1

0

SUB DAX

1

0

1

0

0

SUB DAG

0

0

0

0

0

Tasks

4

2

1

NA

NA

Case 3 : Refer Figure 3 (DAX A workflow ,B3 fails after 3 retries )[Hierarchical Failed Run]

Workflow statistics (Shows cumulative of all retries)

 

Actually Run

Succeeded

Failed

Jobs

4

1

3

SUB DAX

0

0

0

SUB DAG

0

0

0

Tasks

7

4

3

Case  1 : Refer Figure 1[Diamond Failed Run]

 

Actually Run

Succeeded

Failed

Jobs

3

3

0

SUB DAX

1

1

0

SUB DAG

0

0

0

Tasks

4

4

0

Case  2 : Refer Figure 2 (DAX A workflow ) [Hierarchical Successful Run]

 

Actually Run

Succeeded

Failed

Jobs

2

2

0

SUB DAX

3

0

3

SUB DAG

0

0

0

Tasks

5

2

3

Case  3 : Refer Figure 3(DAX A  workflow ,B3 fails after 3 retries)[Hierarchical Failed Run]

Workflow wall time :

The walltime from the start of the workflow execution to the end as
reported by the DAGMAN.In case of rescue dag the value is the cumulative
of all retries.

Workflow cumulative job wall time :

The sum of the walltime of all jobs as reported by kickstart. In case of
job retries the value is the cumulative of all retries. For workflows
having sub workflow jobs (i.e SUBDAG and SUBDAX jobs), the walltime
value doesn't include jobs from the sub workflows . The value is obtained from the remote_runtime in the invocation table

Cumulative job wall time as seen from submit side:

The sum of the walltime of all jobs as reported by DAGMan. This is
similar to the regular cumulative job walltime, but includes job
management overhead and delays. In case of job retries the value is the
cumulative of all retries. For workflows having sub workflow jobs (i.e
SUBDAG and SUBDAX jobs), the walltime value doesn't includes jobs from the sub
workflows . The value is obtained from the local_duration in the job_instance

Jobs statistics file

Jobs file contains the following information about jobs in the individual workflow.

    Job - the name of the job

    Site - the site where the job ran

    Kickstart(sec.) - the actual duration of the job in seconds on the remote compute node. In case of retries the value is the cumulative of all retries.The value is obtained from the remote_runtime in the invocation table

    Post(sec.) - the postscript time as reported by DAGMan .In case of retries the value is the cumulative of all retries. The value is calculated as [POST_SCRIPT_TERMINATED - POST_SCRIPT_STARTED/JOB_TERMINATED].The information is obtained from jobstate table

    DAGMan(sec.) - the time between the last parent job of a job completes and the job gets submitted.In case of retries the value of the last retry is used for calculation.The value is calculated as  [SUBMIT] - last parent job's [POST_SCRIPT_TERMINATED] .The information is obtained from jobstate table

    CondorQTime(sec.) - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node .In case of retries the value is the cumulative of all retries.The value is calculated as [GRID_SUBMIT/GLOBUS_SUBMIT/EXECUTE -SUBMIT].The information is obtained from jobstate table

    Resource(sec.) - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job spent in the remote queue .In case of retries the value is the cumulative of all retries.The value is calculated as [EXECUTE -GRID_SUBMIT/GLOBUS_SUBMIT].The information is obtained from jobstate table

    Runtime(sec.) - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart .In case of retries the value is the cumulative of all retries.The value is obtained from the local_duration in the job_instance

    Seqexec(sec.) - the time taken for the completion of a clustered job .In case of retries the value is the cumulative of all retries. This value is obtained from the cluster_duration in the job instance table

    Seqexec-Delay(sec.) - the time difference between the time for the completion of a clustered job and sum of all the individual tasks kickstart time .In case of retries the value is the cumulative of all retries. This value is obtained as the difference between the cluster_duration in the job instance table and sum of all the corresponding task's remote_runtime in the invocation table

Transformation statistics file

The transformation statistics file contains the following information about each transformation in individual workflow.

    Transformation - name of the transformation.

    Count - the number of times the transformation was executed.

    Mean(sec.) - the mean of the transformation runtime. The value is obtained from the remote_runtime in the invocation table

    Variance(sec.) - the variance of the transformation runtime.Variance is calculated using the on-line algorithm by Knuth (http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance).

    Min(sec.) - the minimum transformation runtime value.

    Max(sec.) - the maximum transformation runtime value.

    Total(sec.) - the cumulative of transformation runtime.

Pegasus Statistics Queries

This section contains the queries that are used for fetching the statistics information from the stampede DB.

Workflow Summary (Across workflow)

Workflow status ( Shows the last retry details)

Query for finding all the workflow id's( i.e all sub workflows and top level workflow ) by passing the wf_uuid
select wf_id
from
workflow as wf
where
wf.root_wf_id = (
     select wf_id from workflow where wf_uuid ='1e8b9ab6-8cdd-4e90-95cb-989f246dab56'
)

Note: All the queries uses the wf_id which is obtained by passing the wf_uuid to the above query. This will avoid the need for joining the workflow table to each query.

Total jobs
select
(
  • No labels