This page lists metrics that we wish to derive for Pegasus workflows.
- Total number of workflows executed
- Number of workflows executed per day
- Total runtime of each workflow
- Sum of task durations with and without pre/post script
- Total walltime of each workflow
- Sum of dagman time (difference between dagman end and start times)
- Total number of workflow jobs/tasks executed. This means total jobs and tasks (including failed, repeats, success what not). These are all the job executions for a given workflow.
- Total number of workflow jobs and tasks that failed
- Total number of workflow jobs and tasks that succeeded
- Total number of workflow jobs/tasks that were automatically retried.
- Workflow jobs/tasks that were retried (breakdown by jobmaname or transformation.
- Number of jobs/tasks executed per <time-period>: day, per week , per month, per hour, per year
- Overheads for the jobs (cumulative and average). We should also be able to quantify the % overhead in relation to the overall job time.
- DAGMan overhead
- amount of time spent in the Condor Q
- time from release to the queue to running
- Kickstart overhead
Related to resource utilization:
- Number of jobs executed on a host/glidein for a particular provisioning request
- Average number of idle jobs in the queue over time (also maybe min/max)
- Average number of running jobs over time (also maybe min/max)
- Average number of idle glide-ins over time (and min/max)
- Job type (data transfer in/out, registration, application, other Pegasus jobs)
- Workflows/jobs/tasks over time