Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

In case of job clustering without kickstart, we will have multiple tasks associated with one Job.
That Job can have multiple Job instances ( dependant on DAGMan retrying the job in case of failure )
Each Job instance, will have only one main invocation associated with it ( with the additional postscript and prescript invocations (if specified) )
Although each job has multiple tasks associated with it, the main invocation for each job only has information coming from the submit file, and does not provided the high-level of detail information given when Kickstart is used.

Information Source for Each Table

task

==> Information comes from the DAX (Pegasus will generate events in the NetLogger format in a file in the submit directory)

...

==> Information will come from kickstart output file

Sample NetLogger Events

As pegasus-monitord parses the various files in a workflow directory (braindump, workflow-map, dagman.out file), it will generate NetLogger events that can be used to populate a database using the Stampede schema. All events have the "stampede." prefix. Here are examples for each of these events:  

stampede.workflow.plan event

ts=2010-10-12T17:43:23.000000Z event=stampede.workflow.plan level=Info parent.wf.id=None root.wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d submit_hostname=butterfly.isi.edu dax_label=diamond dax_index=0 planner_version=3.0.0cvs grid_dn=null user=prasanth submit_dir=/lfs1/prasanth/grid-setup/pegasus/3.0.0/examples/grid-blackdiamond/work/prasanth/pegasus/diamond/20101012T104323-0700 wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d planner_arguments="--dax /dax/diamond.dax --force --dir dags -s local -o local --nocleanup -v" dax_version=3.2 dax_file=/dax/diamond.dax

This event is generated when pegasus-monitord parses braindump.txt. The wf.id field is generated by Pegasus and is guaranteed to be unique. The ts field contains the timestamp the workflow was planned.

stampede.workflow.start event

ts=2011-10-12T17:43:26.000000Z event=stampede.workflow.start level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d restart_count=0

This event is generated by pegasus-monitord when it detects that DAGMan has started. The ts field contains the timestamp DAGMan started.

stampede.workflow.end event

ts=2011-10-12T18:06:57.000000Z event=stampede.workflow.end level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d restart_count=0

This event is generated by pegasus-monitord when it detects that DAGMan has finished. The ts field contains the timestamp DAGMan ended.

stampede.task event

ts=2011-10-12T17:43:26.000000Z event=stampede.task level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d transformation=diamond::preprocess:2.0 arguments="-a preprocess -T60 -i f.a -o f.b1 f.b2" abs_task.id=ID0000001 type=job

This event is generated by Pegasus during the planning phase, and is written to a file in the workflow's directory in NetLogger format. Pegasus-monitord reads this file when it enter the workflow directory and pipes its content to the loader.

stampede.task.map event

ts=2011-10-12T17:43:26.000000Z event=stampede.task.map level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d abs_task.id=ID0000001 exec_job.id=preprocess_ID00000001

This event is generated by Pegasus during the planning phase of a workflow, but will appear only after a task (or tasks) get mapped into a job.

stampede.job event

ts=2011-10-12T17:44:15.000000Z event=stampede.job level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d exec_job.id=create_dir_diamond_0_ISIViz submit_file=create_dir_diamong_0_ISIViz.sub jobtype="create dir" clustered=False max_retries=3 executable=/opt/pegasus/2.4/bin/kickstart arguments="-n pegasus::dirmanager -N pegasus::dirmanager:1.0 -R futuregrid -L diamond -T 2010-08-13T13:37:20-07:00 /opt/pegasus/pegasus-3.0.0cvs/bin/dirmanager --create --dir /Users/voeckler/Pegasus/futuregrid/work/outputs/voeckler/pegasus/diamond/20100813T175039-0700" task_count=0

This event is generated by Pegasus during the planning phase of a workflow, and contains a description of every job in the executable workflow. Jobs inserted by Pegasus, which do not have a mapped task from the DAX, will have its task_count set to 0.

stampede.task.edge

ts=2011-10-12T17:43:26.000000Z event=stampede.task.edge level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d parent_abs_task.id=ID0000001 child_abs_task.id=ID00000002

This event is generated by pegasus-monitord when it parses the file generated by Pegasus during the planning phase.

stampede.jobinstance.prescript.start event

ts=2010-02-20T23:25:28.000000Z event=stampede.jobinstance.prescript.start level=Info wf.id=wftest-id exec_job.id=pegasus-plan_ID000001 job.id=2

This event is generated by pegasus-monitord whenever it detects the start of a prescript for a new job. This event is similar to the stampede.jobinstance.mainjob.start event (see below), but it does not contain the sched_id field (as it is not yet assigned one). The ts field contains the timestamp the prescript started.

stampede.jobinstance.prescript.end event

ts=2010-02-20T23:14:11.000000Z event=stampede.jobinstance.prescript.finish level=Info wf.id=wftest-id exec_job.id=pegasus-plan_ID000001 job.id=2

This event is generated by pegasus-monitord whenever it detects the end of a prescript for a job. The ts field contains the timestamp the prescript ended.

stampede.jobinstance.mainjob.start event

ts=2011-10-12T17:43:40.000000Z event=stampede.jobinstance.mainjob.start level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d sched.id=388.0 job.id=1 exec_job.id=create_dir_diamond_0_ISIViz job_stdout=/workflow/run/create_dir_diamong_0_ISIViz.out job_stderr=/workflow/run/create_dir_diamong_0_ISIViz.err job_stdin=None

The job.mainjob.start event is generated by pegasus-monitord every time a job is found in the dagman.out file. The job.id tag is generated by pegasus-monitord and starts in 1. The combination of wf_uuid and job.id guarantees an unique job. When a job begins, only certain information will be available. Later, when the job finishes, monitord will parse the kickstart output file and send the rest of the information in the job.mainjob.end event (see below). The ts field contains the timestamp the main job started.

stampede.jobinstance.mainjob.end event

ts=2011-10-12T17:44:15.000000Z event=stampede.jobinstance.mainjob.end level=Info remote_user=prasanth site_name=ISIViz exec_job.id=create_dir_diamond_0_ISIViz remote_working_dir=/tmp job.id=1 wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d sched.id=388.0 job_stdout=/workflow/run/create_dir_diamong_0_ISIViz.out job_stderr=/workflow/run/create_dir_diamong_0_ISIViz.err job_stdin=None cluster_start_time=None cluster_duration=None local_duration=5.23 subwf.id=None

This event is generated by pegasus-monitord whenever a main job finishes. It contains all the remaining information for the job table (which comes from the kickstart output file) that was unavailable at the beginning of the job execution. The ts field contains the timestamp the main job ended.

stampede.jobinstance.postscript.start event

ts=2011-10-12T17:44:15.000000Z event=stampede.jobinstance.postscript.start level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d job.id=1 exec_job.id=create_dir_diamond_0_ISIViz

This event is generated by pegasus-monitord when it detects the start of the postscript for a given job. The ts field contains the timestamp the postscript started.

stampede.jobinstance.postscript.end event

ts=2011-10-12T17:44:20.000000Z event=stampede.jobinstance.postscript.end level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d job.id=1 exec_job.id=create_dir_diamond_0_ISIViz

This event is generated by pegasus-monitord when it detects the end of the postscript for a given job. The ts field contains the timestamp the postscript ended.

stampede.jobinstance.state event

ts=2011-10-12T17:44:15.000000Z event=stampede.jobinstance.state level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d state=POST_SCRIPT_STARTED job.id=1 exec_job.id=create_dir_diamond_0_ISIViz js.id=7

A job.state event is generated every time a job changes state (e.g. SUBMIT, then EXECUTE, then JOB_SUCCESS, ....). The ts field contains the timestamp the job state changed. The js.id field contains the state submit sequence for this particular state transition.

stampede.invocation.prescript, stampede.invocation.mainjob, stampede.invocation.postscript events

ts=2011-10-12T17:44:32.000000Z event=stampede.invocation.mainjob level=Info executable=/lfs1/prasanth/grid-setup/pegasus/default/bin/pegasus-transfer exec_job.id=stage_in_local_ISIViz_0 start_time=1286905467 job.id=2 remote_duration=2.008 task.id=1 arguments="" wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d transformation=pegasus::pegasus-transfer exitcode=0 abs_task.id=None

...

These three events are similar and indicate the termination of a mainjob (2 examples), or a postscript invocation. The ts field contains the timestamp the invocation ended. The task.id field contains the value -1 for prescript invocations, -2 for postscript invocations, and an integer (starting in 1) for each main job invocation. The abs_task.id is only populated for jobs that are in the dax (and not for pegasus-generated jobs, not pre and post-scripts).

host event

ts=2011-10-12T17:44:15.000000Z event=stampede.host level=Info job.id=1 site_name=ISIViz exec_job.id=create_dir_diamond_0_ISIViz total_ram=2124730368 hostname=viz-login.isi.edu uname=linux-2.6.18-194.3.1.el5-i686 wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d ip_address=128.9.72.178

This event is generated by pegasus-monitord whenever it parses a kickstart output file. In the case of clustered jobs (when there is more than 1 task in a mainjob), it is generated once per task. The ts field contains the timestamp the task associated with this host ended.

stampede.job.edge

ts=2011-10-12T17:43:26.000000Z event=stampede.job.edge level=Info wf.id=934cb609-ddd4-4b67-ad7a-886ae40fc94d parent_exec_job.id=stage_out_local_ISIViz_1_0 child_exec_job.id=clean_up_stage_out_local_ISIViz_1_0

...