Table Of Contents:

SCB Ocean Workflows

Input and Output Data

  • The input data for the workflow was staged in from corbusier.isi.edu
    • Size of raw input data for workflow 1.7G
  • The output data generated was staged back to ISI machine ( corbusier.isi.edu )
    • Corbusier runs GridFTP Server 3.11 (gcc32, 1213742010-78) Globus Toolkit 4.2.0 ready.
    • Final output of interpolation job that is staged out 8.7 MB

Workflow Information

DAX/Abstract Workflow

http://www.isi.edu/~vahi/scb/roms_dax_v5.jpg

Number of Jobs by Type

Type of Job

Number

das_tide

1

fcst_tide

1

interpolate

1

Total number

3

Pegasus Configuration Files

The pegasus configuration files for the runs on cobalt system are below

Executable Workflow ( Generated By Pegasus )

  • Each job has it's own data stage-in job.
  • Only the final output of interpolate job is staged out
  • Image of the workflow
  • Compute Jobs ( SCB Jobs ) 3

    Number of Compute/SCB Jobs by Type

    Type of Job

    Number

    Number of processors used per job

    das_tide

    1

    8

    fcst_tide

    1

    8

    interpolate

    1

    4

    Total number

    3

Number of Jobs by Type

Type of Job

Number

Compute/SCB Jobs

3

Data Stagein

3

Data Stageout

1

Directory Creation and Sync Jobs

1

Total number

8

Pegasus Workflow Job States and Delays

Workflow Runtimes

  • Number of Test Workflows executed 4
  • Data generated by executing gensim utility on the submit directory
    Each individual run
    ~UWC_TOKEN_START~1254808297634~UWC_TOKEN_END~PEGASUS_HOME/contrib/showlog/gensim --dag scb-0.dag  --output /lfs1/work/jpl/scb_results/run0001 --jobstate-log jobstate.log
    All runs
    ~UWC_TOKEN_START~1254808297635~UWC_TOKEN_END~PEGASUS_HOME/contrib/showlog/gentimes -x run*
    

Visualization of runs over time

X axis - time in seconds

Y axis - number of jobs

Runs On Teragrid

Legend

  • Legend
    • Job - the name of the job
    • Site - the site where the job ran
    • Kickstart - the actual duration of the job in seconds on the remote compute node
    • Post - the postscript time as reported by DAGMan
    • Condor - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node
    • Resource - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job spent in the remote queue
    • Runtime - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart
    • CondorQLen - the number of outstanding jobs in the queue when this job was released.

run0001

#Job                                                     Site Kickstart     Post   DAGMan   Condor Resource  Runtime CondorQLen
create_dir_scb_0_cobalt                                cobalt     0.32     5.00    13.00    15.00     0.00    15.00     1
das_tide_ID000001                                      cobalt  3806.65     5.00     5.00    15.00  3906.00  3855.00     1
fcst_tide_ID000002                                     cobalt   346.39     5.00     5.00    15.00    90.00   465.00     1
interpolate_ID000003                                   cobalt   134.49     5.00     5.00    15.00   155.00   160.00     1
stage_in_das_tide_ID000001_0                           cobalt  2805.49     5.00     5.00    20.00     5.00  2946.00     1
stage_in_fcst_tide_ID000002_0                          cobalt  1665.65     5.00     5.00    20.00     5.00  1805.00     2
stage_in_interpolate_ID000003_0                        cobalt   318.15     5.00     5.00    15.00     0.00   435.00     3
stage_out_interpolate_ID000003_0                       cobalt    13.31     5.00     5.00    15.00     0.00   135.00     1

run0002

#Job                                                     Site Kickstart     Post   DAGMan   Condor Resource  Runtime CondorQLen
create_dir_scb_0_cobalt                                cobalt     0.34     5.00    13.00    15.00     0.00    15.00     1
das_tide_ID000001                                      cobalt  3811.54     5.00     5.00    20.00 13806.00  3915.00     1
fcst_tide_ID000002                                     cobalt   344.90     5.00     5.00    10.00   175.00   405.00     1
interpolate_ID000003                                   cobalt   128.56     5.00     5.00    20.00   225.00   160.00     1
stage_in_das_tide_ID000001_0                           cobalt  2740.53     5.00     5.00    20.00     0.00  2890.00     2
stage_in_fcst_tide_ID000002_0                          cobalt  1670.40     5.00     5.00    15.00     0.00  1815.00     3
stage_in_interpolate_ID000003_0                        cobalt   341.18     5.00     5.00    20.00     0.00   490.00     1
stage_out_interpolate_ID000003_0                       cobalt    12.97     5.00     5.00    10.00     5.00   155.00     1

run0003

#Job                                                     Site Kickstart     Post   DAGMan   Condor Resource  Runtime CondorQLen
create_dir_scb_0_cobalt                                cobalt     0.32     5.00    14.00    15.00     0.00    15.00     1
das_tide_ID000001                                      cobalt  3794.60     5.00     5.00    10.00  1650.00  3916.00     1
fcst_tide_ID000002                                     cobalt   492.81     5.00     6.00    15.00   110.00   520.00     1
interpolate_ID000003                                   cobalt   108.58     5.00     5.00    10.00   120.00   160.00     1
stage_in_das_tide_ID000001_0                           cobalt  2797.34     5.00     5.00    15.00     0.00  2955.00     2
stage_in_fcst_tide_ID000002_0                          cobalt  1658.32     5.00     5.00    10.00     5.00  1755.00     3
stage_in_interpolate_ID000003_0                        cobalt   348.12     5.00     5.00    15.00     0.00   495.00     1
stage_out_interpolate_ID000003_0                       cobalt     8.22     5.00     5.00    15.00     0.00    95.00     1

run0004

#Job                                                     Site Kickstart     Post   DAGMan   Condor Resource  Runtime CondorQLen
create_dir_scb_0_cobalt                                cobalt     0.29     5.00    13.00    15.00     0.00   125.00     1
das_tide_ID000001                                      cobalt  3861.94     5.00     5.00    10.00  2735.00  3916.00     1
fcst_tide_ID000002                                     cobalt   348.76     5.00     5.00    15.00   150.00   405.00     1
interpolate_ID000003                                   cobalt   139.54     5.00     5.00    15.00   150.00   165.00     1
stage_in_das_tide_ID000001_0                           cobalt  2686.90     5.00     5.00    10.00     5.00  2821.00     3
stage_in_fcst_tide_ID000002_0                          cobalt  1641.17     5.00     5.00    10.00     5.00  1806.00     1
stage_in_interpolate_ID000003_0                        cobalt   333.03     5.00     5.00    10.00     5.00   455.00     2
stage_out_interpolate_ID000003_0                       cobalt    12.13     5.00     5.00    15.00     0.00   135.00     1

All Runs

#All
#Transformation                           Count         Mean         Variance
pegasus::transfer                            16      1190.81       1279724.52
scb::das_tide                                 4      3818.68           882.31
pegasus::dirmanager                           4         0.32             0.00
scb::fcst_tide                                4       383.22          5341.00
scb::interpolate                              4       127.79           184.18

Runs On Pollux

run0001

run0002

run0003

Legend

  • Legend
    • Job - the name of the job
    • Site - the site where the job ran
    • Kickstart - the actual duration of the job in seconds on the remote compute node
    • Post - the postscript time as reported by DAGMan
    • Condor - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node
    • Resource - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job spent in the remote queue
    • Runtime - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart
    • CondorQLen - the number of outstanding jobs in the queue when this job was released.

run0001

#Job                                                     Site Kickstart     Post   DAGMan   Condor Resource  Runtime CondorQLen Seqexec Seqexec-Delay
create_dir_scb-gemini_0_pollux                         pollux     1.19     5.00    60.00     0.00     0.00   100.00     0           -          -
das_tide_ID000001                                      pollux  4561.04    10.00   108.00     0.00     0.00  4430.00     0           -          -
fcst_tide_ID000002                                     pollux   956.74     5.00     6.00     0.00     0.00   935.00     0           -          -
interpolate_ID000003                                   pollux   262.02    10.00     7.00     0.00     0.00   255.00     0           -          -
stage_in_das_tide_ID000001_0                            local   160.00    10.00   125.00     0.00     0.00   115.00     0                       
stage_in_fcst_tide_ID000002_0                           local   160.00    10.00   125.00     0.00     0.00   115.00     0                       
stage_in_interpolate_ID000003_0                         local   160.00     5.00   125.00     0.00     0.00   160.00     0                       
stage_out_interpolate_ID000003_0                        local   126.00     5.00     5.00     0.00     0.00   126.00     0             

run0002

#Job                                                     Site Kickstart     Post   DAGMan   Condor Resource  Runtime CondorQLen Seqexec Seqexec-Delay
create_dir_scb-gemini_0_pollux                         pollux     1.19    10.00   121.00     0.00     0.00   215.00     0           -          -
das_tide_ID000001                                      pollux  4496.42     5.00    65.00     0.00     0.00  4485.00     0           -          -
fcst_tide_ID000002                                     pollux  1271.32     5.00     5.00     0.00     0.00  1255.00     0           -          -
interpolate_ID000003                                   pollux   256.19     5.00     6.00     0.00     0.00   260.00     0           -          -
stage_in_das_tide_ID000001_0                            local   360.00     5.00   306.00     0.00     0.00   265.00     0                       
stage_in_fcst_tide_ID000002_0                           local   255.00     5.00   306.00     0.00     0.00   155.00     0                       
stage_in_interpolate_ID000003_0                         local   100.00     5.00   306.00     0.00     0.00   100.00     0                       
stage_out_interpolate_ID000003_0                        local   100.00     5.00     5.00     0.00     0.00   100.00     0                       

run0003

#Job                                                     Site Kickstart     Post   DAGMan   Condor Resource  Runtime CondorQLen Seqexec Seqexec-Delay
create_dir_scb-gemini_0_pollux                         pollux     1.17    10.00   120.00     0.00     0.00   155.00     0           -          -
das_tide_ID000001                                      pollux  4359.44     5.00     5.00     0.00     0.00  4345.00     0           -          -
fcst_tide_ID000002                                     pollux   864.60    10.00     5.00     0.00     0.00   856.00     0           -          -
interpolate_ID000003                                   pollux   305.44     5.00    57.00     0.00     0.00   290.00     0           -          -
stage_in_das_tide_ID000001_0                            local   421.00     5.00   245.00     0.00     0.00   260.00     0                       
stage_in_fcst_tide_ID000002_0                           local   311.00     5.00   245.00     0.00     0.00   155.00     0                       
stage_in_interpolate_ID000003_0                         local   161.00     5.00   245.00     0.00     0.00   161.00     0                       
stage_out_interpolate_ID000003_0                        local   170.00     5.00     5.00     0.00     0.00   170.00     0                       

DAX Generator

Input

  • current simulation time, yyyymmddhh. hh is usually 03 09 15 21
  • forecast lenght ( for fcst jobs ) . is usually 6 but should be configurable.

Notes

Vahi 14:33, 3 March 2009 (PST)

  • for files ending in monMM, MM needs to have the month passed in the simulation time
  • In the sample DAX, the files that end in 06 are the ones that have 3 hours subtracted from the HH of the time passed in the input
  • For roms bulk file and scbclim fille that are input to the forecast job are always from a day before.

CODE DOWNLOAD

DAX Generator

Building from source

This will create a binary distribution
scb-binary-1.0.tar.gz

Installing the binary distribution

  • tar zxvf scb-binary-1.0.tar.gz
  • cd scb-1.0
  • export SCB_HOME=`pwd`
  • unset CLASSPATH
  • source setup.sh

Setting up the user environment

  • cd scb-1.0
    • this is the directory that was created when you untarred the scb-binary-1.0.tar.gz file.
  • export SCB_HOME=`pwd`
  • unset CLASSPATH
  • source setup.sh

DAX Generator Description

The SCB DAX generator is a java program that uses the Pegasus JAVA DAX API to generate SCB DAX'es (abstract workflows). To generate a DAX the user needs to specify the data assimilation time and optionally the forecast duration. The forecast duration unless specified defaults to 6 hours. In addition to generating the SCB DAX, the dax generator generates the following

  • the input and output LOF ( List of Filenames Files ) for the jobs in the DAX
  • A File based Replica Catalog that catalogs the locations of the LOF files, so that Pegasus can transfer them as part of the workflow.

Generating a SCB DAX

scb-dax-gen

USAGE

{panel}
 $Id: DAXGenerator.java 1717 2009-03-10 04:09:08Z vahi $ 
 1.0
 scb-dax-gen - The main class used to run SCB dax generator 
 Usage: scb-dax-gen [-Dprop  [..]] --time <YYYYMMDDHH> [--dir <output directory>]  
 [--name <dax basename>] [--fcst-duration <forecast duration>]  [-u <url-prefix>] 
 [--verbose] [--Version] [-h] 
{panel}

 Mandatory Options 
{panel}
 -t |--time            the time at which data assimilation took place. In YYYYMMDDHH format.
 Other Options  
 -n |--name            the basename to be given to the DAX file that is generated.
 -D |--dir             the directory where to generate the DAX and the LOF files.
 -f |--fcst-duration   the duration in hours of the forecast. Defaults to 6.
 -u |--url-prefix      the url prefix to the server hosting the LOF files e.g. gsiftp://server.isi.edu 
 -v |--verbose         increases the verbosity of messages about what is going on
 -V |--version         displays the version of the SCB DAX Generator
 -h |--help            generates this help.
 The following exitcodes are produced
 0 the dax generator was able to generate DAX and associated LOF files
 1 an error occured. In most cases, the error message logged should give a
   clear indication as to where  things went wrong.
 2 an error occured while loading a specific module implementation at runtime
{panel}
 

EXAMPLE

corbusier:scb-1.0 vahi$ scb-dax-gen --time 2008060409 --dir dax --fcst-duration 6 --url-prefix "gsiftp://dummy.isi.edu"
2009.03.09 21:06:40.968 PDT: [INFO] event.scb.dax-generator scb.version 1.0  - STARTED 
2009.03.09 21:06:41.011 PDT: [INFO]  Time taken to execute is 0.043 seconds 
2009.03.09 21:06:41.011 PDT: [INFO] event.scb.dax-generator scb.version 1.0  - FINISHED 

corbusier:scb-1.0 vahi$ ls dax/
das_input_2008060409_lof		fcst_output_2008060409_lof		scb-2008060409-6.cache
fcst_input_2008060409_lof		interpolate_input_2008060409_lof	scb-2008060409-6.dax

SCB Wrapper Scripts

Vahi 12:20, 2 November 2009 (PDT)

Each job in the SCB workflow has a wrapper associated with it , that prepares the input for the OPEN MP code to execute.
The wrappers figure out the input files required and the other arguments and pass them ahead to the OPEN MP code. To get the codes executed as part of a workflow, the wrapper scripts needed to be modified as follows

  • Remove Hardcoded Paths
    • The wrapper scripts expected that the input data is available in a fixed location relative to where the codes are installed. This is not a feasible solution in the GRID environment, as the jobs are usually executed on a scratch filesystem. Pegasus is able to transfer the input data to the scratch directories. However, the scripts needed to be modified to pick up the input data from the workflow specific scratch directory. The modifications involved passing a list of input of filenames file ( LOF file ) and a list of output filenames files to the wrapper scripts. These list of filenames file identify the input/output data that a job requires/produces.
  • Remove a layer of scripts
    • The wrapper scripts were launched by another wrapper script, that set the arguments for the jobs as environment variables, and certain environment variables for the OPEN MP system. The outermost script was removed, and the immediate wrapper around the codes were modified to take the arguments on the command line. The open MP variables are specified in the DAX and are set in the job environment when it is launched on the remote site.
  • Name of SCB wrapper scripts
    • scb_pegasus_run_das_tide
    • scb_pegasus_interscript_das
    • scb_pegasus_run_fcst_tide

The SCB wrapper scripts are available in the SVN checkout.
The SVN checkout has a tar file scb-codes-050409.tgz

Untar it and it will be in JPL/bin directory

Instructions for running workflows on pollux

  • Log onto pollux as user gmehta
  • check for grid proxy using grid-proxy-info. This is required to stage the input data from grid ftp server on pollux
    pollux scb_test/run0005> grid-proxy-info 
    subject  : /DC=org/DC=doegrids/OU=People/CN=Karan Vahi 476301/CN=200344285
    issuer   : /DC=org/DC=doegrids/OU=People/CN=Karan Vahi 476301
    identity : /DC=org/DC=doegrids/OU=People/CN=Karan Vahi 476301
    type     : Proxy draft (pre-RFC) compliant impersonation proxy
    strength : 512 bits
    path     : /tmp/x509up_u41244
    timeleft : 73:35:04  (3.0 days)
    
  • change to pegasus submit directory
    pollux /home/gmehta> cd ~/pegasus-submit-dir/
    pollux gmehta/pegasus-submit-dir> pwd
    /workp/oba/gmehta/SUBMIT
    pollux gmehta/pegasus-submit-dir> ls
    conf  dags  dax  dax-gen  EXEC  pegasus-plan.txt  pegasus-plan.txt~  STORAGE
    
  • pegasus-plan.txt has some command that we run
  • generating the DAX using scb-dax-gen
    #generating dax using dax generator
    pollux gmehta/pegasus-submit-dir> scb-dax-gen --time 2008060409 --dir dax-gen --fcst-duration 6 --url-prefix "file:///" -p pollux
    2009.05.07 14:06:35.154 PDT: [INFO] event.scb.dax-generator scb.version 1.0  - STARTED 
    2009.05.07 14:06:35.328 PDT: [INFO]  Time taken to execute is 0.156 seconds 
    2009.05.07 14:06:35.329 PDT: [INFO] event.scb.dax-generator scb.version 1.0  - FINISHED 
    
    pollux gmehta/pegasus-submit-dir> ls dax-gen/
    das_input_2008060409_lof   fcst_output_2008060409_lof        scb-2008060409-6.cache
    fcst_input_2008060409_lof  interpolate_input_2008060409_lof  scb-2008060409-6.dax
    
    
    This will create the dax in the dax-gen directory and the associated LOF files reqd for the workflow
    
  • Edit the dax file to give a shorter label due to bug in the fcst code. Change label to scb-test
  • planning and submitting the workflow
     
    pollux gmehta/pegasus-submit-dir> pegasus-plan -Dpegasus.user.properties=./conf/properties --dax dax-gen/scb-2008060409-6.dax --cache dax-gen/scb-2008060409-6.cache -s pollux -o local --dir dags --nocleanup --force --submit
    
    2009.05.07 14:09:14.780 PDT: [INFO] event.pegasus.planner planner.version 2.4.0cvs  - STARTED 
    2009.05.07 14:09:16.152 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.174 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.287 PDT: [INFO] event.pegasus.parse.dax dax.id /workp/oba/gmehta/SUBMIT/dax-gen/scb-2008060409-6.dax  - STARTED 
    2009.05.07 14:09:16.584 PDT: [INFO] event.pegasus.parse.dax dax.id /workp/oba/gmehta/SUBMIT/dax-gen/scb-2008060409-6.dax  - FINISHED 
    2009.05.07 14:09:16.645 PDT: [INFO] event.pegasus.refinement dax.id scb-test_1  - STARTED 
    2009.05.07 14:09:16.683 PDT: [INFO] event.pegasus.load.cache dax.id scb-test_1  - STARTED 
    2009.05.07 14:09:16.695 PDT: [INFO] event.pegasus.load.cache dax.id scb-test_1  - FINISHED 
    2009.05.07 14:09:16.704 PDT: [INFO] event.pegasus.siteselection dax.id scb-test_1  - STARTED 
    2009.05.07 14:09:16.732 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.738 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.742 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.746 PDT: [INFO] event.pegasus.siteselection dax.id scb-test_1  - FINISHED 
    2009.05.07 14:09:16.763 PDT: [INFO]  Grafting transfer nodes in the workflow 
    2009.05.07 14:09:16.763 PDT: [INFO] event.pegasus.generate.transfer-nodes dax.id scb-test_1  - STARTED 
    Ignoring PFN file:////workp/oba/gmehta/SUBMIT/STORAGE/2008060409_rst.nc
    2009.05.07 14:09:16.827 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.836 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.841 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.843 PDT: [WARNING]  profile condor.grid_resource is empty, Removing! 
    2009.05.07 14:09:16.843 PDT: [WARNING]  profile pegasus.style is empty, Removing! 
    2009.05.07 14:09:16.855 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.861 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.866 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.867 PDT: [WARNING]  profile condor.grid_resource is empty, Removing! 
    2009.05.07 14:09:16.868 PDT: [WARNING]  profile pegasus.style is empty, Removing! 
    2009.05.07 14:09:16.877 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.881 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.885 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.886 PDT: [WARNING]  profile condor.grid_resource is empty, Removing! 
    2009.05.07 14:09:16.887 PDT: [WARNING]  profile pegasus.style is empty, Removing! 
    2009.05.07 14:09:16.895 PDT: [INFO] event.pegasus.generate.transfer-nodes dax.id scb-test_1  - FINISHED 
    2009.05.07 14:09:16.913 PDT: [INFO] event.pegasus.generate.workdir-nodes dax.id scb-test_1  - STARTED 
    2009.05.07 14:09:16.922 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.924 PDT: [INFO] event.pegasus.generate.workdir-nodes dax.id scb-test_1  - FINISHED 
    2009.05.07 14:09:16.925 PDT: [INFO] event.pegasus.generate.cleanup-wf dax.id scb-test_1  - STARTED 
    2009.05.07 14:09:16.928 PDT: [WARNING]  unknown profile condor.grid_resource,  using anyway 
    2009.05.07 14:09:16.931 PDT: [INFO] event.pegasus.generate.cleanup-wf dax.id scb-test_1  - FINISHED 
    2009.05.07 14:09:16.931 PDT: [INFO] event.pegasus.refinement dax.id scb-test_1  - FINISHED 
    2009.05.07 14:09:17.152 PDT: [INFO]  Generating codes for the concrete workflow 
    2009.05.07 14:09:17.524 PDT: [INFO]  Generating codes for the concrete workflow -DONE 
    2009.05.07 14:09:17.525 PDT: [INFO]  Generating code for the cleanup workflow 
    2009.05.07 14:09:17.801 PDT: [INFO]  Generating code for the cleanup workflow -DONE 
    2009.05.07 14:09:19.181 PDT: [ERROR]  Rescued /tmp/scb-test-17713699551907048940.log as /tmp/scb-test-17713699551907048940.log.000 
    2009.05.07 14:09:19.212 PDT: [INFO]   
    2009.05.07 14:09:19.224 PDT: [INFO]  Checking all your submit files for log file names. 
    2009.05.07 14:09:19.236 PDT: [INFO]  This might take a while...  
    2009.05.07 14:09:19.248 PDT: [INFO]  Done. 
    2009.05.07 14:09:19.263 PDT: [INFO]  ----------------------------------------------------------------------- 
    2009.05.07 14:09:19.280 PDT: [INFO]  File for submitting this DAG to Condor           : scb-test-1.dag.condor.sub 
    2009.05.07 14:09:19.310 PDT: [INFO]  Log of DAGMan debugging messages                 : scb-test-1.dag.dagman.out 
    2009.05.07 14:09:19.320 PDT: [INFO]  Log of Condor library output                     : scb-test-1.dag.lib.out 
    2009.05.07 14:09:19.332 PDT: [INFO]  Log of Condor library error messages             : scb-test-1.dag.lib.err 
    2009.05.07 14:09:19.344 PDT: [INFO]  Log of the life of condor_dagman itself          : scb-test-1.dag.dagman.log 
    2009.05.07 14:09:19.356 PDT: [INFO]   
    2009.05.07 14:09:19.368 PDT: [INFO]  -no_submit given, not submitting DAG to Condor.  You can do this with: 
    2009.05.07 14:09:19.380 PDT: [INFO]  "condor_submit scb-test-1.dag.condor.sub" 
    2009.05.07 14:09:19.392 PDT: [INFO]  ----------------------------------------------------------------------- 
    2009.05.07 14:09:19.404 PDT: [INFO]  Submitting job(s). 
    2009.05.07 14:09:19.416 PDT: [INFO]  Logging submit event(s). 
    2009.05.07 14:09:19.428 PDT: [INFO]  1 job(s) submitted to cluster 347. 
    2009.05.07 14:09:19.440 PDT: [INFO]   
    2009.05.07 14:09:19.452 PDT: [INFO]  I have started your workflow, committed it to DAGMan, and updated its 
    2009.05.07 14:09:19.464 PDT: [INFO]  state in the work database. A separate daemon was started to collect 
    2009.05.07 14:09:19.476 PDT: [INFO]  information about the progress of the workflow. The job state will soon 
    2009.05.07 14:09:19.488 PDT: [INFO]  be visible. Your workflow runs in base directory.  
    2009.05.07 14:09:19.500 PDT: [INFO]   
    2009.05.07 14:09:19.512 PDT: [INFO]  cd /workp/oba/gmehta/SUBMIT/dags/gmehta/pegasus/scb-test/run0003 
    2009.05.07 14:09:19.524 PDT: [INFO]   
    2009.05.07 14:09:19.536 PDT: [INFO]  *** To monitor the workflow you can run *** 
    2009.05.07 14:09:19.548 PDT: [INFO]   
    2009.05.07 14:09:19.560 PDT: [INFO]  pegasus-status -w scb-test-1 -t 20090507T140914-0700  
    2009.05.07 14:09:19.572 PDT: [INFO]  or 
    2009.05.07 14:09:19.584 PDT: [INFO]  pegasus-status /workp/oba/gmehta/SUBMIT/dags/gmehta/pegasus/scb-test/run0003 
    2009.05.07 14:09:19.596 PDT: [INFO]   
    2009.05.07 14:09:19.608 PDT: [INFO]  *** To remove your workflow run *** 
    2009.05.07 14:09:19.620 PDT: [INFO]   
    2009.05.07 14:09:19.632 PDT: [INFO]  pegasus-remove -d 347.0 
    2009.05.07 14:09:19.644 PDT: [INFO]  or 
    2009.05.07 14:09:19.656 PDT: [INFO]  pegasus-remove /workp/oba/gmehta/SUBMIT/dags/gmehta/pegasus/scb-test/run0003 
    2009.05.07 14:09:19.668 PDT: [INFO]   
    2009.05.07 14:09:19.681 PDT: [INFO]  Time taken to execute is 4.871 seconds 
    2009.05.07 14:09:19.681 PDT: [INFO] event.pegasus.planner planner.version 2.4.0cvs  - FINISHED 
    
    

Voicecall on May 12th, 2009

Topics to discuss after demonstration

Condor installation

Right now condor is running as user gmehta. No other user can use it.

Options

  1. One way to get around it is to run condor as root. This way multiple multiple users can use the same condor installation
  2. The installation is copied to peggy's user, and then we have condor running as peggy also.

If condor is run as a user , then it is running at a lower priority. So occasionally condor is not responsive. Condor commands take longer to execute.
Part of the reason is because we are running condor directly on the machine instead of going via PBS

Population of Replica Catalog

Right now we have the mapping for the input data for the sample workflow in a file based replica catalog

The input data is pulled from a grid ftp server at ISI.

Where is the input data hosted as and when it is generated ? We need to stand up a grid ftp server in front of it to stage the workflows.

Also the locations of the input data need to be catalogued in the replica catalog for DB to use.

Proxy for Peggy

Right now user gmehta uses Karan's proxy to stage in the data . One option is for Gaurang to generate a user certificate for Peggy from the CA he runs at ISI.
If Peggy has DOE certs we can use them.

Missing output files

Sometimes the job does not create an output file that is referred in the DAX for the job.

for e.g. in the sample workflow the fcst_tide job does not create the output file 2008060409_2008060415_avg.nc

Hence the stage out job for fcst_tide fails

One way is for the wrapper script to create an empty output file if the code exits successfully and the output files are not created.

DOCUMENTS

</html>

  • No labels