Request for No GridFTP server on the submit host

Vahi 11:30, 18 July 2008 (PDT)

Requested by Duncan Brown

LIGO wants to decrease the amount of software that needs to be installed on the submit host.
Currently Pegasus requires a GridFTP server on the submit node as data needs to be transferred back to the local pool for LIGO.

However, in case of LIGO since they always want data on the submit node, it maybe possible to have the stage out jobs
execute on the local universe with the data being pulled from the remote side.
This approach may have some issues if there are firewalls on the remote side.

Request for deploying the worker package dynamically

Vahi 11:47, 18 June 2008 (PDT)

Chad Hanna requested for this feature on the weekly voice call held at June 8th, 2008

First stab at implementation of this feature was checked in on June 16, 2008.

This feature is also being tracked through pegasus bugzilla . http://vtcpc.isi.edu/bugzilla/show_bug.cgi?id=35

The first stab at the implementation does the following

  • The worker package is staged automatically to the remote site, by adding a
    setup transfer job to the workflow. The setup transfer job by default uses GUC to stage the data. However, this can be configured by setting the property pegasus.transfer.setup.impl property. If you have pegasus.transfer.*.impl set in your properties file, then you need to set pegasus.transfer.setup.impl to GUC
  • The code discovers the worker package by looking up pegasus::worker in the
    transformation catalog. For the time being, you will have to put the entries in the transformation catalog . The location of the appropriate worker package can be picked up from http://pegasus.isi.edu/mapper/code.php#Worker_Packages In future, we will automatically look up this link to determine the locations. Note: that the basename of the url's should not be changed. Pegasus parses the basename to determine the version of the worker package.
  • There is an untar job added to the workflow after the setup job that un tars
    the worker package on the remote site. It defaults to /bin/tar . However can be overriden by specifying the entry tar in the transformation catalog for a particular site.

Request for specifying job prefixes for creating jobnames

Vahi 13:21, 3 June 2008 (PDT)

The ihope workflow executes multiple workflows from the same submit directory. Thus the job submit files are clobbered/overwritten. Normally, Pegasus generates a new submit directory for each workflow, that is not applicable in this scenario. LIGO requested a way of specifying a job prefix on the command line.

This feature has been implemented and exposed --job-prefix option to pegasus-plan

Request for Pattern bases replica selection

Vahi 15:40, 24 March 2008 (PDT)
'''Reguested by Scott Koranda from LIGO Milwaukee team

Workshop On Inspiral Workflows November 6-7, 2007

Feature Request for Condor Submit Files and DAG variable substitution

Vahi 16:39, 9 November 2007 (PST)

'''Reguested by Scott Koranda and Nick from LIGO Milwaukee team

Currently Pegasus generates a submit file for each node in the workflow.
However, there is a feature in condor and dagman, where one can have a generic
submit file for classes of jobs and then do variable substitution in the dagman
file.

For example for all inspiral thinca jobs in a workflow, one will only have a
single submit file, with the different arguments being passed as variable
substitutions in the .dag file.

The inspiral glue code currently generates the DAG's in that form.

The above issue is being tracked via bugzilla

http://vtcpc.isi.edu/bugzilla/show_bug.cgi?id=17

Moving of a failed computation to another grid site

Vahi 16:45, 9 November 2007 (PST)

'''Feature requested by Duncan Brown

Pegasus currently can do workflow reduction on the basis of the existing data
products in the replica catalog. However, it cannot do reduction on the basis
of the data products created for a particular run.

LIGO wants a feature, whereby a workflow that failed on a grid site X, can be
replanned and submitted to another grid site Y. They want to be able to use the
intermediate data products that were created.

An important point to note is that the intermediate data products that are
created are always shipped back to the submit host in the LIGO case. And for
the replan operation they want to use the intermediate products that were
transferred back to the submit host.

LDR does not support registration for all users.
Duncan wants a single text file created in the submit directory that contains
these mappings, and then ability of being able to replan.

The above issue is being tracked via bugzilla

http://vtcpc.isi.edu/bugzilla/show_bug.cgi?id=18

Planning Terminology of Miron

Vahi 16:53, 9 November 2007 (PST)

Miron characterizes planning as follows

  • Eager Planning

    Before submitting the workflow, we map it to a grid site. This is what internally in Pegasus we refer to full ahead planning

  • Lazy Planning

    One plans a job/DAG when DAGMAN managing dag/outer level DAG determines that a job is ready to run purely on the basis of the graph dependencies.
    This is what internally in pegasus we call deferred planning

  • Just In time Planning
        This is currently not supported in any configuration of Pegasus. This is when the decision to map the job is done at the matchmaking point 
        i.e. when the matchmaker matches a job to a particular site. In the LIGO context, where the smallest unit of work is a DAG, this would involve
        the matchmaker telling pegasus via a callout to plan the workflow for that resource. 
        This is for the future when tighter integration of DAGMan and Pegasus happens.
      

Setup of the PTC at ISI for LIGO workflows

Vahi 13:58, 23 October 2007 (PDT)
Setup a provenance catalog server at ISI for population of the provenance records of the ligo workflows run by Britta.

The following properties need to be set to populate to the server

pegasus.catalog.provenance                  InvocationSchema
pegasus.catalog.provenance.db.driver        MySQL
pegasus.catalog.provenance.db.url     jdbc:mysql://devrandom.isi.edu/ligo_pegasus
pegasus.catalog.provenance.db.user            ligo
pegasus.catalog.provenance.db.password        caltech

Note: the database can be only populated if your submit host is osg-itb-se.ligo.caltech.edu
Additionally the database can be accessed by user ligo from the following hosts

sukhna.isi.edu
devrandom.isi.edu ( log on to devrandom and then connect the host as localhost )

Vahi 14:40, 25 October 2007 (PDT)

During the use of this, a bug was identified and fixed.
Details can be found at

http://vtcpc.isi.edu/bugzilla/show_bug.cgi?id=11

Interaction with Scott Koranda

September 2007

The Pegasus team mentioned to Scott that we may be able to provide a "super Pegasus node" similar to the one provided in Condor, where the user can do a simple parameter sweep with a single workflow node. This is a possible extension of the Pegasus DAX schema, or rather a higher-level specification which would make a DAX easier to construct.

We may also want to look into simple data flow pattern: fan-out, fan-in to simplify the workflow construction.

  • On 9/25/07 Scott Koranda requested a feature which would provide more fault-tolerance in the way we deal with RLS. Basically if RLS fails to respond, implement an exponential backoff with a fixed number of trials. During one of the workflow planning runs, it was discovered that the RLS at Milwaukee was intermittently down due to either
    <BR>
    1) high load
    2) temporary loss in internet connection.
    
    <BR>

If the RLS server/client is able to distinguish these conditions, then an exponential back off scheme could be implmented.

  • On 10/08/2007 Karan volunteered to follow up with the RLS developer if it is possible

Meeting with Kent Blackburn at Caltech 9/24/07

Need to work out an MOU between LIGO and ISI to enable ISI to be able to more directly help fix problems, troubleshoot issues.

2nd year milestone for LIGO in OSG will be Einstein at home. It is not clear what the role of Pegasus may be. Maybe allow for the downloading of multiple jobs and their data and then running these jobs using Pegasus and DAGMan on the OSG.

Issue of late binding are coming up as well. Ewa pointed out that we are looking at a related approach of acquiring resources via Condor Glide-ins and then scheduling onto them.

Note: when we are talking about late binding, we need to also consider job migration--a whole research topic.

SRM might be important for LIGO in the future.

Other possible LIGO applications: burst. May want to help port them to OSG. Burst applications themselves are short running but they are wrapped by scripts which can run over periods of days. We may need to break them up and use clustering techniques to improve behavior and performance.

Kent suggested a specialized location for user logs so that users do not delete them. (Britta deletes all logs once jobs are done).

Ewa also have a discussion with Stuart. He wanted to better understand Pegasus functionality and how it works with Condor-C

  • No labels