Notice: The ensemble manager is no longer supported

Ensemble Manager: Managing multiple workflows

The Ensemble Manager (EM) component of the SR system is responsible for supporting the creation and the execution of multiple workflows at the same time.

Current workflow systems allow only sequential or uncoordinated creation and execution of a single workflow. The Ensemble Manager that we will develop will coordinate and efficiently handle planning and executing 100’s to 1000’s of workflows simultaneously on the Grid.

The EM will manage sets of workflows, with each set specified as a workflow ensemble. A workflow ensemble may, for example, contain the pool of candidate workflows being considered for a given step in the workflow generation process. The EM will be invoked to perform on workflow ensembles any of the generation, ranking, planning, and execution steps of the workflow generation processes.

There is detailed documentation available on the formulation of requests for the workflow system for SE18. There is also background about general description of workflow requests.

Design

In our design, the EM provides several operations to submit, plan, execute and monitor several workflows simultaneously. In this system, a workflow is submitted with an indication of the kind of refinement operation that it needs to go through. The submission may optionally indicate that a workflow is part of an entire workflow ensemble.

We assume that a workflow and all the associated info is stored on a local file system. We differentiate between workflow submission and workflow start. A workflow will be submitted with a start time for when the EM is expected to begin processing it. This way we can model the following cases:

  • Submit the workflow and indicate a start time of infinity
  • Submit the workflow and indicate a specific start time in the future
  • Submit the workflow and indicate an immediate start time

Installation

Prerequisites

  • ANT
  • JAVA 1.5+
  • PEGASUS 2.2.0CVS
  • WINGS
  • CONDOR 7.1.0 only.
  • MYSQL

Build and Install Wings

  • Set env WINGS_HOME to the Wings directory
    $ export WINGS_HOME=<path to wings>
  • source $WINGS_HOME/setenv.sh (if your shell is bash)
    $source $WINGS_HOME/setenv.sh

Build Pegasus

  • Set env PEGASUS_HOME to the Checkout Directory
    $export PEGASUS_HOME=<path to pegasus-svn-checkout>
  • source $PEGASUS_HOME/setup-devel.sh (if your shell is bash)
    $ source $PEGASUS_HOME/setup-devel.sh
  • Build pegasus using ant
    $ ant clean dist

Install Pegasus

Copy $PEGASUS_HOME/dist/pegasus-*.tar.gz and untar it
$ cp $PEGASUS_HOME/dist/pegasus-binary-*.tar.gz /tmp
$ cd <path to software installation directory>
$ gtar zxvf /tmp/pegasus-binary-*.tar.gz

  • Set env PEGASUS_HOME to the binary installation path
    $ export PEGASUS_HOME=</path to binary install>
  • Remove wings jar included in Pegasus
    $ rm $PEGASUS_HOME/lib/wings.jar

Build Ensemble Manager

  • Download Ensemble Manager code from SVN at
  • Set env ENSEMBLE_HOME to the checked out directory

    $ export ENSEMBLE_HOME=<path checked out>

  • source $ENSEMBLE_HOME/setup-devel.csh if your shell is CSH or setup-devel.sh if your shell is BASH

    $ source $ENSMEBLE_HOME/setup-devel.sh

  • Run ant clean package

    $ant clean package

Buildfile: build.xml

clean:
{panel}
   [delete] /nfs/asd2/gmehta/jbproject/Ensemble/dist not found.
   [delete] /nfs/asd2/gmehta/jbproject/Ensemble/build not found.
{panel}

init:
{panel}
    [mkdir] Created dir: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble
    [mkdir] Created dir: /nfs/asd2/gmehta/jbproject/Ensemble/build/src
     [echo] full ISO timestamp: 
{panel}

compile:
{panel}
    [javac] Compiling 28 source files to /nfs/asd2/gmehta/jbproject/Ensemble/build/src
{panel}

...
...
{panel}
 [mkdir] Created dir: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble/var
     [copy] Copying 5 files to /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble
     [gzip] Building: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble.tar.gz
   [delete] Deleting: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble.tar
{panel}
  • A tarball will be created in $ENSEMBLE_HOME/dist/ensemble.tar.gz

Install Ensemble Manager

  • Copy the binary tarball built in the earlier step to an installation location and untar it

    $ gtar zxvf $ENSEMBLE_HOME/dist/ensemble.tar.gz

  • Set environment ENSEMBLE_HOME to the untarred directory

    export ENSEMBLE_HOME=<path to ensemble binary directory>

  • Configure other paths in correct order

$ unset CLASSPATH

$ source $WINGS_HOME/setenv.sh

$ source $PEGASUS_HOME/setup.sh

$ source $ENSEMBLE_HOME/setup.sh

Create the Ensemble DB

  • As user root create a db for storing the ensemble schema in MySQL

create database <databasename>;

  • Add a username and password which has access to this db

grant all on <databasename>.* to <username>@"<hostname>" identified by "<password>";

flush privileges;

  • Populate the created Db with the ensemble schema from $ENSMEBLE_HOME/sql/ensemble.sql

mysql -u <username> -p databasename < $ENSEMBLE_HOME/sql/ensemble.sql;

Edit the Ensemble configuration file

Edit the $ENSEMBLE_HOME/etc/properties or create a file $HOME/.ensemblerc

condor.home=<path to condor install home directory>
pegasus.home=<path to pegasus install home directory $PEGASUS_HOME>
wings.home=<path to wings install>
ensemble.db.url=<jdbc url to ensemble db . e.g. jdbc:mysql://smarty.isi.edu/ensembledb>
ensemble.db=MySQL
ensemble.db.user=<dbusername>
ensemble.db.password=<dbpassword>
ensemble.localdir=<path where the ensemsble workflows are planned and dags are generated. Default is $ENSEMBLE_HOME/var>

Edit the Log4j.configuration file

log4j.rootCategory=DEBUG, File, Console, Socket

log4j.logger.anchor.datametrics=OFF
log4j.logger.com.hp.hpl.jena=OFF
log4j.logger.org.griphyn=DEBUG
log4j.logger.pegasus=DEBUG
log4j.logger.edu.isi=DEBUG

#
# The default file appender
#
log4j.appender.File=org.apache.log4j.RollingFileAppender
log4j.appender.File.Threshold=DEBUG
log4j.appender.File.File=/tmp/ensemblemanager.log
log4j.appender.File.layout=org.apache.log4j.SimpleLayout
log4j.appender.File.Append=true
log4j.appender.File.MaxFileSize=100MB


#
# Console Appender
#
log4j.appender.Console=org.apache.log4j.ConsoleAppender
log4j.appender.Console.layout=org.apache.log4j.SimpleLayout
log4j.appender.Console.Threshold=INFO


#
# The Socket Appender
#
log4j.appender.Socket=org.apache.log4j.net.SocketAppender
log4j.appender.Socket.Threshold=INFO
log4j.appender.Socket.RemoteHost=artemis.stdc.com
log4j.appender.Socket.Port=40940
log4j.appender.Socket.ReconnectionDelay=5000
log4j.appender.Socket.LocationInfo=true

The sample log4j.properties shown here is shipped in $ENSEMBLE_HOME and is used by default. You can modify any of the properties in the $ENSMEBLE_HOME/log4j.properties or provide an alternative log4j.properties file by passing the option -Dlog4j.configuration=file:/path/to/log4j.properties/file

Note that you need to add the following entries to any non standard log4j.properties file for all the logging to appear correctly.

log4j.logger.anchor.datametrics=OFF
log4j.logger.com.hp.hpl.jena=OFF
log4j.logger.org.griphyn=DEBUG
log4j.logger.pegasus=DEBUG
log4j.logger.edu.isi=DEBUG

Create a portfolio file

The format of a portfolio file which is input to ensemble is a simple two column file

Seed1 /path/to/seed-config1
Seed2 path/to/seed/config2
...
...
..

e.g. $ENSEMBLE_HOME/examples/portfolio

Create seed configuration file

example $ENSEMBLE_HOME/examples/seed1.config or seed2.config

Please note the checked in seed2.config has a typo. Please fix the viz.isi.edu to be viz-login.isi.edu. I will try to fix the svn..

There are several paths and values that you may need to change to reflect the path of wings, pegasus, site catalog etc.

https://wiki.boozallenet.com/tangram/index.php/Formulating_Requests_for_the_Workflow_System_in_SE18#Creating_Request_Qualifiers

Also you can define the starttime, walltime, priority etc for the workflow

Create or copy site catalog ==

You can create a site catalog based on instructions here.

https://wiki.boozallenet.com/tangram/index.php/SR-SE-18-SiteCatalog-Instructions

example site catalog which works for isi_viz and isi_wind is at

$ENSEMBLE_HOME/examples/sites.xml

You will need to change the workdir and storage parameters to reflect paths that you can write to on viz and wind clusters.

Clients

sr-submit

SR submit takes a portfolio file and submits the portfolio for execution on a given site or list of sites and given output location.

sr-submit -s isi_viz -o local -p $ENSEMBLE_HOME/example/portfolio


INFO - ts=2008-09-14T08:56:49.205291Z event=event.ensemble.parse.start msgid=4cfa4870-c485-409a-ae84-f393d9f5031e eventId=event.ensemble.parse_b70373b9-27ac-4270-a8b8-a23c9c19638d portfolio.id=c0559b23-942c-49ae-84ce-baefe4a48b11 prog=EnsembleManager 
INFO - ts=2008-09-14T08:56:49.216896Z event=event.ensemble.parse msgid=c9450b64-0ddf-48ef-b7ce-b0c3e45795b4 eventId=event.ensemble.parse_b70373b9-27ac-4270-a8b8-a23c9c19638d msg="Parsing seed \"SE18-SingleGroupDetector-Tangram\" with config \"/Users/gmehta/Documents/NetBeansProjects/Ensemble/examples/seed1.config\"" 
INFO - ts=2008-09-14T08:56:49.223372Z event=event.ensemble.parse.end msgid=5ef6fa0e-afac-418d-a1fc-063bfe0a8bc8 eventId=event.ensemble.parse_b70373b9-27ac-4270-a8b8-a23c9c19638d 
INFO - ts=2008-09-14T08:56:49.224645Z event=event.ensemble.parse.start msgid=3233690a-e8d9-4973-bbb5-a66144338f0f eventId=event.ensemble.parse_4b0f5881-43d9-40a4-a57f-ab4f896d3873 portfolio.id=c0559b23-942c-49ae-84ce-baefe4a48b11 prog=EnsembleManager 
INFO - ts=2008-09-14T08:56:49.225056Z event=event.ensemble.parse msgid=a4a0c004-d0e4-4106-9a93-83d7aa733db6 eventId=event.ensemble.parse_4b0f5881-43d9-40a4-a57f-ab4f896d3873 msg="Parsing seed \"SE18-SingleGroupDetector-Tangram\" with config \"/Users/gmehta/Documents/NetBeansProjects/Ensemble/examples/seed2.config\"" 
INFO - ts=2008-09-14T08:56:49.231395Z event=event.ensemble.parse.end msgid=19d8228e-dc3e-4772-9b86-25e3bf78bdbd eventId=event.ensemble.parse_4b0f5881-43d9-40a4-a57f-ab4f896d3873 
INFO - ts=2008-09-14T08:56:49.232766Z event=event.ensemble.submit.start msgid=5c073eed-1375-4b21-a8b9-b525a494bfa4 eventId=event.ensemble.submit_efc0511e-e70e-4abc-9212-9487f8c9cecd portfolio.id=c0559b23-942c-49ae-84ce-baefe4a48b11 prog=EnsembleManager 
INFO - ts=2008-09-14T08:56:49.842701Z event=event.id.creation msgid=d5df18ce-97bf-40b0-a776-55878592efe7 eventId=idHierarchy_ce3cf3a6-d4cc-44e1-a371-b2a866d3d8f9 parent.id.type=portfolio.id parent.id=c0559b23-942c-49ae-84ce-baefe4a48b11 child.ids.type=seed.id child.ids={74207515-ae19-432a-bc13-f956074491a1,d08f2e50-beef-4385-9fbe-0da7e9934bf0,} 
EnsembleId = c0559b23-942c-49ae-84ce-baefe4a48b11
WorkflowID = 74207515-ae19-432a-bc13-f956074491a1
WorkflowID = d08f2e50-beef-4385-9fbe-0da7e9934bf0


This shows that the submitted portfolio has id =EnsembleId
The seeds have seedid=WorkflowID


You can additionally call the client with a non default log4j.properties file like this

sc-client -Dlog4j.configuration=file:/path/to/log4j.properties/file -s isi_viz,isi_wind -o local -p portfolio

See above for example log4j.properties file or $ENSEMBLE_HOME/log4j.properties

Debugging the submission

EM creates a directory structure for each submission in the $ENSEMBLE_HOME/var or in the directory pointed by the ensemble.local.dir variable in the properties file

The directory structure created is of the following format {$user}/{portfolioid}/{seeid}/....

Each EM submission results in a portfolioid (called EM id) being created with several seedids... (called workflowid by EM)

each portfolioid directory consists of ensemble.in file (input file created by sr-submit client for internal use) and a ensemble.dag file that co-ordinates the run. When the ensemble is run ther will be other files which will be created e.g. ensemble.dag.dagman.out which has information about the overall health of the portfolio..
In case the portfolio fails, a rescue file is created which could be resubmitted by running the condor_submit_dag ensemble.dag.rescuexxx file. This is only useful when the errors are due to transient system failures and not due to path errors etc..

each workflowid/seedid directory consists of jobs for wings, ranking, pegasus planning etc.

WINGS

the wings submit file is wings.sub
the stdout of the wings job goes to wings.out
the stderr of the wings job goes to wings.err

Wings write a log file which goes to wings_log/{seedid.log}
Wings also write the output dax which goes in the directory wings_output

RANKING

the ranking submit file is rank.sub
the stdout of the rank job goes to rank.out
the stderr of the rank job goes to rank.err

Ranking writes out the ranked daxes in rank_output
Ranking writes out a rankfile called ranked.in which is used as input further.

PEGASUS PLANNING

Depending on the number of daxes generated and the rank.top factor set in the seed configuration file the planning jobs will be created in the planner_dag directory

the job for planning each dax is planner_dag/dax_id.sub

the output of each planning goes to planner_dag/dax_id.out
the error of each planning goes to planner_dag/dax_id.err

Planning generates the executable dag in the planner_dag/dax_id directory which will have the necessary dag and submit files.

PLANNED WORKFLOW EXECUTION

Planning generates the executable dag in the planner_dag/dax_id directory which will have the necessary dag and submit files.

Inside this dag you can take a look at the dax_id.dag.dagman.out file to see which job failed.
You can then trace the jobid.out.??? file to see the XML output.

Take a look at the exitcode on the top and it will tell you what exitcode the job exited with.
Take a look at the stdout and stderr sections in the XML file and it will show you the STDOUT and the STDERR of the application itself for further debugging.

ensemble-status

To monitor the workflows you would run ensemble status on a single seed or portfolio id.

ensemble-status -e d7065da3-421d-47ac-b9aa-9447cf9f231c

SUBMITTED    STARTED    PLANNED    RUNNING    SUCCESS    FAILED
{panel}
    0           4          3          2          0         1
{panel}
ensemble-status -e d7065da3-421d-47ac-b9aa-9447cf9f231c --verbose

Workflow : 09523b5a-421d-4cbf-8819-d796b4a09798 2007-11-14 14:29:03.0 STARTED
Workflow : 0e9c2bba-36f6-47ac-9059-1b811b279903 2007-11-14 14:29:01.0 STARTED
Workflow : 3108d383-d83f-4910-b9aa-027c5aa439b7 2007-11-14 14:29:02.0 STARTED
Workflow : 32957769-ee00-468f-9f1a-9447cf9f231c 2007-11-14 14:30:01.0 PLANNED
Workflow : 489eb95b-9d76-4808-849a-b75f36eb05e5 2007-11-14 14:31:03.0 PLANNED
Workflow : 662001c6-75e3-4e8f-b942-47cca5820645 2007-11-14 14:31:02.0 RUNNING
Workflow : 7e13f990-8874-4c70-a4ab-c5d75ce000d4 2007-11-14 14:31:12.0 RUNNING
Workflow : 8142a37c-4e5e-468d-b4ac-4060ef4c5d97 2007-11-14 14:40:11.0 SUCCESS
Workflow : ce19605e-16a0-4d26-a886-31fa791b785a 2007-11-14 14:40:12.0 SUCCESS
Workflow : d7065da3-dc70-4582-9f6e-8d372bacf980 2007-11-14 14:31:02.0 FAILED

SUBMITTED    STARTED    PLANNED    RUNNING    SUCCESS    FAILED
{panel}
    0           3          2          2          2         1
{panel}