Notice: The ensemble manager is no longer supported
Ensemble Manager: Managing multiple workflows
The Ensemble Manager (EM) component of the SR system is responsible for supporting the creation and the execution of multiple workflows at the same time.
Current workflow systems allow only sequential or uncoordinated creation and execution of a single workflow. The Ensemble Manager that we will develop will coordinate and efficiently handle planning and executing 100’s to 1000’s of workflows simultaneously on the Grid.
The EM will manage sets of workflows, with each set specified as a workflow ensemble. A workflow ensemble may, for example, contain the pool of candidate workflows being considered for a given step in the workflow generation process. The EM will be invoked to perform on workflow ensembles any of the generation, ranking, planning, and execution steps of the workflow generation processes.
There is detailed documentation available on the formulation of requests for the workflow system for SE18. There is also background about general description of workflow requests.
Design
In our design, the EM provides several operations to submit, plan, execute and monitor several workflows simultaneously. In this system, a workflow is submitted with an indication of the kind of refinement operation that it needs to go through. The submission may optionally indicate that a workflow is part of an entire workflow ensemble.
We assume that a workflow and all the associated info is stored on a local file system. We differentiate between workflow submission and workflow start. A workflow will be submitted with a start time for when the EM is expected to begin processing it. This way we can model the following cases:
- Submit the workflow and indicate a start time of infinity
- Submit the workflow and indicate a specific start time in the future
- Submit the workflow and indicate an immediate start time
Installation
Prerequisites
- ANT
- JAVA 1.5+
- PEGASUS 2.2.0CVS
- WINGS
- CONDOR 7.1.0 only.
- MYSQL
Build and Install Wings
- Download Wings from SVN and build and install it SR-SE-18_Wings_Instructions
- Set env WINGS_HOME to the Wings directory
$ export WINGS_HOME=<path to wings>
- source $WINGS_HOME/setenv.sh (if your shell is bash)
$source $WINGS_HOME/setenv.sh
Build Pegasus
- Download Pegasus from SVN at. https://tangram.stdc.com/svn/SystemResearch/branches/pegasus/current/
- Set env PEGASUS_HOME to the Checkout Directory
$export PEGASUS_HOME=<path to pegasus-svn-checkout>
- source $PEGASUS_HOME/setup-devel.sh (if your shell is bash)
$ source $PEGASUS_HOME/setup-devel.sh
- Build pegasus using ant
$ ant clean dist
Install Pegasus
Copy $PEGASUS_HOME/dist/pegasus-*.tar.gz and untar it
$ cp $PEGASUS_HOME/dist/pegasus-binary-*.tar.gz /tmp
$ cd <path to software installation directory>
$ gtar zxvf /tmp/pegasus-binary-*.tar.gz
- Set env PEGASUS_HOME to the binary installation path
$ export PEGASUS_HOME=</path to binary install>
- Remove wings jar included in Pegasus
$ rm $PEGASUS_HOME/lib/wings.jar
Build Ensemble Manager
- Download Ensemble Manager code from SVN at
Set env ENSEMBLE_HOME to the checked out directory
$ export ENSEMBLE_HOME=<path checked out>
source $ENSEMBLE_HOME/setup-devel.csh if your shell is CSH or setup-devel.sh if your shell is BASH
$ source $ENSMEBLE_HOME/setup-devel.sh
Run ant clean package
$ant clean package
Buildfile: build.xml clean: {panel} [delete] /nfs/asd2/gmehta/jbproject/Ensemble/dist not found. [delete] /nfs/asd2/gmehta/jbproject/Ensemble/build not found. {panel} init: {panel} [mkdir] Created dir: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble [mkdir] Created dir: /nfs/asd2/gmehta/jbproject/Ensemble/build/src [echo] full ISO timestamp: {panel} compile: {panel} [javac] Compiling 28 source files to /nfs/asd2/gmehta/jbproject/Ensemble/build/src {panel} ... ... {panel} [mkdir] Created dir: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble/var [copy] Copying 5 files to /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble [gzip] Building: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble.tar.gz [delete] Deleting: /nfs/asd2/gmehta/jbproject/Ensemble/dist/ensemble.tar {panel}
- A tarball will be created in $ENSEMBLE_HOME/dist/ensemble.tar.gz
Install Ensemble Manager
Copy the binary tarball built in the earlier step to an installation location and untar it
$ gtar zxvf $ENSEMBLE_HOME/dist/ensemble.tar.gz
Set environment ENSEMBLE_HOME to the untarred directory
export ENSEMBLE_HOME=<path to ensemble binary directory>
- Configure other paths in correct order
$ unset CLASSPATH
$ source $WINGS_HOME/setenv.sh
$ source $PEGASUS_HOME/setup.sh
$ source $ENSEMBLE_HOME/setup.sh
Create the Ensemble DB
- As user root create a db for storing the ensemble schema in MySQL
create database <databasename>;
- Add a username and password which has access to this db
grant all on <databasename>.* to <username>@"<hostname>" identified by "<password>";
flush privileges;
- Populate the created Db with the ensemble schema from $ENSMEBLE_HOME/sql/ensemble.sql
mysql -u <username> -p databasename < $ENSEMBLE_HOME/sql/ensemble.sql;
Edit the Ensemble configuration file
Edit the $ENSEMBLE_HOME/etc/properties or create a file $HOME/.ensemblerc
condor.home=<path to condor install home directory> pegasus.home=<path to pegasus install home directory $PEGASUS_HOME> wings.home=<path to wings install> ensemble.db.url=<jdbc url to ensemble db . e.g. jdbc:mysql://smarty.isi.edu/ensembledb> ensemble.db=MySQL ensemble.db.user=<dbusername> ensemble.db.password=<dbpassword> ensemble.localdir=<path where the ensemsble workflows are planned and dags are generated. Default is $ENSEMBLE_HOME/var>
Edit the Log4j.configuration file
log4j.rootCategory=DEBUG, File, Console, Socket log4j.logger.anchor.datametrics=OFF log4j.logger.com.hp.hpl.jena=OFF log4j.logger.org.griphyn=DEBUG log4j.logger.pegasus=DEBUG log4j.logger.edu.isi=DEBUG # # The default file appender # log4j.appender.File=org.apache.log4j.RollingFileAppender log4j.appender.File.Threshold=DEBUG log4j.appender.File.File=/tmp/ensemblemanager.log log4j.appender.File.layout=org.apache.log4j.SimpleLayout log4j.appender.File.Append=true log4j.appender.File.MaxFileSize=100MB # # Console Appender # log4j.appender.Console=org.apache.log4j.ConsoleAppender log4j.appender.Console.layout=org.apache.log4j.SimpleLayout log4j.appender.Console.Threshold=INFO # # The Socket Appender # log4j.appender.Socket=org.apache.log4j.net.SocketAppender log4j.appender.Socket.Threshold=INFO log4j.appender.Socket.RemoteHost=artemis.stdc.com log4j.appender.Socket.Port=40940 log4j.appender.Socket.ReconnectionDelay=5000 log4j.appender.Socket.LocationInfo=true
The sample log4j.properties shown here is shipped in $ENSEMBLE_HOME and is used by default. You can modify any of the properties in the $ENSMEBLE_HOME/log4j.properties or provide an alternative log4j.properties file by passing the option -Dlog4j.configuration=file:/path/to/log4j.properties/file
Note that you need to add the following entries to any non standard log4j.properties file for all the logging to appear correctly.
log4j.logger.anchor.datametrics=OFF log4j.logger.com.hp.hpl.jena=OFF log4j.logger.org.griphyn=DEBUG log4j.logger.pegasus=DEBUG log4j.logger.edu.isi=DEBUG
Create a portfolio file
The format of a portfolio file which is input to ensemble is a simple two column file
Seed1 /path/to/seed-config1
Seed2 path/to/seed/config2
...
...
..
e.g. $ENSEMBLE_HOME/examples/portfolio
Create seed configuration file
example $ENSEMBLE_HOME/examples/seed1.config or seed2.config
Please note the checked in seed2.config has a typo. Please fix the viz.isi.edu to be viz-login.isi.edu. I will try to fix the svn..
There are several paths and values that you may need to change to reflect the path of wings, pegasus, site catalog etc.
Also you can define the starttime, walltime, priority etc for the workflow
Create or copy site catalog ==
You can create a site catalog based on instructions here.
https://wiki.boozallenet.com/tangram/index.php/SR-SE-18-SiteCatalog-Instructions
example site catalog which works for isi_viz and isi_wind is at
$ENSEMBLE_HOME/examples/sites.xml
You will need to change the workdir and storage parameters to reflect paths that you can write to on viz and wind clusters.
Clients
sr-submit
SR submit takes a portfolio file and submits the portfolio for execution on a given site or list of sites and given output location.
sr-submit -s isi_viz -o local -p $ENSEMBLE_HOME/example/portfolio INFO - ts=2008-09-14T08:56:49.205291Z event=event.ensemble.parse.start msgid=4cfa4870-c485-409a-ae84-f393d9f5031e eventId=event.ensemble.parse_b70373b9-27ac-4270-a8b8-a23c9c19638d portfolio.id=c0559b23-942c-49ae-84ce-baefe4a48b11 prog=EnsembleManager INFO - ts=2008-09-14T08:56:49.216896Z event=event.ensemble.parse msgid=c9450b64-0ddf-48ef-b7ce-b0c3e45795b4 eventId=event.ensemble.parse_b70373b9-27ac-4270-a8b8-a23c9c19638d msg="Parsing seed \"SE18-SingleGroupDetector-Tangram\" with config \"/Users/gmehta/Documents/NetBeansProjects/Ensemble/examples/seed1.config\"" INFO - ts=2008-09-14T08:56:49.223372Z event=event.ensemble.parse.end msgid=5ef6fa0e-afac-418d-a1fc-063bfe0a8bc8 eventId=event.ensemble.parse_b70373b9-27ac-4270-a8b8-a23c9c19638d INFO - ts=2008-09-14T08:56:49.224645Z event=event.ensemble.parse.start msgid=3233690a-e8d9-4973-bbb5-a66144338f0f eventId=event.ensemble.parse_4b0f5881-43d9-40a4-a57f-ab4f896d3873 portfolio.id=c0559b23-942c-49ae-84ce-baefe4a48b11 prog=EnsembleManager INFO - ts=2008-09-14T08:56:49.225056Z event=event.ensemble.parse msgid=a4a0c004-d0e4-4106-9a93-83d7aa733db6 eventId=event.ensemble.parse_4b0f5881-43d9-40a4-a57f-ab4f896d3873 msg="Parsing seed \"SE18-SingleGroupDetector-Tangram\" with config \"/Users/gmehta/Documents/NetBeansProjects/Ensemble/examples/seed2.config\"" INFO - ts=2008-09-14T08:56:49.231395Z event=event.ensemble.parse.end msgid=19d8228e-dc3e-4772-9b86-25e3bf78bdbd eventId=event.ensemble.parse_4b0f5881-43d9-40a4-a57f-ab4f896d3873 INFO - ts=2008-09-14T08:56:49.232766Z event=event.ensemble.submit.start msgid=5c073eed-1375-4b21-a8b9-b525a494bfa4 eventId=event.ensemble.submit_efc0511e-e70e-4abc-9212-9487f8c9cecd portfolio.id=c0559b23-942c-49ae-84ce-baefe4a48b11 prog=EnsembleManager INFO - ts=2008-09-14T08:56:49.842701Z event=event.id.creation msgid=d5df18ce-97bf-40b0-a776-55878592efe7 eventId=idHierarchy_ce3cf3a6-d4cc-44e1-a371-b2a866d3d8f9 parent.id.type=portfolio.id parent.id=c0559b23-942c-49ae-84ce-baefe4a48b11 child.ids.type=seed.id child.ids={74207515-ae19-432a-bc13-f956074491a1,d08f2e50-beef-4385-9fbe-0da7e9934bf0,} EnsembleId = c0559b23-942c-49ae-84ce-baefe4a48b11 WorkflowID = 74207515-ae19-432a-bc13-f956074491a1 WorkflowID = d08f2e50-beef-4385-9fbe-0da7e9934bf0 This shows that the submitted portfolio has id =EnsembleId The seeds have seedid=WorkflowID
You can additionally call the client with a non default log4j.properties file like this
sc-client -Dlog4j.configuration=file:/path/to/log4j.properties/file -s isi_viz,isi_wind -o local -p portfolio
See above for example log4j.properties file or $ENSEMBLE_HOME/log4j.properties
Debugging the submission
EM creates a directory structure for each submission in the $ENSEMBLE_HOME/var or in the directory pointed by the ensemble.local.dir variable in the properties file
The directory structure created is of the following format {$user}/{portfolioid}/{seeid}/....
Each EM submission results in a portfolioid (called EM id) being created with several seedids... (called workflowid by EM)
each portfolioid directory consists of ensemble.in file (input file created by sr-submit client for internal use) and a ensemble.dag file that co-ordinates the run. When the ensemble is run ther will be other files which will be created e.g. ensemble.dag.dagman.out which has information about the overall health of the portfolio..
In case the portfolio fails, a rescue file is created which could be resubmitted by running the condor_submit_dag ensemble.dag.rescuexxx file. This is only useful when the errors are due to transient system failures and not due to path errors etc..
each workflowid/seedid directory consists of jobs for wings, ranking, pegasus planning etc.
WINGS
the wings submit file is wings.sub
the stdout of the wings job goes to wings.out
the stderr of the wings job goes to wings.err
Wings write a log file which goes to wings_log/{seedid.log}
Wings also write the output dax which goes in the directory wings_output
RANKING
the ranking submit file is rank.sub
the stdout of the rank job goes to rank.out
the stderr of the rank job goes to rank.err
Ranking writes out the ranked daxes in rank_output
Ranking writes out a rankfile called ranked.in which is used as input further.
PEGASUS PLANNING
Depending on the number of daxes generated and the rank.top factor set in the seed configuration file the planning jobs will be created in the planner_dag directory
the job for planning each dax is planner_dag/dax_id.sub
the output of each planning goes to planner_dag/dax_id.out
the error of each planning goes to planner_dag/dax_id.err
Planning generates the executable dag in the planner_dag/dax_id directory which will have the necessary dag and submit files.
PLANNED WORKFLOW EXECUTION
Planning generates the executable dag in the planner_dag/dax_id directory which will have the necessary dag and submit files.
Inside this dag you can take a look at the dax_id.dag.dagman.out file to see which job failed.
You can then trace the jobid.out.??? file to see the XML output.
Take a look at the exitcode on the top and it will tell you what exitcode the job exited with.
Take a look at the stdout and stderr sections in the XML file and it will show you the STDOUT and the STDERR of the application itself for further debugging.
ensemble-status
To monitor the workflows you would run ensemble status on a single seed or portfolio id.
ensemble-status -e d7065da3-421d-47ac-b9aa-9447cf9f231c SUBMITTED STARTED PLANNED RUNNING SUCCESS FAILED {panel} 0 4 3 2 0 1 {panel}
ensemble-status -e d7065da3-421d-47ac-b9aa-9447cf9f231c --verbose Workflow : 09523b5a-421d-4cbf-8819-d796b4a09798 2007-11-14 14:29:03.0 STARTED Workflow : 0e9c2bba-36f6-47ac-9059-1b811b279903 2007-11-14 14:29:01.0 STARTED Workflow : 3108d383-d83f-4910-b9aa-027c5aa439b7 2007-11-14 14:29:02.0 STARTED Workflow : 32957769-ee00-468f-9f1a-9447cf9f231c 2007-11-14 14:30:01.0 PLANNED Workflow : 489eb95b-9d76-4808-849a-b75f36eb05e5 2007-11-14 14:31:03.0 PLANNED Workflow : 662001c6-75e3-4e8f-b942-47cca5820645 2007-11-14 14:31:02.0 RUNNING Workflow : 7e13f990-8874-4c70-a4ab-c5d75ce000d4 2007-11-14 14:31:12.0 RUNNING Workflow : 8142a37c-4e5e-468d-b4ac-4060ef4c5d97 2007-11-14 14:40:11.0 SUCCESS Workflow : ce19605e-16a0-4d26-a886-31fa791b785a 2007-11-14 14:40:12.0 SUCCESS Workflow : d7065da3-dc70-4582-9f6e-8d372bacf980 2007-11-14 14:31:02.0 FAILED SUBMITTED STARTED PLANNED RUNNING SUCCESS FAILED {panel} 0 3 2 2 2 1 {panel}