You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 75 Next »

March 2014

March 17th, 2014

Agenda

  • XSEDE poster and tutorial proposal
    • will get it done this week. mats and karan will work on it.
  • idafen will work on a workshop paper for xsede on reproducibility
    • 4 page limit
    • deadline is april 5th.
  • energy simulation for SC 2014
    • measure energy when running workflows
    • try to check if energy usage changes whether data is transferred to a site, or everything is executed at one site.
  • sane defaults for 4.4 for transfer jobs, pre scripts etc
  • leaf cleanup for hierarchal workflows
  • sudharshan's paper
    • emphasize that the goal is not improving the makespan.
  • 4.3.2 release
    • release notes checked in on friday
    • mats will tag after the release.
    • the service should be installed in the tutorial VM image.

March 10th, 2014

Agenda

  • Should we stage sub-workflow output files to parent workflow scratch? (related to leaf cleanup)
  • Should we enable DAX jobs to have input and output uses, and distinguish between planner inputs and sub-workflow inputs?
  • SUB DAG keyword to make pegasus generated subdag submit files match with dagman version alway
  • data reuse edge case
    • have fix for it and have added unit test cases
  • altassian licenses expiring?
  • plan for a pegasus workshop / meeting for 2nd week of January 2015


March 3rd, 2014

  • monitord fix for LIGO
    • pegasus plan prescripts were not logged in the database.
  • checkpointing files
    • karan will create a JIRA item and send it to ligo folks for comment.
  • transfer fix
  • held jobs ?
  • separate pegasus plan planning jobs
    • throttle jobs via category.
  • real full ahead planning
    • plan full ahead -
    • will help in debugging workflows
  • hierarchal workflows planner arguments in the prescript wrapper shell scripts.
  • final cleanup job for the workflow
  • fix for iplant workflows cleanup. previously generated files whose locations are determined in the replica catalog should not be cleaned up

Workflow reproducability ( idafen )

  • here for 3 months - march/april and may
  • document the infrastructure that was used to generate the workflows
  • created ontologies to describe infrastructure.
  • precip API
    • expressed an interest  in it . 
    • he focuses not  on how to deploy, but instead to describe the infrastructure
    • then do experiments that take in his description and deploy it using precept
  • target two conferences
    • one systems
    • other semantic

Pegasus Submit Node on HPCC

  • waiting on glite recommendations from condor-admin

Feb 2014

February 24th, 2014

SCEC Transfer Issues

  • hpc login crashed for scec workflows because of too many stageout jobs
  • there were too many connections open at xinetd level
  • also the stageout jobs were starving all the other local universe jobs in the workflows
  • so the workflows were getting bunched at the stageout level
    • we solved it by moving only the transfers to the vanilla universe on shock
    • ran into credential handling backward compatibility we put in 4.4 after new credential handling.

Transfer Configuration for 4.4

  • by default the number of threads will be 2
  • we will expose a way via properties to increase the number if users want to have better bandwidth
  • in case of any failures, pegasus-transfer will revert back on a single thread

February 10th, 2014

Postscript handling

————————————————————————————————

 

- We have implemented a solution in PM-737 to get around condor quoting rules.

 

- MPI code are not kickstart wrapped

 - Pegasus should indicate whether a clustered job or a kickstart job.

 

- DAGMan exitcode 

 

 

checkpoint jobs

 - 10% of runtimes

 - pegasus-transfer will have to be changed

 - link is set to type checkpoint

 - transaction support for checkpoint

 - timeout  is job runtime - process

 - pegasus-kickstart timeout method

 - also has dv/dt implications for monitoring. 

 

pegasus-exitcode assumes success and checks for failure

 - refactored the script for unit tests as a library

 - pegasus-statistics

 - pegasus-analyzer  ( maybe some commonality)

 - pegasus python library has to be included in worker package

 

 

 

pegasus-transfer 

 - threads are handled similar to pegasus-s3

 - default threading

 

 - expose options end to end

 - initial threads to irods

 - what options to set

 

pegasus-config will now work with a source checkout

December 2013

December 16th, 2013

  • TODO: Talk about ADAMANT design

December 3rd, 2013

  • 4.3.1 release
    • just need to send the announcement.
    • gideon has updated the build infrastructure in bamboo to build the release
    • to do
      • do a drupal snippet, to update the downloads page automatically.
        • dynamically render the page using the shared directory in drupal.
    • pegasus-analyzer will have a recurse option.
  • identity management for pegasus service
    • portal use case
    • user authentications
    • website
      • put a token in a cookie.
    • draw bigger pictures on the identity stuff.
  • Unicore Testing

November 2013

 November 11th, 2013

  • 4.4 Planning
    • according to proposal, we need pegasus as a service, metadata registration, enhanced notifications on long runtimes etc.
    • ligo realtime analysis?
      • scott and kent mentioned that real time analysis is a priority.
      • gstreamer interface.
      • investigate streaming workflows
    • unicore testing support
  • Pegasus Tutorial on (Mats VM on oregon region)
  • Pegasus as a service
  • Ensemble Manager
    • an ensemble has no end state currently.
    • update documentation on the website
    • gideon plans to remove the upload catalog options. instead the clients will read in the properties and automatically upload.
  • NSF Cloud Proposal
    • Experiment management.... maybe does not align itself with NSF Cloud.
  • Adamant Demo
    • workflows are setup and done.

November 4th, 2013

  • Tutorial format finalized for November 14th meeting. similar to software carpentry layout
  • 4.4 release things
    • pegasus metadata support
      • dax schema changes
      • irods - support for metadata attributes
      • s3 objects - they can have tags associated with it.
    • transient replica catalog.
    • unicore support
    • for JIRA items move to the next one.
    • moteur support.
    • dv/dt wrapper support ( probably in a separate dv/dt branch)
  • move to VMWare for hosting websites
    • pegasus.isi.edu will be as a VM in a VMWARE ESX pool.
      • initially 4 VM's for Bamboo BNT
      • retire the machine for PAGE QC
    • long term we are moving to ESX

October 2013

October 1st, 2013

Pegasus 4.3 release

  • dashboard is separate
  • prepare rpm for ligo
  • ssh submission for 4.3
  • tutorial vm almost done
    • the clock issue remains. probably an issue with how virtualbox does the time.
  • need to hear back from scott
  • sepiddeh working on make flow compatible code generator.

September 2013

September 23rd, 2013

Software Carpentry followup
  • Create a pegasus youtube channel.
  • See if that can be linked from the ISI webcast page.

ISI Pegasus Workshop

  • Submit host setup at HPCC
  • specs are similar to workflow.isi.edu
  • gideon will mail to HPCC admins today about this

Tutorial VM

  • networking issue
    • persistent rules file /etc/udev/rules/70-persistent-networking.rules
    • instead of deleting it lets just disable it in our VM's
  • X with virtual box guest additions for enabling copy paste
  • turn on ntp
  • larger virtual disk - will increase the size to 8GB
  • X should just add couple of hundred MB's

Pegasus Release

  • JDBC RC
  • Tutorial VM
  • pegasus-statistics
  • pick up a release date
  • tentatively next friday i.e the 4th.

September 9th, 2013

Software Carpentry

  • Karan will prepare introductory slides for Pegasus.
  • Talk to John about providing a Pegasus submit node.
  • Rajiv will be working on the Pegasus RNASeq VM.
  • John Mehringer will go first in the second day.
  • Parking is in Levy structure in southwest corner.
  • Inquire about shuttle from Health Science Campus.
  • Still do - RNASeq module.
  • Put Information about parking and HSC Shuttle.
    • Parking Center.

Pegasus Release

  • waiting for Scott to do release testing.

Pegasus Lite Paper

  • Karan will send the camera ready version today.

Precip

  • using netlogger for logging.
  • replace python logging framework
  • incorporating events from the remote site
  • AMQP ?
    • Getting events into a common file.
  • Run montage using precip

Condo of Condos Workshop

  • Laurent and Gideon have 10 minutes each.
  • Bosco new name is MyHTC.

 

August 26th, 2013

Pegasus 4.3 release

  • dagman metrics not implemented yet by kent. still in design phase.
  • testing stuff
    • unit tests running in bamboo.
  • add missing data dependencies
    • still checks and produces errors

Precip Logging

  • getting the metrics back

Pegasus Hold

  • how to get dagman stop submitting jobs
  • idle jobs need to go on hold.
  • we can send sigusr1 to dagman.
  • need to handle hierarchal workflows.
  • JDBC RC stuff

JDBC RC

  • we will just update the existing version one.
  • have a python based RC for Replica Catalog.

Ensemble Manager Paper

  • Gideon will be working on it.

DAGMan replacement??

  • Software engg stuff.

August 19th, 2013

  • Pegasus 4.3 release
    • output mapper stuff implemented.
    • pegasus-statistics changes checked in by Rajiv
    • app metrics associated with the metrics report
      • pegasus.metrics.app
      • can be used for RNASeq tracking and other applications
      • the metrics UI will be able to filter on the name.
  • Globus Online Support - move to 4.4 release
    • can only do certain parts of transfers.
    • for transfers from local submit host , we need to use globus connect
      • credentials issue
      • for submit host, there needs a local endpoint.
  • LIGO testing ?
    • prepare a pre release RPM for LIGO 

August 12th, 2013

  • Pegasus Lite Paper
    • Wait for the Big Data and Science Workshop
  • 4.3 Release
      • Output Mapper Submission
        • error if output site and a output mapper replica catalog specified
      • Globus Online Support in pegasus-transfer
        • OAuth tokens issue.. when to get the token
        • support for multi end point with different credentials
        • probably need to do a pegasus-globus-online
          • the client needs to be blocking .
      • SSH Submission
        • Will use RNASeq for that.
      • Boto downgrade worked.
        • did not build on RHEL 5
      • Test Suite
        • Suite of integration tests
          • checksum the files
  • Ensemble Manager
    • Almost done with the first version
    • Will work on the Galactic Plane version
  • General JUnit Tests for Pegasus
  • Galactic Plane Paper

July 2013

July 29th, 2013

Software Carpentry

  • Workflows Tutorial
    • 1 hours overview of HPCC if HPCC folks are interested.
    • Pegasus Tutorial ( 2 hours )
    • An info part on where to run jobs
      • OSG
      • HPCC
      • XSEDE

  • Pegasus Development
    • Rajiv will complete the pegasus-statistics part
    • error messages ( give more hints on what went wrong on site selection )

  • Monitoring API
    • wants a jar with a simple API to monitor workflows
    • wrap it up in a jar
    • provide interface 
    • portal integration
      • rest interface for the pegasus service

July 8th, 2013

  • gideon has changes checked in dax2dot based on the closures and reductions
  • karan has checked in the LCA approach. But does not scale for our performance test case.
  • Also changed the way edges added for the create dir nodes. that will go in for 4.3.
  • Precip Paper
    • deadline extended to the 19th of July.
  • Posters to be made for XSEDE
  • Sudharshan will make a poster on his cleanup work on Monday.
    • Sudharshan will be going on Monday to campus to present the poster around 1-3PM
    • Will give a talk to CCG group Tuesday July 16th at 11:00AM
  • Currently, sudharshan's algo takes 15 seconds on a 1000 node montage workflow.


July 1st, 2013

  • monitord bug fix checked in
  • algorithm to remove extra graph dependencies
  • backups
    • we need to update the pegasus machine
      • jira, svn , website ( website and svn need to move at the same time ) , crowd updates
      • confluence was moved to another . also coordinate with action to do the move.
      • mats already updated crowd today
        • there is secret number of conf files... apache on top of tomcat
      • update to debian machines
        • obelix, cartman and stewie, and the ccg worker nodes.
  • mats has updated the bamboo tests to use new filesystem paths
  • ADAS abstract
    • for galactic plane on Amazon. if accepted due in september.
  • 4.3 release
    • fix error messages. see what can be done to improve them .
    • output replica catalog
    • pegasus-transfer tests.
    • updates to cleanup algorithm based on sudharshan's work ??
    • release notes will be updated to indicate the dashboards move to pegasus-services thing.
  • Precip Paper
    • mats will do the zotero work.
    • submitting to cloud com in bristol uk.
    • seppideh has some data on openstack. could not get all instances started up.
    • seppideh will release the token to gideon to do an edit pass
  • Cleanup Algorithm

June 2013

June 24th, 2013

  • Pegasus Development
  • Update on SCEC visit
    • pegasus-archive tool
      • archive everything other than the stampede db and braindump file
    • scott will try to cluster rupture variations for the same rupture in one task based on runtime estimates
    • the SGT will become 16 times bigger and post processing 8 times bigger on move to 1HZ. clustering rupture variations in scec code will help in reducing the number of jobs in the DAX
    • Scott tried to generate a single DAX for the post processing worklfow. Was unable to do so. Has generated two dax'es
  • Galactic Plane
    • Cut out service. Slow times on retrieving the image from S3. Small bandwith between S3 and EC2
    • Will need to have monitoring etc... Not fast enough for a webpage to be responsive.. will need some queuing up
    • Backups
      • Mats working on Kepler data.
      • mats tried backup with S3. does not like symlinks. will change the way backups are managed. the transfer times can be long.
  • Update from Sudharshan
    • Good progress. showed some simulations
  • Adamant Update
    • we are on hook for providing the interfaces in pegasus-transfer that will talk to the exo planner service
    • also provide shadow queue service, that gives estimates on jobs that will be in the queue.
    • supercomputing demo?
  • Precip Paper
    • majeick si doing some experiments

June 17th, 2013

  • Pegasus Development
    • the dax job handling is completed.
    • update on ligo front.
    • condor priorities for local universe jobs
      • not handled right now.
      • gideon has a ticket open for them.
    • gideon observation of s3
      • scalable but not good latency or
  • Pegasus Lite Paper
    • mats is almost done with the runs. to grep through the runs to get the intermediate files in and out of S3
    • not done the S3 caching for rosetta as yet. still not sure. too much work for the time remaining.
    • mats did do the runs with task clustering. he got better numbers and saw a difference in case of rosetta.
    • interleaving of compute jobs and transfers. may help montage.. but won't help rosetta
    • whether we should include the new pegasus 4.2 features.
  • Cleanup Algorithm
  • Glacier Backups for NFS?
    • instead of using two qnaps, just have one and use other for duplicates
    • we need a place for backups
    • currently the QNAPS are 18TB each with raid 6. Raid 10 is a better configuration on the QNAP according to the forums. This means though we will have half the space.
      • have one qnap for scratch
      • have other qnap for storage - the storage will be backed upto glacier. right now QNAP only support S3. Support for glacier is coming.
    • ewa and richard think glacier backups are a good option.
      • there might be a purge policy required on glacier.
  • Precip Paper
    • change tracking on
    • use dropbox
    • broadcast when you making a new version.

June 10th, 2013

- Pegasus Development

- change to dax handling

- fix of stdout 

- regex based replica catalog. 

- changes to pegasus-statistics for aggregate statistics

 

Pegasus  Lite Paper

- compute data between s3 and local disk.

- compute costs for the runs ? 

- have data outside 

- local cache for the S3 client ??  could affect the rosette cache. 

 - change the rosetta workflow.

 - if there are a lot of small files.

 - reading parts of files.

- Ewa will send her version of the changes.

 

Sudharshan Algorithm for Cleanup

  • Greedy appraoch planned
  • will try implementing a version and show the different executable workflows created


June 3rd, 2013

Pegasus Lite Paper

  • Breakdown of the runtimes , experiments
    • In case of sharedfs, the kickstart runtimes in the breakdown file will be longer
    • for the S3 case we can calculate the S3 transfer time by calculating the difference between the cumulative runtimes
    • doing two experiments rosetta(cpu intensive) and montage( io intensive)

Pegasus Development

  • Java DAX API issues
    • might be some bugs in there.

Precip Paper

  • Ewa wants a link to pegasus website in the paper.
  • have more logical thinking in the paper, like reliability and repeatability
  • Sepideh adding some new figures to the paper.
  • Maciek will provide an experiment use-case for the paper.

Stampede and Corral Annual Reports

  • Karan and Mats will be working on these

Sudarshan's Project

  • Going to look into providing a cleanup algorithm that meets a given storage constraint
  • Will look at the static problem of inserting dependencies into the workflow to achieve a solution

PMC Paper

  • on amazon
  • with clustering and pmc

Shirts

  • Should get the logo sample this week, once we approve then we can order shirts

dV/dT

  • Rafael is working on a draft of the data collection and modeling paper
  • We are planning on publishing data, will start drafting a format this week

May 2013

May 20th, 2013

Confluence is going slow. Mats is going to look.

Analytics are set up on Confluence now.

Pegasus Transfer

  • Mats committed a new version that has support for 2-stage transfers

Pegasus S3 Client

  • Gideon changed .s3cfg to .pegasus/s3cfg

Pegasus Lite Paper

  • Mats is working on the experiments
  • We have two weeks to the deadline

PMC Paper

  • Experiments on Amazon comparing Pegasus, Pegasus w/ Clustering, PMC alone

Pegasus Service

  • Finished setting up users and test suite
  • Next is a quick-and-dirty ensemble manager implementation
  • Gideon is going to commit a change to Pegasus that removes the dashboard components. They will live in the pegasus-service repository from now on.

Summer Student

  • Need to think up a project. Needs to be research-oriented and relatively small.
  • Cleanup? Precip? 

Contacting users

  • Find out if they need anything.

Examples

  • Simple examples in Perl, Python and Java
  • Gideon will add them to the examples in the pegasus Git repo

April 2013

April 22nd, 2013

Pegasus 4.2.1 Release
  • monitord prescript handling fixed
    • pegasus-analyzer should detect prescript failures, and the prescript exitstatus should be logged in the database
    • pegasus-statistics was updated for the job instance report
  • pegasus planner
    • need to confirm all checkin's are complete
  • do we want to get LIGO to do a test or just release?

Pegasus statistics across workflows - Rajiv

Pegasus Lite Paper

  • Mats will do the runs on Amazon
  • Karan will work on paper when he comes back

pegasus-hold and pegasus-release

  • any difference between doing a hold on the dagman directly or pegasus-dagman
  • we need to do more investigations on monitord

BOSCO

  • Mats is trying to run on HPCC
  • a single job is running fine.

April 8th, 2013

Pegasus 4.2.1 Release
  • Work on it towards this week
  • monitord prescript issue to fix
Pegasus 4.3

Pegasus Posters

  • One at XSEDE
  • joint one with BOSCO team

Pegasus Lite Paper

  • Submission to IEEE Big Data

New Programmer Hire

  • expanded posting on confluence
  • New Programmer Hire
  • will send out to HPC Wire , RENCI and USC SC Connect

April 1st, 2013

Pegasus Lite Paper

  • Waiting on Ewa
  • Not much we can do about the IEEE conference. The page limit is 8 , the current size of the paper.

XSEDE Poster

  • Pegasus Poster. Karan will send update
  • Also a joint Pegasus BOSCO poster
  • Also as part of that we will get the MPI workflows up and running through Pegasus and BOSCO

Pegasus Development

  • Bypass of staging input files for Pegasus Lite Case
  • Inplace cleanup bug fixes done.
  • pegasus-s3
    • gideon checked in changes of copy from one file to another
    • mats adds a pegasus transfer
  • workflow cleanup nodes
    • separate cleanup node in the workflow
    • for hierarchal workflows we only delete the outermost workflow
    • what happens if no output-site specified
      • the ligo case!
  • backward compatiblity for LIGO
  • Pegasus Dashboard
    • general javascript updates
  • Generic Pegasus Slides
    • 2-3 slides.



 

March 2013

March 25th, 2013

  • Pegasus Lite Paper Submission
  • Pegasus-statisitcs
    • Waiting on Scott to get back with the list of metrics
    • Rajiv will be working on it
  • pegasus-s3 changes
    • we want to be able to copy output files from one s3 bucket to another
    • requires changes to pegasus-transfer and pegasus-s3
  • final node for cleaning up remote directories
    • also related is getting the cleanup algorithm working when we bypass first level staging.

March 18th, 2013

  • Mats has an RPM almost sorted out for LIGO that does not require us to have PYTHONPATH set. Instead the libraries go into standard locations
  • Karan is testing this RPM at on spice-dev1 and has setup a page with instructions on how to submit a test workflow to VIRGO
  • Statistics across root workflows
    • earlier gaurang had generated statistics for scec runs by hand... executiing queries on the msql command line
    • he does not have the queries documented anywhere
    • this is something we have talked about in context of 4.3 with Rajiv
    • will follow up with scott on wednesday's call
  • 4.2.1 release
    • backward compatibility for LIGO . still to be done
    • probably next week after the pegasus annual report
    • RPM to handle native python installation
  • Pegasus Annual Report
    • Karan will work on it this week
    • Try to follow the same template as earlier.

March 4th, 2013

  • Sent link on DAGMan metrics to DAGMan Metrics Reporting to Ewa
  • Metrics for Rob Quick's workflow
  • Gideon pushed out kickstart changes
  • Rajiv has pushed changes to the queries for the dashboard.
  • Setup meeting with Jaime and Derrick at OSG AHM to discuss
    • remote_initialdir
    • extra attributes for glite/bosco submissions
    • mpi workflows.
  • OSG Poster to be made this week. And 4.2 Release slides.

February 2013

February 11th, 2013

Direct submission of workflows to PBS

  • Glite submission in Condor. We setup a VM that hosts a PBS scheduler and using that too test
  • Karan prepared an example for 4.2 that can be used to submit directly to local PBS using the glite interfaces in Condor
    • the remote_initialdir  / +remote_iwd  does not work
      • problem for MPI codes
      • for the time being, the example prepared relies on kickstart to change the directory before launching a job
    • there is also a ssh style that allows us to use BOSCO to do remote submissions using SSH to a PBS cluster
      • that one also has the issue of remote initialdir

 - jobstate.log refactoring. 

 - data transfer ( support for globus online) 

- lightweight tracing

 -  task stats. net link socket pegasus-kickstart . how much memory the task used and io used. 

 - add task stats to kickstart

 - ptrace

 - trace  linux equivalent is system tap

 

- dashboard improvements

 - single api for clients

 - last week drop down

 - performance run on large workflows.

 

February 4th, 2013

  • CCGrid / Pegasus Lite Paper
    •  Performance section
    •  remove the experiments section?
    •  OR
    •  extra experiments section 
    •  have the squid proxy cache
    • find a workshop to submit the paper
  • Cloud Paper
    •  Ewa is working on it.

  • Git HUB Migration
    •  - couple of branches like monitord , pmc and dang are branches
    •  - svn will be made read only . 
    •  - update the website with all the development information
    •  - bamboo scripts
    •  - documentation ( long term )
    •  - nightly builds
  • SSH Submission
    •  - gsissh submission for blue waters
    •  - ssh to blue waters is required for OTP
    •  - passing of parameters to PBS
    •  - SSH key
    •  - ssh agent.
    •  - queue keyword
    •  - Batch session
    •  - submit jobs to HPCC
    •  - Gideon will do that. 

  • monitord memory explosion
    •  - long term for monitord 
    •  - pegasus-dagman replacement 

  •   minor release 4.2.1
    •  - potential monitord bug issue
    •  - long term dagman replacement

  • Response time for metrics page
    •  - occasionally it is slow
  • No labels