March 2014
March 31st, 2014
- Gideon changed the tutorial VM.
- Put in backward support for old credential handling.
- Mats started on an outline for the optimizations chapter.
- next week's developer meeting is cancelled.
- general Pegasus dependencies
- python > 2.4 and less 3.0
- in general, easier to build from source rather than from source RPMs
- update Pegasus README
- change the build.xml to say default build without docs. remove the dist-nodoc target. instead we will have ant dist-release as the default target
March 24th, 2014
- Pegasus 4.3.2 release done last week
- storage constraints paper - gideon, rafael and karan worked on it.
- karan worked on the hpc-pegasus setup.. has workflows running through PMC
- karan and mats have a XSEDE tutorial proposal that will be submitted today
- dv/dt paper rejected for HPDC. Will try for a middleware conference due mid may
- 4.4 release
- checkpointing solution
- leaf cleanup for hierarchal workflows
- md5checksum option for guc transfers
- we won't follow up on kickstart generating the checksums, but tracking checksums in replica catalog.
March 17th, 2014
Agenda
- XSEDE poster and tutorial proposal
- will get it done this week. mats and karan will work on it.
- idafen will work on a workshop paper for xsede on reproducibility
- 4 page limit
- deadline is april 5th.
- energy simulation for SC 2014
- measure energy when running workflows
- try to check if energy usage changes whether data is transferred to a site, or everything is executed at one site.
- sane defaults for 4.4 for transfer jobs, pre scripts etc
- transfer jobs
- how many stage in jobs - 2 jobs and each job with 2 threads.
- how many threads each transfer jobs - pegasus-transfer has a default to 2
- pegasuslite job
- change sls name ? property name change
- control the number of threads
- add a chapter called tuning workflows
- mats will add about a section on tuning transfers.
- setting clustering parameters.
- changing back the default refiner to bundle???
- cleanup job
- change hold release time to one hour.
- transfer jobs
- new transfer refiner
- maybe can use k means clustering ?
- leaf cleanup for hierarchal workflows
- --cleanup leaf,inplace,none
- tell the planner to throw a warning when
- sudharshan's paper
- emphasize that the goal is not improving the makespan.
- 4.3.2 release
- release notes checked in on friday
- mats will tag after the release.
- the service should be installed in the tutorial VM image.
- Condor Categories
- similar to dagman categories.
- will condor accounting groups work??
March 10th, 2014
Agenda
- Should we stage sub-workflow output files to parent workflow scratch? (related to leaf cleanup)
- Should we enable DAX jobs to have input and output uses, and distinguish between planner inputs and sub-workflow inputs?
- SUB DAG keyword to make pegasus generated subdag submit files match with dagman version alway
- From Kent, Wenger
Hey, I just wanted to touch base and find out whether you guys have made any progress towards making Pegasus-generated sub-DAG submit files match
the "normal" DAGMan format.
(See https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3891,4.)
- From Kent, Wenger
- data reuse edge case
- have fix for it and have added unit test cases
- altassian licenses expiring?
- plan for a pegasus workshop / meeting for 2nd week of January 2015
March 3rd, 2014
- monitord fix for LIGO
- pegasus plan prescripts were not logged in the database.
- checkpointing files
- karan will create a JIRA item and send it to ligo folks for comment.
- transfer fix
- held jobs ?
- separate pegasus plan planning jobs
- throttle jobs via category.
- real full ahead planning
- plan full ahead -
- will help in debugging workflows
- hierarchal workflows planner arguments in the prescript wrapper shell scripts.
- final cleanup job for the workflow
- fix for iplant workflows cleanup. previously generated files whose locations are determined in the replica catalog should not be cleaned up
Workflow reproducability ( idafen )
- here for 3 months - march/april and may
- document the infrastructure that was used to generate the workflows
- created ontologies to describe infrastructure.
- precip API
- expressed an interest in it .
- he focuses not on how to deploy, but instead to describe the infrastructure
- then do experiments that take in his description and deploy it using precept
- target two conferences
- one systems
- other semantic
Pegasus Submit Node on HPCC
- waiting on glite recommendations from condor-admin
Feb 2014
February 24th, 2014
SCEC Transfer Issues
- hpc login crashed for scec workflows because of too many stageout jobs
- there were too many connections open at xinetd level
- also the stageout jobs were starving all the other local universe jobs in the workflows
- so the workflows were getting bunched at the stageout level
- we solved it by moving only the transfers to the vanilla universe on shock
- ran into credential handling backward compatibility we put in 4.4 after new credential handling.
Transfer Configuration for 4.4
- by default the number of threads will be 2
- we will expose a way via properties to increase the number if users want to have better bandwidth
- in case of any failures, pegasus-transfer will revert back on a single thread
February 10th, 2014
Postscript handling
————————————————————————————————
- We have implemented a solution in PM-737 to get around condor quoting rules.
- MPI code are not kickstart wrapped
- Pegasus should indicate whether a clustered job or a kickstart job.
- DAGMan exitcode
checkpoint jobs
- 10% of runtimes
- pegasus-transfer will have to be changed
- link is set to type checkpoint
- transaction support for checkpoint
- timeout is job runtime - process
- pegasus-kickstart timeout method
- also has dv/dt implications for monitoring.
pegasus-exitcode assumes success and checks for failure
- refactored the script for unit tests as a library
- pegasus-statistics
- pegasus-analyzer ( maybe some commonality)
- pegasus python library has to be included in worker package
pegasus-transfer
- threads are handled similar to pegasus-s3
- default threading
-
- expose options end to end
- initial threads to irods
- what options to set
pegasus-config will now work with a source checkout
December 2013
December 16th, 2013
- TODO: Talk about ADAMANT design
December 3rd, 2013
- 4.3.1 release
- just need to send the announcement.
- gideon has updated the build infrastructure in bamboo to build the release
- to do
- do a drupal snippet, to update the downloads page automatically.
- dynamically render the page using the shared directory in drupal.
- do a drupal snippet, to update the downloads page automatically.
- pegasus-analyzer will have a recurse option.
- identity management for pegasus service
- portal use case
- user authentications
- website
- put a token in a cookie.
- draw bigger pictures on the identity stuff.
- Unicore Testing
November 2013
November 11th, 2013
- 4.4 Planning
- according to proposal, we need pegasus as a service, metadata registration, enhanced notifications on long runtimes etc.
- ligo realtime analysis?
- scott and kent mentioned that real time analysis is a priority.
- gstreamer interface.
- investigate streaming workflows
- unicore testing support
- Pegasus Tutorial on (Mats VM on oregon region)
- Pegasus as a service
- Ensemble Manager
- an ensemble has no end state currently.
- update documentation on the website
- gideon plans to remove the upload catalog options. instead the clients will read in the properties and automatically upload.
- NSF Cloud Proposal
- Experiment management.... maybe does not align itself with NSF Cloud.
- Adamant Demo
- workflows are setup and done.
November 4th, 2013
- Tutorial format finalized for November 14th meeting. similar to software carpentry layout
- 4.4 release things
- pegasus metadata support
- dax schema changes
- irods - support for metadata attributes
- s3 objects - they can have tags associated with it.
- transient replica catalog.
- unicore support
- for JIRA items move to the next one.
- moteur support.
- dv/dt wrapper support ( probably in a separate dv/dt branch)
- pegasus metadata support
- move to VMWare for hosting websites
- pegasus.isi.edu will be as a VM in a VMWARE ESX pool.
- initially 4 VM's for Bamboo BNT
- retire the machine for PAGE QC
- long term we are moving to ESX
- pegasus.isi.edu will be as a VM in a VMWARE ESX pool.
October 2013
October 1st, 2013
Pegasus 4.3 release
- dashboard is separate
- prepare rpm for ligo
- ssh submission for 4.3
- tutorial vm almost done
- the clock issue remains. probably an issue with how virtualbox does the time.
- need to hear back from scott
- sepiddeh working on make flow compatible code generator.
September 2013
September 23rd, 2013
- Create a pegasus youtube channel.
- See if that can be linked from the ISI webcast page.
ISI Pegasus Workshop
- Submit host setup at HPCC
- specs are similar to workflow.isi.edu
- gideon will mail to HPCC admins today about this
Tutorial VM
- networking issue
- persistent rules file /etc/udev/rules/70-persistent-networking.rules
- instead of deleting it lets just disable it in our VM's
- X with virtual box guest additions for enabling copy paste
- turn on ntp
- larger virtual disk - will increase the size to 8GB
- X should just add couple of hundred MB's
Pegasus Release
- JDBC RC
- Tutorial VM
- pegasus-statistics
- pick up a release date
- tentatively next friday i.e the 4th.
September 9th, 2013
Software Carpentry
- Karan will prepare introductory slides for Pegasus.
- Talk to John about providing a Pegasus submit node.
- Rajiv will be working on the Pegasus RNASeq VM.
- John Mehringer will go first in the second day.
- Parking is in Levy structure in southwest corner.
- Inquire about shuttle from Health Science Campus.
- Still do - RNASeq module.
- Put Information about parking and HSC Shuttle.
- Parking Center.
Pegasus Release
- waiting for Scott to do release testing.
Pegasus Lite Paper
- Karan will send the camera ready version today.
Precip
- using netlogger for logging.
- replace python logging framework
- incorporating events from the remote site
- AMQP ?
- Getting events into a common file.
- Run montage using precip
Condo of Condos Workshop
- Laurent and Gideon have 10 minutes each.
- Bosco new name is MyHTC.
August 26th, 2013
Pegasus 4.3 release
- dagman metrics not implemented yet by kent. still in design phase.
- testing stuff
- unit tests running in bamboo.
- add missing data dependencies
- still checks and produces errors
Precip Logging
- getting the metrics back
Pegasus Hold
- how to get dagman stop submitting jobs
- idle jobs need to go on hold.
- we can send sigusr1 to dagman.
- need to handle hierarchal workflows.
- JDBC RC stuff
JDBC RC
- we will just update the existing version one.
- have a python based RC for Replica Catalog.
Ensemble Manager Paper
- Gideon will be working on it.
DAGMan replacement??
- Software engg stuff.
August 19th, 2013
- Pegasus 4.3 release
- output mapper stuff implemented.
- pegasus-statistics changes checked in by Rajiv
- app metrics associated with the metrics report
- pegasus.metrics.app
- can be used for RNASeq tracking and other applications
- the metrics UI will be able to filter on the name.
- Globus Online Support - move to 4.4 release
- can only do certain parts of transfers.
- for transfers from local submit host , we need to use globus connect
- credentials issue
- for submit host, there needs a local endpoint.
- LIGO testing ?
- prepare a pre release RPM for LIGO
August 12th, 2013
- Pegasus Lite Paper
- Wait for the Big Data and Science Workshop
- 4.3 Release
- Output Mapper Submission
- error if output site and a output mapper replica catalog specified
- Globus Online Support in pegasus-transfer
- OAuth tokens issue.. when to get the token
- support for multi end point with different credentials
- probably need to do a pegasus-globus-online
- the client needs to be blocking .
- SSH Submission
- Will use RNASeq for that.
- Boto downgrade worked.
- did not build on RHEL 5
- Test Suite
- Suite of integration tests
- checksum the files
- Suite of integration tests
- Output Mapper Submission
- Ensemble Manager
- Almost done with the first version
- Will work on the Galactic Plane version
- General JUnit Tests for Pegasus
- Galactic Plane Paper
July 2013
July 29th, 2013
Software Carpentry
- Workflows Tutorial
- 1 hours overview of HPCC if HPCC folks are interested.
- Pegasus Tutorial ( 2 hours )
- An info part on where to run jobs
- OSG
- HPCC
- XSEDE
- Pegasus Development
- Rajiv will complete the pegasus-statistics part
- error messages ( give more hints on what went wrong on site selection )
- Monitoring API
- wants a jar with a simple API to monitor workflows
- wrap it up in a jar
- provide interface
- portal integration
- rest interface for the pegasus service
July 8th, 2013
- gideon has changes checked in dax2dot based on the closures and reductions
- karan has checked in the LCA approach. But does not scale for our performance test case.
- Also changed the way edges added for the create dir nodes. that will go in for 4.3.
- Precip Paper
- deadline extended to the 19th of July.
- Posters to be made for XSEDE
- Sudharshan will make a poster on his cleanup work on Monday.
- Sudharshan will be going on Monday to campus to present the poster around 1-3PM
- Will give a talk to CCG group Tuesday July 16th at 11:00AM
- Currently, sudharshan's algo takes 15 seconds on a 1000 node montage workflow.
July 1st, 2013
- monitord bug fix checked in
- algorithm to remove extra graph dependencies
- backups
- we need to update the pegasus machine
- jira, svn , website ( website and svn need to move at the same time ) , crowd updates
- confluence was moved to another . also coordinate with action to do the move.
- mats already updated crowd today
- there is secret number of conf files... apache on top of tomcat
- update to debian machines
- obelix, cartman and stewie, and the ccg worker nodes.
- we need to update the pegasus machine
- mats has updated the bamboo tests to use new filesystem paths
- ADAS abstract
- for galactic plane on Amazon. if accepted due in september.
- 4.3 release
- fix error messages. see what can be done to improve them .
- output replica catalog
- pegasus-transfer tests.
- updates to cleanup algorithm based on sudharshan's work ??
- release notes will be updated to indicate the dashboards move to pegasus-services thing.
- Precip Paper
- mats will do the zotero work.
- submitting to cloud com in bristol uk.
- seppideh has some data on openstack. could not get all instances started up.
- seppideh will release the token to gideon to do an edit pass
- Cleanup Algorithm
June 2013
June 24th, 2013
- Pegasus Development
- Monitord issue https://jira.isi.edu/browse/PM-712
- karan has a fix that works for him.
- needs to test the replay mode.
- Update on SCEC visit
- pegasus-archive tool
- archive everything other than the stampede db and braindump file
- scott will try to cluster rupture variations for the same rupture in one task based on runtime estimates
- the SGT will become 16 times bigger and post processing 8 times bigger on move to 1HZ. clustering rupture variations in scec code will help in reducing the number of jobs in the DAX
- Scott tried to generate a single DAX for the post processing worklfow. Was unable to do so. Has generated two dax'es
- pegasus-archive tool
- Galactic Plane
- Cut out service. Slow times on retrieving the image from S3. Small bandwith between S3 and EC2
- Will need to have monitoring etc... Not fast enough for a webpage to be responsive.. will need some queuing up
- Backups
- Mats working on Kepler data.
- mats tried backup with S3. does not like symlinks. will change the way backups are managed. the transfer times can be long.
- Update from Sudharshan
- Good progress. showed some simulations
- Adamant Update
- we are on hook for providing the interfaces in pegasus-transfer that will talk to the exo planner service
- also provide shadow queue service, that gives estimates on jobs that will be in the queue.
- supercomputing demo?
- Precip Paper
- majeick si doing some experiments
June 17th, 2013
- Pegasus Development
- the dax job handling is completed.
- update on ligo front.
- condor priorities for local universe jobs
- not handled right now.
- gideon has a ticket open for them.
- gideon observation of s3
- scalable but not good latency or
- Pegasus Lite Paper
- mats is almost done with the runs. to grep through the runs to get the intermediate files in and out of S3
- not done the S3 caching for rosetta as yet. still not sure. too much work for the time remaining.
- mats did do the runs with task clustering. he got better numbers and saw a difference in case of rosetta.
- interleaving of compute jobs and transfers. may help montage.. but won't help rosetta
- whether we should include the new pegasus 4.2 features.
- Cleanup Algorithm
- Glacier Backups for NFS?
- instead of using two qnaps, just have one and use other for duplicates
- we need a place for backups
- currently the QNAPS are 18TB each with raid 6. Raid 10 is a better configuration on the QNAP according to the forums. This means though we will have half the space.
- have one qnap for scratch
- have other qnap for storage - the storage will be backed upto glacier. right now QNAP only support S3. Support for glacier is coming.
- ewa and richard think glacier backups are a good option.
- there might be a purge policy required on glacier.
- Precip Paper
- change tracking on
- use dropbox
- broadcast when you making a new version.
June 10th, 2013
- Pegasus Development
- change to dax handling
- fix of stdout
- regex based replica catalog.
- changes to pegasus-statistics for aggregate statistics
Pegasus Lite Paper
- compute data between s3 and local disk.
- compute costs for the runs ?
- have data outside
- local cache for the S3 client ?? could affect the rosette cache.
- change the rosetta workflow.
- if there are a lot of small files.
- reading parts of files.
- Ewa will send her version of the changes.
Sudharshan Algorithm for Cleanup
- Greedy appraoch planned
- will try implementing a version and show the different executable workflows created
June 3rd, 2013
Pegasus Lite Paper
- Breakdown of the runtimes , experiments
- In case of sharedfs, the kickstart runtimes in the breakdown file will be longer
- for the S3 case we can calculate the S3 transfer time by calculating the difference between the cumulative runtimes
- doing two experiments rosetta(cpu intensive) and montage( io intensive)
Pegasus Development
- Java DAX API issues
- might be some bugs in there.
Precip Paper
- Ewa wants a link to pegasus website in the paper.
- have more logical thinking in the paper, like reliability and repeatability
- Sepideh adding some new figures to the paper.
- Maciek will provide an experiment use-case for the paper.
Stampede and Corral Annual Reports
- Karan and Mats will be working on these
Sudarshan's Project
- Going to look into providing a cleanup algorithm that meets a given storage constraint
- Will look at the static problem of inserting dependencies into the workflow to achieve a solution
PMC Paper
- on amazon
- with clustering and pmc
Shirts
- Should get the logo sample this week, once we approve then we can order shirts
dV/dT
- Rafael is working on a draft of the data collection and modeling paper
- We are planning on publishing data, will start drafting a format this week
May 2013
May 20th, 2013
Confluence is going slow. Mats is going to look.
Analytics are set up on Confluence now.
Pegasus Transfer
- Mats committed a new version that has support for 2-stage transfers
Pegasus S3 Client
- Gideon changed .s3cfg to .pegasus/s3cfg
Pegasus Lite Paper
- Mats is working on the experiments
- We have two weeks to the deadline
PMC Paper
- Experiments on Amazon comparing Pegasus, Pegasus w/ Clustering, PMC alone
Pegasus Service
- Finished setting up users and test suite
- Next is a quick-and-dirty ensemble manager implementation
- Gideon is going to commit a change to Pegasus that removes the dashboard components. They will live in the pegasus-service repository from now on.
Summer Student
- Need to think up a project. Needs to be research-oriented and relatively small.
- Cleanup? Precip?
Contacting users
- Find out if they need anything.
Examples
- Simple examples in Perl, Python and Java
- Gideon will add them to the examples in the pegasus Git repo
April 2013
April 22nd, 2013
- monitord prescript handling fixed
- pegasus-analyzer should detect prescript failures, and the prescript exitstatus should be logged in the database
- pegasus-statistics was updated for the job instance report
- pegasus planner
- need to confirm all checkin's are complete
- do we want to get LIGO to do a test or just release?
Pegasus statistics across workflows - Rajiv
Pegasus Lite Paper
- Mats will do the runs on Amazon
- Karan will work on paper when he comes back
pegasus-hold and pegasus-release
- any difference between doing a hold on the dagman directly or pegasus-dagman
- we need to do more investigations on monitord
BOSCO
- Mats is trying to run on HPCC
- a single job is running fine.
April 8th, 2013
- Work on it towards this week
- monitord prescript issue to fix
- pegasus statistics extensions
- across root workflows
- https://jira.isi.edu/browse/PM-507
- condor temp file
Pegasus Posters
- One at XSEDE
- joint one with BOSCO team
Pegasus Lite Paper
- Submission to IEEE Big Data
New Programmer Hire
- expanded posting on confluence
- New Programmer Hire
- will send out to HPC Wire , RENCI and USC SC Connect
April 1st, 2013
Pegasus Lite Paper
- Waiting on Ewa
- Not much we can do about the IEEE conference. The page limit is 8 , the current size of the paper.
XSEDE Poster
- Pegasus Poster. Karan will send update
- Also a joint Pegasus BOSCO poster
- Also as part of that we will get the MPI workflows up and running through Pegasus and BOSCO
Pegasus Development
- Bypass of staging input files for Pegasus Lite Case
- Inplace cleanup bug fixes done.
- pegasus-s3
- gideon checked in changes of copy from one file to another
- mats adds a pegasus transfer
- workflow cleanup nodes
- separate cleanup node in the workflow
- for hierarchal workflows we only delete the outermost workflow
- what happens if no output-site specified
- the ligo case!
- backward compatiblity for LIGO
- Pegasus Dashboard
- general javascript updates
- Generic Pegasus Slides
- 2-3 slides.
March 2013
March 25th, 2013
- Pegasus Lite Paper Submission
- We will try for https://sites.google.com/site/sweetworkshop2013/
- Karan will move the paper to the ACM format
- Pegasus-statisitcs
- Waiting on Scott to get back with the list of metrics
- Rajiv will be working on it
- pegasus-s3 changes
- we want to be able to copy output files from one s3 bucket to another
- requires changes to pegasus-transfer and pegasus-s3
- final node for cleaning up remote directories
- also related is getting the cleanup algorithm working when we bypass first level staging.
March 18th, 2013
- Mats has an RPM almost sorted out for LIGO that does not require us to have PYTHONPATH set. Instead the libraries go into standard locations
- Karan is testing this RPM at on spice-dev1 and has setup a page with instructions on how to submit a test workflow to VIRGO
- Statistics across root workflows
- earlier gaurang had generated statistics for scec runs by hand... executiing queries on the msql command line
- he does not have the queries documented anywhere
- this is something we have talked about in context of 4.3 with Rajiv
- will follow up with scott on wednesday's call
- 4.2.1 release
- backward compatibility for LIGO . still to be done
- probably next week after the pegasus annual report
- RPM to handle native python installation
- Pegasus Annual Report
- Karan will work on it this week
- Try to follow the same template as earlier.
March 4th, 2013
- Sent link on DAGMan metrics to DAGMan Metrics Reporting to Ewa
- Metrics for Rob Quick's workflow
- Gideon pushed out kickstart changes
- Rajiv has pushed changes to the queries for the dashboard.
- Setup meeting with Jaime and Derrick at OSG AHM to discuss
- remote_initialdir
- extra attributes for glite/bosco submissions
- mpi workflows.
- OSG Poster to be made this week. And 4.2 Release slides.
February 2013
February 11th, 2013
Direct submission of workflows to PBS
- Glite submission in Condor. We setup a VM that hosts a PBS scheduler and using that too test
- Karan prepared an example for 4.2 that can be used to submit directly to local PBS using the glite interfaces in Condor
- the remote_initialdir / +remote_iwd does not work
- problem for MPI codes
- for the time being, the example prepared relies on kickstart to change the directory before launching a job
- there is also a ssh style that allows us to use BOSCO to do remote submissions using SSH to a PBS cluster
- that one also has the issue of remote initialdir
- the remote_initialdir / +remote_iwd does not work
- jobstate.log refactoring.
- data transfer ( support for globus online)
- lightweight tracing
- task stats. net link socket pegasus-kickstart . how much memory the task used and io used.
- add task stats to kickstart
- ptrace
- trace linux equivalent is system tap
- dashboard improvements
- single api for clients
- last week drop down
- performance run on large workflows.
February 4th, 2013
- CCGrid / Pegasus Lite Paper
- Performance section
- remove the experiments section?
- OR
- extra experiments section
- have the squid proxy cache
- find a workshop to submit the paper
- Cloud Paper
- Ewa is working on it.
- Ewa is working on it.
- Git HUB Migration
- - couple of branches like monitord , pmc and dang are branches
- - svn will be made read only .
- - update the website with all the development information
- - bamboo scripts
- - documentation ( long term )
- - nightly builds
- SSH Submission
- - gsissh submission for blue waters
- - ssh to blue waters is required for OTP
- - passing of parameters to PBS
- - SSH key
- - ssh agent.
- - queue keyword
- - Batch session
- - submit jobs to HPCC
- - Gideon will do that.
- monitord memory explosion
- - long term for monitord
- - pegasus-dagman replacement
- minor release 4.2.1
- - potential monitord bug issue
- - long term dagman replacement
- Response time for metrics page
- - occasionally it is slow