Table of Contents
April 2019
April 12th 2019
- Pegasus 5.0
- Site Catalog Conversion to YAML
- mukund is mainly done
- pushed out his changes
- trying to make the tests green
- Checkpointing changes to accomodate LIGO use of vanilla universe
- Karan and Mats will explore and see if it is possible
- cumulative stdout|stderr
- what about time and duration values
- since there is no DAG Node retry and job just goes on HELD state
- Composite Events
- Kibana dashboard needs to be updated
- dropping __ in the event names
- George wants the AMQP library updated
- Will create a JIRA item
- Office Hours video
- Karan will work Jasmine to upload the video
- Site Catalog Conversion to YAML
- Papers
- RACE Paper submitted last week
- PEARC Paper this week
- Proposals
- Army Research
- enabling in-situ supports for ExaScale
- linked with what Tu is doing
- SCEC Proposal Submitted
- have a good chance
- Exascale one with Michigan
- the call will come out soon
- Ewa , Rafael and Deborah
- NSF GCR Proposal
- Modelling wild fires
- Has PRICE school input and also Deborah Post DOC
- Army Research
- EScience
- Pegasus Tutorial Proposal
- May 6, 2019: Tutorial Proposal Deadline
- Also trying for the workflow comparison paper
- Dynamo paper by George
- Pegasus connect discussion
- tabled it for later when Mats is around
- HTCondor Week
- Karan will be doing a Pegasus talk and Pegasus workshop
- Pegasus OLCF Poster
- combine the panda poster
- can also submit to EScience
- Ryan's work
- Loic is moving pachyderm setup to AWS
- Loic Rafael and Tu are working on a paper for Cluster
- Software X
March 2019
March 29th 2019
- 4.9.1 Release
- done and working on 4.9.2
- Site Catalog Conversion to YAML
- mukund working on it
- i still need to look at the bamboo tests
- bamboo faling on mount scratch thing that condor thing
- we have to fix in pegasus also. to fail on credentials in /tmp
- check and do condor_config_val on the key and check if /tmp is in there
- mainly affects all the users that use x509
- LIGO has also tripped over it . Both with Pegasus and without Pegasus
- Condor vanilla checkpointing
- karan asked him about what he is trying to do
- composite events
- check for keys with same values
- also do we need to pad extra keys for all events?
- Extensions to Jupyter Integration
- Pegasus Connect
- will discuss on whiteboard on April 12th
- will discuss on whiteboard on April 12th
March 1st 2019
- 4.9.1 Release
- moving it to early next week
- Pending Issues
- https://jira.isi.edu/projects/PM/versions/11891
- Execution environment for titan
- service dependencies
- PyOpen SSL
- Rajiv Mayani please look at that and the flask dependencies
- PyOpen SSL
- HPSS transfer client incorporation
- Set the transfers to do remotely
- Office Hours
- On Friday March 22nd on real time monitoring
- transformation catalog for 5.0
- Mukund will work on it next
- EScience?
- Paper
- pegasus-exitcode test
- success message not parsed correctly
- Programmer
- will interview the
February 2019
February 22nd 2019
- 4.9.1 Release
- Pending Issues
- https://jira.isi.edu/projects/PM/versions/11891
This raises the larger issue of how long we want to support externals packages
there are some packages we need to ship because of worker packages dependencies.
Consensus:
We remove mysql python externals package for 4.9.1 and 5.0.0And also remove the dependencies from our deb and RPM builds.
- Transfers within containers
- We are only going to transfer from within the container till people complain
- George Papadimitriou will add to the documentation.
- non ascii encoding in the stdout
- Support HPSS storage
The tools we use are htar and hsi
https://docs.nersc.gov/filesystems/archive/
- Pending Issues
- Office Hours
- George on real time monitoring.
- Date?
- George on real time monitoring.
- EScience?
- Paper
- Tutorial submission
February 1st 2019
- 4.9.1 Release
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- but we should ensure that stdout in database still gets populated
- Karan will fix this
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- New TC Format
- Shifter Support in Pegasus
- is in 4.9 branch
- Pegasus Annual Report
- will be working on it in coming weeks
- will ask for input
- next year report will be tricky . in terms of effort allocation.
January 2019
January 25th 2019
- 4.9.1 Release
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- but we should ensure that stdout in database still gets populated
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- YAML format for the TC
- the line numbers should be mentioned in the errors
- GitHub commits don't trigger bamboo builds right now
- move to webhooks?
- slack token in bamboo.yml .
- mats will look into it further
- SCEC for HPC Transfer certificate issue
- Globus online certificates messed up hpc-transfer issue.
- Data Storage at NERSC
- almost full
- Singularity container with the entry point.
- docker → singularity container conversion does not add the entry point.
January 18th 2019
- 4.9.1
- container execution
- data transfers happen within the container
- python3 issue
- vague rules to discover what python to use
- Singularity HUb URL's updated
- Documentation and tutorials need to be updated
- montage examples
- python stuff: create JIRA item
- LIGO pull requests
- Build pull request
- PAM module
- subprocess package thing
- also related to Python3 movement
- container execution
- Transformation Catalog Implementation
- Astro Py
- Shifter support at NERSC
- Panda Integration
- CENON NT
- Rusio data pull in
- fetching data might be easier
- Journal Paper
- need to write something about containers
December 2018
December 13rd, 2018
- Pegasus 4.9.1 release
- local site catalog entry creation
- based on the pegasus version on the submit host
- encoding issue in the stdout.
- local site catalog entry creation
- Pegasus 5.0 Release
- TC yaml implementation
- mukund will create a yaml schema compatible with the TC
- backwards compatibly
- case by case basis
- definitely for
- catalogs
- dax
- pegasus-transfer
- TC yaml implementation
- SWIP Paper
- we are in good shape
- Titan
- under the PBS batch gahp.
- ZTF
- the pipeline is based on docker-compose
- peter will visit ISI with postdoc Danny in January
- Tutorial at TACC
- karan has updated pegasus-init to work on wrangler
- will update the tutorial notes accordingly
- OLCF accounts
- make sure they work
- get karan and mats can login
November 2018
Nov 29th, 2018
- Ryan
- working on comparison paper with george on workflow systems
- mats, karan shared neon meeting notes with Ryan
- Pegasus 4.9.1 release
- Due for december end
- potential issue in monitord in reference to hierarchal organization of submit directories
- pegasus-submitdir
- ADASS Paper
- due tomorrow
- need to add information about sample run
- SWIP paper
- mats and karan will work on it tomorrow afternoon.
- cull out sections
- add information about updated monitoring in 4.9
- OLCF Kubernetes
- Condor is installed and configured as root
- George tried condor log directory to lustre as condor in container has to run as user not as root
- LOG_DIR should be /tmp
- volumes can be attached to container to contain workflows etc
- Dynamo
- Do dynamic scheduling
- George thinking of using flocking
- similar to what is done in OSG
- non-sharedfs deployments should work
Nov 1st, 2018
- Pegasus 4.9.0 and 4.8.5 Released
- We released it this week.
- Pegasus Business Card
- Advocate for job postings.
- Postdoc options
- Programmers
- pegasus.isi.edu/jobs
- We should take to conferences with us
- Advocate for job postings.
- Pegasus JAVA 8 dependence in RPM
- there is a disconnect between RPM and common.sh
- ADASS
- Karan working on a wlpipe demo example
- New Student
- Mukund
- Duncan started using 4.9.0 and has updated pyCBC to use singularity
- changed our container execution model
- all transfers done within the container now.
October 2018
Oct 12th, 2018
- Rescheduling meetings
- New time is Thursdays 2PM starting from last week of October
- DAX APi reporting
- Perl DAX API - Rajiv
- Atlas visit
- Wednesday we have Scientific Computing Seminar
- Will involve writing a Pegasus code generator
- Panda is second biggest after Condor on OSG
- Thursday
- Karan and George will be there.
- Mats might be available remotely
- Wednesday we have Scientific Computing Seminar
- 4.9.0 Release
- Mats preference is to skip the beta tag
- Aim for the full release
- Documentation freeze on Oct 26th
- Try and do the builds over the weekend
- Duncan container usecase
- cvmfs hosted container images
- Demo repository
- panorama data and some runs from exogeni / nersc
- Mats has two new elastic search VM's and are part of Elastic Search cluster
- these vm's data is backed up also
Oct 5th, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Karan will circulate a doodle poll
- Either Tuesday or Thursdays
September 2018
September 28th, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Karan will circulate a doodle poll
- Either Tuesday or Thursdays
- Pegasus 4.9.0 Release
- transformation selection issue
- karan has not been able to recreate it yet.
- will look into it more today
- docker singularity pulls
- container symlink
- deprecate api's
- modify DAX generators to indicate version/ DAX API used.
- will look into ways on how to do it
- one way is workflow metadata attributes
- second is attribute to ADAG object.
- rajiv will check how it gets stored in the metrics server
- transformation selection issue
- ADASS
- will try and do a poster with Mike at ADASS
- deadline is Oct 8th
September 21st, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Pegasus 4.9 release
- integrity error reporting
- pegasus-statistics reporting information about integrity errors
- the unicorn dashboard for internal swip purposes
- errors are appearing in the stream
- more brainstorming required. the data is there
- not clear whether to use grafana or kibana
- does not have drill down functionality
- mix of production and test workflows
- create different queues in AMQP exchanges
- container mount point support
- karan is close to have that being implemented
- transferring outputs to multiple location
- lets say one for portal and the other for
- list of output sites
- good feature to add for 4.9.1
- update --output-site option to pegasus-plan
- pull docker images for singularity runs
- we should do for 4.9.0
- planner needs to tell pegasus-transfer an extra attribute.
- add a type attribute
- Papers
- Github private papers repo
- Deprecate stuff
- perl api
- old catalog formats
- pegasus-plots
- Hiring
- integrity error reporting
August 2018
August 24th, 2018
- Pegasus 4.8.4 Release
- when are we releasing?
- next week before mats go on vacation
- when are we releasing?
- error tagging
- update stampede schema to add a table called tags
- will allow us to capture number of integrity errors
August 17th, 2018
- Pegasus 4.8.4 Release
- RPM fix ?
- mats will manually verify
- Karan should follow up with Stuart
- AMQP filtering
- we are working on having filtering in built into monitord
- nepomunk already has 33 errors identified
- we need to db connection, pegasus-db-admin and other tools to pass properties with pegasus property prefix stripped off
- SWIP Paper
- one reject seems to be harsh
- we can try for HPDC also
August 3, 2018
- Pegasus 4.8.3 Release
- singularity fix
- mats talked to adam at nebraska about containers.
- the main doc book will not be updated for 4.9
- SLURM
- Design Safe / TACC on Wrangler headnode
- Nextflow has integration with SLURM and everything can be installed in user space
- PMC unit tests are broken
- lets fix the tests
- Pegasus 4.9 release
- more real life runs
- nepomunk against ceph-s3 from one of uchicago machines
- we need to get stats reported for integrity errors
- larger issue of error classification
- ADASS Tutorial
- we got into second round
- add on exercise to run montage in the end.
- we got into second round
- LIGO
- Bruce group at AEI Hannover has left LSC
- Infrastructure
- HipChat mess
- should we move to ISI Slack
- Public Chat feature
- Some clients for Hipchat
- Get a free channel from Slack
- for all Hipchat rooms
- what about ISI slack??
- Github removal of old integrations
- moving email notifications . Rafael Ferreira Da Silva will take care of it
- we need to explore
- HipChat mess
- MINT Meeting
- went well overall
- issue of scoping .
July 2018
July 27th, 2018
- Pegasus 4.8.3 Release
- VM Tutorial
- will update pegasus-init requirements to get it working
- main tutorial chapter will be updated for 4.9
- because then tutorial based container may not work
- change how docker scripts set environment
- SCEC database loading error
- VM Tutorial
- Failing Tests
- Issue in updates to the dashboard database
- Panorama Paper
- agreed on a re-organization
June 2018
June 29th, 2018
- Pegasus
- 4.8.3 needs to be released because of singularity launching options
- will wait till tutorial is updated.
- karan will update pegasus-init with population modeling or povray option
- 4.9
- pegasus-statistics updated with integrity metrics
- how to flag job errors because of integrity
- need to figure out logic
- value add proposition
- maybe we should value type in the pegasus lite
- need to implement the integrity dial
- Start creating default local site entries to execute without local site
- 4.8.3 needs to be released because of singularity launching options
- ADASS Tutorial
- Will submit today
- Google doc shared
June 22nd, 2018
- Pegasus
- SWIP paper submitted to escience
- 4.8 montage tests failing
- changes for integrity metrics in pegasus-transfer
- updated monitord to parse events from various sources like pegasus lite output
- mats pointed out to a bug in monitord
- LIGO
- pip for python source package
- update dependencies for latest packages , like pyopen ssl
- install in the pip repository
- pegasus-analyzer
- interested in swip and containers.
- pip for python source package
- SCEC CSEP
- will use containers
- run on Comet
- 1000 genome workflow or use chimerica workflow
- ADASS Tutorial
- montage ?
- probably pycbc is also submitting a proposal
June 8th, 2018
- Scott Replica Catalog issue
- Replica Catalog deletes take a long time
- Bamboo
- bamboo emails are no longer received. so we dont come to know about workflow plan failures
- SWIP
- monitord integrity changes. population of data from ks records working now.
- we still need to populate data from pegasus lite records and pegasus-transfer
- pegasus-statisitcs need to be updated
- 0.1% overhead on production osg gem workflow
- Pegasus deployment at ORNL
- we should be doing it similar to hpc-pegasus
- Pegasus Office Hours
- next one in August
- travels in July
May 2018
May 4th, 2018
- Pegasus 4.8.2 Release done on May 3rd
- we should consider separate user data to a separate file on pegasus-wms
- si2 meeting updates
- some potential new users
- ewa slides were a good overview summary
- integrity data schema changes.
- monitord changes need thinking
April 2018
April 6th, 2018
- Pegasus 4.8.2 Release
- PMC bugs
- tutorial for usc hpc
- no longer allow + or . in the names
- Pegasus Report
- Submitted for Ewa' review
- SWIP test run
- discovered integrity errors in the wild
- at colorado and university of nebraska
- we would have not caught it before
- e-science paper
March 2018
March 30th, 2018
- SWIP
- pegasus-run issue, with wf restarting from scratch
- because dagman rescue file is not there.
- so should we update pegasus-run to look at the dagman.out file
- so far we think it should be kept consistent with normal dagman behavior
- to de discussed at condor week
- mats created a Jira item for swip related statistics
- https://jira.isi.edu/browse/PM-1260
- will involve a database schema.
- Things remaining
- Dials to be implemented
- stampede changes
- pegasus-transfer changes???
- pegasus-run issue, with wf restarting from scratch
- SC Tutorial Submission ( April 16th)
- https://sc18.supercomputing.org/submit/tutorials-submissions/
- We should try and add exercises for containers
- We will try for half day
- 45 minute introduction
- Feedback from Arizona Container Camp
- There is interest.
- coming up with an existing application that people understand or can relate to
- montage - complex dax generator
- rosetta
- only works in nonsharedfs stuff
- with
- machine learning example?
- with tensor flow?
- requires container
- NVIDIA has a lot of examples about machine learning
- has to be multistep
- and at least bag of tasks
- Ashwin is doing some tensor flow stuff
- on workflow.isi.edu
- is working out of jupyter notebook
- Genome sequencing workflows??
- use Broad GATK sequencing workflow to use
- SOYKB and IRRI use GATK
- and are huge communities
- http://biocontainers.pro/docs/101/running-example/
- Pegasus Report
- we should be resolve Jira items as we fix them
- will be also doing cumulative statistics
- Pegasus Office Hours
- Jupyter Notebooks
- will update the example to use namd example used for Oakridge
- Panorama Stuff
- our multiplexing part in monitord done so far
- however we are relying on amqp queues and routing keys for filtering
- darshan data population
- we need to invoke a script (pegasus-darshan) that will be invoked in the namd wrapper script, to pull the data from darshan logs on the file system and generate an ASCII output
- Panorama.isi.edu VM
- AMQP
- Logstash
- Kibana
- Elastic Search
- Make it do a backup every so often.
- Warns against doing it as a permanent datastore
- Rajiv will verify
- Influx
- Backups
- CRASH PLAN backup for the /srv and /opt in the panorama VM
- our multiplexing part in monitord done so far
- LIGO Database locked issues
- we need to look into the locking issues by tinkering with monitord flush intervals
March 16th, 2018
- SWIP
- Most of the SWIP stuff is done as far as planner changes and getting the workflows running
- we are in a position to share something
- To do
- sharedfs
- Dial implementation
- Update monitoring
- Paper submission for EScience
- Pegasus Reports
- new applications to attribute to pegasus grants
- all the mike wangs work will go here
- SCEC
- LIGO - need to ping Duncan
- Panorama/ Pegasus workflow endpoints
- We seems to be going towards AMQP
- How is AMQP going to be configured
- So far we have
- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<exchange_name>
Online monitoring in kickstart- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<virtualhost>/<exchange_name>
- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<exchange_name>
- Virtual Hosts
- right now virtual host is hardcoded in monitord code. we set it to pegasus
- global - across workflows
- Exchanges
- should be global across workflows
- type direct - in panorama
- we want them to be type -> topic instead
- Queue
- in panorama different queues for each workflows
- Routing Keys
- the routing key should be based on stampede event names
- Events populated
- https://pegasus.isi.edu/documentation/stampede_wf_events.php
- We should add periodic events about states of workflows
- SWIP integrity error events will be populated by clients
- We seems to be going towards AMQP
February 2018
February 23th, 2018
Eliminate support for Py2.6?
Python Dependencies
All - future
pegasus-service - Flask, SQLAlchemy, Flask-SQLAlchemy, Flask-Cache, pam, plex, pyOpenSSL, ordereddict
pegasus-monitord - SQLAlchemy
pegasus-analyzer - SQLAlchemy
pegasus-s3 - boto
pegasus-globus-* - globus-sdk
pegasus-init - jinja2
pegasus-metadata - argparse
pegasus-em - requests
PostgreSQL - psycopg2
MySQL - MySQL-Python OR mysqlclient
Note: Packages in green are available from yum.
February 9th, 2018
- SWIP
- checksum computation will be implemented in pegasus-transfer.
- allows us to handle the case where the input files don't have checksums in the RC
- integrity checks are disabled now for files that dont have checksums in the RC
- dial knob
- checksum computation will be implemented in pegasus-transfer.
- Tests
- seem to be slow
- bamboo could be moved to the new server
- storage constraint test
- Lizard FS
- Mats will give an update next time around
- Servers
- Trying to do two server
- IF we buy one server
- Buy a storage server. That is Mats preference.
- SoyKB workflow has
- Compute
- we will get a compute server first.
- We should figure out the server and put in the request soon, and done by Feb end
- LSST
- Tom Glanzman?
- We will touch base on Monday with Tom and Nersc folks
- Office Hours today
- have a presentation on containers
- will upload on the website
January 2018
January 12nd, 2018
- AWS Batch
- seems to be running in karan's account.
- update documentation about aws batch
- Pegasus 4.8.1 Release
- upto Mats whether we should tag or not.
- Pegasus Office Hours
- Rafael will look up a new name
- Container Presentation
- Talk about containers
- Blue Jeans
- Advertising avenues
- XSEDE workflows list
- OSG List
December 2017
December 1st, 2017
- AWS Batch
- Client done. still have to figure out about stdout and stderr
- maybe we should have batch push the files and control where the jobs go in
- also maybe each file should go to it's own stdout stderr
- Metrics for SWIP
- Stampede
- Metrics Server
- Elastic Search
- Rajiv working on changing the salt configuration
- Model Integration with Wings
November 2017
November 10th, 2017
- Pegasus
- AWS Batch
- checked in stuff
- jars checked in aws sub directory in the jars folder. pegasus-config classpath is updated accordingly
- Bamboo builds
- change in how users are handled
- rajiv and mats worked on changing the salt configuration for the various machines
- the major part changed was how the users are handled
- the bamboo user got messed up and uid's were mismatching on the filesystem
- main group for people unix accounts should be pegasus for everybody
- only project users will have access to VM's for a particular project
- Stewie Rebuild
- move off stewie. the main OS needs to be updated
- parnorama
- Rafael and Geroge will create a VM for panorama
- CENTOS 7
- mats will help George create VM
- Ashwin consumers from Influx DB
- CENTOS 7
- Rafael and Geroge will create a VM for panorama
- mysql server
- Pegasus metrics server
- JSON vs YAML
- initial impressions seem to favor yaml
- YAML does have benefit of including comments
- also YAML , JSON will result in additional lines
- initial impressions seem to favor yaml
- templates for site catalogs
- LSST
- mats will update documentation for pyglidein
- to work with condor pool passwords thing
- also will take mike site catalog to update NERSC entries
- tests
- rosetta and montage appear working again. not clear what triggered errors in first place
- AWS Batch
- SC Next week
- Rafael and Karan are away
- AWS workshop for LIGO
- George Panorama work
- Dakota ends up launching multiple Pegasus workflows based on it's gradient functions
- using ensemble manager to do multiple runs
- George will check in dakota test case and example
- pick one approach and update documentation
- SWIP Demo
- think about merging stuff from panorama back to production branch
- work with ian foster and raj kettimutt on globus online
- do multi site run
- Tudo
- working on insitu
- data spaces approach to have staging area
- tudo wrote sample applications
- evaluating on CORI using shared memory
- burst buffers cannot be used
- Ashwin
- analyzes influx db data
- using statistical learning
- python panda library
November 3rd, 2017
- Pegasus 4.8.1 release
- 3 bugs in worker package staging.
- pegasus-transfer PYTHONHOME unset does not work
- hierarchal workflow handling.
- to be discussed tomorrow
- AWS Batch
- need to check in changes.
- need to add options for the client and do error checking.
- still need to figure out how to integrate in pegasus
September 2017
September 15th, 2017
- Pegasus development
- Dashboard
- LSST might want it running out of a directory other than $HOME/.pegasus
- No plans to tackle it right now. requirements are vague. and catch 22 situtation
- Python problem with Pegasus install
- DAX3 problem does not work.
- Could not be recreated
- PyPy account should be disabled
- pypy has a 4.3 pegasus package
- we should remove it
- The jobname with dagman not allowing . is fixed
- Dashboard
- LIGO
- Heard from Duncan. Tried out metadata stuff
- Another person at NERSC that is interested in running Condor
- AWS Batch
- done initial development.
- how to retrieve logs etc.
September 8th, 2017
- Pegasus 4.8.0 Release
- went out this week
- documentation
- pyglidein
- out of icecube
- mats added a section in the documentation
- pretty neat once it is setup
- and works really well on machines with two factor
- not tuned for MPI things.
- on the submit machine a web based python thing.
- pegasus resource profiles will work out of the box with pyglidein
- Releases
- Post 4.8 Releases
- changes in the debain build
- source package has been renamed. mats removed the source part
- changed the versioninig of RPM and debian. The dev series will have the timestamp in it.
- pegasus-version -f also has timestamp
- Will create a separate YUM and DEB developer repositories
- repositories will not be signed.
- Mats is still playing setup
- Worked a lot on Debian packaging.
- changes in the debain build
- Post 4.8 Releases
- HipChat will be upgraded to Stride
- Mats updated JIRA today
- Sim Center Workflows
- Using Condor IO thing
- for 4.8.1 we should look at the remap thing
- SWIP Poster
- the first review is really good
- Docker and Singularity
- have stuff about engineering challenges
- But not enough usage
- Practical Aspect
- Von's Group SWAMP thing.
- pegasus is part of trusthworthy software thing?
- AWS Batch
- AWS batch thing works
- Investigate how Dakota and Pegasus can work together
- Run Dakota as a job
- Run Dakota on submission machine
- dakota calls a script that does a pegasus workflow
- Mix of 1 and 2.
August 2017
August 25th, 2017
- Pegasus 4.8.0 Release
- beta3 tagged
- monitord replay issue for rc tables against mysql server
- Jupyter thing
- VM updated with Jupyter
- Docker example application
- R builds with pegasus
- for time being only brew builds have that disabled.
- Condor update to the brew installation.
- Pegasus 4.9 Roadmap
- SWIP
- lay out the changes
- prioritize stuff for production readiness
- the knob for integrity.
- get into transfers.
- signing stuff on the backburner.
- chaos monkey tests
- lay out the changes
- metadata things
- aws batch support
- SWIP
- Pegasus Tutorial
- George felt that Pegasus tutorial was a bit too easy.
- it should be maybe more interactive. get the user to develop a new workflow
- Tudo will pick up Decaf work
- Dataspaces
- do data management
- Ashwin will work on deep learning on panorama
- use tensor flow
- Dakota
- ini file . runs simulation and converges simulation points
- George will be working on it
- has a checkpoiniting facility
August 18th, 2017
- mats found a new hydrology user in boulder
- based at Boulder
- there was a magpie presentation there.
- mats did a hosted ce tutorial
- 4.8.0beta2 release
- tagged and sent it out.
- monitord workflow and read permissions creation
- should only when the database is created.
- ~/.pegasus directory should be 755
- dashboard errors
- rajiv should traverse the directory in the dashboard.
- LSST
- cleanup issue
- mats and karan agree on it, that it is bad application
- we should reply to it.
- the wrapper should copy the file and launch the job
- cleanup issue
- source a setup a script for jobs
- has to be generically done
- registration jobs shell expansion
- we should not do getEnv=True
- testing repo
- stuart from LIGO asked for it.
- BOSCO
- we have the examples updated
- Karan will remind Eliu about LIGO and Bluewaters
- Slick Jupyter Demos
- Started up VM's
- Jupyter tutorial
- should be integrated into the VM
August 11th, 2017
- Bamboo is finally green
- we will do a Pegasus RC1. actually a beta since we still want to address some issues.
- Rajiv fixed the build with python crypto issues
- pyopen-ssl was updated during 4.7.x series
- we should package only things that we are not sensitive to the versions
- so right now pyopenssl is removed from binary builds, and all associated dependencies were removed.
- New throttling things.
- number of jobs scale with the size of the workflows.
- SCEC all hands meeting.
- Documentation
- Took a stab at the containers.
- Rafael has to add a separate jupyter chapter
- Karan will update the throttling docs
- LSST
- Mats and Karan had a call with Tom about designing a workflow for one of the production pipelines
- Mats and Rafael had a call with the French cluster folks (Fredrique Sutter). Fredrique works for simgrid
- Paper
- rvGAHP paper ready for submissions
- Suraj Poster
- Mings pass really helped
July 2017
July 21st, 2017
- VMs are down, so tests are slow, and cannot test the new features yet
- Mats will send an email (or call) Derek to check with the VMs issue
- Try to run the Montage container test on OSG
- TODO: Reconfigure our poll (it is not flocked yet)
- Pegasus 4.8.0
- Bugs on the container (transformation catalog) is fixed
- Stage in/out nodes based on the number of computing jobs on the workflow
- TODO: add warning for errors (size of jobs)
- Warning for category is done
- TODO: reference implementation of a workflow using docker (1000 Genome workflow - Rafael)
- Jupyter: add container keyword for API
June 2017
June 23rd, 2017
- Pegasus 4.8.0
- Decaf
- local universe jobs does not honor request_cpus , and jobs remain idle if they ask for multiple cpu's
- karan will update pegasus to remove the request_ parameters from the local universe jobs
- local universe jobs does not honor request_cpus , and jobs remain idle if they ask for multiple cpu's
- Steven Clark
- Pegasus build issue is related to python 3 compatibility in the DAX API
- Decaf
- LIGO
- Eliu plans to run on Bluewaters
- we should confirm that he only wants to run on bluewaters.
- they have sucky performance of getting data to the compute nodes in bluewaters.
- set the schedd start date
- NERSC
- Karan will do a test setup there.
- Karan will do a test setup there.
- Pegasus Builds
- failed because of detain version upgrades to build tools
- setup tools in python complains to pegasus 4.8.0-dev
June 9th, 2017
- Pegasus 4.7.5
- pegasus-rc-client bug fix is done
- 4.7.5 and 4.8.0 together
- Pegasus 4.8 release
- docker stuff is complete
- docker tests added are green
- karan will work on singularity next week.
- LIGO reports pegasus lite jobs filling up /tmp . karan will check with LIGO on whether there is any environment set?
- rafael will update his api to make it consistent with the container format
- also will add a bamboo example.
- docker stuff is complete
- DECAF integration
- karan has an idea about it.
June 2nd, 2017
- Pegasus 4.7.5
- pegasus-rc-client bug fix to be done
- Jupyter
- rafael will be working on it during June
- For 4.8.0
- container
- docker works in nonsharedfs right now.
- work on singularity support.
- clustering . clustered jobs can only refer to one container
- symlinks - for 4.8.0 they are disabled.
- container sharedfs example
- we have pegasus-lite with sharedfs. automatic translation of file URL's
- transfer refiner
- notification email updates
- mats updated default notification scripts. will generate svg files
- at end of workflow generate notifications that have statistics
- monitord needs to run the remaining notifications after the workflow is done.
- container
- makeflow integration
- limitations for pegasus generating make flow integration
- makeflow model
- all files have to be on the submit host
- how do we translate auxiliary jobs to make flow description
- tyson at arizona.
- add new transfer jobs
- add new credentials
- no postscripts there
- monitoring
- won't work with monitoring
- write a new monitord.
- maybe do an oppposite translation???
- what will be useful is to integrate with using work queue with our own dagman manager.
- makeflow model
- limitations for pegasus generating make flow integration
May 2017
May 12th, 2017
- auto scaling of stage out and stage in jobs
- 4.8 transfer refiner will be Cluster by default.
- auto-computation of number of stage in, stage out and cleanup jobs
- defaults should be computed based on number of jobs at a level.
- use a ratio or step function .
- come up ratio ranges for auto determination
- 1:5 for numbers of jobs < 10K ( 20%)
- 1:20 for number of jobs > 20k ( 5%)
- will create a JIRA item for this
- container stuff
- close to having one example running
- have not figured clustering jobs out yet.
- mats agrees with the approach now. pegasus lite invokes the docker run commands.
- integrity stuff
- will make slides
- be specific about we have done .
- we give them an option of running synthetic stuff
- For
- also define best effort part.
- strict, off, minimal , best effort
- how do we handle case where SHA exists.
- WDL
- workflow definition language
- WDL is JSON based
- has a template approach with variable substitution
- workflow definition language
- AWS Cleanup
- need to delete snapshots and cleanup VM's
March 2017
March 17th, 2016
- monitord stdout and stderr missing
- the VARS one. just expose the variable.
- SCEC issue
- job managers per resource
- got fixed by one job manager per job
- BOSCO works partly.
- containers call from yesterday
- dsa
- metadata
- metadata population in postscripts
- move metadata population to the postscripts.
March 10th, 2016
- SCEC cleanup issue
- related to Jglobus not updating and enforcing the compliance for RFC 2818 compliance
- LSST visit update
March 3rd, 2016
- Pegasus 4.7.4 Release
- sent out the release
- we did a ligo fix yesterday to pegasus transfer
- mats osg gem
- workflow did not finish
- pegasus-exitcode has a shortcut for a regex
- make it more strict. whether to trigger failure in pegasus-exitcode
- revisit how metadata population
- trigger failure for missing records.
- pegasus-exitcode has a shortcut for a regex
- workflow did not finish
- SCEC RC client issue
- Rafael will look into it for pegasus-rc-client
- containers support
- containers on a pause right now.
- Webinar
- lets try and schedule one for april end
- bluejeans will be an option
- topic will be covered new features for 4.8.0
February 2017
February 24th, 2016
- Pegasus 4.7.4 Release
- we will tag today.
- there is a potential monitord bug that happens on sub workflow retires only in the live mode, that Karan is unable to trace
- ds
- containers support
- pegasus lite launches docker wrap
- or the other way around. because worker package has to be installed in the container in some cases
- so double install
- or the other way around. because worker package has to be installed in the container in some cases
- Clustered jobs
- we want at max one container to use the clustered job.
- pegasus lite launches docker wrap
- monitord performance
- on OSG connect there is a difference between 4.6 and 4.7 performance replay
- monitord.log has errors indicating unable to read .out .err files.
- we think it is a race between DAGMan and the filesystem
February 17th, 2016
- Pegasus 4.7.4 Release
- targeted for next week.
- LIGO ran into a prescript issue
- pegasus lite deleted the worker package in the workflow submit directory
- only triggered when there was a subsequent compute job.
- pegasus lite deleted the worker package in the workflow submit directory
- new transformation catalog format
- containers
- open issue whether docker wrapper launches pegasus lite
- or the other way around
February 10th, 2016
- Pegasus 4.7.3 Release
- SCEC has issue with pegasus-db-admin
- mysqldump timesout when updating their replica catalog
- Database TC
- remove support for Database TC
- SCEC has issue with pegasus-db-admin
- Stewie and fisheye upgrades
- fisheye upgrade
- Mats agreed to do the upgrade
- stewie runs debian 7
- we need to upgrade it one day or later.
- runs GridFTP and mysql
- RabbitMQ is running there
- MongoDB is running there
- Catalog dependencies on stewie
- 5K limit for a new server
- fisheye upgrade
- OSG All Hands Meeting
- no tutorial looks like
- lots of pegasus users coming there
- Containers Support
- pegasus lite invokes the docker wrap.
- singularity support will be required.
- container modes
- should we support docker definition file
- do we build on the worker nodes?
- pull in an existing docker image from the hub
- on the staging site
- whether we should unload an image or not
- we should try and cleanup
- credential renaming has to be worked out
- should we support docker definition file
- Transformation Catalog
- how to represent container dependency in the transformation catalog
February 3rd, 2016
- Pegasus 4.7.3 Release
- we tag later today or first thing monday
- waiting for scott to reply
- Jupiter Notebook
- in general jupyter the interactive interface closes if you close the tab
- in our case it does not affect us, since we invoke pegasus-plan at the server end
- Vicky has a workflow out of panorama that she has in jupyter as a set of the instructions
- Containers
- karan did some exploration of docker containers via HTCondor
- by default docker in the container runs as root.
- means output files are written out as root
- also the containers need to be shipped around.
January 2017
January 27th, 2016
...