Table of Contents
April
...
2019
April
...
12th 2019
- Pegasus 4.5 release
- release candidate today rc2
- updates
- job throttling added to optimization guide.
- APP PATH addon
- rest monitoring API5.0
- Site Catalog Conversion to YAML
- mukund is mainly done
- pushed out his changes
- trying to make the tests green
- Checkpointing changes to accomodate LIGO use of vanilla universe
- Karan and Mats will explore and see if it is possible
- cumulative stdout|stderr
- what about time and duration values
- since there is no DAG Node retry and job just goes on HELD state
- Composite Events
- Kibana dashboard needs to be updated
- dropping __ in the event names
- George wants the AMQP library updated
- Will create a JIRA item
- Office Hours video
- Karan will work Jasmine to upload the video
- Site Catalog Conversion to YAML
- Papers
- RACE Paper submitted last week
- PEARC Paper this week
- Proposals
- Army Research
- enabling in-situ supports for ExaScale
- linked with what Tu is doing
- SCEC Proposal Submitted
- have a good chance
- Exascale one with Michigan
- the call will come out soon
- Ewa , Rafael and Deborah
- NSF GCR Proposal
- Modelling wild fires
- Has PRICE school input and also Deborah Post DOC
- Army Research
- EScience
- Pegasus Tutorial Proposal
- May 6, 2019: Tutorial Proposal Deadline
- Also trying for the workflow comparison paper
- Dynamo paper by George
- Pegasus connect discussion
- tabled it for later when Mats is around
- HTCondor Week
- Karan will be doing a Pegasus talk and Pegasus workshop
- Pegasus OLCF Poster
- combine the panda poster
- can also submit to EScience
- Ryan's work
- Loic is moving pachyderm setup to AWS
- Loic Rafael and Tu are working on a paper for Cluster
- Software X
March 2019
March 29th 2019
- 4.9.1 Release
- done and working on 4.9.2
- Site Catalog Conversion to YAML
- mukund working on it
- i still need to look at the bamboo tests
- bamboo faling on mount scratch thing that condor thing
- we have to fix in pegasus also. to fail on credentials in /tmp
- check and do condor_config_val on the key and check if /tmp is in there
- mainly affects all the users that use x509
- LIGO has also tripped over it . Both with Pegasus and without Pegasus
- Condor vanilla checkpointing
- karan asked him about what he is trying to do
- composite events
- check for keys with same values
- also do we need to pad extra keys for all events?
- Extensions to Jupyter Integration
- Pegasus Connect
- will discuss on whiteboard on April 12th
- will discuss on whiteboard on April 12th
March 1st 2019
- 4.9.1 Release
- moving it to early next week
- Pending Issues
- https://jira.isi.edu/projects/PM/versions/11891
- Execution environment for titan
- service dependencies
- PyOpen SSL
- Rajiv Mayani please look at that and the flask dependencies
- PyOpen SSL
- HPSS transfer client incorporation
- Set the transfers to do remotely
- Office Hours
- On Friday March 22nd on real time monitoring
- transformation catalog for 5.0
- Mukund will work on it next
- EScience?
- Paper
- pegasus-exitcode test
- success message not parsed correctly
- Programmer
- will interview the
February 2019
February 22nd 2019
- 4.9.1 Release
- Pending Issues
- https://jira.isi.edu/projects/PM/versions/11891
This raises the larger issue of how long we want to support externals packages
there are some packages we need to ship because of worker packages dependencies.
Consensus:
We remove mysql python externals package for 4.9.1 and 5.0.0And also remove the dependencies from our deb and RPM builds.
- Transfers within containers
- We are only going to transfer from within the container till people complain
- George Papadimitriou will add to the documentation.
- non ascii encoding in the stdout
- Support HPSS storage
The tools we use are htar and hsi
https://docs.nersc.gov/filesystems/archive/
- Pending Issues
- Office Hours
- George on real time monitoring.
- Date?
- George on real time monitoring.
- EScience?
- Paper
- Tutorial submission
February 1st 2019
- 4.9.1 Release
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- but we should ensure that stdout in database still gets populated
- Karan will fix this
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- New TC Format
- Shifter Support in Pegasus
- is in 4.9 branch
- Pegasus Annual Report
- will be working on it in coming weeks
- will ask for input
- next year report will be tricky . in terms of effort allocation.
January 2019
January 25th 2019
- 4.9.1 Release
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- but we should ensure that stdout in database still gets populated
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- YAML format for the TC
- the line numbers should be mentioned in the errors
- GitHub commits don't trigger bamboo builds right now
- move to webhooks?
- slack token in bamboo.yml .
- mats will look into it further
- SCEC for HPC Transfer certificate issue
- Globus online certificates messed up hpc-transfer issue.
- Data Storage at NERSC
- almost full
- Singularity container with the entry point.
- docker → singularity container conversion does not add the entry point.
January 18th 2019
- 4.9.1
- container execution
- data transfers happen within the container
- python3 issue
- vague rules to discover what python to use
- Singularity HUb URL's updated
- Documentation and tutorials need to be updated
- montage examples
- python stuff: create JIRA item
- LIGO pull requests
- Build pull request
- PAM module
- subprocess package thing
- also related to Python3 movement
- container execution
- Transformation Catalog Implementation
- Astro Py
- Shifter support at NERSC
- Panda Integration
- CENON NT
- Rusio data pull in
- fetching data might be easier
- Journal Paper
- need to write something about containers
December 2018
December 13rd, 2018
- Pegasus 4.9.1 release
- local site catalog entry creation
- based on the pegasus version on the submit host
- encoding issue in the stdout.
- local site catalog entry creation
- Pegasus 5.0 Release
- TC yaml implementation
- mukund will create a yaml schema compatible with the TC
- backwards compatibly
- case by case basis
- definitely for
- catalogs
- dax
- pegasus-transfer
- TC yaml implementation
- SWIP Paper
- we are in good shape
- Titan
- under the PBS batch gahp.
- ZTF
- the pipeline is based on docker-compose
- peter will visit ISI with postdoc Danny in January
- Tutorial at TACC
- karan has updated pegasus-init to work on wrangler
- will update the tutorial notes accordingly
- OLCF accounts
- make sure they work
- get karan and mats can login
November 2018
Nov 29th, 2018
- Ryan
- working on comparison paper with george on workflow systems
- mats, karan shared neon meeting notes with Ryan
- Pegasus 4.9.1 release
- Due for december end
- potential issue in monitord in reference to hierarchal organization of submit directories
- pegasus-submitdir
- ADASS Paper
- due tomorrow
- need to add information about sample run
- SWIP paper
- mats and karan will work on it tomorrow afternoon.
- cull out sections
- add information about updated monitoring in 4.9
- OLCF Kubernetes
- Condor is installed and configured as root
- George tried condor log directory to lustre as condor in container has to run as user not as root
- LOG_DIR should be /tmp
- volumes can be attached to container to contain workflows etc
- Dynamo
- Do dynamic scheduling
- George thinking of using flocking
- similar to what is done in OSG
- non-sharedfs deployments should work
Nov 1st, 2018
- Pegasus 4.9.0 and 4.8.5 Released
- We released it this week.
- Pegasus Business Card
- Advocate for job postings.
- Postdoc options
- Programmers
- pegasus.isi.edu/jobs
- We should take to conferences with us
- Advocate for job postings.
- Pegasus JAVA 8 dependence in RPM
- there is a disconnect between RPM and common.sh
- ADASS
- Karan working on a wlpipe demo example
- New Student
- Mukund
- Duncan started using 4.9.0 and has updated pyCBC to use singularity
- changed our container execution model
- all transfers done within the container now.
October 2018
Oct 12th, 2018
- Rescheduling meetings
- New time is Thursdays 2PM starting from last week of October
- DAX APi reporting
- Perl DAX API - Rajiv
- Atlas visit
- Wednesday we have Scientific Computing Seminar
- Will involve writing a Pegasus code generator
- Panda is second biggest after Condor on OSG
- Thursday
- Karan and George will be there.
- Mats might be available remotely
- Wednesday we have Scientific Computing Seminar
- 4.9.0 Release
- Mats preference is to skip the beta tag
- Aim for the full release
- Documentation freeze on Oct 26th
- Try and do the builds over the weekend
- Duncan container usecase
- cvmfs hosted container images
- Demo repository
- panorama data and some runs from exogeni / nersc
- Mats has two new elastic search VM's and are part of Elastic Search cluster
- these vm's data is backed up also
Oct 5th, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Karan will circulate a doodle poll
- Either Tuesday or Thursdays
September 2018
September 28th, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Karan will circulate a doodle poll
- Either Tuesday or Thursdays
- Pegasus 4.9.0 Release
- transformation selection issue
- karan has not been able to recreate it yet.
- will look into it more today
- docker singularity pulls
- container symlink
- deprecate api's
- modify DAX generators to indicate version/ DAX API used.
- will look into ways on how to do it
- one way is workflow metadata attributes
- second is attribute to ADAG object.
- rajiv will check how it gets stored in the metrics server
- transformation selection issue
- ADASS
- will try and do a poster with Mike at ADASS
- deadline is Oct 8th
September 21st, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Pegasus 4.9 release
- integrity error reporting
- pegasus-statistics reporting information about integrity errors
- the unicorn dashboard for internal swip purposes
- errors are appearing in the stream
- more brainstorming required. the data is there
- not clear whether to use grafana or kibana
- does not have drill down functionality
- mix of production and test workflows
- create different queues in AMQP exchanges
- container mount point support
- karan is close to have that being implemented
- transferring outputs to multiple location
- lets say one for portal and the other for
- list of output sites
- good feature to add for 4.9.1
- update --output-site option to pegasus-plan
- pull docker images for singularity runs
- we should do for 4.9.0
- planner needs to tell pegasus-transfer an extra attribute.
- add a type attribute
- Papers
- Github private papers repo
- Deprecate stuff
- perl api
- old catalog formats
- pegasus-plots
- Hiring
- integrity error reporting
August 2018
August 24th, 2018
- Pegasus 4.8.4 Release
- when are we releasing?
- next week before mats go on vacation
- when are we releasing?
- error tagging
- update stampede schema to add a table called tags
- will allow us to capture number of integrity errors
August 17th, 2018
- Pegasus 4.8.4 Release
- RPM fix ?
- mats will manually verify
- Karan should follow up with Stuart
- AMQP filtering
- we are working on having filtering in built into monitord
- nepomunk already has 33 errors identified
- we need to db connection, pegasus-db-admin and other tools to pass properties with pegasus property prefix stripped off
- SWIP Paper
- one reject seems to be harsh
- we can try for HPDC also
August 3, 2018
- Pegasus 4.8.3 Release
- singularity fix
- mats talked to adam at nebraska about containers.
- the main doc book will not be updated for 4.9
- SLURM
- Design Safe / TACC on Wrangler headnode
- Nextflow has integration with SLURM and everything can be installed in user space
- PMC unit tests are broken
- lets fix the tests
- Pegasus 4.9 release
- more real life runs
- nepomunk against ceph-s3 from one of uchicago machines
- we need to get stats reported for integrity errors
- larger issue of error classification
- ADASS Tutorial
- we got into second round
- add on exercise to run montage in the end.
- we got into second round
- LIGO
- Bruce group at AEI Hannover has left LSC
- Infrastructure
- HipChat mess
- should we move to ISI Slack
- Public Chat feature
- Some clients for Hipchat
- Get a free channel from Slack
- for all Hipchat rooms
- what about ISI slack??
- Github removal of old integrations
- moving email notifications . Rafael Ferreira Da Silva will take care of it
- we need to explore
- HipChat mess
- MINT Meeting
- went well overall
- issue of scoping .
July 2018
July 27th, 2018
- Pegasus 4.8.3 Release
- VM Tutorial
- will update pegasus-init requirements to get it working
- main tutorial chapter will be updated for 4.9
- because then tutorial based container may not work
- change how docker scripts set environment
- SCEC database loading error
- VM Tutorial
- Failing Tests
- Issue in updates to the dashboard database
- Panorama Paper
- agreed on a re-organization
June 2018
June 29th, 2018
- Pegasus
- 4.8.3 needs to be released because of singularity launching options
- will wait till tutorial is updated.
- karan will update pegasus-init with population modeling or povray option
- 4.9
- pegasus-statistics updated with integrity metrics
- how to flag job errors because of integrity
- need to figure out logic
- value add proposition
- maybe we should value type in the pegasus lite
- need to implement the integrity dial
- Start creating default local site entries to execute without local site
- 4.8.3 needs to be released because of singularity launching options
- ADASS Tutorial
- Will submit today
- Google doc shared
June 22nd, 2018
- Pegasus
- SWIP paper submitted to escience
- 4.8 montage tests failing
- changes for integrity metrics in pegasus-transfer
- updated monitord to parse events from various sources like pegasus lite output
- mats pointed out to a bug in monitord
- LIGO
- pip for python source package
- update dependencies for latest packages , like pyopen ssl
- install in the pip repository
- pegasus-analyzer
- interested in swip and containers.
- pip for python source package
- SCEC CSEP
- will use containers
- run on Comet
- 1000 genome workflow or use chimerica workflow
- ADASS Tutorial
- montage ?
- probably pycbc is also submitting a proposal
June 8th, 2018
- Scott Replica Catalog issue
- Replica Catalog deletes take a long time
- Bamboo
- bamboo emails are no longer received. so we dont come to know about workflow plan failures
- SWIP
- monitord integrity changes. population of data from ks records working now.
- we still need to populate data from pegasus lite records and pegasus-transfer
- pegasus-statisitcs need to be updated
- 0.1% overhead on production osg gem workflow
- Pegasus deployment at ORNL
- we should be doing it similar to hpc-pegasus
- Pegasus Office Hours
- next one in August
- travels in July
May 2018
May 4th, 2018
- Pegasus 4.8.2 Release done on May 3rd
- we should consider separate user data to a separate file on pegasus-wms
- si2 meeting updates
- some potential new users
- ewa slides were a good overview summary
- integrity data schema changes.
- monitord changes need thinking
April 2018
April 6th, 2018
- Pegasus 4.8.2 Release
- PMC bugs
- tutorial for usc hpc
- no longer allow + or . in the names
- Pegasus Report
- Submitted for Ewa' review
- SWIP test run
- discovered integrity errors in the wild
- at colorado and university of nebraska
- we would have not caught it before
- e-science paper
March 2018
March 30th, 2018
- SWIP
- pegasus-run issue, with wf restarting from scratch
- because dagman rescue file is not there.
- so should we update pegasus-run to look at the dagman.out file
- so far we think it should be kept consistent with normal dagman behavior
- to de discussed at condor week
- mats created a Jira item for swip related statistics
- https://jira.isi.edu/browse/PM-1260
- will involve a database schema.
- Things remaining
- Dials to be implemented
- stampede changes
- pegasus-transfer changes???
- pegasus-run issue, with wf restarting from scratch
- SC Tutorial Submission ( April 16th)
- https://sc18.supercomputing.org/submit/tutorials-submissions/
- We should try and add exercises for containers
- We will try for half day
- 45 minute introduction
- Feedback from Arizona Container Camp
- There is interest.
- coming up with an existing application that people understand or can relate to
- montage - complex dax generator
- rosetta
- only works in nonsharedfs stuff
- with
- machine learning example?
- with tensor flow?
- requires container
- NVIDIA has a lot of examples about machine learning
- has to be multistep
- and at least bag of tasks
- Ashwin is doing some tensor flow stuff
- on workflow.isi.edu
- is working out of jupyter notebook
- Genome sequencing workflows??
- use Broad GATK sequencing workflow to use
- SOYKB and IRRI use GATK
- and are huge communities
- http://biocontainers.pro/docs/101/running-example/
- Pegasus Report
- we should be resolve Jira items as we fix them
- will be also doing cumulative statistics
- Pegasus Office Hours
- Jupyter Notebooks
- will update the example to use namd example used for Oakridge
- Panorama Stuff
- our multiplexing part in monitord done so far
- however we are relying on amqp queues and routing keys for filtering
- darshan data population
- we need to invoke a script (pegasus-darshan) that will be invoked in the namd wrapper script, to pull the data from darshan logs on the file system and generate an ASCII output
- Panorama.isi.edu VM
- AMQP
- Logstash
- Kibana
- Elastic Search
- Make it do a backup every so often.
- Warns against doing it as a permanent datastore
- Rajiv will verify
- Influx
- Backups
- CRASH PLAN backup for the /srv and /opt in the panorama VM
- our multiplexing part in monitord done so far
- LIGO Database locked issues
- we need to look into the locking issues by tinkering with monitord flush intervals
March 16th, 2018
- SWIP
- Most of the SWIP stuff is done as far as planner changes and getting the workflows running
- we are in a position to share something
- To do
- sharedfs
- Dial implementation
- Update monitoring
- Paper submission for EScience
- Pegasus Reports
- new applications to attribute to pegasus grants
- all the mike wangs work will go here
- SCEC
- LIGO - need to ping Duncan
- Panorama/ Pegasus workflow endpoints
- We seems to be going towards AMQP
- How is AMQP going to be configured
- So far we have
- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<exchange_name>
Online monitoring in kickstart- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<virtualhost>/<exchange_name>
- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<exchange_name>
- Virtual Hosts
- right now virtual host is hardcoded in monitord code. we set it to pegasus
- global - across workflows
- Exchanges
- should be global across workflows
- type direct - in panorama
- we want them to be type -> topic instead
- Queue
- in panorama different queues for each workflows
- Routing Keys
- the routing key should be based on stampede event names
- Events populated
- https://pegasus.isi.edu/documentation/stampede_wf_events.php
- We should add periodic events about states of workflows
- SWIP integrity error events will be populated by clients
- We seems to be going towards AMQP
February 2018
February 23th, 2018
Eliminate support for Py2.6?
Python Dependencies
All - future
pegasus-service - Flask, SQLAlchemy, Flask-SQLAlchemy, Flask-Cache, pam, plex, pyOpenSSL, ordereddict
pegasus-monitord - SQLAlchemy
pegasus-analyzer - SQLAlchemy
pegasus-s3 - boto
pegasus-globus-* - globus-sdk
pegasus-init - jinja2
pegasus-metadata - argparse
pegasus-em - requests
PostgreSQL - psycopg2
MySQL - MySQL-Python OR mysqlclient
Note: Packages in green are available from yum.
February 9th, 2018
- SWIP
- checksum computation will be implemented in pegasus-transfer.
- allows us to handle the case where the input files don't have checksums in the RC
- integrity checks are disabled now for files that dont have checksums in the RC
- dial knob
- checksum computation will be implemented in pegasus-transfer.
- Tests
- seem to be slow
- bamboo could be moved to the new server
- storage constraint test
- Lizard FS
- Mats will give an update next time around
- Servers
- Trying to do two server
- IF we buy one server
- Buy a storage server. That is Mats preference.
- SoyKB workflow has
- Compute
- we will get a compute server first.
- We should figure out the server and put in the request soon, and done by Feb end
- LSST
- Tom Glanzman?
- We will touch base on Monday with Tom and Nersc folks
- Office Hours today
- have a presentation on containers
- will upload on the website
January 2018
January 12nd, 2018
- AWS Batch
- seems to be running in karan's account.
- update documentation about aws batch
- Pegasus 4.8.1 Release
- upto Mats whether we should tag or not.
- Pegasus Office Hours
- Rafael will look up a new name
- Container Presentation
- Talk about containers
- Blue Jeans
- Advertising avenues
- XSEDE workflows list
- OSG List
December 2017
December 1st, 2017
- AWS Batch
- Client done. still have to figure out about stdout and stderr
- maybe we should have batch push the files and control where the jobs go in
- also maybe each file should go to it's own stdout stderr
- Metrics for SWIP
- Stampede
- Metrics Server
- Elastic Search
- Rajiv working on changing the salt configuration
- Model Integration with Wings
November 2017
November 10th, 2017
- Pegasus
- AWS Batch
- checked in stuff
- jars checked in aws sub directory in the jars folder. pegasus-config classpath is updated accordingly
- Bamboo builds
- change in how users are handled
- rajiv and mats worked on changing the salt configuration for the various machines
- the major part changed was how the users are handled
- the bamboo user got messed up and uid's were mismatching on the filesystem
- main group for people unix accounts should be pegasus for everybody
- only project users will have access to VM's for a particular project
- Stewie Rebuild
- move off stewie. the main OS needs to be updated
- parnorama
- Rafael and Geroge will create a VM for panorama
- CENTOS 7
- mats will help George create VM
- Ashwin consumers from Influx DB
- CENTOS 7
- Rafael and Geroge will create a VM for panorama
- mysql server
- Pegasus metrics server
- JSON vs YAML
- initial impressions seem to favor yaml
- YAML does have benefit of including comments
- also YAML , JSON will result in additional lines
- initial impressions seem to favor yaml
- templates for site catalogs
- LSST
- mats will update documentation for pyglidein
- to work with condor pool passwords thing
- also will take mike site catalog to update NERSC entries
- tests
- rosetta and montage appear working again. not clear what triggered errors in first place
- AWS Batch
- SC Next week
- Rafael and Karan are away
- AWS workshop for LIGO
- George Panorama work
- Dakota ends up launching multiple Pegasus workflows based on it's gradient functions
- using ensemble manager to do multiple runs
- George will check in dakota test case and example
- pick one approach and update documentation
- SWIP Demo
- think about merging stuff from panorama back to production branch
- work with ian foster and raj kettimutt on globus online
- do multi site run
- Tudo
- working on insitu
- data spaces approach to have staging area
- tudo wrote sample applications
- evaluating on CORI using shared memory
- burst buffers cannot be used
- Ashwin
- analyzes influx db data
- using statistical learning
- python panda library
November 3rd, 2017
- Pegasus 4.8.1 release
- 3 bugs in worker package staging.
- pegasus-transfer PYTHONHOME unset does not work
- hierarchal workflow handling.
- to be discussed tomorrow
- AWS Batch
- need to check in changes.
- need to add options for the client and do error checking.
- still need to figure out how to integrate in pegasus
September 2017
September 15th, 2017
- Pegasus development
- Dashboard
- LSST might want it running out of a directory other than $HOME/.pegasus
- No plans to tackle it right now. requirements are vague. and catch 22 situtation
- Python problem with Pegasus install
- DAX3 problem does not work.
- Could not be recreated
- PyPy account should be disabled
- pypy has a 4.3 pegasus package
- we should remove it
- The jobname with dagman not allowing . is fixed
- Dashboard
- LIGO
- Heard from Duncan. Tried out metadata stuff
- Another person at NERSC that is interested in running Condor
- AWS Batch
- done initial development.
- how to retrieve logs etc.
September 8th, 2017
- Pegasus 4.8.0 Release
- went out this week
- documentation
- pyglidein
- out of icecube
- mats added a section in the documentation
- pretty neat once it is setup
- and works really well on machines with two factor
- not tuned for MPI things.
- on the submit machine a web based python thing.
- pegasus resource profiles will work out of the box with pyglidein
- Releases
- Post 4.8 Releases
- changes in the debain build
- source package has been renamed. mats removed the source part
- changed the versioninig of RPM and debian. The dev series will have the timestamp in it.
- pegasus-version -f also has timestamp
- Will create a separate YUM and DEB developer repositories
- repositories will not be signed.
- Mats is still playing setup
- Worked a lot on Debian packaging.
- changes in the debain build
- Post 4.8 Releases
- HipChat will be upgraded to Stride
- Mats updated JIRA today
- Sim Center Workflows
- Using Condor IO thing
- for 4.8.1 we should look at the remap thing
- SWIP Poster
- the first review is really good
- Docker and Singularity
- have stuff about engineering challenges
- But not enough usage
- Practical Aspect
- Von's Group SWAMP thing.
- pegasus is part of trusthworthy software thing?
- AWS Batch
- AWS batch thing works
- Investigate how Dakota and Pegasus can work together
- Run Dakota as a job
- Run Dakota on submission machine
- dakota calls a script that does a pegasus workflow
- Mix of 1 and 2.
August 2017
August 25th, 2017
- Pegasus 4.8.0 Release
- beta3 tagged
- monitord replay issue for rc tables against mysql server
- Jupyter thing
- VM updated with Jupyter
- Docker example application
- R builds with pegasus
- for time being only brew builds have that disabled.
- Condor update to the brew installation.
- Pegasus 4.9 Roadmap
- SWIP
- lay out the changes
- prioritize stuff for production readiness
- the knob for integrity.
- get into transfers.
- signing stuff on the backburner.
- chaos monkey tests
- lay out the changes
- metadata things
- aws batch support
- SWIP
- Pegasus Tutorial
- George felt that Pegasus tutorial was a bit too easy.
- it should be maybe more interactive. get the user to develop a new workflow
- Tudo will pick up Decaf work
- Dataspaces
- do data management
- Ashwin will work on deep learning on panorama
- use tensor flow
- Dakota
- ini file . runs simulation and converges simulation points
- George will be working on it
- has a checkpoiniting facility
August 18th, 2017
- mats found a new hydrology user in boulder
- based at Boulder
- there was a magpie presentation there.
- mats did a hosted ce tutorial
- 4.8.0beta2 release
- tagged and sent it out.
- monitord workflow and read permissions creation
- should only when the database is created.
- ~/.pegasus directory should be 755
- dashboard errors
- rajiv should traverse the directory in the dashboard.
- LSST
- cleanup issue
- mats and karan agree on it, that it is bad application
- we should reply to it.
- the wrapper should copy the file and launch the job
- cleanup issue
- source a setup a script for jobs
- has to be generically done
- registration jobs shell expansion
- we should not do getEnv=True
- testing repo
- stuart from LIGO asked for it.
- BOSCO
- we have the examples updated
- Karan will remind Eliu about LIGO and Bluewaters
- Slick Jupyter Demos
- Started up VM's
- Jupyter tutorial
- should be integrated into the VM
August 11th, 2017
- Bamboo is finally green
- we will do a Pegasus RC1. actually a beta since we still want to address some issues.
- Rajiv fixed the build with python crypto issues
- pyopen-ssl was updated during 4.7.x series
- we should package only things that we are not sensitive to the versions
- so right now pyopenssl is removed from binary builds, and all associated dependencies were removed.
- New throttling things.
- number of jobs scale with the size of the workflows.
- SCEC all hands meeting.
- Documentation
- Took a stab at the containers.
- Rafael has to add a separate jupyter chapter
- Karan will update the throttling docs
- LSST
- Mats and Karan had a call with Tom about designing a workflow for one of the production pipelines
- Mats and Rafael had a call with the French cluster folks (Fredrique Sutter). Fredrique works for simgrid
- Paper
- rvGAHP paper ready for submissions
- Suraj Poster
- Mings pass really helped
July 2017
July 21st, 2017
- VMs are down, so tests are slow, and cannot test the new features yet
- Mats will send an email (or call) Derek to check with the VMs issue
- Try to run the Montage container test on OSG
- TODO: Reconfigure our poll (it is not flocked yet)
- Pegasus 4.8.0
- Bugs on the container (transformation catalog) is fixed
- Stage in/out nodes based on the number of computing jobs on the workflow
- TODO: add warning for errors (size of jobs)
- Warning for category is done
- TODO: reference implementation of a workflow using docker (1000 Genome workflow - Rafael)
- Jupyter: add container keyword for API
June 2017
June 23rd, 2017
- Pegasus 4.8.0
- Decaf
- local universe jobs does not honor request_cpus , and jobs remain idle if they ask for multiple cpu's
- karan will update pegasus to remove the request_ parameters from the local universe jobs
- local universe jobs does not honor request_cpus , and jobs remain idle if they ask for multiple cpu's
- Steven Clark
- Pegasus build issue is related to python 3 compatibility in the DAX API
- Decaf
- LIGO
- Eliu plans to run on Bluewaters
- we should confirm that he only wants to run on bluewaters.
- they have sucky performance of getting data to the compute nodes in bluewaters.
- set the schedd start date
- NERSC
- Karan will do a test setup there.
- Karan will do a test setup there.
- Pegasus Builds
- failed because of detain version upgrades to build tools
- setup tools in python complains to pegasus 4.8.0-dev
June 9th, 2017
- Pegasus 4.7.5
- pegasus-rc-client bug fix is done
- 4.7.5 and 4.8.0 together
- Pegasus 4.8 release
- docker stuff is complete
- docker tests added are green
- karan will work on singularity next week.
- LIGO reports pegasus lite jobs filling up /tmp . karan will check with LIGO on whether there is any environment set?
- rafael will update his api to make it consistent with the container format
- also will add a bamboo example.
- docker stuff is complete
- DECAF integration
- karan has an idea about it.
June 2nd, 2017
- Pegasus 4.7.5
- pegasus-rc-client bug fix to be done
- Jupyter
- rafael will be working on it during June
- For 4.8.0
- container
- docker works in nonsharedfs right now.
- work on singularity support.
- clustering . clustered jobs can only refer to one container
- symlinks - for 4.8.0 they are disabled.
- container sharedfs example
- we have pegasus-lite with sharedfs. automatic translation of file URL's
- transfer refiner
- notification email updates
- mats updated default notification scripts. will generate svg files
- at end of workflow generate notifications that have statistics
- monitord needs to run the remaining notifications after the workflow is done.
- container
- makeflow integration
- limitations for pegasus generating make flow integration
- makeflow model
- all files have to be on the submit host
- how do we translate auxiliary jobs to make flow description
- tyson at arizona.
- add new transfer jobs
- add new credentials
- no postscripts there
- monitoring
- won't work with monitoring
- write a new monitord.
- maybe do an oppposite translation???
- what will be useful is to integrate with using work queue with our own dagman manager.
- makeflow model
- limitations for pegasus generating make flow integration
May 2017
May 12th, 2017
- auto scaling of stage out and stage in jobs
- 4.8 transfer refiner will be Cluster by default.
- auto-computation of number of stage in, stage out and cleanup jobs
- defaults should be computed based on number of jobs at a level.
- use a ratio or step function .
- come up ratio ranges for auto determination
- 1:5 for numbers of jobs < 10K ( 20%)
- 1:20 for number of jobs > 20k ( 5%)
- will create a JIRA item for this
- container stuff
- close to having one example running
- have not figured clustering jobs out yet.
- mats agrees with the approach now. pegasus lite invokes the docker run commands.
- integrity stuff
- will make slides
- be specific about we have done .
- we give them an option of running synthetic stuff
- For
- also define best effort part.
- strict, off, minimal , best effort
- how do we handle case where SHA exists.
- WDL
- workflow definition language
- WDL is JSON based
- has a template approach with variable substitution
- workflow definition language
- AWS Cleanup
- need to delete snapshots and cleanup VM's
March 2017
March 17th, 2016
- monitord stdout and stderr missing
- the VARS one. just expose the variable.
- SCEC issue
- job managers per resource
- got fixed by one job manager per job
- BOSCO works partly.
- containers call from yesterday
- dsa
- metadata
- metadata population in postscripts
- move metadata population to the postscripts.
March 10th, 2016
- SCEC cleanup issue
- related to Jglobus not updating and enforcing the compliance for RFC 2818 compliance
- LSST visit update
March 3rd, 2016
- Pegasus 4.7.4 Release
- sent out the release
- we did a ligo fix yesterday to pegasus transfer
- mats osg gem
- workflow did not finish
- pegasus-exitcode has a shortcut for a regex
- make it more strict. whether to trigger failure in pegasus-exitcode
- revisit how metadata population
- trigger failure for missing records.
- pegasus-exitcode has a shortcut for a regex
- workflow did not finish
- SCEC RC client issue
- Rafael will look into it for pegasus-rc-client
- containers support
- containers on a pause right now.
- Webinar
- lets try and schedule one for april end
- bluejeans will be an option
- topic will be covered new features for 4.8.0
February 2017
February 24th, 2016
- Pegasus 4.7.4 Release
- we will tag today.
- there is a potential monitord bug that happens on sub workflow retires only in the live mode, that Karan is unable to trace
- ds
- containers support
- pegasus lite launches docker wrap
- or the other way around. because worker package has to be installed in the container in some cases
- so double install
- or the other way around. because worker package has to be installed in the container in some cases
- Clustered jobs
- we want at max one container to use the clustered job.
- pegasus lite launches docker wrap
- monitord performance
- on OSG connect there is a difference between 4.6 and 4.7 performance replay
- monitord.log has errors indicating unable to read .out .err files.
- we think it is a race between DAGMan and the filesystem
February 17th, 2016
- Pegasus 4.7.4 Release
- targeted for next week.
- LIGO ran into a prescript issue
- pegasus lite deleted the worker package in the workflow submit directory
- only triggered when there was a subsequent compute job.
- pegasus lite deleted the worker package in the workflow submit directory
- new transformation catalog format
- containers
- open issue whether docker wrapper launches pegasus lite
- or the other way around
February 10th, 2016
- Pegasus 4.7.3 Release
- SCEC has issue with pegasus-db-admin
- mysqldump timesout when updating their replica catalog
- Database TC
- remove support for Database TC
- SCEC has issue with pegasus-db-admin
- Stewie and fisheye upgrades
- fisheye upgrade
- Mats agreed to do the upgrade
- stewie runs debian 7
- we need to upgrade it one day or later.
- runs GridFTP and mysql
- RabbitMQ is running there
- MongoDB is running there
- Catalog dependencies on stewie
- 5K limit for a new server
- fisheye upgrade
- OSG All Hands Meeting
- no tutorial looks like
- lots of pegasus users coming there
- Containers Support
- pegasus lite invokes the docker wrap.
- singularity support will be required.
- container modes
- should we support docker definition file
- do we build on the worker nodes?
- pull in an existing docker image from the hub
- on the staging site
- whether we should unload an image or not
- we should try and cleanup
- credential renaming has to be worked out
- should we support docker definition file
- Transformation Catalog
- how to represent container dependency in the transformation catalog
February 3rd, 2016
- Pegasus 4.7.3 Release
- we tag later today or first thing monday
- waiting for scott to reply
- Jupiter Notebook
- in general jupyter the interactive interface closes if you close the tab
- in our case it does not affect us, since we invoke pegasus-plan at the server end
- Vicky has a workflow out of panorama that she has in jupyter as a set of the instructions
- Containers
- karan did some exploration of docker containers via HTCondor
- by default docker in the container runs as root.
- means output files are written out as root
- also the containers need to be shipped around.
January 2017
January 27th, 2016
- Pegasus 4.7.3 Release
- 4.7.3 release.
- condor stable release has been released.
- we will tag next friday one way or other
- fix monitord replay mode
- crosscheck with rajiv on dashboard
- centralized mysql server for master workflow dashboard
- LIGO wants to host a mysql server for master workflow databases
- Mats will like to see something similar
- also look at some publish subscribe options
- 4.7.3 release.
- Rafael give an update on the container
- docker universe
- htcondor support i think is mainly geared towards startds
- preinstall software in user containers
- another model is to let pegasus figure out data and executables
- rafael did stuff in pegasus lite stuff
- will have to rewrite proxy and credential environment variables
- also how is the environment is rewritten
- good to have a generic concept of multi-level wrappers
- need to have a pegasus-docker-wrapper or pegasus-container-wrapper to do launch docker or singularity
- another container technology called shifter
- OSG uses singularity because it is more friendly to launch in user space.
- OSG is launching user jobs using singularity. the image is determined by the VO.
- lets target pegasus lite mode first
- little bit of data passing.
- docker universe
- Rafael will have a student to take forward the docker swarm stuff
- 8 hours every week
January 13th, 2016
- Pegasus 4.7.3 Release
- sub workflows
- better error message for pegasus-transfer when source files don't exist
- pegasus-kickstart
- improve error message
- dashboard to better separate kickstart and pegasus lite messages
- Potential SCEC issued with RV-GAHP
- results of qualtrics user survey
- Pegasus 4.8
- swip stuff for 4.8
- have sent emails for their use cases
October 2016
October 7th, 2016
- Pegasus 4.7 Release
- release notes and documentation is done
- need to follow up with Action for our build VM's
- LIGO is not going to test 4.7 release as they are in midst of a cluster upgrade.
- Rafael will write a blogpost about R API after the 4.7 release
- Dashboard requests 4.7.1
- rafael and rajiv will work on getting dashboard to display the database schema version and the pegasus version
- useful, when a new version of pegasus is deployed and .
- Unable to read the sqlite database
- related to users permissions on the database
- from braindump in replay mode should be able to pick up relative paths.
- brew error on macos sierra
- brew releases are built manually
- after the release we have to update the formula to reflect latest stable version.
- ACME workflow on MIRA
- GitHub page to be updated with list of dependent software
- ACME team needs to help with installation of one of the software.
September 2016
September 16th, 2016
- Builds
- disabling RHEL5, Debian 6, Ubuntu precise. Karan will make sure in the code it works
- Pegasus 4.7.0 Release
- reached out to LIGO. hopefully they will start testing
- rajiv checked in dashboard changes
- karan to write documentation for directory layout
- rafael will update pegasus-exitcode next week.
- Pegasus 4.8.0 release
- one of the first things will be to update the SUBDAG keyword.
- LLNL account approved for Karan
- OLCF account waiting for notarized documents to be received
- SCEC
- concurrency limits for transfer jobs
- prime candidate for priority stuff that will allow good interleaving of transfer jobs with the compute jobs
- ask Scott to see if 8.5.6 condor can be released.
- ACME workflow
- HSI client for HPSS storage.
- Karan will reply to Jamie.
- Bluewaters HTCondor install
- Bluewaters renewed till 2019
- Pegasus HPCC workshop on September 30th
- karan will be there.
- karan will be there.
September 9th, 2016
- Builds
- disabling RHEL5, Debian 6, Ubuntu precise
- Pegasus Development
- 4.6.2 released . LIGO has updated it.
- LIGO tripped over changes to planner submit directory behavior
- held job reasons are recorded in the database
- 4.7.0 release
- went through pending items
- targeting end of the month for the release
- 4.6.2 released . LIGO has updated it.
- proposal
- data aware workflow management
- no BPEL only a reference for it.
September 2nd, 2016
- Pegasus Development
- 4.6.2 released . LIGO has updated it.
- pegasus.dir.storage.deep true throws an error right now.
- 4.7.0 release
- karan looked into the HELD job
- rajiv thinks no dashboard change required.
- pegasus-exitcode changes will be done by rafael
- LIGO should install 4.7.0 on dev machine.
- SCEC production run
- Reverse GAHP OLCF
- once tokens are reactivated , karan will check up on rhea rvgahp and get it running
- HTCondor on bluewaters
- Karan opened a ticket.
- LLNL
- security training to be done by Karan
- panorama
- rafael is working on panorama demo
- two different pegasus workflows running on 2 exogeni slices
- and data staging server in between. shadow q has to propagate transfer priorities
- currently it is workflow level priority. will be manually assigned.
- 1000 genome workflow -
- rafael is working on panorama demo
- 4.6.2 released . LIGO has updated it.
August 2016
August 12th, 2016
- Pegasus Development
- 4.6.2 release
- release notes are checked
- tutorial documentation will be updated to include the docker tutorial
- pegasus service init script
- we will not include it and enable by default in the builds
- mats will update the item accordingly
- 4.7.0 release
- submit directory structure
- we need to get the depth thing fixed . Karan need to make sure if the DAGMan knob can be set automatically.
- we should have a way to have it set for deeper
- documentation to be set
- pegasus-exitcode to have wait lock thing to setup it's logs
- one option is to log only exceptions initially.
- submit directory structure
- 4.6.2 release
- pegasus-keg to mimic IO pattern
- read files over and over again.
- this way we can increase IO without increasing file size ( that results in higher data transfer costs)
- this way we can increase IO without increasing file size ( that results in higher data transfer costs)
- read files over and over again.
- DECAF WMS
[10:30 AM] Rafael Ferreira da Silva: https://bytebucket.org/tpeterka1/decaf/wiki/public-docs/peterka-decaf-handout.pdf?token=c693fc8bb177...
[10:31 AM] Rafael Ferreira da Silva: https://cfwebprod.sandia.gov/cfdocs/CompResearch/templates/insert/project.cfm?proj=134
August 5th, 2016
- Pegasus development
- waiting for LIGO to check the support for changes for OSG, where pegasuslite URLs are converted to file URL if the staging site and compute site are same
- Karan has a call with Duncan next week planned.
- staging sites deep directory structure
- mats has it working for one of the workflow.
- https://jira.isi.edu/browse/PM-1049
- automatic delayed job retries
- the real fix should be in DAGMan. Karan will follow up with Kent. Will address for 4.8
- postscript output redirects
- one file per job is what we had considered earlier
- maybe we should do it per workflow log file.
- waiting for LIGO to check the support for changes for OSG, where pegasuslite URLs are converted to file URL if the staging site and compute site are same
- DIPA workflow development
- good progress there.
- Titan Setup
- we should consider setting up it the same way as bluewaters
- Next Pegasus proposal
- next week meeting we should iterate on items.
- Samrat issue
- get pegasus-exitcode to look for final output files
- checked in workflows to the pegasus repository
- bioconductor repository
- would be good to setup PAGE cloud VM with the workflow.
- Deter Krans Mueller
- director of supercomputing in germany
- supermute supercomputing cluster
- will send a student for 3 months to ISI end of the month.
- Rafael plans to practical comparison paper
- Gui's docker stuff.
- do a blogpost of montage with above docker stuff.
July 2016
July 15th, 2016
- Pegasus development
- waiting for LIGO to check the support for changes for OSG, where pegasuslite URLs are converted to file URL if the staging site and compute site are same
- staging sites deep directory structure
- dashboard changes for nested submit directory structure
- fixed the on demand loading for the dashboard.
- identify workflows that will benefit
- LIGO
- Splinter
- OSG - Kink
- put in the test cases for testing it out.
- use the new montage dax generator
- pull the montage dax generator via squid cache.
- Release schedule
- Get 4.6.2 out first.
- 4.7 probably early august.
- ALCF Mira running.
- cobalt workflow
- ACME workflow compilation. Waiting on Ben for the source code.
- Panorama use case
- SNS is not enough in terms of data sizes.
- anirban will start working on it next week.
- R Examples
- samrat working on a bioconductor example
- has an example workflow
- code should be checked into github
- samrat is working on a more advanced workflow that will be put in the examples directory also
- samrat working on a bioconductor example
- Gui docker nodes work on amazon ec2
- uses docker swarm and docker machine to do setup etc
- workflows run in condor IO mode.
- DIPA Workflows
- waisman folks will start working on it.
- free surfer workflow
- mats does not think there is enough uptake.
- suchandra is working on a second version that will add more capabilities
- seismology workflow
- rafael will check in to the repo.
July 8th, 2016
- Pegasus development
- waiting for LIGO to check the support for changes for OSG, where pegasuslite URLs are converted to file URL if the staging site and compute site are same
- pegasuslite signal handling
- mats updated it. LIGO reported cases, where jobs got killed before the outputs were staged back . But the jobs themselves were not marked as failures.
- duncan's third issue could also be related to the signal handler
- modify kickstart to compute md5 checksums.
- we could potentially get kickstart to validate md5 checksums
- have an architectural idea about it.
- gridftp currently does not expose checksumming
- irods client has checksumming in built.
- pegasus-init R example
- R example will not run on OSG because of module load issues
- all R examples will have a wrapper for the scripts
- 4.6.2 after changes are verified.
- DIPA Workflow
- with Waisman brain imaging pipeline that runs on Waisman cluster
- Rafael is working on a seismology workflow
- tophat workflow paper got accepted in a bio journal
- Pegasus Virtual Summer School
- would be similar to the XSEDE ones
- will be 1.5 hours long.
July 1st, 2016
- Mats has moved bamboo to a new RHEL7 VM
- migrated all the tests to it.
- there were issues with CondorC tests that are resolved now. because of path issues
- pegasus-init R
- Rafael will integrate Samrat's R example workflow
- Samrat is also working on a bioconductor example workflow
- rajiv made minor dashboard query changes
May 2016
May 13th, 2016
- Pegasus development
- kickstart wrappers
- process explosion.
- eventually we would want it to be in the workflow.
- handle these wrappers as credentials in the workflow.
- what are class of files that are always required.
- KICKSTART_WRAPPER in kickstart
- was done for the PAPI stuff originally.
- pegasus-init for OSG
- pegasus-init
- R examples?
- rafael will do it in june.
- job held scenarios
- open with htcondor admin .. a job should never goto the held state
- maybe pegasus should do quick retry for small workflows
- for large workflows retries should happen at a longer delay
- for workflows less than 100 nodes held duration should be small, and failures maybe should be triggered earlier
- not for large workflows
- revisit whether clustered jobs should be based on size of the cluster or the number of jobs
- mats no longer likes the idea of having fixed number of transfers
- deep directory structure for the workflows
- can splinter move to using them?
- right now they are condor io
- on the data side it deep directory structure will only work
- can splinter move to using them?
- BOSCO SSH
- Mats tried with condor 8.5.4 on comet.
- kickstart wrappers
May 6th, 2016
- Pegasus development
- moved the submit directory creation stuff to the mapper interface
- reorganized the code for it.
- on the execution site for nonsharedfs case we will enable for the dashboard
- dashboard works mostly
- only improvement is on the file browser side. will open a JIRA item for it
- database changes
- for 4.7 we will add extra columns to workflow state and job state tables.
- the dashboard needs to show the better the task metadata better for 4.7
- moved the submit directory creation stuff to the mapper interface
- pegasus tutorial for virtual summer school.
- will be based on the XSEDE tutorial
- bluewaters will setup a VM for the tutorial.
- Scott will do an introduction and an overview.
April 2016
April 22nd, 2016
- Pegasus development
- 4.6.1 released today
- had to fix bugs for symlinking not being triggered for SCEC
- dashboard for the home page should work without trailing slash
- all other pages should work the same way . For 4.7 we should do that
- Pegasus R example
- rafael will work on it
- OSG and XSEDE site catalog examples
- Submit Directory organization
- Relative DAGMan paths
- 4.6.1 released today
- HTCondor week
- Lauren said training week
- Bluewaters training
- 2 day training might be too long
- we will work on pegasus training module.
April 15th, 2016
- Pegasus development
- 4.6.1 release next week
- pegasus-status change for new Condor changes
- cartoon will be upgraded to 8.5.x
- pegasus-analyzer
- will report correctly submit failures
- better errors for mismatch in cores/ppn requirements
- Tag and build on Thursday.
- pegasus-s3
- batched uploads and downloads
- output directory options fails if local scratch not specified
- pegasus-status change for new Condor changes
- 4.6.1 release next week
- LIGO transfer issue
- NFS reported write as successful for a transfer job.
- wget reported data was transferred and wget succeededgood use case for checksumming of data
- where do checksums come from
- NFS reported write as successful for a transfer job.
- for data files good placeholder in the transformation catalog.
- that is why SCEC put a specific job in the workflow and uses ABORT DAG on feature
- Call with Kent for adding nodes to a running DAG
- group jobs with similar errors
- might be a python library in there
- HTCondor Week
- proposed a hands on tutorial
- pegasus 4.7
- ignore integrity constraints in monitord
- only for duplicate keys
- ignore integrity constraints in monitord
April 1st, 2016
- Pegasus development
- Submitted tutorial for XSEDE 16
- will include RADICAL
- might update tutorial with BOSCO. Mats already have BOSCO to run on Comet
- Derrick Lazaro wants to build a bigger filesystem ( 400 TB )
- will be backed up
- has a commercial storage vendor in mind
- has backed up capabilities in built ( block level backup)
- let Mats know about storage needs
- Mats estimated our storage needs to 25-50TB
- Graduate student coming to the group mid may to july. brazilian student. currently in Florida
- Ahmad group got a EPSCoR grant
- CRAFT Meeting update
March 2016
March 25th, 2016
- Pegasus development
- Gideon has been working on kickstart online monitoring for panorama.
- the lib interpose monitoring requires app code to be dynamically linked to use LD_PRELOAD
- now kickstart has a new mode, where monitoring thread will scan the proc filesystem for all processes in resource group.
- this approach disables the PAPI counters as they need to be retrieved from app itself
- also is working on aggregation logic
- complicated accounting information
- added another process called pegasus-monitor . so it is usually pegasus-kickstart-> pegasus-monitor -> application
- can deploy without any external dependencies.
- 4.6.1 release
- in april when karan comes back from PAGE meeting
- Condor bug on schedd evicting dagman jobs
- LIGO noticed on other submit nodes
- mats worked with Derrick to make sure glideins work with BOSCO on comet
- CyVerse Talk - Mats will do a hands on thing with them. Mats may do an existing tutorial.
- raphael used the new slides.
- Gideon has been working on kickstart online monitoring for panorama.
- Pegasus workshop
- erin will get back to us with other feedback.
- make the intro slides more simpler.
March 18th, 2016
- Pegasus development
- deep submit directory structure working for submit directory on PM-833 branch. however need to move to relative directory paths in the .dag file , before merging back to master
- gideon is reworking how kickstart online monitoring work
- working on kickstart monitor that goes through the /proc/ filesystem with the assumption all apps installed via kickstart have the same process group as pegasus-kickstart
- pegasus workshop on campus on tuesday. it is setup https://pegasus.isi.edu/tutorial/usc/
- the tutorial is setup using pegasus-init
- will ask mats to move the XSEDE tutorial to pegasus-init
- raphael working on energy paper again
- stephan paper to HPDC got accepted
March 11th, 2016
- Pegasus development
- R DAX API is done
- will be proposing for CGSMD
- Deep hierarchy structure
- R DAX API is done
- LIGO meeting
- do a local file copy against the staging site
- having a separate staging site bogs down inter site transfers
- metadata
- they are interested. want monitord to transfer the stampede database to another location from the scratch submit directories
- cannot really do it in monitord
- can also potentially do it in pegasus-dagman
- argument passing for sub workflows
- will be done 4.6.1
- jobs that work on output site directory.
- credentials issue
- variable substitution
- will make use of it
- submit directory and other directory organizations
- are interested in using it
- do a local file copy against the staging site
- Rosa
- wants to do something with pegasus
- Monitord
March 4th, 2016
- Rosa
- dispel4py Stream based workflow mapped to MPI, Storm
- MPI 3 Failure Recovery from Node Failures
- Monitord
- Triggered by Condor failures. Workflow killed, condor recovery did not spit out all event on recovery.
- Need better way to test.
- DB Admin
- Merge issues
- rafael with confirm with gideon if there is an issue
- Bamboo
- Rebooted for DROWN Attack
- R API
- Unit tests done.
- Packaging - Ship, host?
February 2016
February 19th, 2016
Pegasus development
- support for GO - mats is working on it
- dashboard shows multiple workflows with same uuid. fixed in monitord
- pegasus transfer was prepending path because of globus location
- mats has changed the logic
- SCEC wanted to disable the stat of files that was happening automatically because of registration turned on.
- we now have the property that can explicitly turn it off
- SCEC tripped over replica catalog insert performance.
- rafael working on it. identified the bottleneck
- Catalog files in submit directories
- will create a catalogs directory
- what about file based replica catalogs and cache files etc? some of them can be large.
- Pegasus Blogs
- SCEC
- RVGahp?
- Website
- highlight applications better.
- workq has a catalog server running
- how do jobs report real time monitoring information back to monitor without rabbitmq
- have a condor submit wrapper
- will help us increase memory requirements in case of failures.
- PegasusLite to have pegasus-transfer invocations as kickstart records
- kickstart
February 12th, 2016
Pegasus development
- support for GO
- mats found a python REST API - is decent.
- will only work on a small subset of workflows
- only third party transfers
- how to handle file URL's on the submit host
- and how do we activate the end points.
- lifetime of credentials .
- cannot work on non shared fs mode, as what end point to use when staging to the worker nodes.
- maybe we should look at how condor does it.
- held jobs
- dagman added support in 8.3 where the held job reason appears in dagman.out
- will need schema change
- failing workflows
- held jobs.
- have a held job tab.
- pegasus-submitdir archive
- PMC job statistics in pegasus-statistics
- mats and rajiv
Annual Report
February 5th, 2016
Pegasus development
- 4.6.1 release
- pegasus-glite-configure
- change of how retries are done for transfer jobs, using requirements and dagnode retries
- https://jira.isi.edu/browse/PM-1049
- there are just 2 retries implemented for transfer jobs
- one more option is for pegasus-transfer to do better retries
- and let the dagman retry set to 1.
- use DAGMan influence to do in retry.
- do more testing at our end.
- lets change default retries for transfer jobs
- and do this only for transfer cleanups in condor environments
- LIGO runs
- symlinking
- R API
- will target 4.6.1 and keep it similar to the python API
- 4.7.0 release
- filesystem organization
- Keck workshop on Pegasus on Feb 26th
- Pegasus Annual Report
- Pegasus GUI email
- we will send user a direct link
- Pegasus Announce SLES email
- we have done on SLES 11 not on SLES 12
January 2016
January 28th, 2016
Pegasus development
- 4.6.0 release
- Released this week
- Pegasus Website
- new website there
- karan will put in the old release notes.
- Links for old documentation on the new website
- Rajiv has updated the docker tutorial
- Tutorials will be moved to Pegasus website
- Have a research link to point to Scitech website
- Gideon confirmed MoabGlite helper scripts work with stock condor
- will also check in a tool to put in the scripts to the right locations.
- Pegasus Lite pulls in a worker package
- should we download even by default from the worker package
- warnings for worker package not being found.
January 22nd, 2016
Pegasus development
- 4.6.0 release
- open items
- constraints algo implemented and checked in . tests worked .
- documentation
- karan added chapters on metadata and variable expansion
- gideon updated execution environments
- updated the BOSCO section about SSH
- pegasus-analyzer exits gracefully when nothing in the stampede database
- check if analyzer and statistics check for the version.
- pegasus-init
- pegasus-db-admin
- better error message for that case.
- karan will update tutorial to take account of default options
- for glite style condor arguments quoting is automatically turned off
- new website.
January 15th, 2016
Pegasus development
- 4.6.0 release
- open items
- https://jira.isi.edu/issues/?filter=10952
- Rafael almost done with Constraints cleanup algo. tests run fine on the branch
- pegasus-bootstrap
- gideon was doing it as Jinja templates
- will set it up a shell script. will be easier for people to update
- documentation needs to be updated
- map the globe
- for resource requirements add pegasus.queue keyword. update documentation to have one table. remove the documentation for priorities.
- MOAB stuff documentation. Will be considered for next major release.
- open items
- DAGMan wants to remove the functionality of running postscript in case of prescript failure
- does not affect pegasus
- DAGMan wants to remove DAG NOOP keyword
- was introduced for LIGO
January 8th, 2016
Pegasus development
- 4.6.0 release
- Condor DAGMan log messages contain HTCondor in 8.5 series
- broke monitord
- fixed both 4.5.4 and 4.6.0.
- 8.5.2 has DAGMan logging timestamp from condor job log also.
- monitord has been updated for that.
- metrics reported were updated
- Globus strict checking mode.
- gridftp + ssh version.
- Scott is working on getting the reverse GAHP stuff
- How to configure the batch_gahp
December 2015
December 18th, 2015
Pegasus development
- 4.6.0 release
- Reverse GAHP for Oakridge Titan
- https://github.com/juve/rvgahp
- done because cannot do incoming connections on titan
- and also they don't want to use pilot jobs, as it is not easy to yank a job from a HTCondor queue
- Harvard Pegasus installation
- with SLURM support.. Karan will work on this.
- We should explore remote batch GAHP stuff
- for remote batch do
- batch gahp --rgahp-key /give/key user@host
- look at the remote_gahp script.
- documentation for the batch gahp thing.
- for remote batch do
December 11th, 2015
Pegasus development
- 4.6.0 release
- open items
- pegasus-db-admin
- cleanup algorithm
- raphael will work on it.
- pegasus-s3 cert issue
- updated boto library to account for cacert change
- on mac, had to disable the automatic failover
- Bypass PFN's
- replica selectors can now order replicas. Default and regex ones updated
- monitord
- combination of missing job terminated and exception on casting job duration as int, triggered a bug that LIGO reported.
- default behavior of planner
- pick up pegasus.properties from cwd as a replacement for conf option
- --sites option for * behavior , remove local from candidate sites
- pegasus-bootstrap commands
- sets up pegasus with site catalog. and dax generators
December 4th, 2015
Pegasus development
- JDBCRC
- should work for 4.5.3 . will work for the release
- need to make the changes for 4.6.0
- should consider batch inserts
- rafael has implemented the batch inserts also
- the database locked errors are fixed.
- Rafael is looking into how the timeouts are implemented in sql alchemy
- Mac OSX El Capitan Builds
- Gideon fixed those. El Capitan does not allow root to modify files in /usr
- Gideon changed the installer to install to /local
- Upgrading the mac mini build host.
- LIGO proxy issue
- change in how proxies are generated.
- LIGO en-common proxies were not supported by J-Globus
- Gideon has the patch for making the updated jar.
- Gideon has added instructions on building globus for El - Capitan
- Jobmanager-condor for obelix was updated to support both shared fs and non shared fs cases.
- metadata registration
- information for output files is tracked.
- pegasus-metadata client . Rajiv.
- Cleanup algorithm - Rafael ?
- LIGO use case for fallback PFN for PegasusLite cases
- they want to use existing input data for frame files, on different locations across sites
- but have a single site catalog entry for the computation, as glideinwms provisions it
- Karan and Mats are working on it
- pegasus-transfer changes ?
- sd
- LIGO running workflows across LIGO and OSG .
- Database locked errors for monitord.
- Call the 4.6 release as 5.0 release.
- Gideon working on MOAB Blahp support.
October 2015
October 23rd, 2015
Pegasus development
- Tutorial VM
- rajiv will update dashboard screenshots and go through the Virtual machine based tutorial
- JDBCRC
- should work for 4.5.3 . will work for the release
- need to make the changes for 4.6.0
- should consider batch inserts
- sqlite supports unlimited connections
- for write locks , 25 jobs running for write locks. after 25 and it ignores timeout settings.
- 67 registration jobs.
- raphael is implementing a back off
- category for the registration jobs
- eventually do the dagman category stuff
- metadata registration
- information for output files is tracked.
- pegasus-metadata client
- concurrency limits
- in partitionable slots this has an affect on performance
- for 4.5.3 we will have a knob and set it to false by default.
- Dashboard and PAM problem.
- mats will create JIRA item.
- salon working on data from MYRA
- trying to find contention of data
October 16th, 2015
Pegasus development
- does stime include io wait time. does not appear so. the cp of 1GB file indicates that
- so then is there a way to capture the IO wait time
- pegasus-db-admin
- version migration for panorama works
- metadata schema finalized
- failing jdbc RC test
- metadata population
- metadata population from DAX working
- metadata attributes from transformation catalog and site catalog are now incorporated, as metadata events are generated at end of site selection
- output file sizes will be populated for files with register flag set to true.
- pegasus dashboard
- metadata display done other than the file information that needs to be populated
- cleanup algorithm
- will be done before raphael leaves for vacation
- website changes
- panorama changes
- monitord change to make sure events don't get dropped
- online monitoring spawns a thread where there is a queue that is responsible for inserting the online monitoring events into the db
- the thread checks the database to make sure the job instance is populated.
- CURRENTLY, it is not done for the anomaly populations.
- SNS and Acme workflow
- maybe we can hire a student to do it
- maybe scalarm can be used for SNS workflows
- Ben said there is a meeting about Pegasus on Titan.
- Mats has installed wordpress on one of the machines.
October 9th, 2015
Pegasus development
- pegasus-db-admin
- db version has been moved to string. a new column was added.
- metadata population
- files are populated if a user specifically associates metadata with a file in the DAX or if an output file is marked for registration
- make sure that for tasks metadata attributes are inherited from the transformation catalog.
- pegasus-metadata client
- output format ?
- is the client for end users
- list files for a workflow
- list workflow metadata
- pegasus dashboard
- workflow level
- task level level
- file level metadata
October 2nd, 2015
Pegasus development
- pegasus-db-admin
- changes discussed last week?
- also change to string for the database version for allowing merges with panorama
- panorama db versions should be N.x and not whole integers
- jdbrc sqlite test failures
- pegasus-transfer
- better job with grouping for ssh transfers.
- metadata population
- planner generates the events now for associating metadata with wf, job and files
- use case should be for a file what workflow and job created that file.
- Pegasus workshop
- we will be using workflow.isi.edu
- mats has created 30 training accounts on workflow.isi.edu
- suggestions on workflow example?
- blender rendering example..
- pegasus-dashboard should be installed
- Sipht portal
- back up and running
September 2015
September 25th, 2015
- Pegasus development
- pegasus-kickstart to return record on condor_rm ( SIGINT)
- changes to data reuse algo for Chris Edlund
- delete jobs when inplace cleanup is used for intermediate files that are not transferred to the output site.
- use of DAGMan NOOP keyword
- workflow test failures
- change monitor to not complain for noop jobs.
- comma separated directories for input dir
- automatically delete the input directory ? we all agree not a general use case.
- pegasus-transfer grouping should be done for all protocols?
- problem is some renames for output files
- avi has been running workflows on OSG with pegasus lite.
- 2 million connections over two days on SSH server
- pegasus-db-admin error handling.
- if it fails with error, it should not report that database has been updated. This is a bug
- other is what to do , when 4.5 is run against
- downgrade option
- warn if db-admin detects database version is higher than what it is currently running, and exit with 0 exitcode.
- Pegasus IEEE article accepted
- montage workflows
- dax generator is not maintained
- have it as a student project to convert the DAX generator to python API.
- they also check an overlap check
- montage jobs have varying memory requirements
- we should not showcase it.
- Pegasus Workshop in October
- fallback from USC HPCC cluster required
- whole day will be rough.
- Mats will not be around! Going for the duke workshop.
- panorama
- monitoring thread segfaults
- why was the segfault happening initially
- happening in fork system calls
- related to starting and stopping monitoring threads
- and how PAPI counters were updated.
September 18th, 2015
- Pegasus development
- pegasus-db-admin updated
- for spec added registration of flat lfn's when deep LFN are used
- workflow tests now running.
- pegasus paper
- will add info about galactic plane and gtfar
- cloud challenges
- talk about virtual clusters . precipe / wranglar
- tie more closely to setup stuff and talk about chef/puppet and precise and wrangler.
- gtfar
- add them in acknowledgements
- talk about virtual clusters . precipe / wranglar
- not much to add about cloud challenges other than image managements
- hubub conference
- latech user who wants to run on bleaters
- tom bishop
- pegasus submit tutorial.
- to do with steven...
- panorama
- segfaults happening randomly
- happen when the monitoring thread is started.
- segfaults happening randomly
- craft
- jarek
- hubzero
- chip design
- instead of hubzero use open science framework - a non profit funded thing
September 11th, 2015
- Pegasus development
- worker package tests in pegasus lite
- pegasus lite will complain if the system architecture
- panorama tests now work
- maybe some problems might be masked!
- jdbcrc
- updated jdbcrc . for mysql and postgres deletes work differently.
- raphael will abstract it out
- gideon changed the way the papi counters are used in kickstart
- earlier signals were being used for threads to report counters
- PAPI now allows to query for counter values
- worker package tests in pegasus lite
- Pegasus cloud article
- ewa is doing the final edits
- HubBub presentation
- panorama
- darek working on getting papi counters to monitord
- changed the job metrics table in the stampede database.
September 4th, 2015
- Pegasus development
- worker package creation on the submit host.
- should we include python externals directory .
- we will put that back in. we only need boto.
- also need to make sure it works for a RPM or deb install.
- implement the compatibility check in PegasusLite
- panorama tests
- better error for input file replica selection failures
- Scalr for openstack tests
- action has a new openstack deployment.
- have our two QNAPS setup on the build VM's to run workflow tests.
- run on vmware pool.
- SCEC shallow LFN's
- for registration in the replica catalog.
- put the test in 4.5 .
- Database schema changes
- pegasus-db-admin changes to database schema.
- downgrades work
- worker package creation on the submit host.
- The short paper
- working on the google doc.
- we are not actively working on ec2.
- panorama
- adding papi counters to online monitoring.
- pegasus-transfer explodes when signal is sent
- online monitoring dashboard.
August 2015
August 28th, 2015
- pegasus 4.5.2 released
- worker package staging
- planner will use a worker package from the submit side installation and use it.
- pegasus s3 tests
- currently no s3 tests
- tests are running against 8.3.8
- cleanup algorithm update ( Rafael)
- estimate that it will be done in two weeks
- has to work for multiple sites
- cloud computing short paper
- hub bub
- panorama and dv/dt poster and presentations . in mid september
- metadata discussion
- google doc updated
- leaning towards monitor populating the database
- remove the estimated size and md5 checksum
August 21st, 2015
- pegasus 4.5.2 release
- release notes checked in
- db-admin changes?
- update man pages
- python source package
- tests are we moving to dev branch?
- docker problem
- how to get around it ?
- an issue inside docker, that is being exposed
- we will put in a wrapper around it.
- panorama branch is disabled
- but tests should be fixed.
- dark will be fixing it
- rajiv pushed out his dashboard changes for darek. for demo at supercomputing.
- cleanup algorithm
- Rafael will start next week
- how will the limits be passed
- kickstart changes
- metadata schema discussion
- next week.
- postscript
- dagman has plugin's
- schema
- use case
- stampede is sqlite
- pegasus-exitcode write locks.
- separate sqlite database for metadata.
August 14th, 2015
- Pegasus 4.5.1 release
- Release notes online https://pegasus.isi.edu/news/4.5.1 is done
- Bamboo machine troubles
- panorama tests hung because of bamboo
- do experiment for the case where we do condor off and see what happens to pegasus-dagman.
- Panorama tests
- look at build #73
- pegasus-kickstart stuff
- for interpose stuff
- gideon investigating how to cover all cases for threads
- wants to make sure that descriptor table is accessed in a thread safe way. in worse case
- also is doing thread tracking, thread counters and thread lists
- directory structure organization for submit directories.
- nonsharedfs mode problem for auxillary jobs
- sudharshan cleanup algorithm
- stefan update
- working on user models on how to submit jobs to HPC
- what user characteristics are of submission process
- to be able to show the IO part for SoyKB
- metrics of success
- makespan is reduced.
- number of service units is reduced
- metrics of success
- what makes an application IO intensive
August 7th, 2015
- Pegasus 4.5.1 release
- Release notes online https://pegasus.isi.edu/news/4.5.1
- PEGASUS_SCRATCH_DIR is populated for all jobs in shared fs case.
- 4.6 common resource requirements
- we are now exposing three pegasus profiles cores, nodes and ppn.
- added logic to do specific translations for PBS and SGE
- cleanup bug fixed related to DAX transfer flag for input files
- larger question and agreement. transfer flags for input files usually don't have any meaning.
- transfer flag should be renamed or in the API
- change in schema
- at minimum we should change the DAX API's
- transfer attribute renamed to final output?
- spaces in Pegasus URL
- gideon feels it should be mod 20 instead
- somewhere in documentation .
- the planner should have more specific error message in case of spaces.
- kickstart enhancements - gideon
- fixing edge cases in kickstart for the extended reporting
- what can we do with the papi performance counters and see what will be used in panorama.
- will be updated for counters.
- gideon and darek will try and merge
July 2015
July 31st, 2015
- Pegasus 4.5.1 release
- will release it next week
- update the mapper documentation
- have a link to the replica catalog
- steven clarke cleanup issue
- resource requirements
- update the resource requirements section for 4.6
- acme integration
- rajiv will work with bibi to integrate it with the REST monitoring api
- kickstart changes to get papi counters
- Only triggered if -Z option is passed
- the paper on xsede mentioned about them reporting per threads
- also we make better track of threads launched by the executable
- some edge cases for the thread case
- double execve of process does not work currently
- example: /usr/bin/env date
- also record command line options for all sub process launched
- in the proc record , the cmd tag
- grabs only first 1K of arguments
- monitord amqp population
- revert back to use the event name as the routing key for AMQP population.
- pegasus cleanup with peak storage requirements
- Panorama
- Data analysis done..
- ideas about writing a paper about workflow profiles
- Anomalies Detection
- showing anomalies in dashboard and population in stampede schema
July 24th, 2015
- XSEDE Tutorial
- 2 Posters and one tutorial
- news item online
- Pegasus Development
- common resource requirements PM-962
- documentation needs to be updated
- we have cores , hostcount
- karan should make sure cores is translated correctly to ncpus for PBS
- Pegasus REST API for integrating with Pegasus
- pegasus transfer
- checkpoint files
- LIGO developer notion of site attribute
- maybe we should be more clearer in the documentation
- automatically changing parameters for memory on job retries
- check point file for the job is a partial solution
- monitord amqp population
- works.. we will document it on JIRA
- common resource requirements PM-962
- Panorama
- Darek implemented sending messages in batches from kickstart to rabbitmq
- socket based communication between kickstart and lib interpose . was done to take of the file interleaving issue.
- tests on obelix and exogeni indicate socket writes are atomic for panorama message
July 17th, 2015
- PMC Cpu affinity
- LIGO pegasus analyzer bug
- has been passed to LIGO . awaiting to hear from them
- Cleanup algo
- Resource Requirements
- common pegasus profiles
- SGE
- change.dir should be set automatically for shared filesystem stuff
- documented already.
- kickstart path variable to prepend.
- REST interface for monitoring for pegasus is done. Rajiv completed this week.
- extensions to the cleanup algorithm. rafael will start working .
- Pegasus 4.5.1 release
- will be done after XSEDE.
- Pegasus XSEDE tutorial
- XSEDE Pegasus Poster
- show a LIGO workflow for the XSEDE poster.
- Salt configuration needs to be updated
- Student machines on salt
- panorama
- rabbit mq installed on exogeni site.
- darek will do message batching working.
- gideon recommends doing it with the AMQP C API library
- message interleaving in kickstart.
- lot of unacknowledged messages in rabbit mq
- kickstart polling loop
- all kickstart memory values are in MB
July 10th, 2015
- PMC jobs automatic summing of maxwalltime. Should be disabled
- In PMC case we will do a division.
- PMC CPU affinity for jobs PM-953
- there might be a fragmentation approach.
- Pegasus REST interface
- short cut URL end points.
- karan will send email to Lavanya.
- running on SGE cluster using GLite interface.
- harmonized pegasus profiles
- Metadata
- will need the file implementation .
- Dashboard Panorama stuff
- September 16th. Time series and anomaly detection.
- Application level anomalies
- Infrastructure level anomalies.
- no plans for integration in production Pegasus.
- monitord profiling of monitord population.
- we want to see how long 1000 events take to be populated in case of LIGO .
- Panorama
- anomaly detection
- implemented a working prototype of threshold based anomaly detection
- kickstart sends events to rabbit mq, then monitord populates to influx db.
- darek tool queries influx db and takes in the metadata file generated by pegasus and determines the anomaly and sends it back to rabbit mq
- monitord then again picks up anomaly and populates it to stampede db for dashboard to display.
- anomaly detection
June 2015
June 12th, 2015
- Pegasus profiles for job/resource requirements
- postponed till next week when mats is here
- karan to create a list of relevant profiles
- pegasus dashboard
- locking issue?
- can this be related to new connection stuff or the failing tab?
- look at connection pooling .. or maybe transactions are not being closed properly?
- also see if there is an option for dashboard to set a read only lock when opening a connection to the databases
- panorama workflow tests
- failing.. but merge from master was done.
- karan to investigate
- panorama workflow dashboard
- updated the job metrics tab for doing the polling
- for mpi jobs the job name appears as aprun, since that is the process running on rank 0
- Job Survery paper
- Darek sent a final version
- will be submitting next week
- Pegasus Release timeline
- maybe we should put on our website somewhere?
- Rafael Energy paper
- information about building energy profile.
June 5th, 2015
- panorama usecase and metadata passing through
- not done yet for the metadata associated with files with replica catalog
- DONT rebase commits that have been pushed out
- job.runtime, cluster.maxruntime, maxwalltime parameters
- how to associate profiles. have a different namespace
- how is it expose in the DAX API
- python dependency
- stopped support for 2.5 and 2.6
- only affects redhead 5 systems.
- will have to install redhat 2.6 python package on 2.5
- setup tools for python 2.6 has to be at build time
- pegasus-dashboard updates for LIGO
- cleanup bug for intercept runs with InPlace cleanup.
- S3 storage
- about 9TB and rising for pegasus system services backup
- right now no backups are going to go to Glacier
- we only keep 2 weeks of data
- glacier is good if we want to keep 6 months of data
- 3VM' for pegasus website , CROWD etc
- database on stewy and obelix
- qnaps /nfs/ccg3 and /nfs/ccg4
- Big ticket items of 9TB backup bucket in S3
- need to keep 2 backups in S3
- HubBub talk.
- abstract
- talk by Jack Donagara.
May 2015
May 29th, 2015
- Bamboo test failures
- condor-c tests working now. changed the site catalog for those
- rhel5 json module
- pegasus-transfer will do a proper check and complain for missing json module
- mats will update documentation accordingly
- Python Dependencies
- New python dependency 2.6 from 2.4
- newer versions of Fedora uses Python 3
- Fedora will keep python 2.x support till 2020.
- maybe have a dynamic bash wrapper across python code to pick the right python version
- have a tool called pegasus-python??
- concurrency limits
- apply to bamboo machine and our other workflow hosts.
- throttle number of grid jobs per categories of jobs. that is what SCEC wants and cannot be done.
- unless negotiation can be employed for grid universe jobs.
- define own throttles in compute jobs
- pegasus-dashboard
- LIGO has an issue with no authentication URL rendering.
- quoting for environment
- implemented. changed both for environment and +remote_environment
- docker universe support
- should work out of the box with condorio
- new dagman default values
- pegasus-statistiscs
- show bad put?
- LIGO OSG
- Documentation
- 10 minutes using pegasus-docbook
- using new pipeline it uses 3 minutes
- the hyperlinks don't work
- include that into pegasus website template
- In PHP we tell Google not to index old version
- panorama
May 8th, 2015
Bamboo test failures
- montage tests are failing because of the remote service being down
- documentation tilte is messed up. gideon will look at it
pegasus-transfer new format
- mats has come up with a new JSon format.
- backward compatibility with the old format
- create dir and cleanup jobs will be different
Metatdata
- google doc shared with people
- next steps are panorama use case for calling out
- ssh cleanup . JGlobus library does not implement ftp
LIGO on XSEDE
- have started using PMC
- data management
Python builds
- always check the python version.
- if we ship our own python modules, then we may have to
Bamboo build machine
- build and test plan ( running concurrently )
- also we can run docker stuff
- automate the salt setup of bamboo agents
- maintain one OS. Can action give us a beefier VM?
- we have too many documentation builds running ?
- VW with bamboo agent and use docker
- workflow tests are a separate issue
- they don't load the bamboo machine
- that is more related to a big condor pool.
- workflows tests will run always out of bamboo.
- mats and rajiv will work on it for the VM stuff.
Getting new SSL certificates
- *.isi.edu is screwed up in firefox
Metrics Server fixes
- google maps update broke the web UI.
- somehow all the colors were used in the trends ?
May 1st 2015
- Pegasus 4.5 release
- not heard back from SCEC and LIGO
- mats checked in the example
- will add release slider
- Variable Expansion
- pretty much done
- right now we have $()
- we will change with ${env-variable}
- have more helpful error message
- pretty much done
- pegasus-kickstart
- file does not exist. now gives a proper error
- XSEDE poster due next week
- Monitoring Service API
- donald is almost done.
- PMC with PegasusLite
- PMC job by default runs on the shared filesystem
- tasks in PMC are pegasus lite tasks
- if a task does randomio, then on shared fs might be tricky
- brazilian student contacted about pegasus application for real workflows.
- mats will be doing the transfer events for panorama next week
April 2015
April 24th 2015
- Pegasus 4.5 release
- release candidate today rc2
- updates to pending items
- job throttling added to optimization guide.
- release notes are online https://pegasus.isi.edu/news/4.5.0
- waiting for db-admin unit tests to be checked in.
- pegasus-cleanup checking
- pegasus-lite-local.sh add some path before starting.
- rest monitoring API
- we have not heard back from lavanya yet
- PNNL acme stuff
- pegasus 4.6 release
- common pegasus-transfer , pegasus-cleanup and pegasus-createdir
- APP_PATH_PREPEND addon
- pegasus worker package staging
- planner calls out to common script to determine the worker package
- if it does not exist , we build a default worker package on the fly
- add extra logic to the untar job in the
- pegasus-gridftp modification for ssh ftp.
- software eggs
- panorama
- metadata for 4.6
April 17th 2015
- Pegasus 4.5.0 Release
- rc1 working for hub
- LIGO trying it out.. wanted to change checkpoint files. need to hear back on the dashboard changes.
- SCEC ? waiting to hear from Scott
- https://jira.isi.edu/issues/?filter=10851
- pegasus-db-admin sqlalchemy issues? for updating tables?
- pass through implemented for Glite to PBS
- verification of update to pegasus version on running workflows
- mats thinks his testing should do the trick.
- Pegasus Dashboard for bamboo user
- URL - https://cartman.isi.edu:5000
Authentication - Uses PAM Authentication
Admin Users - mayani, vahi, rynge, juve, rafsilva, darek, deelman
- URL - https://cartman.isi.edu:5000
- Cedars visit
- SGE cluster
- we have 3 potential SGI cluster users Cedars, Vision group at ISI and maybe Rutgers ( that will be replaced with SLURM)
- Lavanya REST API
- Pegasus 4.6 release
- variable expansion thing figured out
- argument strings in dax, profile values in the dax
- site catalog.
- replica catalog file based one.
- need to now make changes in various parsers
- predefined environment variable
- metadata
- LIGO Dibbs .. ability to do data reuse based on metadata attributes
- panorama - pegasus - aspen interface
- iplant
- they want in the IRODs
- S3 tags.
- mats wants a better idea of what it looks like in the ideal world.
- file management on scratch directory, submit directory also?
- implementation of the REST API
- implementation for held job tracking
- Panorama requirements
- influx db monitoring , into pegasus-transfer.
- pegasus-transfer sends messages to rabbit mq about file size transferred
- pegasus aspen interface ( modelling tool ) . apsen is a C++ library.. pegasus planner querying the aspen models for each node.
- command line tool pegasus-aspen
- planner needs to send application parameters, and all the metadata for the node.
- gets back a list of attributes , memory and usage, and convert them internally into pegasus profiles
- this can be a generator of metadata.
- application model which is a file and a machine model
- timeseries data . monitoring data about the dashboard, anomalies
- there is a CEP thing that anirban is developing and will determine anomalies.
- dv/dt requirements
- prediction service
- pegasus will query the prediction service
- variable expansion thing figured out
...