Table of Contents
April 2019
April 12th 2019
- Pegasus 5.0
- Site Catalog Conversion to YAML
- mukund is mainly done
- pushed out his changes
- trying to make the tests green
- Checkpointing changes to accomodate LIGO use of vanilla universe
- Karan and Mats will explore and see if it is possible
- cumulative stdout|stderr
- what about time and duration values
- since there is no DAG Node retry and job just goes on HELD state
- Composite Events
- Kibana dashboard needs to be updated
- dropping __ in the event names
- George wants the AMQP library updated
- Will create a JIRA item
- Office Hours video
- Karan will work Jasmine to upload the video
- Site Catalog Conversion to YAML
- Papers
- RACE Paper submitted last week
- PEARC Paper this week
- Proposals
- Army Research
- enabling in-situ supports for ExaScale
- linked with what Tu is doing
- SCEC Proposal Submitted
- have a good chance
- Exascale one with Michigan
- the call will come out soon
- Ewa , Rafael and Deborah
- NSF GCR Proposal
- Modelling wild fires
- Has PRICE school input and also Deborah Post DOC
- Army Research
- EScience
- Pegasus Tutorial Proposal
- May 6, 2019: Tutorial Proposal Deadline
- Also trying for the workflow comparison paper
- Dynamo paper by George
- Pegasus connect discussion
- tabled it for later when Mats is around
- HTCondor Week
- Karan will be doing a Pegasus talk and Pegasus workshop
- Pegasus OLCF Poster
- combine the panda poster
- can also submit to EScience
- Ryan's work
- Loic is moving pachyderm setup to AWS
- Loic Rafael and Tu are working on a paper for Cluster
- Software X
March 2019
March 29th 2019
- 4.9.1 Release
- done and working on 4.9.2
- Site Catalog Conversion to YAML
- mukund working on it
- i still need to look at the bamboo tests
- bamboo faling on mount scratch thing that condor thing
- we have to fix in pegasus also. to fail on credentials in /tmp
- check and do condor_config_val on the key and check if /tmp is in there
- mainly affects all the users that use x509
- LIGO has also tripped over it . Both with Pegasus and without Pegasus
- Condor vanilla checkpointing
- karan asked him about what he is trying to do
- composite events
- check for keys with same values
- also do we need to pad extra keys for all events?
- Extensions to Jupyter Integration
- Pegasus Connect
- will discuss on whiteboard on April 12th
- will discuss on whiteboard on April 12th
March 1st 2019
- 4.9.1 Release
- moving it to early next week
- Pending Issues
- https://jira.isi.edu/projects/PM/versions/11891
- Execution environment for titan
- service dependencies
- PyOpen SSL
- Rajiv Mayani please look at that and the flask dependencies
- PyOpen SSL
- HPSS transfer client incorporation
- Set the transfers to do remotely
- Office Hours
- On Friday March 22nd on real time monitoring
- transformation catalog for 5.0
- Mukund will work on it next
- EScience?
- Paper
- pegasus-exitcode test
- success message not parsed correctly
- Programmer
- will interview the
February 2019
February 22nd 2019
- 4.9.1 Release
- Pending Issues
- https://jira.isi.edu/projects/PM/versions/11891
This raises the larger issue of how long we want to support externals packages
there are some packages we need to ship because of worker packages dependencies.
Consensus:
We remove mysql python externals package for 4.9.1 and 5.0.0And also remove the dependencies from our deb and RPM builds.
- Transfers within containers
- We are only going to transfer from within the container till people complain
- George Papadimitriou will add to the documentation.
- non ascii encoding in the stdout
- Support HPSS storage
The tools we use are htar and hsi
https://docs.nersc.gov/filesystems/archive/
- Pending Issues
- Office Hours
- George on real time monitoring.
- Date?
- George on real time monitoring.
- EScience?
- Paper
- Tutorial submission
February 1st 2019
- 4.9.1 Release
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- but we should ensure that stdout in database still gets populated
- Karan will fix this
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- New TC Format
- Shifter Support in Pegasus
- is in 4.9 branch
- Pegasus Annual Report
- will be working on it in coming weeks
- will ask for input
- next year report will be tricky . in terms of effort allocation.
January 2019
January 25th 2019
- 4.9.1 Release
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- but we should ensure that stdout in database still gets populated
- ascii encoding breaks while parsing for monitoring events. monitors should have the population working and have log a warning error.
- YAML format for the TC
- the line numbers should be mentioned in the errors
- GitHub commits don't trigger bamboo builds right now
- move to webhooks?
- slack token in bamboo.yml .
- mats will look into it further
- SCEC for HPC Transfer certificate issue
- Globus online certificates messed up hpc-transfer issue.
- Data Storage at NERSC
- almost full
- Singularity container with the entry point.
- docker → singularity container conversion does not add the entry point.
January 18th 2019
- 4.9.1
- container execution
- data transfers happen within the container
- python3 issue
- vague rules to discover what python to use
- Singularity HUb URL's updated
- Documentation and tutorials need to be updated
- montage examples
- python stuff: create JIRA item
- LIGO pull requests
- Build pull request
- PAM module
- subprocess package thing
- also related to Python3 movement
- container execution
- Transformation Catalog Implementation
- Astro Py
- Shifter support at NERSC
- Panda Integration
- CENON NT
- Rusio data pull in
- fetching data might be easier
- Journal Paper
- need to write something about containers
December 2018
December 13rd, 2018
- Pegasus 4.9.1 release
- local site catalog entry creation
- based on the pegasus version on the submit host
- encoding issue in the stdout.
- local site catalog entry creation
- Pegasus 5.0 Release
- TC yaml implementation
- mukund will create a yaml schema compatible with the TC
- backwards compatibly
- case by case basis
- definitely for
- catalogs
- dax
- pegasus-transfer
- TC yaml implementation
- SWIP Paper
- we are in good shape
- Titan
- under the PBS batch gahp.
- ZTF
- the pipeline is based on docker-compose
- peter will visit ISI with postdoc Danny in January
- Tutorial at TACC
- karan has updated pegasus-init to work on wrangler
- will update the tutorial notes accordingly
- OLCF accounts
- make sure they work
- get karan and mats can login
November 2018
Nov 29th, 2018
- Ryan
- working on comparison paper with george on workflow systems
- mats, karan shared neon meeting notes with Ryan
- Pegasus 4.9.1 release
- Due for december end
- potential issue in monitord in reference to hierarchal organization of submit directories
- pegasus-submitdir
- ADASS Paper
- due tomorrow
- need to add information about sample run
- SWIP paper
- mats and karan will work on it tomorrow afternoon.
- cull out sections
- add information about updated monitoring in 4.9
- OLCF Kubernetes
- Condor is installed and configured as root
- George tried condor log directory to lustre as condor in container has to run as user not as root
- LOG_DIR should be /tmp
- volumes can be attached to container to contain workflows etc
- Dynamo
- Do dynamic scheduling
- George thinking of using flocking
- similar to what is done in OSG
- non-sharedfs deployments should work
Nov 1st, 2018
- Pegasus 4.9.0 and 4.8.5 Released
- We released it this week.
- Pegasus Business Card
- Advocate for job postings.
- Postdoc options
- Programmers
- pegasus.isi.edu/jobs
- We should take to conferences with us
- Advocate for job postings.
- Pegasus JAVA 8 dependence in RPM
- there is a disconnect between RPM and common.sh
- ADASS
- Karan working on a wlpipe demo example
- New Student
- Mukund
- Duncan started using 4.9.0 and has updated pyCBC to use singularity
- changed our container execution model
- all transfers done within the container now.
October 2018
Oct 12th, 2018
- Rescheduling meetings
- New time is Thursdays 2PM starting from last week of October
- DAX APi reporting
- Perl DAX API - Rajiv
- Atlas visit
- Wednesday we have Scientific Computing Seminar
- Will involve writing a Pegasus code generator
- Panda is second biggest after Condor on OSG
- Thursday
- Karan and George will be there.
- Mats might be available remotely
- Wednesday we have Scientific Computing Seminar
- 4.9.0 Release
- Mats preference is to skip the beta tag
- Aim for the full release
- Documentation freeze on Oct 26th
- Try and do the builds over the weekend
- Duncan container usecase
- cvmfs hosted container images
- Demo repository
- panorama data and some runs from exogeni / nersc
- Mats has two new elastic search VM's and are part of Elastic Search cluster
- these vm's data is backed up also
Oct 5th, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Karan will circulate a doodle poll
- Either Tuesday or Thursdays
September 2018
September 28th, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Karan will circulate a doodle poll
- Either Tuesday or Thursdays
- Pegasus 4.9.0 Release
- transformation selection issue
- karan has not been able to recreate it yet.
- will look into it more today
- docker singularity pulls
- container symlink
- deprecate api's
- modify DAX generators to indicate version/ DAX API used.
- will look into ways on how to do it
- one way is workflow metadata attributes
- second is attribute to ADAG object.
- rajiv will check how it gets stored in the metrics server
- transformation selection issue
- ADASS
- will try and do a poster with Mike at ADASS
- deadline is Oct 8th
September 21st, 2018
- Rescheduling meetings
- Either Tuesday or Thursdays
- Pegasus 4.9 release
- integrity error reporting
- pegasus-statistics reporting information about integrity errors
- the unicorn dashboard for internal swip purposes
- errors are appearing in the stream
- more brainstorming required. the data is there
- not clear whether to use grafana or kibana
- does not have drill down functionality
- mix of production and test workflows
- create different queues in AMQP exchanges
- container mount point support
- karan is close to have that being implemented
- transferring outputs to multiple location
- lets say one for portal and the other for
- list of output sites
- good feature to add for 4.9.1
- update --output-site option to pegasus-plan
- pull docker images for singularity runs
- we should do for 4.9.0
- planner needs to tell pegasus-transfer an extra attribute.
- add a type attribute
- Papers
- Github private papers repo
- Deprecate stuff
- perl api
- old catalog formats
- pegasus-plots
- Hiring
- integrity error reporting
August 2018
August 24th, 2018
- Pegasus 4.8.4 Release
- when are we releasing?
- next week before mats go on vacation
- when are we releasing?
- error tagging
- update stampede schema to add a table called tags
- will allow us to capture number of integrity errors
August 17th, 2018
- Pegasus 4.8.4 Release
- RPM fix ?
- mats will manually verify
- Karan should follow up with Stuart
- AMQP filtering
- we are working on having filtering in built into monitord
- nepomunk already has 33 errors identified
- we need to db connection, pegasus-db-admin and other tools to pass properties with pegasus property prefix stripped off
- SWIP Paper
- one reject seems to be harsh
- we can try for HPDC also
August 3, 2018
- Pegasus 4.8.3 Release
- singularity fix
- mats talked to adam at nebraska about containers.
- the main doc book will not be updated for 4.9
- SLURM
- Design Safe / TACC on Wrangler headnode
- Nextflow has integration with SLURM and everything can be installed in user space
- PMC unit tests are broken
- lets fix the tests
- Pegasus 4.9 release
- more real life runs
- nepomunk against ceph-s3 from one of uchicago machines
- we need to get stats reported for integrity errors
- larger issue of error classification
- ADASS Tutorial
- we got into second round
- add on exercise to run montage in the end.
- we got into second round
- LIGO
- Bruce group at AEI Hannover has left LSC
- Infrastructure
- HipChat mess
- should we move to ISI Slack
- Public Chat feature
- Some clients for Hipchat
- Get a free channel from Slack
- for all Hipchat rooms
- what about ISI slack??
- Github removal of old integrations
- moving email notifications . Rafael Ferreira Da Silva will take care of it
- we need to explore
- HipChat mess
- MINT Meeting
- went well overall
- issue of scoping .
July 2018
July 27th, 2018
- Pegasus 4.8.3 Release
- VM Tutorial
- will update pegasus-init requirements to get it working
- main tutorial chapter will be updated for 4.9
- because then tutorial based container may not work
- change how docker scripts set environment
- SCEC database loading error
- VM Tutorial
- Failing Tests
- Issue in updates to the dashboard database
- Panorama Paper
- agreed on a re-organization
June 2018
June 29th, 2018
- Pegasus
- 4.8.3 needs to be released because of singularity launching options
- will wait till tutorial is updated.
- karan will update pegasus-init with population modeling or povray option
- 4.9
- pegasus-statistics updated with integrity metrics
- how to flag job errors because of integrity
- need to figure out logic
- value add proposition
- maybe we should value type in the pegasus lite
- need to implement the integrity dial
- Start creating default local site entries to execute without local site
- 4.8.3 needs to be released because of singularity launching options
- ADASS Tutorial
- Will submit today
- Google doc shared
June 22nd, 2018
- Pegasus
- SWIP paper submitted to escience
- 4.8 montage tests failing
- changes for integrity metrics in pegasus-transfer
- updated monitord to parse events from various sources like pegasus lite output
- mats pointed out to a bug in monitord
- LIGO
- pip for python source package
- update dependencies for latest packages , like pyopen ssl
- install in the pip repository
- pegasus-analyzer
- interested in swip and containers.
- pip for python source package
- SCEC CSEP
- will use containers
- run on Comet
- 1000 genome workflow or use chimerica workflow
- ADASS Tutorial
- montage ?
- probably pycbc is also submitting a proposal
June 8th, 2018
- Scott Replica Catalog issue
- Replica Catalog deletes take a long time
- Bamboo
- bamboo emails are no longer received. so we dont come to know about workflow plan failures
- SWIP
- monitord integrity changes. population of data from ks records working now.
- we still need to populate data from pegasus lite records and pegasus-transfer
- pegasus-statisitcs need to be updated
- 0.1% overhead on production osg gem workflow
- Pegasus deployment at ORNL
- we should be doing it similar to hpc-pegasus
- Pegasus Office Hours
- next one in August
- travels in July
May 2018
May 4th, 2018
- Pegasus 4.8.2 Release done on May 3rd
- we should consider separate user data to a separate file on pegasus-wms
- si2 meeting updates
- some potential new users
- ewa slides were a good overview summary
- integrity data schema changes.
- monitord changes need thinking
April 2018
April 6th, 2018
- Pegasus 4.8.2 Release
- PMC bugs
- tutorial for usc hpc
- no longer allow + or . in the names
- Pegasus Report
- Submitted for Ewa' review
- SWIP test run
- discovered integrity errors in the wild
- at colorado and university of nebraska
- we would have not caught it before
- e-science paper
March 2018
March 30th, 2018
- SWIP
- pegasus-run issue, with wf restarting from scratch
- because dagman rescue file is not there.
- so should we update pegasus-run to look at the dagman.out file
- so far we think it should be kept consistent with normal dagman behavior
- to de discussed at condor week
- mats created a Jira item for swip related statistics
- https://jira.isi.edu/browse/PM-1260
- will involve a database schema.
- Things remaining
- Dials to be implemented
- stampede changes
- pegasus-transfer changes???
- pegasus-run issue, with wf restarting from scratch
- SC Tutorial Submission ( April 16th)
- https://sc18.supercomputing.org/submit/tutorials-submissions/
- We should try and add exercises for containers
- We will try for half day
- 45 minute introduction
- Feedback from Arizona Container Camp
- There is interest.
- coming up with an existing application that people understand or can relate to
- montage - complex dax generator
- rosetta
- only works in nonsharedfs stuff
- with
- machine learning example?
- with tensor flow?
- requires container
- NVIDIA has a lot of examples about machine learning
- has to be multistep
- and at least bag of tasks
- Ashwin is doing some tensor flow stuff
- on workflow.isi.edu
- is working out of jupyter notebook
- Genome sequencing workflows??
- use Broad GATK sequencing workflow to use
- SOYKB and IRRI use GATK
- and are huge communities
- http://biocontainers.pro/docs/101/running-example/
- Pegasus Report
- we should be resolve Jira items as we fix them
- will be also doing cumulative statistics
- Pegasus Office Hours
- Jupyter Notebooks
- will update the example to use namd example used for Oakridge
- Panorama Stuff
- our multiplexing part in monitord done so far
- however we are relying on amqp queues and routing keys for filtering
- darshan data population
- we need to invoke a script (pegasus-darshan) that will be invoked in the namd wrapper script, to pull the data from darshan logs on the file system and generate an ASCII output
- Panorama.isi.edu VM
- AMQP
- Logstash
- Kibana
- Elastic Search
- Make it do a backup every so often.
- Warns against doing it as a permanent datastore
- Rajiv will verify
- Influx
- Backups
- CRASH PLAN backup for the /srv and /opt in the panorama VM
- our multiplexing part in monitord done so far
- LIGO Database locked issues
- we need to look into the locking issues by tinkering with monitord flush intervals
March 16th, 2018
- SWIP
- Most of the SWIP stuff is done as far as planner changes and getting the workflows running
- we are in a position to share something
- To do
- sharedfs
- Dial implementation
- Update monitoring
- Paper submission for EScience
- Pegasus Reports
- new applications to attribute to pegasus grants
- all the mike wangs work will go here
- SCEC
- LIGO - need to ping Duncan
- Panorama/ Pegasus workflow endpoints
- We seems to be going towards AMQP
- How is AMQP going to be configured
- So far we have
- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<exchange_name>
Online monitoring in kickstart- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<virtualhost>/<exchange_name>
- amqp://[USERNAME:PASSWORD@]amqp.isi.edu[:port]/<exchange_name>
- Virtual Hosts
- right now virtual host is hardcoded in monitord code. we set it to pegasus
- global - across workflows
- Exchanges
- should be global across workflows
- type direct - in panorama
- we want them to be type -> topic instead
- Queue
- in panorama different queues for each workflows
- Routing Keys
- the routing key should be based on stampede event names
- Events populated
- https://pegasus.isi.edu/documentation/stampede_wf_events.php
- We should add periodic events about states of workflows
- SWIP integrity error events will be populated by clients
- We seems to be going towards AMQP
February 2018
February 23th, 2018
Eliminate support for Py2.6?
Python Dependencies
All - future
pegasus-service - Flask, SQLAlchemy, Flask-SQLAlchemy, Flask-Cache, pam, plex, pyOpenSSL, ordereddict
pegasus-monitord - SQLAlchemy
pegasus-analyzer - SQLAlchemy
pegasus-s3 - boto
pegasus-globus-* - globus-sdk
pegasus-init - jinja2
pegasus-metadata - argparse
pegasus-em - requests
PostgreSQL - psycopg2
MySQL - MySQL-Python OR mysqlclient
Note: Packages in green are available from yum.
February 9th, 2018
- SWIP
- checksum computation will be implemented in pegasus-transfer.
- allows us to handle the case where the input files don't have checksums in the RC
- integrity checks are disabled now for files that dont have checksums in the RC
- dial knob
- checksum computation will be implemented in pegasus-transfer.
- Tests
- seem to be slow
- bamboo could be moved to the new server
- storage constraint test
- Lizard FS
- Mats will give an update next time around
- Servers
- Trying to do two server
- IF we buy one server
- Buy a storage server. That is Mats preference.
- SoyKB workflow has
- Compute
- we will get a compute server first.
- We should figure out the server and put in the request soon, and done by Feb end
- LSST
- Tom Glanzman?
- We will touch base on Monday with Tom and Nersc folks
- Office Hours today
- have a presentation on containers
- will upload on the website
January 2018
January 12nd, 2018
...