March 2015
March 13th, 2015
- Metrics Server
- deployed on the production server.
- want to do anything on basis of distribution of files
- donald will create a new histogram page ,
- Pegasus NSF Report
- sent to Ewa
- Pegasus 4.4.2 release
- karan will check in release notes today
- Pegasus Tutorial as part of HPC Workshop Series in April
- Gideon will be going to the summer school.
- Pegasus 4.5.0 release
- Targeting May 1st release
- local-scratch is picked up.
- ensemble manager submission
- will support both modes
- bundle mode
- public ensemble manager. there are security issues. user credentials.
- the person who starts the service will setup the credentials
- pegasus-analyzer fix for case where jobs eventually succeed after failures
- pegasus-db-admin update
- ds
- transfer grouping of staging jobs
- Pending items
- User Questionnaire
- 12 responses for
- a lot of people are interested in a workshop
- better support for loops and branches
- better provenance support .
- Workflows on Google and Amazon
March 6th, 2015
- metrics server update
- plans to deploy the changes today. fixing last issue
- still has to make the database schema changes required for planner file counts
- will be done next week
- planner reports file breakdowns
- pegasus 4.4.2 release
- it has fixes LIGO is interested.
- most probably next week.
- pegasus-db-admin
- reorganization of the code and the schema.
- pegasus-archive /pegasus-delete
- rafael does not have time to work on these because of proposal work
- will move to either gideon or mats
- pegasus-dashboard updates
- has more LIGO requests for pegasus 4.5.0 release
- wsgi script for root mode
- LIGO visit
- post 4.5 we will do better organization of files on the file structure
- Pegasus poster for LIGO meeting
- ensemble manager
- scec folks will try it
- monitord netlogger bugfix
- pegasus-transfer enhancements for panorama
- job submission paper in github
- pegasus and job management systems.
- online monitoring for pegasus-kickstart
- application sends signal to pegasus-kickstart via libinterpose
- pegasus-keg extensions
- the pegasus-mpi-keg is a separate executable
- extensions to the io stuff
- will incorporate in 4.5.0
- NSF report
- still waiting to hear from mats and scott
- karan is still updating the metrics page.
February 2015
Feb 20th, 2015
- metrics server update
- donald still has to deploy the changes.
- pegasus user questionnaire
- gideon will send new links and will update
- SCEC update
- scott has debugged his memory
- Pegasus Report
- soykb and other iplant workflows ... part of ECSS
- galactic plane
- ahmeds work
- pegasus dashboard updates
- pegasus-dashboard is started whenever bamboo is built up
- dashboard show all states for a job now.
- pegasus-db-admin tool
- test cases in bamboo
- documentation
- migration notes
- some python errors that need to be fixed.
- 4.5 release
- still remaining
- held jobs tracking in monitord
- job retry set to 1 and disable retries for DAX jobs
- decrease the held period from one hour when job is removed.
- improved documentation for output mappers
- ensemble manager todo's
- we won't have ensemble manger in multiuser mode
- support both modes ( upload a tar file and finer grained control where he specifies the DAX files and the submit directory )
- only the dashboard will run in multiuser mode
- how do we start ensemble manager process
- run as per user .
- copying of catalog files to submit directory.
- still remaining
- input directory copies based on recursive transfers as part of directory
- it won't work in condorio mode because it flattens out
- add type directory in the DAX schema.
- pegasus tutorial
- environment variable file substitution in site catalog, replica catalog and transformation catalog
- XSEDE Tutorial proposal and Posters
January 2015
Jan 14th, 2015
- metrics server update
- no update from Donald still away from vacation
- Pegasus development
- data configuration for different sites
- working for steven
- held jobs
- pegasus-dashboard
- root mode for dashboard and ensemble manager
- gideon needs to confirm for ensemble manger
- done for dashboard
- root mode for dashboard and ensemble manager
- pegasus-analyzer bug fix
- pegasus-db-admin tool update
- unit tests
- bamboo pool will break.
- upgrade to newer version of Pegasus
- what happens to running workflows
- pegasus-statistics with PMC - Mats and Rajiv
- mats and rajiv will work on it.
- docker based tutorial launcher
- how to integrate in the build process
- form
- candidate machine
- obelix
- vmware colo vm
- obelix.
- data configuration for different sites
- Pegasus Poster for Si2
- will base on the previous years.
- any particular thing we want to focus on ? or general?
- Pegasus Annual Report
- User questionnaire - need to send out.
- list of people to send it out to . Gideon has one.?
- User questionnaire - need to send out.
Jan 7th, 2015
- metrics server update
- no update from Donald still away from vacation
- no update from Donald still away from vacation
- 4.4.1
- installed on workflow
- OSG and XSEDE submit hosts will be upgraded in 3 weeks
- need to follow up with LIGO
- database upgrade tool integration
- documentation and manage left
- import error for properties
- python test case
- support for per site data configuration
- mostly done/ still need to figure out worker package staging for that.
- mostly done/ still need to figure out worker package staging for that.
- pegasus-dashboard
- should we show all job instances for a job.
- should we show all job instances for a job.
- held jobs logged by pegasus-monitord
- user questionnaire
December 2014
Dec 8th, 2014
- metrics server update
- minor bugs in the UI... still need to be fixed, especially how the session states are handled
- things remaining to do
- database/server side pagination
- figure out the scroll issue for the trend charts
- move the trends charts from the home page to under planner and download tabs
- rename run metrics to dagman metrics, and instead of showing the most number of times a workflow was run, we want to see the top applications for which dagman workflows were run
- for the time bar on the top, have drop down menu for years and months
- can the maps pin show the actual number, for example in the top downloads map thing
- monitord fixes
- for the race issue with postscript handling PM-798
- had to change the way stdout and stderr is populated for job_instance. It is now populated with the POST_SCRIPT_TERMINATED event happens
- for the race issue with postscript handling PM-798
- pegasus-analyzer fixes
- show the planner log when prescript for sub dax fails. PM-808
- we want to release 4.4.1 before the break.
- has monitord fixes that LIGO requires
- tracking held jobs
- decided to add a column in the jobstate table to capture why a job was held
- changes to pegasus-keg
- to simulate reading in input and writing out of output files
- will also simulate cputime and walltime
- initially pegasus-keg will read in and write out the outputs and then do the sleep for the cpu time duration
- removing the system information that it prints out
- in the mpi version, the IO is solely done by the master.
December 3rd, 2014
- Update from Duncan on LIGO dashboard requirements
- run a flask module from apache
- let apache handle authentication
- read only dashboard view
- have a separate flask frontend.
- they are ok with a command line tool to remove workflow entries
- port collisions .. so they prefer apache to do the handling.
- failed jdbrc unit test case
- glite quoting for the environment
- pegasus-dashboard delete workflows capability
- failing workflow reporting in the dashboard
- monitord to follow condor job log
- db admin tool updates
November 2014
November 12th, 2014
- DAGMan metrics reporting
- working and completed for 4.5.0cvs
- planned metrics
- exclude the metrics that never ran.
- have a drop down menu - planned , planned and run
- RPM/ and DEB tracking for downloads
- mats has a script that goes through the download logs to populate the server.
- So we are tracking those now.
- Failed data reuse regex test
- make it a planning only test case
- hierarchal workflows options forwarding
- have a value of null/none
- --inherit option with a comma separated list of long opts.
- higher level DAX API for sub workflows ?
- hack to figure out the command line arguments for the planner
- Pegasus Distribute Wrapper
- waiting to hear further from Steven
- a /bin/bash test case
- Metrics Server Updates by Donald
- has the geo location running
- DB Upgrade tool - Rafael ??
- version upgrades
- specific table version
- https://jira.isi.edu/browse/PM-776
November 5th, 2014
- DAGMan metrics reporting
- already in recent DAGMan versions. can be enabled.
- pegasus-run having the duplicate logic.
- Pegasus Distribute Wrapper
- Initial implementation done and there is an example for Steven to try out
- Metrics Server Updates by Donald
- DB Upgrade tool - Rafael ??
October 2014
October 29th, 2014
- Upcoming Proposals
- NEESGrid call
- Robert Flashgun with Nirav..ASU stuff. Do some earthquake stuff
- frank mckenna for nees type stuff
- SCEC is part of the proposal
- December 3rd due date
- NEESGrid call
- Pegasus Development
- monitord postscript handling
- dynamic hierarchy stuff
- Condor C with LIGO
- Steven Clarke Distribute Stuff
- pegasus-hpc-cluster ( PHC )
- DAGMan metrics
- Kenichi Workflow
- SNS workflow
- Training material.
- Metrics UI updates
- Trends over times
- Geo overlay
- Darek from Poland - A postdoc 1206
- panorama project
- Adaptive Workflows
- adapting workflows... they are not converging.
- templating workflows
- Hopper Site Catalog
- Sample Site Catalogs
September 2014
September 17th, 2014
- Checkpointing feature
- tested and implemented into pegasus
- communicated with LIGO and John Veitch will test it next week.
- will be run from a binary install
- kickstart won't enforce non zero exit code for application exit code . we will require application codes to exit with non zero status.
- Profile and Properties documentation integration
- database schema upgrade tool
- rafael starts working on it
- support for google storage
- hassan writes a paper for google storage
- compare S3 with google storage
- parallel uploads of chunks not supported with gsutils.. relies on a very specific python module
- ~/.botoconfig
- uses oath token for authentication
- works paper revisions due oct 1st.
- dv/dt paper has been submitted as a CS dept tech report.
- DOE Oakridge meeting
- interface with ASPEN ( analytical modeling ) - domain specific language for defining code.
- combine aspen model with machine model and come up with estimates of runtimes.
- christopher riggers from RPI models parallel storage systems.
- Explore visualization stuff for pegasus-plots and dashboard?
August 2014
August 25th, 2014
- Ensemble Manager - User Authentication
- initially gideon is working on a PAM based approach
- refactored netlogger dead code
- Workflow Checkpointing support - ongoing
- Google Compute Engine
- related to google genomics
- put in support for GCE transfer tool to interact with Google Storage ( their S3 equivalent)
- put in credential handling in the planner.
- fits well with long term planning for pegasus.
- Replica Catalog Service
August 18th, 2014
- Data Reuse Partial Mode
- Service integration
- Profiles and Properties Documentation
- Scope Column in the properties documentation ( transformation, job and global )
- in profiles documentation corresponding property key
- pegasus-service integration
- need to integrate the documentation
- redhat 5 builds
- partially... because of 2.4 installed version pegasus-s3 fail
- authentication mechanism
- pegasus-service-admin migrate option
- new tool pegasus-db-admin
- get a new 32 bit VM with cents 6.5
- also centos 7 VM
- add a setup task that cleans $HOME/.pegasus in bamboo infrastructure.
- Docker Kernel Problem
- if a docker build running and you stop the build, then the whole thing crashes
- one solution is to upgrade the kernel version.
- cartman OS can be changed or move the docker builds to a VM.
August 11, 2014
August 4th, 2014
- how to handle a single job wrapping around PMC
- will add a property to turn the wrapping off.
- checkpointing for LIGO . synonym for checkpointing. user level state files.
- create a JIRA item that explains that.
- list the various cases that will be handled
- a lot of times in case of eviction kill -9 is sent.
- pegasus dashboard changes
- multi tenancy for users.
June 2014
June 30th, 2014
- pegasus-remove and pegasus-dagman. pegasus-dagman has a wait of 100 seconds before monitord is killed, when pegasus-remove is called.
- rafael will add a workflow test case for JDBCRC
- Still have to make a slider.
- Karan will work on XSEDE poster for Pegasus
- IPlant and metadata requirements.
- pegasus-dagman / monitord /condor-dagman
- hierarchal
- PMC
- GRAM
June 9th, 2014
- 4.4 release
- next week
- documentation items remaining
- JDBRC test cases and handover to SCEC
- Dashboard improvements
- dashboard improvements
- Post Release Activiites
- integrate pegasus service back into the main codebase
May 2014
May 12th, 2014
- PM-747
- will be used for soykb
- test case
- Development releases
- 4.4
- plan for June 20th
- automatic data dependencies
- wrap up existing stuff
- documentation
- JDBCRC change
- documentation of FAQ's
- 4.5
- pegasus-service
- some form of multi tenancy
- python dependencies especially for external stuff is tricky
- rename of dashboard database tables
- pegasus-dashboard enhancements
- separate the planning job from the prescript
- checkpointing
- software cleanup
- transfers with hierarchies
- leverage condor asynch transfers in pegasus lite
- try for before christmas
- 5 minute youtube video
- pegasus-service
- 4.6
- metadata
- dax annotation
- enhanced notifications
- monitord
- PMC data locality
- globus online support ??
- get credentials . at least do more research.
- skipping symbolic links
- 4.4
May 5th, 2014
Condor week
- Lauren
- Karan needs to provide more documentation for her
- Kent Wenger
- dagman reporting
- dagman metrics files is created by newer versions of DAGMan in the submit directory.
- retry immediate parent
- CMS has a requirement for this also. The most important thing on Kent's plate
- dagman reporting
- dynamic workflows
- node expansion . may not be that worthwhile
- pegasus lite asynch transfers
- using condor chirp in the pegasus lite shell script once the main computations are done. that way we can pipeline
- does not work with partitionable slots
- does not work with condor file io
Bamboo Test Cases
- Job got hung for a long time??
User Survey
- Developer Meeting will be moved to 1PM for
April 2014
April 21st, 2014
- Pegasus Metrics
- ewa sent out the report for metrics to Dan. we need to get her final version.
- JIRA metrics
- work log feature of JIRA - everybody does not find it useful.
- all developers need to be diligent of putting tasks into JIRA
- sub tasks in JIRA ???
- how to track user feature requests
- performance improvement
- get the data structures upto speed.
- timing the cleanup is also important and canceling it if it goes too long
- SI2 Tasks
- Support Data as first class objects
- file movement open JIRA item
- data flow dependencies
- Support annotations for runtime and files sizes
- software review of streamlined
- remove pegasus-plots
- remove libexec
- remove unused example
- archive sub directory
- https://jira.isi.edu/browse/PM-672
- tutorial VM's
- refine and document metrics
- we have the confluence page that captures
- metadata registration in catalogs
- triggers for enhanced notifications for long runtimes
- we personally feel
- pegasus service
- have a release and multi tenancy
- sort out all the python stuff.
- reconsider moving pegasus-service back into pegasus git repo
- documentation for integrating pegasus
- enhance feature coverage and testing framework.
- unit test coverage
- adopt a model on how others can contribute to pegasus
- document the process how people can contribute.
- Support Data as first class objects
- Customer Survey
- identify questions to ask.
- Pegasus Metrics
April 14th, 2014
- JIRA Policy Document or page
- Pegasus Metrics
- Pegasus Survey
- Develop a list of questions .
- Forward to Duncan CBC Group
- New Default Transfer Refiner - BalancedCluster
March 2014
March 31st, 2014
- Gideon changed the tutorial VM.
- Put in backward support for old credential handling.
- Mats started on an outline for the optimizations chapter.
- next week's developer meeting is cancelled.
- general Pegasus dependencies
- python > 2.4 and less 3.0
- in general, easier to build from source rather than from source RPMs
- update Pegasus README
- change the build.xml to say default build without docs. remove the dist-nodoc target. instead we will have ant dist-release as the default target
- also we should start having documentation per minor release and not per major release as we do now.
March 24th, 2014
- Pegasus 4.3.2 release done last week
- storage constraints paper - gideon, rafael and karan worked on it.
- karan worked on the hpc-pegasus setup.. has workflows running through PMC
- karan and mats have a XSEDE tutorial proposal that will be submitted today
- dv/dt paper rejected for HPDC. Will try for a middleware conference due mid may
- 4.4 release
- checkpointing solution
- leaf cleanup for hierarchal workflows
- md5checksum option for guc transfers
- we won't follow up on kickstart generating the checksums, but tracking checksums in replica catalog.
March 17th, 2014
Agenda
- XSEDE poster and tutorial proposal
- will get it done this week. mats and karan will work on it.
- idafen will work on a workshop paper for xsede on reproducibility
- 4 page limit
- deadline is april 5th.
- energy simulation for SC 2014
- measure energy when running workflows
- try to check if energy usage changes whether data is transferred to a site, or everything is executed at one site.
- sane defaults for 4.4 for transfer jobs, pre scripts etc
- transfer jobs
- how many stage in jobs - 2 jobs and each job with 2 threads.
- how many threads each transfer jobs - pegasus-transfer has a default to 2
- pegasuslite job
- change sls name ? property name change
- control the number of threads
- add a chapter called tuning workflows
- mats will add about a section on tuning transfers.
- setting clustering parameters.
- changing back the default refiner to bundle???
- cleanup job
- change hold release time to one hour.
- transfer jobs
- new transfer refiner
- maybe can use k means clustering ?
- leaf cleanup for hierarchal workflows
- --cleanup leaf,inplace,none
- tell the planner to throw a warning when
- sudharshan's paper
- emphasize that the goal is not improving the makespan.
- 4.3.2 release
- release notes checked in on friday
- mats will tag after the release.
- the service should be installed in the tutorial VM image.
- Condor Categories
- similar to dagman categories.
- will condor accounting groups work??
March 10th, 2014
Agenda
- Should we stage sub-workflow output files to parent workflow scratch? (related to leaf cleanup)
- Should we enable DAX jobs to have input and output uses, and distinguish between planner inputs and sub-workflow inputs?
- SUB DAG keyword to make pegasus generated subdag submit files match with dagman version alway
- From Kent, Wenger
Hey, I just wanted to touch base and find out whether you guys have made any progress towards making Pegasus-generated sub-DAG submit files match
the "normal" DAGMan format.
(See https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3891,4.)
- From Kent, Wenger
- data reuse edge case
- have fix for it and have added unit test cases
- altassian licenses expiring?
- plan for a pegasus workshop / meeting for 2nd week of January 2015
March 3rd, 2014
- monitord fix for LIGO
- pegasus plan prescripts were not logged in the database.
- checkpointing files
- karan will create a JIRA item and send it to ligo folks for comment.
- transfer fix
- held jobs ?
- separate pegasus plan planning jobs
- throttle jobs via category.
- real full ahead planning
- plan full ahead -
- will help in debugging workflows
- hierarchal workflows planner arguments in the prescript wrapper shell scripts.
- final cleanup job for the workflow
- fix for iplant workflows cleanup. previously generated files whose locations are determined in the replica catalog should not be cleaned up
Workflow reproducability ( idafen )
- here for 3 months - march/april and may
- document the infrastructure that was used to generate the workflows
- created ontologies to describe infrastructure.
- precip API
- expressed an interest in it .
- he focuses not on how to deploy, but instead to describe the infrastructure
- then do experiments that take in his description and deploy it using precept
- target two conferences
- one systems
- other semantic
Pegasus Submit Node on HPCC
- waiting on glite recommendations from condor-admin
Feb 2014
February 24th, 2014
SCEC Transfer Issues
- hpc login crashed for scec workflows because of too many stageout jobs
- there were too many connections open at xinetd level
- also the stageout jobs were starving all the other local universe jobs in the workflows
- so the workflows were getting bunched at the stageout level
- we solved it by moving only the transfers to the vanilla universe on shock
- ran into credential handling backward compatibility we put in 4.4 after new credential handling.
Transfer Configuration for 4.4
- by default the number of threads will be 2
- we will expose a way via properties to increase the number if users want to have better bandwidth
- in case of any failures, pegasus-transfer will revert back on a single thread
February 10th, 2014
Postscript handling
————————————————————————————————
- We have implemented a solution in PM-737 to get around condor quoting rules.
- MPI code are not kickstart wrapped
- Pegasus should indicate whether a clustered job or a kickstart job.
- DAGMan exitcode
checkpoint jobs
- 10% of runtimes
- pegasus-transfer will have to be changed
- link is set to type checkpoint
- transaction support for checkpoint
- timeout is job runtime - process
- pegasus-kickstart timeout method
- also has dv/dt implications for monitoring.
pegasus-exitcode assumes success and checks for failure
- refactored the script for unit tests as a library
- pegasus-statistics
- pegasus-analyzer ( maybe some commonality)
- pegasus python library has to be included in worker package
pegasus-transfer
- threads are handled similar to pegasus-s3
- default threading
-
- expose options end to end
- initial threads to irods
- what options to set
pegasus-config will now work with a source checkout
December 2013
December 16th, 2013
- TODO: Talk about ADAMANT design
December 3rd, 2013
- 4.3.1 release
- just need to send the announcement.
- gideon has updated the build infrastructure in bamboo to build the release
- to do
- do a drupal snippet, to update the downloads page automatically.
- dynamically render the page using the shared directory in drupal.
- do a drupal snippet, to update the downloads page automatically.
- pegasus-analyzer will have a recurse option.
- identity management for pegasus service
- portal use case
- user authentications
- website
- put a token in a cookie.
- draw bigger pictures on the identity stuff.
- Unicore Testing
November 2013
November 11th, 2013
- 4.4 Planning
- according to proposal, we need pegasus as a service, metadata registration, enhanced notifications on long runtimes etc.
- ligo realtime analysis?
- scott and kent mentioned that real time analysis is a priority.
- gstreamer interface.
- investigate streaming workflows
- unicore testing support
- Pegasus Tutorial on (Mats VM on oregon region)
- Pegasus as a service
- Ensemble Manager
- an ensemble has no end state currently.
- update documentation on the website
- gideon plans to remove the upload catalog options. instead the clients will read in the properties and automatically upload.
- NSF Cloud Proposal
- Experiment management.... maybe does not align itself with NSF Cloud.
- Adamant Demo
- workflows are setup and done.
November 4th, 2013
- Tutorial format finalized for November 14th meeting. similar to software carpentry layout
- 4.4 release things
- pegasus metadata support
- dax schema changes
- irods - support for metadata attributes
- s3 objects - they can have tags associated with it.
- transient replica catalog.
- unicore support
- for JIRA items move to the next one.
- moteur support.
- dv/dt wrapper support ( probably in a separate dv/dt branch)
- pegasus metadata support
- move to VMWare for hosting websites
- pegasus.isi.edu will be as a VM in a VMWARE ESX pool.
- initially 4 VM's for Bamboo BNT
- retire the machine for PAGE QC
- long term we are moving to ESX
- pegasus.isi.edu will be as a VM in a VMWARE ESX pool.
October 2013
October 1st, 2013
Pegasus 4.3 release
- dashboard is separate
- prepare rpm for ligo
- ssh submission for 4.3
- tutorial vm almost done
- the clock issue remains. probably an issue with how virtualbox does the time.
- need to hear back from scott
- sepiddeh working on make flow compatible code generator.
September 2013
September 23rd, 2013
- Create a pegasus youtube channel.
- See if that can be linked from the ISI webcast page.
ISI Pegasus Workshop
- Submit host setup at HPCC
- specs are similar to workflow.isi.edu
- gideon will mail to HPCC admins today about this
Tutorial VM
- networking issue
- persistent rules file /etc/udev/rules/70-persistent-networking.rules
- instead of deleting it lets just disable it in our VM's
- X with virtual box guest additions for enabling copy paste
- turn on ntp
- larger virtual disk - will increase the size to 8GB
- X should just add couple of hundred MB's
Pegasus Release
- JDBC RC
- Tutorial VM
- pegasus-statistics
- pick up a release date
- tentatively next friday i.e the 4th.
September 9th, 2013
Software Carpentry
- Karan will prepare introductory slides for Pegasus.
- Talk to John about providing a Pegasus submit node.
- Rajiv will be working on the Pegasus RNASeq VM.
- John Mehringer will go first in the second day.
- Parking is in Levy structure in southwest corner.
- Inquire about shuttle from Health Science Campus.
- Still do - RNASeq module.
- Put Information about parking and HSC Shuttle.
- Parking Center.
Pegasus Release
- waiting for Scott to do release testing.
Pegasus Lite Paper
- Karan will send the camera ready version today.
Precip
- using netlogger for logging.
- replace python logging framework
- incorporating events from the remote site
- AMQP ?
- Getting events into a common file.
- Run montage using precip
Condo of Condos Workshop
- Laurent and Gideon have 10 minutes each.
- Bosco new name is MyHTC.
August 26th, 2013
Pegasus 4.3 release
- dagman metrics not implemented yet by kent. still in design phase.
- testing stuff
- unit tests running in bamboo.
- add missing data dependencies
- still checks and produces errors
Precip Logging
- getting the metrics back
Pegasus Hold
- how to get dagman stop submitting jobs
- idle jobs need to go on hold.
- we can send sigusr1 to dagman.
- need to handle hierarchal workflows.
- JDBC RC stuff
JDBC RC
- we will just update the existing version one.
- have a python based RC for Replica Catalog.
Ensemble Manager Paper
- Gideon will be working on it.
DAGMan replacement??
- Software engg stuff.
August 19th, 2013
- Pegasus 4.3 release
- output mapper stuff implemented.
- pegasus-statistics changes checked in by Rajiv
- app metrics associated with the metrics report
- pegasus.metrics.app
- can be used for RNASeq tracking and other applications
- the metrics UI will be able to filter on the name.
- Globus Online Support - move to 4.4 release
- can only do certain parts of transfers.
- for transfers from local submit host , we need to use globus connect
- credentials issue
- for submit host, there needs a local endpoint.
- LIGO testing ?
- prepare a pre release RPM for LIGO
August 12th, 2013
- Pegasus Lite Paper
- Wait for the Big Data and Science Workshop
- 4.3 Release
- Output Mapper Submission
- error if output site and a output mapper replica catalog specified
- Globus Online Support in pegasus-transfer
- OAuth tokens issue.. when to get the token
- support for multi end point with different credentials
- probably need to do a pegasus-globus-online
- the client needs to be blocking .
- SSH Submission
- Will use RNASeq for that.
- Boto downgrade worked.
- did not build on RHEL 5
- Test Suite
- Suite of integration tests
- checksum the files
- Suite of integration tests
- Output Mapper Submission
- Ensemble Manager
- Almost done with the first version
- Will work on the Galactic Plane version
- General JUnit Tests for Pegasus
- Galactic Plane Paper
July 2013
July 29th, 2013
Software Carpentry
- Workflows Tutorial
- 1 hours overview of HPCC if HPCC folks are interested.
- Pegasus Tutorial ( 2 hours )
- An info part on where to run jobs
- OSG
- HPCC
- XSEDE
- Pegasus Development
- Rajiv will complete the pegasus-statistics part
- error messages ( give more hints on what went wrong on site selection )
- Monitoring API
- wants a jar with a simple API to monitor workflows
- wrap it up in a jar
- provide interface
- portal integration
- rest interface for the pegasus service
July 8th, 2013
- gideon has changes checked in dax2dot based on the closures and reductions
- karan has checked in the LCA approach. But does not scale for our performance test case.
- Also changed the way edges added for the create dir nodes. that will go in for 4.3.
- Precip Paper
- deadline extended to the 19th of July.
- Posters to be made for XSEDE
- Sudharshan will make a poster on his cleanup work on Monday.
- Sudharshan will be going on Monday to campus to present the poster around 1-3PM
- Will give a talk to CCG group Tuesday July 16th at 11:00AM
- Currently, sudharshan's algo takes 15 seconds on a 1000 node montage workflow.
July 1st, 2013
- monitord bug fix checked in
- algorithm to remove extra graph dependencies
- backups
- we need to update the pegasus machine
- jira, svn , website ( website and svn need to move at the same time ) , crowd updates
- confluence was moved to another . also coordinate with action to do the move.
- mats already updated crowd today
- there is secret number of conf files... apache on top of tomcat
- update to debian machines
- obelix, cartman and stewie, and the ccg worker nodes.
- we need to update the pegasus machine
- mats has updated the bamboo tests to use new filesystem paths
- ADAS abstract
- for galactic plane on Amazon. if accepted due in september.
- 4.3 release
- fix error messages. see what can be done to improve them .
- output replica catalog
- pegasus-transfer tests.
- updates to cleanup algorithm based on sudharshan's work ??
- release notes will be updated to indicate the dashboards move to pegasus-services thing.
- Precip Paper
- mats will do the zotero work.
- submitting to cloud com in bristol uk.
- seppideh has some data on openstack. could not get all instances started up.
- seppideh will release the token to gideon to do an edit pass
- Cleanup Algorithm
June 2013
June 24th, 2013
- Pegasus Development
- Monitord issue https://jira.isi.edu/browse/PM-712
- karan has a fix that works for him.
- needs to test the replay mode.
- Update on SCEC visit
- pegasus-archive tool
- archive everything other than the stampede db and braindump file
- scott will try to cluster rupture variations for the same rupture in one task based on runtime estimates
- the SGT will become 16 times bigger and post processing 8 times bigger on move to 1HZ. clustering rupture variations in scec code will help in reducing the number of jobs in the DAX
- Scott tried to generate a single DAX for the post processing worklfow. Was unable to do so. Has generated two dax'es
- pegasus-archive tool
- Galactic Plane
- Cut out service. Slow times on retrieving the image from S3. Small bandwith between S3 and EC2
- Will need to have monitoring etc... Not fast enough for a webpage to be responsive.. will need some queuing up
- Backups
- Mats working on Kepler data.
- mats tried backup with S3. does not like symlinks. will change the way backups are managed. the transfer times can be long.
- Update from Sudharshan
- Good progress. showed some simulations
- Adamant Update
- we are on hook for providing the interfaces in pegasus-transfer that will talk to the exo planner service
- also provide shadow queue service, that gives estimates on jobs that will be in the queue.
- supercomputing demo?
- Precip Paper
- majeick si doing some experiments
June 17th, 2013
- Pegasus Development
- the dax job handling is completed.
- update on ligo front.
- condor priorities for local universe jobs
- not handled right now.
- gideon has a ticket open for them.
- gideon observation of s3
- scalable but not good latency or
- Pegasus Lite Paper
- mats is almost done with the runs. to grep through the runs to get the intermediate files in and out of S3
- not done the S3 caching for rosetta as yet. still not sure. too much work for the time remaining.
- mats did do the runs with task clustering. he got better numbers and saw a difference in case of rosetta.
- interleaving of compute jobs and transfers. may help montage.. but won't help rosetta
- whether we should include the new pegasus 4.2 features.
- Cleanup Algorithm
- Glacier Backups for NFS?
- instead of using two qnaps, just have one and use other for duplicates
- we need a place for backups
- currently the QNAPS are 18TB each with raid 6. Raid 10 is a better configuration on the QNAP according to the forums. This means though we will have half the space.
- have one qnap for scratch
- have other qnap for storage - the storage will be backed upto glacier. right now QNAP only support S3. Support for glacier is coming.
- ewa and richard think glacier backups are a good option.
- there might be a purge policy required on glacier.
- Precip Paper
- change tracking on
- use dropbox
- broadcast when you making a new version.
June 10th, 2013
- Pegasus Development
- change to dax handling
- fix of stdout
- regex based replica catalog.
- changes to pegasus-statistics for aggregate statistics
Pegasus Lite Paper
- compute data between s3 and local disk.
- compute costs for the runs ?
- have data outside
- local cache for the S3 client ?? could affect the rosette cache.
- change the rosetta workflow.
- if there are a lot of small files.
- reading parts of files.
- Ewa will send her version of the changes.
Sudharshan Algorithm for Cleanup
- Greedy appraoch planned
- will try implementing a version and show the different executable workflows created
June 3rd, 2013
Pegasus Lite Paper
- Breakdown of the runtimes , experiments
- In case of sharedfs, the kickstart runtimes in the breakdown file will be longer
- for the S3 case we can calculate the S3 transfer time by calculating the difference between the cumulative runtimes
- doing two experiments rosetta(cpu intensive) and montage( io intensive)
Pegasus Development
- Java DAX API issues
- might be some bugs in there.
Precip Paper
- Ewa wants a link to pegasus website in the paper.
- have more logical thinking in the paper, like reliability and repeatability
- Sepideh adding some new figures to the paper.
- Maciek will provide an experiment use-case for the paper.
Stampede and Corral Annual Reports
- Karan and Mats will be working on these
Sudarshan's Project
- Going to look into providing a cleanup algorithm that meets a given storage constraint
- Will look at the static problem of inserting dependencies into the workflow to achieve a solution
PMC Paper
- on amazon
- with clustering and pmc
Shirts
- Should get the logo sample this week, once we approve then we can order shirts
dV/dT
- Rafael is working on a draft of the data collection and modeling paper
- We are planning on publishing data, will start drafting a format this week
May 2013
May 20th, 2013
Confluence is going slow. Mats is going to look.
Analytics are set up on Confluence now.
Pegasus Transfer
- Mats committed a new version that has support for 2-stage transfers
Pegasus S3 Client
- Gideon changed .s3cfg to .pegasus/s3cfg
Pegasus Lite Paper
- Mats is working on the experiments
- We have two weeks to the deadline
PMC Paper
- Experiments on Amazon comparing Pegasus, Pegasus w/ Clustering, PMC alone
Pegasus Service
- Finished setting up users and test suite
- Next is a quick-and-dirty ensemble manager implementation
- Gideon is going to commit a change to Pegasus that removes the dashboard components. They will live in the pegasus-service repository from now on.
Summer Student
- Need to think up a project. Needs to be research-oriented and relatively small.
- Cleanup? Precip?
Contacting users
- Find out if they need anything.
Examples
- Simple examples in Perl, Python and Java
- Gideon will add them to the examples in the pegasus Git repo
April 2013
April 22nd, 2013
- monitord prescript handling fixed
- pegasus-analyzer should detect prescript failures, and the prescript exitstatus should be logged in the database
- pegasus-statistics was updated for the job instance report
- pegasus planner
- need to confirm all checkin's are complete
- do we want to get LIGO to do a test or just release?
Pegasus statistics across workflows - Rajiv
Pegasus Lite Paper
- Mats will do the runs on Amazon
- Karan will work on paper when he comes back
pegasus-hold and pegasus-release
- any difference between doing a hold on the dagman directly or pegasus-dagman
- we need to do more investigations on monitord
BOSCO
- Mats is trying to run on HPCC
- a single job is running fine.
April 8th, 2013
- Work on it towards this week
- monitord prescript issue to fix
- pegasus statistics extensions
- across root workflows
- https://jira.isi.edu/browse/PM-507
- condor temp file
Pegasus Posters
- One at XSEDE
- joint one with BOSCO team
Pegasus Lite Paper
- Submission to IEEE Big Data
New Programmer Hire
- expanded posting on confluence
- New Programmer Hire
- will send out to HPC Wire , RENCI and USC SC Connect
April 1st, 2013
Pegasus Lite Paper
- Waiting on Ewa
- Not much we can do about the IEEE conference. The page limit is 8 , the current size of the paper.
XSEDE Poster
- Pegasus Poster. Karan will send update
- Also a joint Pegasus BOSCO poster
- Also as part of that we will get the MPI workflows up and running through Pegasus and BOSCO
Pegasus Development
- Bypass of staging input files for Pegasus Lite Case
- Inplace cleanup bug fixes done.
- pegasus-s3
- gideon checked in changes of copy from one file to another
- mats adds a pegasus transfer
- workflow cleanup nodes
- separate cleanup node in the workflow
- for hierarchal workflows we only delete the outermost workflow
- what happens if no output-site specified
- the ligo case!
- backward compatiblity for LIGO
- Pegasus Dashboard
- general javascript updates
- Generic Pegasus Slides
- 2-3 slides.
March 2013
March 25th, 2013
- Pegasus Lite Paper Submission
- We will try for https://sites.google.com/site/sweetworkshop2013/
- Karan will move the paper to the ACM format
- Pegasus-statisitcs
- Waiting on Scott to get back with the list of metrics
- Rajiv will be working on it
- pegasus-s3 changes
- we want to be able to copy output files from one s3 bucket to another
- requires changes to pegasus-transfer and pegasus-s3
- final node for cleaning up remote directories
- also related is getting the cleanup algorithm working when we bypass first level staging.
March 18th, 2013
- Mats has an RPM almost sorted out for LIGO that does not require us to have PYTHONPATH set. Instead the libraries go into standard locations
- Karan is testing this RPM at on spice-dev1 and has setup a page with instructions on how to submit a test workflow to VIRGO
- Statistics across root workflows
- earlier gaurang had generated statistics for scec runs by hand... executiing queries on the msql command line
- he does not have the queries documented anywhere
- this is something we have talked about in context of 4.3 with Rajiv
- will follow up with scott on wednesday's call
- 4.2.1 release
- backward compatibility for LIGO . still to be done
- probably next week after the pegasus annual report
- RPM to handle native python installation
- Pegasus Annual Report
- Karan will work on it this week
- Try to follow the same template as earlier.
March 4th, 2013
- Sent link on DAGMan metrics to DAGMan Metrics Reporting to Ewa
- Metrics for Rob Quick's workflow
- Gideon pushed out kickstart changes
- Rajiv has pushed changes to the queries for the dashboard.
- Setup meeting with Jaime and Derrick at OSG AHM to discuss
- remote_initialdir
- extra attributes for glite/bosco submissions
- mpi workflows.
- OSG Poster to be made this week. And 4.2 Release slides.
February 2013
February 11th, 2013
Direct submission of workflows to PBS
- Glite submission in Condor. We setup a VM that hosts a PBS scheduler and using that too test
- Karan prepared an example for 4.2 that can be used to submit directly to local PBS using the glite interfaces in Condor
- the remote_initialdir / +remote_iwd does not work
- problem for MPI codes
- for the time being, the example prepared relies on kickstart to change the directory before launching a job
- there is also a ssh style that allows us to use BOSCO to do remote submissions using SSH to a PBS cluster
- that one also has the issue of remote initialdir
- the remote_initialdir / +remote_iwd does not work
- jobstate.log refactoring.
- data transfer ( support for globus online)
- lightweight tracing
- task stats. net link socket pegasus-kickstart . how much memory the task used and io used.
- add task stats to kickstart
- ptrace
- trace linux equivalent is system tap
- dashboard improvements
- single api for clients
- last week drop down
- performance run on large workflows.
February 4th, 2013
- CCGrid / Pegasus Lite Paper
- Performance section
- remove the experiments section?
- OR
- extra experiments section
- have the squid proxy cache
- find a workshop to submit the paper
- Cloud Paper
- Ewa is working on it.
- Ewa is working on it.
- Git HUB Migration
- - couple of branches like monitord , pmc and dang are branches
- - svn will be made read only .
- - update the website with all the development information
- - bamboo scripts
- - documentation ( long term )
- - nightly builds
- SSH Submission
- - gsissh submission for blue waters
- - ssh to blue waters is required for OTP
- - passing of parameters to PBS
- - SSH key
- - ssh agent.
- - queue keyword
- - Batch session
- - submit jobs to HPCC
- - Gideon will do that.
- monitord memory explosion
- - long term for monitord
- - pegasus-dagman replacement
- minor release 4.2.1
- - potential monitord bug issue
- - long term dagman replacement
- Response time for metrics page
- - occasionally it is slow