Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Panel

Table Of Contents:

Table of Contents

3/29/2012 Teleconference

Attendees: David R., Taghrid, Karan, Guarang, Ahmed, Dan G.

Topics

Dashboard

  • Change invocation to use submit directory

  • ISI: Add Python module to find database from braindump module

    e.g. from stampede.util import find_db

    dburl = find_db("/path/to/submit/dir")

  • Feature requests:
    • Troubleshooting: invoke pegasus-analyzer? Add "see failed jobs" button.
    • Submit directory
    • Troubleshooting
    • Link it to distribution (need release)
  • Live dashboard?
    • Install Condor and have pegasus running on a machine at LBL.
    • CentOS; yum
    • All runs in workflow gallery are 3.1 (no host information)

Triana integration

Triana (http://www.trianacode.org/)

  • Yang schema was important for getting things to work.
  • Triana has a "streaming" model. What gets counted as a job retry is streaming of the work.
  • Going to run some workflows through pegasus and their cluster.
  • Going to test-drive the dashboard? Didn't talk about it much.
  • Paper plans: we will see.

FutureGrid status

  • re: India, Need a window when we are going to run our experiments.
  • Add monitoring stuff (blipp)
  • Space issues on submit VM – mount scratch drive; not done. For these experiments we don't need a shared file system.

2/21/2012 Project Meeting

  • Attendees Ewa, Karan, Dan, Taghrid, Gaurang, Fabio and Martin

    Stampede All Hands Meeting

    Pegasus 4.0 Update

      • pegasus analyzer works with the stampede databases
      • pegasus statistics changes
        • csv format output
      • stampede database
        • new job instance
      • new schema upgrade tool to migrate users from 3.1 to 4.0 schemas
        • minor changes still required to be done.
      • nothing new on pegasus plots
    • Future plans
      • have a sort of a provenance trail/ point out which DAX job failed by pegasus analyzer .
      • replay workflows… more at pegasus level than at monitord level
      • dashboard to be integrated in 4.1. Can be released separately also.
    • Stampede Dashboard
      • Chris Brooks student also was working on the dashboard.
      • She will go through and build a whole lot of performance tests for the API.
      • David at LBNL is the main person working on the dashboard.
      • David is using JQ Plot.
      • Karan will run a real life LIGO workflow and tar it up and sent the workflow files to Dan
      • David can use it for his development runs.
      • Pointer to the stampede API
      • Pointer to pegasus plots for David
      • Long term maybe we can support uploads to a restful service
      • automating the dashboard via monitord or pegasus-run.
      • how to integrate multiple sqlite databases in a single dashboard instance.
        • maybe when the workflows start, a simple rest request is sent to dashboard tell what workflow to monitor
        • like a global sqllite db that resides in $HOME/.pegasus , the registers the sqlalchemy string.
        • dashboard should start with a pointer to the sqlalchemy connection string that has main sqllite db that has pointers to all the workflow db's.
        • Dan and Martin feel that it should be a web service call
        • Move dashboard to Github to coincide with Ahmed's work on periscope
        • Users may have to start up dashboard separately if we want dashboard to track multiple sqllite databases.
    • Periscope Demonstration
      • Some visualization of experiments on FG
      • Will require tornado instead of web.py that is being used by the dashboard.
      • backend to the dashboard , can be be made compatible with the periscope stuff. will talk to stampede database for workflow information and periscope for network monitoring. Periscope is works better with tornado??
      • Dashboard should be released before the periscope work is stitched into it
      • monitord may send rest messages to periscope brokering service.
      • Populate from FG runs the periscope information brokering service/location service.
      • What does TG use for monitoring.?
      • In general Periscope maybe able to forewarn about the failures.
    • Online Analysis Work
      • all of it is written in R.
      • works against the database
      • does not use the stampede api
      • looking at the rate of job failures predict whether a workflow will fail or not?
      • built probability models for different job types in a workflow.
      • online analysis will work on the same place where there is dashboard.
      • How does pegasus integrate with online analysis work? user notifications?
      • In the near term, we need to upgrade the analysis to generate some user notifications
      • Placeholder for using this, can be either in site selection or in retry hooks.
    • General
      • Ditched AMQP
    • To Do
      • Integrate dashboard with tornado work.
      • Move analysis tools to work online.
      • Analysis tools should be using the API.
      • figure out how much space can a VM get
    • Website / Stampede Organization
      • Organize the Stampede wiki page by the artifacts discussed below.
      • Update stampede papers on the WIKI.
      • Stampede Page should have the stampede logo.
      • WF prof should go somewhere else?
    • Futuregrid
      • experiments targetted for SC 2012
      • space on VM
        • m1 small - does not have much disk space
        • m1 large - also varies from india and sierra
      • place where the vm's are stored in FG have about 80GB's
      • in eucalyptus there might be a way to externally mount a filesystem using bit bucket?
      • submit host in VM
      • first set of experiments
    • Experiments
      • ISI Failure once it starts ( wrong executable ) / change DAX .
      • task failure after start/ kill process ( Martin )
      • task failure ( Martin )
      • task failure / hang suspend process
      • Data Transfer failures . ISI
        • jobs don't bring the data back/ write out the data
      • Host shutdown ( Martin )
      • job monitor failure ( Martin )
      • network issues
      • disk filling up .
      • Futuregrid does not have a storage service.
        • Martin has some physical nodes that we can get
      • On India
        • eucalyptus on xen
        • openstack on xen
        • every month there is a maintenance in FG . First Tuesday every month.
      • Larger story for the paper
        • stop early . early detection of the failures.
        • induce failures, models of failures.

    • Papers
      • SC 2012 paper for the FG experiments
      • Journal Stampede Paper ( completing the picture and putting various parts together)
      • Special issue of workflows due May 1st.
      • How does the stampede work relate to provenance ? Maybe a future paper in reference to that. Martin is working with Beth on a similar thing. Performance and Provenance interlap
    • Future Plans
      • Workflow Gallery - make it available. Useful for others to use? everybody should look at the wiki.
      • do instrumenting and include it in kickstart . tie it down with work martin has been doing. Will be useful for workflow debugging and re-running
      • some instrumentation at compiler level? sort of pegasus make . Potentially a lot of work in that.
      • do experiments on FG where data is staged using gridftp instead of condor io . martin has accelerators that will increase the bandwidth on internet2 or calit
      • Artifact from the stampede project
        • Dashboard
        • General SQL schema
        • Formalized the netlogger yang model
        • Analysis functions / Stampede DB API
        • Anomaly Detection work ? Can that be resurrected. Priscilla did not transfer. So not sure.
        • Connection with periscope ?
        • Also periscope is integrated with gridftp. We should use it.
    • For wei
      • how to adapt scheduling of workflows, as errors happen, diskspace used up or hosts problems.
    • Funding opportunities.
      • At LBNL a lot of dynamic workflows from a single domain.
      • DOE oscar - more for research not for software development
      • opportunities for application tie in for dynamic environments
      • potentially for NSF . STCI will not work. It will be more experimental. More to go in for Si2
      • technologies for processing - extreme scale science requires multi scale and multi processors . some doe workshop.
        • potential tie in with ann data scheduling stuff
    • From Dan Email
      • 1. Finding:  Collaborative extreme-scale science requires a range of services to manage the collaboration itself.  These may include machine-accessible lists of collaborators with their roles, authorities and duties, project planning tools, effort and financial reporting services, meeting scheduling and management services.  The lack of uniformity, and indifferent quality of many of the existing services is a major impediment to collaborative science.
      • Recommendation: Develop Office-of-Science guidelines for centrally supported services, with special focus on the Software-as-a-Service mode of delivery.  The central support and any essential research and development, may be provided by ASCR in concert with science programs or by commercial services where appropriate.

12/1/2011

  • Attendees Ewa, Karan, Dan, Taghrid, Monte and Martin
  • Taghrid working on a proposal . 500K? over 3 years. Taghrid will send a version to Ewa to look at by end of next week.
  • Doing some experiments on Future Grid
    • Dan needs to get an account on FG or already has
    • Dan says Taghrid should get an account.
    • Jens runs on Future Grid. Periodogram workflow on Future Grid.
    • Dan has some modifications to kickstart in mind.
    • He will put in a request in Pegasus JIRA .
  • Paper Deadline
    • List of experiments to be done.
    • HPDC too close . Maybe some workshop at HPDC that has a later deadline.
    • SC submission is a possibility ( April 27th )
  • Status of the dashboard
    • Same as what was shown in SC.
    • Stephanie will be doing some scalability tests against the database API. Write a performance API test suite.
  • Integration with Periscope
    • Dan is interested in doing it. But not sure whether LBNL will have time to do it.
    • He feels the right time to look at periscope will be when they start doing runs on FG.
  • Follow on to Stampede Proposal
    • Ewa and Dan want to do something.
    • STCI proposals are now pushed into Si2.

...