Introduction

This page describes methods for benchmarking the latency, throughput, etc. of the Stampede tools. Links are provided to results.

The complete list of metrics by which the project will be evaluated, from the proposal, is here:STCI-metrics.pdf. This document divides the performance of the toolchain into responsiveness and scalability, an organization we will echo, below.

Benchmarks

System Responsiveness

Archival delay

This is the interval between the time an event is made available to the system, and when it is committed to the archive.

Plan:

  1. Start with output of a completed workflow
  2. Create a skeleton copy of its output directory, with no jobs. This is the input directory.
  3. Run tailstatd, nl_broker, and nl_loader
    • Need to run nl_loader to: MongoDB, MySQL, SQLite
  4. Copy one job file from the completed directory to the input directory, mark the time
  5. Measure the time at which the job file is added to the database (debug instrumentation for the nl_loader code)
  6. Subtract the load time from the end-of-copy time to get the latency
  7. Repeat steps 4 through 6 multiple times (100? 1000?) to get some idea of the range of times.
    • Note that the program to "trickle" the jobs in should have relatively small but significant sleeps between each new job copied to the input directory so that buffering and flushing can be taken into account.

Analysis:

  1. Mean, median, quartiles, and outliers of the latency
  2. The proportion of the delays at each stage:
    • file-exists to tailstatd-write
    • broker-read to loader-read
    • loader-read to loader-write

The loader-read to loader-write numbers should be calculated separately for each database type.

Analysis delay

Interval between the time an event is made available to the system, and when it generates an alarm. This will be measured for some pre-set alarm criteria that vary in complexity.

Plan: TBD. Real-time alarms are still a work in progress.

System Scalability

Archival rate.

Per-node and scalability numbers for the sustained rate of archiving monitoring events.

Plan:

  1. Use an existing workflow output directory and inflate, if necessary, the number of jobs to a large number – say 1 million. This is the input directory.
  2. Run tailstatd, nl_broker, and nl_loader on the input directory.
    • Need to run nl_loader to: MongoDB, MySQL, SQLite
  3. Measure the time between the start of the run and the time the last event enters the database. This can be approximated with the throughput at the database loader (given that we know the mean latency already, and anyways this number is much smaller than the total runtime).

Analysis:

  1. Mean throughput for the whole run
  2. Graph and approximation for ratio of mean throughput to (throughput / num. events) as the number of events increases.

The analysis should be calculated separately for each database type.

Analysis rate.

Per-node and scalability numbers for the sustained rate of anomaly detection on monitoring events.

Plan: TBD. Real-time alarms are still a work in progress.

  • No labels