Skip to content

Meeting 15.11: Proper Implementation

roy-sc edited this page Nov 15, 2021 · 8 revisions

Progress

  • Patrick
    • Time runtime of MPI calls
    • Multiple iterations per job
    • Research allgather/allreduce algos used by openMPI
    • Synchronize processes before running using MPI_Barrier
  • Elwin
    • Use different physical nodes
    • Use new repetition format (multiple runs per job)
    • New Plots (Memory usage, compute ratio, variation across repetitions)
    • Verify results are actually correct
    • Use specific algorithm for allreduce
  • Roy
    • allreduce-rabenseifner
    • rabenseifner-gather
    • allreduce-butterfly --> finished it, and generalized to non-powers-of-2 processes
    • started rabenseifner-scatter --> divide final matrix into smaller submatrices (not yet finished)
  • Noe
    • Research allreduce, algos used by openMPI and runtimes as functions of bandwidth, latency and compute-time per byte
    • Research Rabenseifner's Algo
    • Extend our previous plotting-framework, incl. adding scatter-plots to line plots for per runtime plots, using different aggregate methods such as median, 99th-percentie and mean
    • LogP-model of our initial model of butterfly
  • Dave
    • Implement allreduce ring
    • Compare with openMPI allreduce ring

New Algos

  • allgather-async: All processes send to all other processes. While waiting, the outer product of received vectors can already be computed and added to the total
  • allreduce-buttefly: Each process calculates whole matrix, use buttefly to distribute temporary matrices (which are composed by adding own matrix and all received matrices)
  • allreduce-rabenseifner: Each process computes own submatrix, use a butterfly to add all submatrices, use a second butterfly to distribute final submatrices.
  • rabenseifner-gather: distribute vectors using an allgather approach, each process calculates a (final) submatrix, use a butterfly to aggregate final submatrices
  • rabenseifner-scatter: distribute vectors using a butterfly, each process calculates a (final) submatrix, use a butterfly to aggregate all final submatrices

Presentation

Plots can be found here (scroll down). (1) Only a few selected plots are shown - click the arrow to show the remaining variations. (2) There are some subfolders containing the same plots but for specific algorithms: native for a comparison of all allreduce algorithms, especially the MPI native ones, and (3) some others for specific algorithms (whose plots in the parent directory are messed up somehow).

  • Elwin: Introduction, high level overview on what we've worked on
    • Separate physical nodes, verify operation results for correctness, repeated runs (40x) → show plots
    • What comes next? logp, newly implemented algorithms, fixed MPI native algorithms
  • Noe:
    • Overview on logp model
  • Roy (lead), Patrick, Dave:
    • Overview on new algorithms (whoever implemented it)
    • Show plots, some interpretation
    • Suggestion for further improvements (Roy has an idea)
  • Questions:
    • MPI_Send() asynchronous in some cases (small payloads) - how to deal with measuring?
    • Benchmark on more than 48 nodes

Meeting notes

  • Need to wait until received all data --> synchronous --> maybe think about starting computation earlier
  • Asynchronous Computation:
    • ...
  • Write what we have done up to now --> send it --> discuss next steps with Saleh (Send by Saturday)
    • Summary of what we have done, Results, how they have been done --> he should understand what we have done
  • Motivation - Neural Networks:
    • Neural Networks --> Distributed Matrix Computation
    • Y = X^\top W (output of a layer prior to nonlinearity)
    • Optimize NN using backprop --> gradient
    • Math to compute the gradient w.r.t. weights --> chain rule --> gradient of weight matrix W
    • Update weights: using Stochastic Gradient Descent
    • \partial Y \cdot X = \partial W ???
    • For large datasets --> want to train distributedly
    • Distributed Training: Data Parallel
    • Data-Parallel: Divide dataset (e.g. images), put subsets onto individual processors, replicate model onto each machine
    • Forward pass can be computed completely independent
    • Backprop cannot be computed independently --> depends on input X (images) --> get different weight updates
    • Approach: Each process computes own weight update (\partial W (gradient)) --> apply allreduce on the two gradient updates
    • \partial W = \sum_i \partial W_i --> update with similar values
    • for \partial W_i = \partial y_i \cdot X_i --> need outer product to get weight updates of own processor
    • Note: sometimes sparse vectors --> sparsify before sending etc. --> may change communication
    • ...
Clone this wiki locally