Meeting 15.11: Proper Implementation

Progress

Patrick
- Time runtime of MPI calls
- Multiple iterations per job
- Research allgather/allreduce algos used by openMPI
- Synchronize processes before running using MPI_Barrier
Elwin
- Use different physical nodes
- Use new repetition format (multiple runs per job)
- New Plots (Memory usage, compute ratio, variation across repetitions)
- Verify results are actually correct
- Use specific algorithm for allreduce
Roy
- allreduce-rabenseifner
- rabenseifner-gather
- allreduce-butterfly --> finished it, and generalized to non-powers-of-2 processes
- started rabenseifner-scatter --> divide final matrix into smaller submatrices (not yet finished)
Noe
- Research allreduce, algos used by openMPI and runtimes as functions of bandwidth, latency and compute-time per byte
- Research Rabenseifner's Algo
- Extend our previous plotting-framework, incl. adding scatter-plots to line plots for per runtime plots, using different aggregate methods such as median, 99th-percentie and mean
- LogP-model of our initial model of butterfly
Dave
- Implement allreduce ring
- Compare with openMPI allreduce ring

New Algos

allgather-async: All processes send to all other processes. While waiting, the outer product of received vectors can already be computed and added to the total
allreduce-buttefly: Each process calculates whole matrix, use buttefly to distribute temporary matrices (which are composed by adding own matrix and all received matrices)
allreduce-rabenseifner: Each process computes own submatrix, use a butterfly to add all submatrices, use a second butterfly to distribute final submatrices.
rabenseifner-gather: distribute vectors using an allgather approach, each process calculates a (final) submatrix, use a butterfly to aggregate final submatrices
rabenseifner-scatter: distribute vectors using a butterfly, each process calculates a (final) submatrix, use a butterfly to aggregate all final submatrices

Presentation

Plots can be found here (scroll down). (1) Only a few selected plots are shown - click the arrow to show the remaining variations. (2) There are some subfolders containing the same plots but for specific algorithms: native for a comparison of all allreduce algorithms, especially the MPI native ones, and (3) some others for specific algorithms (whose plots in the parent directory are messed up somehow).

Elwin: Introduction, high level overview on what we've worked on
- Separate physical nodes, verify operation results for correctness, repeated runs (40x) → show plots
- What comes next? logp, newly implemented algorithms, fixed MPI native algorithms
Noe:
- Overview on logp model
Roy (lead), Patrick, Dave:
- Overview on new algorithms (whoever implemented it)
- Show plots, some interpretation
- Suggestion for further improvements (Roy has an idea)
Questions:
- MPI_Send() asynchronous in some cases (small payloads) - how to deal with measuring?
- Benchmark on more than 48 nodes

Meeting notes

Need to wait until received all data --> synchronous --> maybe think about starting computation earlier
Asynchronous Computation:
- ...
Write what we have done up to now --> send it --> discuss next steps with Saleh (Send by Saturday)
- Summary of what we have done, Results, how they have been done --> he should understand what we have done
Motivation - Neural Networks:
- Neural Networks --> Distributed Matrix Computation
- Y = X^\top W (output of a layer prior to nonlinearity)
- Optimize NN using backprop --> gradient
- Math to compute the gradient w.r.t. weights --> chain rule --> gradient of weight matrix W
- Update weights: using Stochastic Gradient Descent
- \partial Y \cdot X = \partial W ???
- For large datasets --> want to train distributedly
- Distributed Training: Data Parallel
- Data-Parallel: Divide dataset (e.g. images), put subsets onto individual processors, replicate model onto each machine
- Forward pass can be computed completely independent
- Backprop cannot be computed independently --> depends on input X (images) --> get different weight updates
- Approach: Each process computes own weight update (\partial W (gradient)) --> apply allreduce on the two gradient updates
- \partial W = \sum_i \partial W_i --> update with similar values
- for \partial W_i = \partial y_i \cdot X_i --> need outer product to get weight updates of own processor
- Note: sometimes sparse vectors --> sparsify before sending etc. --> may change communication
- ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting 15.11: Proper Implementation

Progress

New Algos

Presentation

Meeting notes

Clone this wiki locally