-
Notifications
You must be signed in to change notification settings - Fork 2
Meeting 15.11: Proper Implementation
roy-sc edited this page Nov 15, 2021
·
8 revisions
-
Patrick
- Time runtime of MPI calls
- Multiple iterations per job
- Research allgather/allreduce algos used by openMPI
- Synchronize processes before running using
MPI_Barrier
-
Elwin
- Use different physical nodes
- Use new repetition format (multiple runs per job)
- New Plots (Memory usage, compute ratio, variation across repetitions)
- Verify results are actually correct
- Use specific algorithm for
allreduce
-
Roy
- allreduce-rabenseifner
- rabenseifner-gather
- allreduce-butterfly --> finished it, and generalized to non-powers-of-2 processes
- started rabenseifner-scatter --> divide final matrix into smaller submatrices (not yet finished)
-
Noe
- Research allreduce, algos used by openMPI and runtimes as functions of bandwidth, latency and compute-time per byte
- Research Rabenseifner's Algo
- Extend our previous plotting-framework, incl. adding scatter-plots to line plots for per runtime plots, using different aggregate methods such as median, 99th-percentie and mean
- LogP-model of our initial model of butterfly
-
Dave
- Implement allreduce ring
- Compare with openMPI allreduce ring
-
allgather-async
: All processes send to all other processes. While waiting, the outer product of received vectors can already be computed and added to the total -
allreduce-buttefly
: Each process calculates whole matrix, use buttefly to distribute temporary matrices (which are composed by adding own matrix and all received matrices) -
allreduce-rabenseifner
: Each process computes own submatrix, use a butterfly to add all submatrices, use a second butterfly to distribute final submatrices. -
rabenseifner-gather
: distribute vectors using an allgather approach, each process calculates a (final) submatrix, use a butterfly to aggregate final submatrices -
rabenseifner-scatter
: distribute vectors using a butterfly, each process calculates a (final) submatrix, use a butterfly to aggregate all final submatrices
Plots can be found here (scroll down). (1) Only a few selected plots are shown - click the arrow to show the remaining variations. (2) There are some subfolders containing the same plots but for specific algorithms: native
for a comparison of all allreduce algorithms, especially the MPI native ones, and (3) some others for specific algorithms (whose plots in the parent directory are messed up somehow).
-
Elwin: Introduction, high level overview on what we've worked on
- Separate physical nodes, verify operation results for correctness, repeated runs (40x) → show plots
- What comes next? logp, newly implemented algorithms, fixed MPI native algorithms
-
Noe:
- Overview on logp model
-
Roy (lead), Patrick, Dave:
- Overview on new algorithms (whoever implemented it)
- Show plots, some interpretation
- Suggestion for further improvements (Roy has an idea)
-
Questions:
-
MPI_Send()
asynchronous in some cases (small payloads) - how to deal with measuring? - Benchmark on more than 48 nodes
-
- Need to wait until received all data --> synchronous --> maybe think about starting computation earlier
-
Asynchronous Computation:
- ...
- Write what we have done up to now --> send it --> discuss next steps with Saleh (Send by Saturday)
- Summary of what we have done, Results, how they have been done --> he should understand what we have done
-
Motivation - Neural Networks:
- Neural Networks --> Distributed Matrix Computation
- Y = X^\top W (output of a layer prior to nonlinearity)
- Optimize NN using backprop --> gradient
- Math to compute the gradient w.r.t. weights --> chain rule --> gradient of weight matrix W
- Update weights: using Stochastic Gradient Descent
- \partial Y \cdot X = \partial W ???
- For large datasets --> want to train distributedly
- Distributed Training: Data Parallel
- Data-Parallel: Divide dataset (e.g. images), put subsets onto individual processors, replicate model onto each machine
- Forward pass can be computed completely independent
- Backprop cannot be computed independently --> depends on input X (images) --> get different weight updates
- Approach: Each process computes own weight update (\partial W (gradient)) --> apply allreduce on the two gradient updates
- \partial W = \sum_i \partial W_i --> update with similar values
- for \partial W_i = \partial y_i \cdot X_i --> need outer product to get weight updates of own processor
- Note: sometimes sparse vectors --> sparsify before sending etc. --> may change communication
- ...