-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add P2P distributed optimization to advanced examples #3189
base: main
Are you sure you want to change the base?
Add P2P distributed optimization to advanced examples #3189
Conversation
examples/advanced/distributed_optimization/nvdo/controllers/base.py
Outdated
Show resolved
Hide resolved
examples/advanced/distributed_optimization/nvdo/executors/base.py
Outdated
Show resolved
Hide resolved
examples/advanced/distributed_optimization/nvdo/executors/consensus.py
Outdated
Show resolved
Hide resolved
@chesterxgchen I implemented your suggested changes:
Let me know what you think. Now that it's moved to the core, I feel implementation part could be changed/improved by offloading to the user things like saving the results, storing losses via callbacks, monitoring, etc - perhaps it makes more sense to do that at a later stage though |
Tested locally and runs fine. The tutorial is great! Should consider adding some CI testing in a future PR. |
from nvflare.app_opt.p2p.types import Config | ||
|
||
|
||
class P2PAlgorithmController(Controller): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name implies this apply to all p2p algorithms.
How does this controller works with swarm learning ? in that case, the controller doesn't seem to have payload for model, metrics and metadata
Or this is a special p2p_controller only for dist optimization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The P2PAlgorithmController
currently only sends the configuration to the clients and broadcast a task to run whatever p2p algorithm they're meant to run.
I guess we can either extend the P2PAlgorithmController
later to handle payloads/metrics/metadata for swarm learning, or subclass it to create a SwarmLearningController
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have swarm clearning controller. since swarm learning is doing fedavg like operations on the client side, we also have client-side controller to do this.
My point is before the new controller can capture more general requirements for p2p, we probably should named it more specific instead so general. We can always make it general later to consolidate two algorithms, then we can make a general name. A general name at this point might limit us for later change.
Maybe it doesn't take much to include Swarm learning controller requirements ( you might need to check swarm controller), as majority of the logics is on the client side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense, I wasn't considering that. Let me use a more specific name for the controller making it clear it's for the distributed optimization algorithms implemented in the module (DistributedOptimizationController
?). We can create a more general p2p controller later with all the general requirements. @chesterxgchen shall I also rename the module from p2p
to something like do
or dist_opt
for the moment in light of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p2p is fine. u could have
class DistOptController() ( DistributedOptimizationController is too long)
apt_opt/controllers/p2p/dist_opt_controller.py or
apt_opt/controllers/dist_opt_controller.py or
apt_opt/controllers/p2p_do_controller.py or
either way is fine with me.
from nvflare.app_opt.p2p.types import LocalConfig, Neighbor | ||
|
||
|
||
class BaseP2PAlgorithmExecutor(Executor, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comments for controller
if len(self.neighbors_values[iteration]) < len(self.neighbors): | ||
# wait for all neighbors to send their values for the current iteration | ||
# if not received after timeout, abort the job | ||
if not self.sync_waiter.wait(timeout=10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard-code 10 sec timeout, need to make this configurable
# Store the received value in the neighbors_values dictionary | ||
self.neighbors_values[iteration][sender] = self._from_message(data["value"]) | ||
# Check if all neighbor values have been received for the iteration | ||
if len(self.neighbors_values[iteration]) >= len(self.neighbors): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we reset the neighbors_values once we have all of them for next round ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a few more comments
Description
This PR adds a new set of advanced examples in
examples/advanced/distributed_optimization
, showing how to use the lower-level APIs to build P2P distributed optimization algorithms.Types of changes