Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add P2P distributed optimization to advanced examples #3189

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

francescofarina
Copy link

Description

This PR adds a new set of advanced examples in examples/advanced/distributed_optimization, showing how to use the lower-level APIs to build P2P distributed optimization algorithms.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).

@francescofarina
Copy link
Author

@chesterxgchen I implemented your suggested changes:

  • I moved the implementation to app_opt - I changed the name of the module nvdo to p2p as the classes can potentially be used for arbitrary p2p algos, not just distributed optimization. Happy to change the name though
  • added documentation to everything that's now in app_opt
  • moved the SyncAlgorithmExecutor to a separate file and renamed the base.py files to base_p2p_executor.py for the executor and p2p_controller.py for the controller. The BaseP2PAlgorithmExecutor is now an ABC.
  • removed pickle. For convenience right now all the executors are saving the results with pytorch.save, but that could be removed and easily reimplemented by the user if needed
  • the dependencies in the examples/advanced/distributed_optimization are now specified in a requirements.txt as the core core has been moved to app_opt and can be imported directly from nvflare

Let me know what you think.

Now that it's moved to the core, I feel implementation part could be changed/improved by offloading to the user things like saving the results, storing losses via callbacks, monitoring, etc - perhaps it makes more sense to do that at a later stage though

@holgerroth
Copy link
Collaborator

Tested locally and runs fine. The tutorial is great! Should consider adding some CI testing in a future PR.

from nvflare.app_opt.p2p.types import Config


class P2PAlgorithmController(Controller):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name implies this apply to all p2p algorithms.
How does this controller works with swarm learning ? in that case, the controller doesn't seem to have payload for model, metrics and metadata

Or this is a special p2p_controller only for dist optimization ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The P2PAlgorithmController currently only sends the configuration to the clients and broadcast a task to run whatever p2p algorithm they're meant to run.

I guess we can either extend the P2PAlgorithmController later to handle payloads/metrics/metadata for swarm learning, or subclass it to create a SwarmLearningController.

Copy link
Collaborator

@chesterxgchen chesterxgchen Feb 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have swarm clearning controller. since swarm learning is doing fedavg like operations on the client side, we also have client-side controller to do this.
My point is before the new controller can capture more general requirements for p2p, we probably should named it more specific instead so general. We can always make it general later to consolidate two algorithms, then we can make a general name. A general name at this point might limit us for later change.

Maybe it doesn't take much to include Swarm learning controller requirements ( you might need to check swarm controller), as majority of the logics is on the client side

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense, I wasn't considering that. Let me use a more specific name for the controller making it clear it's for the distributed optimization algorithms implemented in the module (DistributedOptimizationController?). We can create a more general p2p controller later with all the general requirements. @chesterxgchen shall I also rename the module from p2p to something like do or dist_opt for the moment in light of that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p2p is fine. u could have

class DistOptController() ( DistributedOptimizationController is too long)

apt_opt/controllers/p2p/dist_opt_controller.py or
apt_opt/controllers/dist_opt_controller.py or
apt_opt/controllers/p2p_do_controller.py or

either way is fine with me.

from nvflare.app_opt.p2p.types import LocalConfig, Neighbor


class BaseP2PAlgorithmExecutor(Executor, ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comments for controller

if len(self.neighbors_values[iteration]) < len(self.neighbors):
# wait for all neighbors to send their values for the current iteration
# if not received after timeout, abort the job
if not self.sync_waiter.wait(timeout=10):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard-code 10 sec timeout, need to make this configurable

# Store the received value in the neighbors_values dictionary
self.neighbors_values[iteration][sender] = self._from_message(data["value"])
# Check if all neighbor values have been received for the iteration
if len(self.neighbors_values[iteration]) >= len(self.neighbors):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we reset the neighbors_values once we have all of them for next round ?

Copy link
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a few more comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants