Add P2P distributed optimization to advanced examples #3189

francescofarina · 2025-01-29T11:02:01Z

Description

This PR adds a new set of advanced examples in examples/advanced/distributed_optimization, showing how to use the lower-level APIs to build P2P distributed optimization algorithms.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).

examples/advanced/distributed_optimization/pyproject.toml

examples/advanced/distributed_optimization/nvdo/controllers/base.py

examples/advanced/distributed_optimization/nvdo/executors/base.py

examples/advanced/distributed_optimization/nvdo/executors/consensus.py

francescofarina · 2025-01-31T11:54:16Z

@chesterxgchen I implemented your suggested changes:

I moved the implementation to app_opt - I changed the name of the module nvdo to p2p as the classes can potentially be used for arbitrary p2p algos, not just distributed optimization. Happy to change the name though
added documentation to everything that's now in app_opt
moved the SyncAlgorithmExecutor to a separate file and renamed the base.py files to base_p2p_executor.py for the executor and p2p_controller.py for the controller. The BaseP2PAlgorithmExecutor is now an ABC.
removed pickle. For convenience right now all the executors are saving the results with pytorch.save, but that could be removed and easily reimplemented by the user if needed
the dependencies in the examples/advanced/distributed_optimization are now specified in a requirements.txt as the core core has been moved to app_opt and can be imported directly from nvflare

Let me know what you think.

Now that it's moved to the core, I feel implementation part could be changed/improved by offloading to the user things like saving the results, storing losses via callbacks, monitoring, etc - perhaps it makes more sense to do that at a later stage though

examples/advanced/distributed_optimization/README.md

holgerroth · 2025-01-31T20:01:06Z

Tested locally and runs fine. The tutorial is great! Should consider adding some CI testing in a future PR.

test locally

chesterxgchen · 2025-02-02T06:46:46Z

nvflare/app_opt/p2p/controllers/p2p_controller.py

+from nvflare.app_opt.p2p.types import Config
+
+
+class P2PAlgorithmController(Controller):


The name implies this apply to all p2p algorithms.
How does this controller works with swarm learning ? in that case, the controller doesn't seem to have payload for model, metrics and metadata

Or this is a special p2p_controller only for dist optimization ?

The P2PAlgorithmController currently only sends the configuration to the clients and broadcast a task to run whatever p2p algorithm they're meant to run.

I guess we can either extend the P2PAlgorithmController later to handle payloads/metrics/metadata for swarm learning, or subclass it to create a SwarmLearningController.

We already have swarm clearning controller. since swarm learning is doing fedavg like operations on the client side, we also have client-side controller to do this.
My point is before the new controller can capture more general requirements for p2p, we probably should named it more specific instead so general. We can always make it general later to consolidate two algorithms, then we can make a general name. A general name at this point might limit us for later change.

Maybe it doesn't take much to include Swarm learning controller requirements ( you might need to check swarm controller), as majority of the logics is on the client side

Yes, makes sense, I wasn't considering that. Let me use a more specific name for the controller making it clear it's for the distributed optimization algorithms implemented in the module (DistributedOptimizationController?). We can create a more general p2p controller later with all the general requirements. @chesterxgchen shall I also rename the module from p2p to something like do or dist_opt for the moment in light of that?

p2p is fine. u could have

class DistOptController() ( DistributedOptimizationController is too long)

apt_opt/controllers/p2p/dist_opt_controller.py or
apt_opt/controllers/dist_opt_controller.py or
apt_opt/controllers/p2p_do_controller.py or

either way is fine with me.

chesterxgchen · 2025-02-02T22:43:10Z

nvflare/app_opt/p2p/executors/base_p2p_executor.py

+from nvflare.app_opt.p2p.types import LocalConfig, Neighbor
+
+
+class BaseP2PAlgorithmExecutor(Executor, ABC):


similar comments for controller

chesterxgchen · 2025-02-02T22:46:36Z

nvflare/app_opt/p2p/executors/sync_executor.py

+        if len(self.neighbors_values[iteration]) < len(self.neighbors):
+            # wait for all neighbors to send their values for the current iteration
+            # if not received after timeout, abort the job
+            if not self.sync_waiter.wait(timeout=10):


hard-code 10 sec timeout, need to make this configurable

chesterxgchen · 2025-02-02T22:48:03Z

nvflare/app_opt/p2p/executors/sync_executor.py

+            # Store the received value in the neighbors_values dictionary
+            self.neighbors_values[iteration][sender] = self._from_message(data["value"])
+            # Check if all neighbor values have been received for the iteration
+            if len(self.neighbors_values[iteration]) >= len(self.neighbors):


do we reset the neighbors_values once we have all of them for next round ?

chesterxgchen

add a few more comments

francescofarina added 3 commits January 29, 2025 10:58

p2p distributed optimization examples

d59f49b

add distributed optimization to examples readme

7406ac1

update data kind in messages

1409de9

holgerroth requested review from chesterxgchen, YuanTingHsieh, yanchengnv, nvidianz and nvkevlu January 29, 2025 18:13

yanchengnv requested review from holgerroth and ZiyueXu77 January 30, 2025 15:49

francescofarina added 2 commits January 30, 2025 16:31

add license text

3d1e710

fix broken link

b597afe

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/pyproject.toml Outdated Show resolved Hide resolved

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/nvdo/controllers/base.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/nvdo/executors/base.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/nvdo/executors/consensus.py Outdated Show resolved Hide resolved

francescofarina added 2 commits January 31, 2025 11:07

move nvdo to nvflare.app_opt.p2p + refactoring

e1d394e

add docs

6b92082

abc algo executor

bdfb3b8

holgerroth reviewed Jan 31, 2025

View reviewed changes

examples/advanced/distributed_optimization/README.md Outdated Show resolved Hide resolved

francescofarina and others added 3 commits January 31, 2025 16:53

typo in example README

72c55a9

reload notebook

9e401f0

test locally

89df860

francescofarina added 2 commits January 31, 2025 20:14

Merge pull request #1 from holgerroth/p2p_minor_updates

0384bca

test locally

select device in executors

954c952

chesterxgchen reviewed Feb 2, 2025

View reviewed changes

Merge branch 'main' into francescofarina/distributed-optimization

03f7c50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add P2P distributed optimization to advanced examples #3189

Add P2P distributed optimization to advanced examples #3189

francescofarina commented Jan 29, 2025

francescofarina commented Jan 31, 2025

holgerroth commented Jan 31, 2025

chesterxgchen Feb 2, 2025

francescofarina Feb 2, 2025

chesterxgchen Feb 2, 2025 •

edited

Loading

francescofarina Feb 2, 2025

chesterxgchen Feb 2, 2025

chesterxgchen Feb 2, 2025

chesterxgchen Feb 2, 2025

chesterxgchen Feb 2, 2025

chesterxgchen left a comment

		from nvflare.app_opt.p2p.types import Config


		class P2PAlgorithmController(Controller):

		from nvflare.app_opt.p2p.types import LocalConfig, Neighbor


		class BaseP2PAlgorithmExecutor(Executor, ABC):

Add P2P distributed optimization to advanced examples #3189

Are you sure you want to change the base?

Add P2P distributed optimization to advanced examples #3189

Conversation

francescofarina commented Jan 29, 2025

Description

Types of changes

francescofarina commented Jan 31, 2025

holgerroth commented Jan 31, 2025

chesterxgchen Feb 2, 2025

Choose a reason for hiding this comment

francescofarina Feb 2, 2025

Choose a reason for hiding this comment

chesterxgchen Feb 2, 2025 • edited Loading

Choose a reason for hiding this comment

francescofarina Feb 2, 2025

Choose a reason for hiding this comment

chesterxgchen Feb 2, 2025

Choose a reason for hiding this comment

chesterxgchen Feb 2, 2025

Choose a reason for hiding this comment

chesterxgchen Feb 2, 2025

Choose a reason for hiding this comment

chesterxgchen Feb 2, 2025

Choose a reason for hiding this comment

chesterxgchen left a comment

Choose a reason for hiding this comment

chesterxgchen Feb 2, 2025 •

edited

Loading