arbitrary shard counts #177

hai719 · 2025-12-08T07:51:18Z

What was changed

Support replication between Temporal servers with arbitrary shard counts.

Why?

Checklist

Closes
How was this tested:

Any docs updates needed?

temporal-nick · 2025-12-12T19:39:25Z

proxy/adminservice.go

-		reportStreamValue: reportStreamValue,
-		shardCountConfig:  shardCountConfig,
-		lcmParameters:     lcmParameters,
+		clusterConnection:  clusterConnection,


ClusterConnection is a huge context to pass around (MuxManager, inbound and outbound servers and clients, now ShardManager and intraProxyManager and Send/Ack channels). Can this code be written without this back-reference?

Refactored the structure and removed the use of ClusterConnection

temporal-nick · 2025-12-12T19:44:42Z

develop/config/nginx.conf

@@ -0,0 +1,43 @@
+worker_processes 1;


Are we using nginx in this change? I don't see it being started in main.go

It's only for local testing for multi-proxy case. I can remove it.

…ster

Move channel management and intra-proxy routing from ClusterConnection to ShardManager.

hehaifengcn · 2025-12-17T18:54:57Z

.github/workflows/pull-request.yml

        uses: azure/[email protected]
        with:
-          version: v3.17.3
+          version: v3.19.4


what is this for? do the change in a separate PR?

This is to fix the helm error that I didn't see in main branch:

Error: plugin is installed but unusable: failed to load plugin at "/home/runner/.local/share/helm/plugins/helm-unittest.git/plugin.yaml": error unmarshaling JSON: while decoding JSON: json: unknown field "platformHooks" Error: Process completed with exit code 1.```

hehaifengcn · 2025-12-17T18:56:07Z

cmd/proxy/main.go

 	}

+	// Add debug endpoint handler
+	http.HandleFunc("/debug/connections", func(w http.ResponseWriter, r *http.Request) {


Nice stuff but prefer having a separate PR for this

hehaifengcn · 2025-12-17T18:58:32Z

develop/config/cluster-b-mux-server-proxy-1.yaml

+# shardCount:
+#   mode: "lcm"
+#   localShardCount: 3
+#   remoteShardCount: 2


can we remove comment code? we can create separate config if needed.

hehaifengcn · 2025-12-17T18:59:11Z

develop/config/dynamic-config.yaml

+history.shardUpdateMinInterval:
+  - value: 1s
+history.ReplicationStreamSendEmptyTaskDuration:


what are these config for?

This is the sender side keep alive when there is no replication task. This will keep the replication stream alive and make sure receiver gets latest watermark.

Is that a new mandatory configuration that clients will need to enable?

Clients do not need to explicit enable it. The default value in OSS is 1 minute.

hehaifengcn · 2025-12-17T18:59:52Z

interceptor/translation_interceptor.go

 ) error {
+
+	i.logger.Debug("InterceptStream", tag.NewAnyTag("method", info.FullMethod))
+	// Skip translation for intra-proxy streams


Translation already happen at inbound/outbound server. The intraProxy stream will still go through inbound/outbound server. The translation is skipped here to avoid multiple translations.

temporal-nick · 2026-01-02T21:21:35Z

proxy/test/replication_failover_test.go

+		if s.proxyB != nil {
+			s.proxyB.Stop()
+		}
+	} else {
+		if s.loadBalancerA != nil {
+			s.loadBalancerA.Stop()
+		}


Does this need to be if-else if we're already nil-checking everything?

temporal-nick · 2026-01-02T21:26:21Z

proxy/test/tcp_proxy.go

+		Upstream   *Upstream
+	}
+
+	TCPProxy struct {


Could you add some comments describing what this proxy is for?

temporal-nick · 2026-01-02T21:45:10Z

proxy/admin_stream_transfer.go

+			for i, task := range attr.Messages.ReplicationTasks {
+				msg = append(msg, fmt.Sprintf("[%d]: %v", i, task.SourceTaskId))
+			}
+			f.logger.Info(fmt.Sprintf("forwarding ReplicationMessages: exclusive %v, tasks: %v", attr.Messages.ExclusiveHighWatermark, strings.Join(msg, ", ")))


This looks like a large data dump. How often will this log? Do we want to put it behind a ticker?

moved to debug level.

temporal-nick · 2026-01-02T22:00:47Z

proxy/adminservice.go

 	forwarder := newStreamForwarder(
 		s.adminClient,
-		targetStreamServer,
+		streamServer,
 		targetMetadata,
 		sourceClusterShardID,
 		targetClusterShardID,
 		s.metricLabelValues,
 		logger,
 	)
 	err = forwarder.Run()
 	if err != nil {
 		return err
 	}
 	// Do not try to transfer EOF from the source here. Just returning "nil" is sufficient to terminate the stream
 	// to the client.
 	return nil


This is the other half of "if s.shardCountConfig.Mode == config.ShardCountLCM" above. Let's put the body of that if statement into a method to match streamRouting and streamIntraProxyRouting, and then we can consolidate the dispatch into a single obvious place.

updated and move these into another file.

temporal-nick · 2026-01-02T22:06:20Z

proxy/adminservice.go

 	return nil
 }

+func (s *adminServiceProxyServer) streamIntraProxyRouting(


For a future PR: The stream routing logic has outgrown this adminservice file, which is supposed to be the adminservice handler. We should put streamIntraProxyRouting, streamRouting,streamLCM, and streamDirect together into a separate file.

temporal-nick · 2026-01-02T22:07:42Z

proxy/adminservice_test.go

 			resp, err := server.DescribeCluster(ctx, req)
 			s.NoError(err)
-			s.Equal(c.expResp, resp)
+			s.Equal(c.expResp.FailoverVersionIncrement, resp.FailoverVersionIncrement)


Leave a comment that reminds us what parts of the response aren't supposed to match

updated with proto.Equal()

temporal-nick · 2026-01-02T22:10:22Z

proxy/cluster_connection.go

-		targetShardCount int32
 		logger           log.Logger
+
+		clusterConnection *ClusterConnection


Looks like this is only used to get a reference to the shardManager. Can we just put that here?

temporal-nick · 2026-01-02T22:11:56Z

proxy/debug.go

+	"github.com/temporalio/s2s-proxy/transport/mux"
+	"github.com/temporalio/s2s-proxy/transport/mux/session"
+)
+


A comment here describing that this is the handler for the proxy debug endpoint would be helpful

temporal-nick · 2026-01-02T22:18:57Z

proxy/intra_proxy_router.go

+	// lastWatermark tracks the last watermark received from source shard for late-registering target shards
+	lastWatermarkMu sync.RWMutex
+	lastWatermark   *replicationv1.WorkflowReplicationMessages


Can we replace this with atomic.Pointer? It should be both higher-performance and simpler to use

revert lastWatermark atomic.Pointer change due to test failure.
We can redo the change after oss ReplicationStreamSendEmptyTaskDuration is available.

temporal-nick · 2026-01-02T22:33:24Z

proxy/proxy.go

+	RoutedAck struct {
+		TargetShard history.ClusterShardID
+		Req         *adminservice.StreamWorkflowReplicationMessagesRequest
+	}
+
+	// RoutedMessage wraps a replication response with originating client shard info
+	RoutedMessage struct {
+		SourceShard history.ClusterShardID
+		Resp        *adminservice.StreamWorkflowReplicationMessagesResponse
+	}


These seem orphaned. Do they belong in another file?

temporal-nick · 2026-01-02T22:52:48Z

proxy/admin_stream_transfer.go

+	shutdownChan := channel.NewShutdownOnce()
+	go func() {
+		if err := sender.Run(streamServer, shutdownChan); err != nil {
+			logger.Error("intraProxyStreamSender.Run error", tag.Error(err))
+		}
+	}()
+	<-shutdownChan.Channel()


Two problems:

This shutdownChan doesn't escape anywhere, so this is equivalent to just calling sender.Run()

A given clusterConnection and all its resources may be closed unilaterally from a config endpoint. Instead of creating a new shutdown channel, pass in the lifetime context when this handler is built during ClusterConnection and wait on that.

I think passing the lifetime through to sender.Run in place of shutdownChan should accomplish the need for shutdown in that function.

The replication stream has different lifetime, not same as clusterConnection.

Right. But, when the clusterConnection is terminated, the replication streams should close. There's a hierarchy of lifetimes:

Proxy -> []ClusterConnection -> MuxManager -> Inbound GRPC Servers -> []inbound AdminService Streams -> Outbound GRPC server -> []outbound AdminService Streams -> StreamManager

wired lifetime.

temporal-nick · 2026-01-02T22:54:49Z

proxy/admin_stream_transfer.go

+	shutdownChan := channel.NewShutdownOnce()
+	wg := sync.WaitGroup{}
+	wg.Add(2)
+	go func() {
+		defer wg.Done()
+		proxyStreamSender.Run(streamServer, shutdownChan)
+	}()
+	go func() {
+		defer wg.Done()
+		proxyStreamReceiver.Run(shutdownChan)
+	}()
+	wg.Wait()


Same point as above: Pass in the lifetime context instead of creating a new shutdownChan here.

same as above.

We can redo the change after oss ReplicationStreamSendEmptyTaskDuration is available.

temporal-nick · 2026-01-02T23:49:41Z

proxy/cluster_connection.go

+	getLCMParameters := func(shardCountConfig config.ShardCountConfig, inverse bool) LCMParameters {
+		if shardCountConfig.Mode != config.ShardCountLCM {
+			return LCMParameters{}
+		}
+		lcm := common.LCM(shardCountConfig.LocalShardCount, shardCountConfig.RemoteShardCount)
+		if inverse {
+			return LCMParameters{
+				LCM:              lcm,
+				TargetShardCount: shardCountConfig.LocalShardCount,
+			}
+		}
+		return LCMParameters{
+			LCM:              lcm,
+			TargetShardCount: shardCountConfig.RemoteShardCount,
+		}
+	}
+	getRoutingParameters := func(shardCountConfig config.ShardCountConfig, inverse bool, directionLabel string) RoutingParameters {


Do these need to be functions? It seems like you could just construct LCMParameters and RoutingParameters here.

Oh, I see they're called twice, but they each embed an if-else which has custom code for each branch. I'd recommend compressing this into a simple struct definition embedded into the serverConfiguration structs below.

These functions can later be extracted (like xxTranslations) into separated file to make this function clean.

temporal-nick · 2026-01-03T00:03:54Z

proxy/proxy_streams.go

+) {
+	// Terminate any previous local receiver for this shard
+	if r.shardManager != nil {
+		r.shardManager.TerminatePreviousLocalReceiver(r.sourceShardID, r.logger)


For future PR: Is it possible to avoid the need for this using context lifetimes and/or locking?

temporal-nick · 2026-01-03T00:15:42Z

proxy/shard_manager.go

+	}
+
+	// ShardManager manages distributed shard ownership across proxy instances
+	ShardManager interface {


Still reading through this. My initial impression is that this is a huge interface. Do we use all of these methods? Is there a higher-level interface we could expose that would remove ~50% of these?

temporal-nick · 2026-01-03T00:29:16Z

Makefile

 # Disable cgo by default.
 CGO_ENABLED ?= 0
-TEST_ARG ?= -race -timeout=5m -tags test_dep
+TEST_ARG ?= -race -timeout=15m -tags test_dep -count=1


Do the tests really take 15 minutes to run? I think we need to change or separate them if this is the case.

The test took 11m to finish when replication tests were run in sequence for easy debugging. Updated the tests to run in parallel.

temporal-nick

Current version is clean enough to continue iterating on in the main branch. Approving!

hai719 added 16 commits November 25, 2025 11:13

add debug endpoint; add test configs

6078a49

add shard manager

4b1eb9b

proxy routing

de6633b

route only in target proxy; remap taskID and track ack in proxy.

d797c52

update

1b20ad0

retry when shard not available

fbc4af4

add tags to log

9f5cd62

add intra proxy streams

b51520a

fix incorrect stream pair

85c6a97

add log for debugging

04046e5

fix issues

796bef6

fix streams

6e98448

add ring_max_size to debug

ccdb29d

fix panic; fix memberlist join issue

939d6b2

fix panic

4ed7e50

update regarding cluster_connection

5fb2b88

hai719 requested a review from temporal-nick December 8, 2025 17:29

Merge branch 'main' into hai719/routing

9a9535a

temporal-nick reviewed Dec 12, 2025

View reviewed changes

hai719 added 8 commits December 16, 2025 00:30

fix for cluster_conn; fix routing

4c716c9

fix test error: handle case when no replication task after shard regi…

cf59b2d

…ster

Refactor: move channel management to ShardManager

4813e58

Move channel management and intra-proxy routing from ClusterConnection to ShardManager.

Merge branch 'main' into hai719/routing

d2d52f2

update helm

286f51f

remove clusterConnection from adminServiceProxyServer

0b7b2d4

fix unit test

34e07db

fix test error

2f26e6f

hehaifengcn reviewed Dec 17, 2025

View reviewed changes

fix intra proxy streams; add connection debug info.

970b6ad

hai719 requested review from hehaifengcn and temporal-nick December 26, 2025 18:04

hai719 marked this pull request as ready for review December 26, 2025 18:08

hai719 requested a review from a team as a code owner December 26, 2025 18:08