RSDK-10794: Add inflight request limit for each resource #5133

jmatth · 2025-07-11T20:54:30Z

No description provided.

robot/web/web.go

…happy

robot/web/web.go

jmatth · 2025-07-14T14:20:27Z

This is ready for review but still needs some work. I left PR comments on the parts I still have questions about.

dgottlieb · 2025-07-14T14:58:09Z

robot/web/web_test.go

@@ -1565,7 +1565,7 @@ func TestPerRequestFTDC(t *testing.T) {
 	// We can assert that there are two counters in our stats. The fact that `GetEndPosition` was
 	// called once and we hence spent (negligible) time in that RPC call.
 	stats := svc.RequestCounter().Stats().(map[string]int64)
-	test.That(t, len(stats), test.ShouldEqual, 4)
+	test.That(t, len(stats), test.ShouldEqual, 5)


Where did the extra stat come from? Is that the curr counter we use for comparing against the new limit?

Yes, it's the new in flight request counter. Originally it was the maximum concurrent in flight requests over the last minute, with the latest changes it's just the count of in flight requests at the time .Stats() is called.

there are two counters in our stats

Can we update the comment above this assertion to explain why len(stats) == 5?

dgottlieb · 2025-07-14T15:49:37Z

robot/web/web.go

 type RequestCounter struct {
-	requestKeyToStats sync.Map
+	requestKeyToStats ssync.Map[string, *requestStats]
+	limiter           requestLimiter


I had to pull the PR to grok everything here. There are lots of new nouns. I think the core of it is that:

requestKeyToStats is keyed on the (resource name, api method) pair

limiter has a map internally that is just keyed on the resource name.

requestLimiter has a map where the value of that map is requestCounter

I'm going to softly float a layer of denormalization and some renaming:

// RequestCounter is used to track and limit incoming requests. It instruments every unary and // streaming request. Coming in from both external clients and internal modules. type RequestCounter struct { // requestKeyToStats maps individual API calls for each resource to a set of metrics. E.g: // `motor-foo.IsPowered` and `motor-foo.GoFor` would each have their own set of stats. requestKeyToStats ssync.Map[string, *requestStats] // inFlightRequests maps resource names to how many in flight requests are currently targetting // that resource name. There can only be `limit` API calls for any resource. E.g: `motor-foo` // can have 50 `IsPowered` concurrent calls with 50 more `GoFor` calls. Or instead 100 // `IsPowered` calls. Before it starts to reject new incoming requests. inFlightRequests ssync.Map[string, *inFlightCounter] // limit is controlled with the `VIAM_RESOURCE_REQUESTS_LIMIT` env variable. Its default is 100. limit int }

And add something about how streaming RPCs count against limits? I suppose each open stream is just counted as a single "in flight" request?

Out of curiosity -- does internal signaling count against these limits?

Based on my observation about the purpose curr and max, I also feel inFlightCounter just becomes an *atomic.Int64. Maybe it can be a non-pointer, but I'm not 100% on the limitations of sync.Map.

Out of curiosity -- does internal signaling count against these limits?

No. Currently we are only counting+limiting requests to apis that start with /viam.component, /viam.service, or /viam.robot and have a resource name*, or have no resource name but are going to a /viam.robot.v1.RobotService/ API. As far as I can tell internal signalling lives under proto.rpc.webrtc.v1.SignalingService/.

*: Or at least have a field called name that we assume is referring to a resource, or a field called controller that we assume is a resource for input controller APIs.

robot/web/web.go

dgottlieb · 2025-07-14T15:55:15Z

robot/web/web.go

-	})
+	for k, v := range rc.limiter.counts.Range {
+		max := v.max
+		v.max = v.curr


I don't understand what's happening here. This is not a pattern I've come across. I think you just want to use an atomic int64 here and do away with max and the mutex.

…hen writing FTDC stats

jmatth · 2025-07-15T21:01:23Z

robot/web/web.go

 		requestKey := buildRCKey(m, w.apiMethod)
-		w.requestKey.Store(&requestKey)
+		w.requestKey.Store(&streamRequestKey{


@dgottlieb Benji's question got me taking a closer look at this. Is it a problem that we just call Load here instead of Swap and then checking if we lost the race?

Responded above thinking both you and Benji will be notified there

jmatth · 2025-07-16T21:38:49Z

Ok, I think this is ready for review again. We're now just reporting the number of in flight requests whenever FTDC happens to write rather than trying to track the max in between each FTDC write, and I added a special case for the one service Bohdan found that uses a different field name to refer to resources in its messages. Also there are tests.

I'd like to figure out a better system for identifying resources in gRPC messages but in the interest of helping everyone who's suddenly hitting stream limits I'd like to merge this and iterate on it as necessary later.

dgottlieb

This looks great to me. I'm most concerned with documenting some of the business logic. All of the choices look very reasonable, just need some clarity on "must be done this way" versus "being defensive against the future unknowns".

I also think the request counter stuff is outgrowing this file. But I don't want that kind of code movement to muddle/slow down this PR.

dgottlieb · 2025-07-16T22:53:52Z

robot/web/web.go

+	}
+}
+
+func (rc *RequestCounter) ensureKey(resource string) *atomic.Int64 {


Can this be simplified down to:

func (rc *RequestCounter) ensureKey(resource string) *atomic.Int64 { counter, _ = rc.inFlightRequests.LoadOrStore(resource, &atomic.Int64{}) return counter }

?

It could. I wrote it this way to avoid allocating an extra atomic.Int64 for every call after the first that just creates GC pressure. If you think that's not worth worrying about we can simplify the function.

Ah, that makes sense! Please leave a comment

dgottlieb · 2025-07-16T23:06:37Z

robot/web/web.go

+	// streaming RPCs both count against the limit.`limit` defaults to 100 but
+	// can be configured with the `VIAM_RESOURCE_REQUESTS_LIMIT`
+	// environment variable.
+	inFlightRequests ssync.Map[string, *atomic.Int64]


What Josh is saying about copying being illegal is correct. Go chose to make atomic objects "values" rather than pointers under the hood. So one can embed a mutex or atomic.Int64 without needing to allocate on the heap.

But those can no longer be copied. When a mutex, for example, is "copied" there are now two mutexes. Each can be individually locked/unlocked.

I don't think immutability is relevant here. If one wanted an immutable int64 -- there's no need to make it atomic -- the two properties cannot meaningfully coexist. It's perfectly fine for any number of threads to concurrently read from the same memory address. There's no reason to have an atomic if one never intends to store into it.

dgottlieb · 2025-07-16T23:08:55Z

robot/web/web.go

+	}
+}
+
+func (rc *RequestCounter) ensureKey(resource string) *atomic.Int64 {


We should disambiguate the method name ensureKey. If it makes sense to have ensureKey initialize map entries for both the stats and inflight maps, the name ensureKey makes sense.

Otherwise we should choose something that makes it more obviously we're only ensuring inFlightRequests has a given key.

dgottlieb · 2025-07-16T23:10:35Z

robot/web/web.go

-	rc.requestKeyToStats.Range(func(requestKeyI, requestStatsI any) bool {
-		requestKey := requestKeyI.(string)
-		requestStats := requestStatsI.(*requestStats)
+	for requestKey, requestStats := range rc.requestKeyToStats.Range {


ah, so this is a legit usage of the new language feature for custom implemented range/iterator functions?

Correct. All both of the function signatures it supports are declared/documented in the iter package.

dgottlieb · 2025-07-16T23:13:15Z

robot/web/web.go

+	name      string
+}
+
+func extractViamAPI(fullMethod string) apiMethod {
 	// Extract Service and Method name from `fullMethod` values such as:
 	// - `/viam.component.motor.v1.MotorService/IsMoving` -> MotorService/IsMoving


Can you update the comment here so the example return values are in terms of the new apiMethod type?

robot/web/web_test.go

dgottlieb · 2025-07-17T00:25:55Z

robot/web/web.go

@@ -648,10 +767,18 @@ func (rc *RequestCounter) UnaryInterceptor(
 	ctx context.Context, req any, info *googlegrpc.UnaryServerInfo, handler googlegrpc.UnaryHandler,
 ) (resp any, err error) {
 	apiMethod := extractViamAPI(info.FullMethod)
-
+	if resource := buildResourceLimitKey(req, apiMethod); resource != "" {


I wonder if we want this after the stat increments. Such that a resource limit exceeded error would register as both:

A call to the RPC method

An error being returned from the RPC method

dgottlieb · 2025-07-17T00:41:26Z

robot/web/web.go

+			rc:           rc,
+			requestKey:   atomic.Pointer[streamRequestKey]{},
+		}
+		defer wrappedStream.tryDecr()


StreamInterceptors hurt my head, so forgive me if this is a dumb question. Based on the code:

Stream handler(srv, &wrappedStream) methods only return when the stream is closed?

And during that time there may be multiple RecvMsg and SendMsg calls?

If we're only interested in resource limiting an entire stream -- should we increment here for simplicity? Or does the first RecvMsg add an important element with regards to the resource limit key we're tracking? I assume that's where we learn of the resource name.

I feel this unconditional tryDecr is very close to be an accidental bug. If tryIncr fails because of a resource limit, we do not want to decrement. And I believe we correctly don't decrement. But only because tryDecr checks if the requestKey was set. And right now, we only set the requestKey if we successfully incremented.

None of that is obvious. And if we wanted to go down a path where we do increment stats, despite failing to increment, we'd now introduce a double decrementing bug.

Part of me wants to just ignore streaming APIs and do it as a low priority follow-up. I know that'd be losing some of the parity we have today (this change isn't trying to maintain full parity). But we don't have good examples of streaming APIs being problematic and reasoning about streams is much harder.

Trying to come up with a more guided actionable. Choose one of these three:

Think if there's a safer way to know when to decrement

If not, definitely document the assumption tryDecr is making when an increment fails.

If you're actually just on board with undoing the stream limiting for this PR, I'm good with that.

I have undone the stream limiting for this PR, but to somewhat close the loop here: I traced a callstack for a streaming API call and our WebRTC code closes the stream when the handler returns. I'm not sure what normal gRPC over HTTP2 does but with our WebRTC system I'm pretty confident saying that if a handler spins off a goroutine to handle a stream and then returns, the stream will stop working.

dgottlieb · 2025-07-17T00:53:12Z

robot/web/web.go

+	return method.name
+}
+
+func buildResourceLimitKey(clientMsg any, method apiMethod) string {


// buildResourceLimitKey returns a string of the form: // - `foo-motor.viam.component.motor.v1.MotorService` for requests with a `name` field hitting a Viam resource API or // - `viam.robot.v1.RobotService` for requests on the robot service directly without a `name` field

I'm happy pretending controllers don't meaningfully exist here.

dgottlieb · 2025-07-17T00:56:33Z

robot/web/web_test.go

+	test.That(t, stats[statsKey], test.ShouldEqual, 0)
+}
+
+func TestPerResourceLimitsAndFTDC(t *testing.T) {


This test seemed pretty onerous to instrument. I imagine doing the same for a streaming example would be more so?

Support for stream APIs has been removed for now.

cheukt · 2025-07-18T16:00:06Z

robot/web/web.go

-
+	if resource := buildResourceLimitKey(req, apiMethod); resource != "" {
+		if ok := rc.incrInFlight(resource); !ok {
+			return nil, &RequestLimitExceededError{


driveby, but can we log on server side too? if we're worried about spammy logs we can make sure to only log every 30s or 1min

I hope we haven't already come up with a reason to circumvent the existing log deduplication code?

jmatth · 2025-07-22T15:14:01Z

resource/utils.go

This is a modified version of a file in Bohdan's ongoing jobmanager PR (#5104).

jmatth added 5 commits July 11, 2025 15:26

RSDK-10794: Add inflight request limit for each resource

b5b2f5a

Adding docstrings, moving from channels to simple ints with mutexes

05271f7

Slightly better naming

d506998

Moving request limit initialization to web service start

0915a14

Removing unnecessary slices use, limiting all internal apis

2ca4b09

viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jul 11, 2025

jmatth commented Jul 11, 2025

View reviewed changes

robot/web/web.go Outdated Show resolved Hide resolved

jmatth commented Jul 11, 2025

View reviewed changes

robot/web/web.go Outdated Show resolved Hide resolved

Moving request limit setup to web svc constructor

ef3f02c

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 11, 2025

Fixing test

101758b

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 11, 2025

Make max connection tracking less reliable to keep the race detector …

aac474f

…happy

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 14, 2025

jmatth commented Jul 14, 2025

View reviewed changes

robot/web/web.go Outdated Show resolved Hide resolved

jmatth requested review from benjirewis and dgottlieb July 14, 2025 14:16

jmatth marked this pull request as ready for review July 14, 2025 14:16

dgottlieb reviewed Jul 14, 2025

View reviewed changes

Stop tracking max concurrent connections and just use current value w…

fc50889

…hen writing FTDC stats

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 14, 2025

jmatth added 2 commits July 14, 2025 14:47

Including api information in resource name

fe236df

Refacor limit implementation

849272e

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 14, 2025

viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jul 15, 2025

jmatth commented Jul 15, 2025

View reviewed changes

Centralize api path parsing, special case inputcontroller messages

bdc72bc

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 16, 2025

jmatth requested review from benjirewis and dgottlieb July 16, 2025 21:38

dgottlieb reviewed Jul 17, 2025

View reviewed changes

cheukt reviewed Jul 18, 2025

View reviewed changes

jmatth added 10 commits July 21, 2025 10:20

Unskip test

9b6bae2

Much less ambiguous name

11aabc4

Update comment

2593276

Moving request counter into separate file

c00bdf0

Removing request limits from stream APIs

b9493e4

Use central source of truth for resource name extraction

4840855

Document reasoning

458fe85

Move limit exceeded error between files

471f47d

Deduplicate webService constructor

cd758b8

Log when request limits are exceeded

f725515

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 22, 2025

jmatth requested a review from dgottlieb July 22, 2025 15:09

jmatth commented Jul 22, 2025

View reviewed changes

resource/utils.go Outdated

Copy link

Member Author

jmatth Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a modified version of a file in Bohdan's ongoing jobmanager PR (#5104).

s/namespace/service, fix incorrect dilimiter in string construction

725a8a0

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 22, 2025

dgottlieb approved these changes Jul 22, 2025

View reviewed changes

jmatth merged commit e933632 into viamrobotics:main Jul 22, 2025
32 of 34 checks passed

jmatth deleted the RSDK-10794 branch July 22, 2025 17:24

RSDK-10794: Add inflight request limit for each resource #5133

RSDK-10794: Add inflight request limit for each resource #5133

Uh oh!

Conversation

jmatth commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmatth commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmatth Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmatth commented Jul 16, 2025

Uh oh!

dgottlieb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmatth Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheukt Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmatth commented Jul 14, 2025 •

edited

Loading

jmatth Jul 16, 2025 •

edited

Loading

dgottlieb left a comment •

edited

Loading

jmatth Jul 18, 2025 •

edited

Loading

cheukt Jul 18, 2025 •

edited

Loading