-
Notifications
You must be signed in to change notification settings - Fork 557
improvement(client-presence): improve join response scalability and reliability #25371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the scalability and reliability of the join response mechanism in the presence framework by implementing delayed broadcasting, audience-based selection, and enhanced tracking capabilities.
Key changes:
- Implements delayed join responses with configurable timing (200ms for named responders, 40ms increments for backup responders)
- Adds comprehensive telemetry tracking with events for join deferrals, requests, and responses
- Updates test infrastructure to support scale testing with higher client counts (up to 100 clients) when
FLUID_TEST_SCALE=true
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/pipelines/test-service-clients.yml | Adds environment variable to enable scale testing in CI |
| tools/pipelines/test-real-service.yml | Adds pipeline documentation link |
| packages/service-clients/end-to-end-tests/azure-client/src/test/multiprocess/presenceTest.spec.ts | Updates test configuration for conditional scale testing and removes previous test skip |
| packages/framework/presence/src/test/schemaValidation/protocol.spec.ts | Updates test expectations for new join response protocol with delays and response tracking |
| packages/framework/presence/src/test/presenceDatastoreManager.spec.ts | Comprehensive test updates for new join handling logic including delayed responses and audience management |
| packages/framework/presence/src/protocol.ts | Adds joinResponseFor field to track which clients a response satisfies |
| packages/framework/presence/src/presenceManager.ts | Updates event handling and adds disconnect callback to datastore manager |
| packages/framework/presence/src/presenceDatastoreManager.ts | Major implementation changes for delayed joins, audience-based responder selection, and enhanced broadcast timing |
| .changeset/floppy-sides-hammer.md | Documents the feature addition for improved join scalability |
alexvy86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for docs with a suggestion. Didn't look at the rest in detail.
| // Remove self and non-interactive members from possibilities | ||
| othersWorthIgnoring.push(selfClientId); | ||
| for (const clientIdToDelete of othersWorthIgnoring) { | ||
| quorumMembers.delete(clientIdToDelete); | ||
| otherMembers.delete(clientIdToDelete); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pushing selfClientID to othersWorthIgnoring is a bit confusing here. Maybe could replace the line with quorumMembers.delete(selfClientId) instead if I understand correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but getAudienceInformation doesn't guarantee that self is removed. It happens to do so currently. This seems more robust. The comment didn't make it clear?
fea36cd to
c0e1555
Compare
872a3ef to
eb8ab6e
Compare
|
🔗 Found some broken links! 💔 Run a link check locally to find them. See linkcheck output |
eb8ab6e to
a1a27a5
Compare
…eliability - In broadcast join responses: - include list of Client Connection Ids that the broadcast response covers. This will prevent hole where existing client sees a broadcast result too soon after a Join message and assumes it contains all sufficient and timely data. - potential join respondents will only reply if they have received a join response (or appear to have same or more requests that other audience members indicating complete knowledge) - Increase trust of Audience (dictated by service) and select some of its member for named (primary) join responders when no Quorum members are known. - Delay Join response messages for scalability - *Important* Do not broadcast a join response immediately even when selected as a named responder. Wait 200ms for others that may join. - Add telemetry events for join handling: - `JoinRequested` - client has broadcast Join signal - `JoinResponse` - client is responding to join request Each event includes attendee and connection ids. `JoinResponse` additionally contains lists for whom they are responding to. Update tests for higher client counts and delayed join responses and enable previously failing test. Limit higher client count (longer duration) tests to dedicated CI pipelines.
a1a27a5 to
cce7cb9
Compare
JoinRequested- client has broadcast Join signalJoinResponse- client is responding to join requestEach event includes attendee and connection ids.
JoinResponseadditionally contains lists for whom they are responding to.Update tests for higher client counts and delayed join responses and enable previously failing test. Limit higher client count (longer duration) tests to dedicated CI pipelines.
AB#45620