Skip to content

Conversation

@jason-ha
Copy link
Contributor

@jason-ha jason-ha commented Sep 2, 2025

  • In broadcast join responses:
    • include list of Client Connection Ids that the broadcast response covers. This will prevent hole where existing client sees a broadcast result too soon after a Join message and assumes it contains all sufficient and timely data.
    • potential join respondents will only reply if they have received a join response (or appear to have same or more requests that other audience members indicating complete knowledge)
  • Increase trust of Audience (dictated by service) and select some of its member for named (primary) join responders when no Quorum members are known.
  • Delay Join response messages for scalability
    • Important Do not broadcast a join response immediately even when selected as a named responder. Wait 200ms for others that may join.
  • Add telemetry events for join handling:
    • JoinRequested - client has broadcast Join signal
    • JoinResponse - client is responding to join request
      Each event includes attendee and connection ids. JoinResponse additionally contains lists for whom they are responding to.

Update tests for higher client counts and delayed join responses and enable previously failing test. Limit higher client count (longer duration) tests to dedicated CI pipelines.

AB#45620

Copilot AI review requested due to automatic review settings September 2, 2025 21:35
@jason-ha jason-ha requested a review from a team as a code owner September 2, 2025 21:35
@jason-ha jason-ha requested a review from WillieHabi September 2, 2025 21:35
@github-actions github-actions bot added area: build Build related issues area: framework Framework is a tag for issues involving the developer framework. Eg Aqueduct changeset-present base: main PRs targeted against main branch labels Sep 2, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the scalability and reliability of the join response mechanism in the presence framework by implementing delayed broadcasting, audience-based selection, and enhanced tracking capabilities.

Key changes:

  • Implements delayed join responses with configurable timing (200ms for named responders, 40ms increments for backup responders)
  • Adds comprehensive telemetry tracking with events for join deferrals, requests, and responses
  • Updates test infrastructure to support scale testing with higher client counts (up to 100 clients) when FLUID_TEST_SCALE=true

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/pipelines/test-service-clients.yml Adds environment variable to enable scale testing in CI
tools/pipelines/test-real-service.yml Adds pipeline documentation link
packages/service-clients/end-to-end-tests/azure-client/src/test/multiprocess/presenceTest.spec.ts Updates test configuration for conditional scale testing and removes previous test skip
packages/framework/presence/src/test/schemaValidation/protocol.spec.ts Updates test expectations for new join response protocol with delays and response tracking
packages/framework/presence/src/test/presenceDatastoreManager.spec.ts Comprehensive test updates for new join handling logic including delayed responses and audience management
packages/framework/presence/src/protocol.ts Adds joinResponseFor field to track which clients a response satisfies
packages/framework/presence/src/presenceManager.ts Updates event handling and adds disconnect callback to datastore manager
packages/framework/presence/src/presenceDatastoreManager.ts Major implementation changes for delayed joins, audience-based responder selection, and enhanced broadcast timing
.changeset/floppy-sides-hammer.md Documents the feature addition for improved join scalability

Copy link
Contributor

@alexvy86 alexvy86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for docs with a suggestion. Didn't look at the rest in detail.

Comment on lines 302 to 307
// Remove self and non-interactive members from possibilities
othersWorthIgnoring.push(selfClientId);
for (const clientIdToDelete of othersWorthIgnoring) {
quorumMembers.delete(clientIdToDelete);
otherMembers.delete(clientIdToDelete);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pushing selfClientID to othersWorthIgnoring is a bit confusing here. Maybe could replace the line with quorumMembers.delete(selfClientId) instead if I understand correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but getAudienceInformation doesn't guarantee that self is removed. It happens to do so currently. This seems more robust. The comment didn't make it clear?

@github-actions github-actions bot added area: examples Changes that focus on our examples area: runtime Runtime related issues labels Sep 3, 2025
@jason-ha jason-ha requested a review from WillieHabi September 5, 2025 23:00
@jason-ha jason-ha requested a review from a team as a code owner September 17, 2025 01:01
@github-actions github-actions bot added area: loader Loader related issues public api change Changes to a public API labels Sep 17, 2025
@jason-ha jason-ha force-pushed the test/presence/support-large-client-count-join branch from fea36cd to c0e1555 Compare September 23, 2025 16:33
@jason-ha
Copy link
Contributor Author

JsonString related changes will be merged separately. Currently PRs #25501 and #25522.

@jason-ha jason-ha force-pushed the test/presence/support-large-client-count-join branch 2 times, most recently from 872a3ef to eb8ab6e Compare October 27, 2025 04:29
@github-actions
Copy link
Contributor

🔗 Found some broken links! 💔

Run a link check locally to find them. See
https://github.com/microsoft/FluidFramework/wiki/Checking-for-broken-links-in-the-documentation for more information.

linkcheck output


> [email protected] ci:check-links /home/runner/work/FluidFramework/FluidFramework/docs
> start-server-and-test "npm run serve -- --no-open" 3000 check-links

1: starting server using command "npm run serve -- --no-open"
and when url "[ 'http://127.0.0.1:3000' ]" is responding with HTTP status code 200
running tests using command "npm run check-links"


> [email protected] serve
> docusaurus serve --no-open

[SUCCESS] Serving "build" directory at: http://localhost:3000/

> [email protected] check-links
> linkcheck http://localhost:3000 --skip-file skipped-urls.txt

 ELIFECYCLE  Command failed with exit code 1.

@jason-ha jason-ha force-pushed the test/presence/support-large-client-count-join branch from eb8ab6e to a1a27a5 Compare October 27, 2025 19:30
@github-actions github-actions bot removed area: loader Loader related issues public api change Changes to a public API labels Oct 27, 2025
@jason-ha jason-ha removed the request for review from a team October 27, 2025 19:31
…eliability

- In broadcast join responses:
  - include list of Client Connection Ids that the broadcast response covers. This will prevent hole where existing client sees a broadcast result too soon after a Join message and assumes it contains all sufficient and timely data.
  - potential join respondents will only reply if they have received a join response (or appear to have same or more requests that other audience members indicating complete knowledge)
- Increase trust of Audience (dictated by service) and select some of its member for named (primary) join responders when no Quorum members are known.
- Delay Join response messages for scalability
  - *Important* Do not broadcast a join response immediately even when selected as a named responder. Wait 200ms for others that may join.
- Add telemetry events for join handling:
  - `JoinRequested` - client has broadcast Join signal
  - `JoinResponse` - client is responding to join request
  Each event includes attendee and connection ids. `JoinResponse` additionally contains lists for whom they are responding to.

Update tests for higher client counts and delayed join responses and enable previously failing test. Limit higher client count (longer duration) tests to dedicated CI pipelines.
@jason-ha jason-ha force-pushed the test/presence/support-large-client-count-join branch from a1a27a5 to cce7cb9 Compare October 27, 2025 21:03
@github-actions github-actions bot removed the area: runtime Runtime related issues label Oct 27, 2025
@jason-ha jason-ha enabled auto-merge (squash) October 27, 2025 21:43
@jason-ha jason-ha merged commit 98dadd3 into main Oct 27, 2025
39 checks passed
@jason-ha jason-ha deleted the test/presence/support-large-client-count-join branch October 27, 2025 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: build Build related issues area: examples Changes that focus on our examples area: framework Framework is a tag for issues involving the developer framework. Eg Aqueduct base: main PRs targeted against main branch changeset-present

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants