Skip to content

Conversation

@ShahabT
Copy link
Contributor

@ShahabT ShahabT commented Dec 10, 2025

What changed?

Enforce a max limit on number of versions registered in a task queue.

Why?

Safety agains user data becoming too big.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

User's polls will be rejected if server cannot allow new version registration due to the limit.

@ShahabT ShahabT requested review from a team as code owners December 10, 2025 01:59
@ShahabT ShahabT requested a review from Shivs11 December 10, 2025 02:00

loadCause loadCause
loadCause loadCause
MaxVersionsInTaskQueue func() int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep these settings consistent in the four places they have to appear? like put this right underneath MaxTaskQueuesInDeployment for all of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all the versioning settings appear in per-tq config. but I moved this one around to make things more consistent.


// Remove this deleted version
delete(deploymentData.Versions, dv.buildID)
deletedVersions = deletedVersions[1:] // Remove from slice to keep count accurate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more idiomatic way will be to use a dedicated variable to know the total number of elements that were deleted.
Example:

cleaned := false
deletedRemaining := len(deletedVersions)
for _, dv := range deletedVersions {
    totalCount := undeletedCount + deletedRemaining

    if !dv.updateTime.Before(aWeekAgo) && totalCount <= maxVersions {
        break
    }

    delete(deploymentData.Versions, dv.buildID)
    deletedRemaining--
    cleaned = true
}

Copy link
Member

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor comments/concerns but not blocking!

Comment on lines +1322 to +1327
MatchingMaxVersionsInTaskQueue = NewNamespaceIntSetting(
"matching.maxVersionsInTaskQueue",
200,
`MatchingMaxVersionsInTaskQueue represents the maximum number of versions that can be registered in a single task queue.
Should be larger than MatchingMaxVersionsInDeployment because a task queue can be in versions spanning across more than one deployments.`,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: was 200 just chosen arbitrarily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's twice MatchingMaxVersionsInDeployment so in case someone moves to another deployment, they'd get "maximum versions in deployment exceeded" before getting "maximum versions in task queue exceeded" which is harder for user to deal with because there is no direct way to see all the versions in a TQ.

}, nil
}

// CleanupOldDeletedVersions removes versions deleted more than 7 days ago. Also removes more deleted versions if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we should also mention the fact that older, stale versions are only deleted from this task queue iff there is some caller calling SyncDeploymentUserData for this specific task queue and is not a recurring/background GC process.

This makes me wonder that in the worst case, if no one were to call this Sync method, there could be a world where a task queue just has a bunch of versions and would still keep on serving tasks to pollers. This could happen if someone were to not make a new deployment and someone had made a bunch of deployments in the past 7 days (I could see this coming up in our internal dogfooding environments)

Can we/Should we make this an automatic process? How much additional work would that be? If not, maybe we should document this fact here on this function since if there is a bottleneck tomorrow, we know where to look at

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no one were to call this Sync method, there could be a world where a task queue just has a bunch of versions and would still keep on serving tasks to pollers.

the tasks will be served if the version is not deleted only. and if it's not deleted, then event automatic CG would not touch it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I get that - my concern was about to this being a bottleneck when it comes to delivery of tasks if noone has called the Sync call in like a week or something and the task queue has like 199 versions (Theoretically possible, but low probability)

}
}

func (s *Versioning3Suite) skipFromVersion(version workerdeployment.DeploymentWorkflowVersion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this guy seems to not be used anywhere

Comment on lines +4591 to +4640
for i := 0; i < maxVersions; i++ {
tvVersion := tv.WithDeploymentSeriesNumber(i).WithBuildIDNumber(i)
upsertVersions := make(map[string]*deploymentspb.WorkerDeploymentVersionData)
upsertVersions[tvVersion.BuildID()] = &deploymentspb.WorkerDeploymentVersionData{
Status: enumspb.WORKER_DEPLOYMENT_VERSION_STATUS_INACTIVE,
}

deploymentName := tvVersion.DeploymentVersion().GetDeploymentName()
_, err := s.GetTestCluster().MatchingClient().SyncDeploymentUserData(
ctx, &matchingservice.SyncDeploymentUserDataRequest{
NamespaceId: s.NamespaceID().String(),
TaskQueue: tv.TaskQueue().GetName(),
TaskQueueTypes: []enumspb.TaskQueueType{tqTypeWf},
DeploymentName: deploymentName,
UpsertVersionsData: upsertVersions,
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rather than doing this, you could also just have 5 different pollers, each coming from different deployments, poll on the task queue right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I felt this way is more deterministic and less chance of flakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand why using pollers would not make it deterministic though - you just start 5 pollers, wait for those to appear in the task queue and you then proceed with the test.

Base automatically changed from shahab/async-2 to main December 13, 2025 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants