- 
                Notifications
    
You must be signed in to change notification settings  - Fork 256
 
[YUNIKORN-2057] Optimize FindQueueByAppID: O(n) DFS -> O(1) cached lookup #1037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           cc @chenyulin0719, need your help. Thanks  | 
    
          Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##           master    #1037      +/-   ##
==========================================
+ Coverage   80.78%   80.91%   +0.13%     
==========================================
  Files          98       99       +1     
  Lines       15765    12882    -2883     
==========================================
- Hits        12735    10424    -2311     
+ Misses       2771     2201     -570     
+ Partials      259      257       -2     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
        
          
                pkg/scheduler/objects/queue.go
              
                Outdated
          
        
      | // appQueueMapping is a thread safe mapping from applicationID to queuePath | ||
| type appQueueMapping struct { | ||
| byAppID map[string]string | ||
| locking.RWMutex | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be maintained globally, not in a per-queue basis. I think the ideal place is PartitionContext. However, the data type has to stay in pkg/scheduler/objects to avoid circular references.
Here is my take:
- Make it public (
AppQueueMapping) - Extract this type to a separate file eg. 
pkg/scheduler/objects/app_queue_mapping.go, create simple unit tests for it - Create a single instance inside 
PartitionContextwhen the context is created - When a 
Queueis created, inject the instance to theQueue. ExtendNewConfiguredQueue()andNewRecoveryQueue()andNewDynamicQueue()with an extra argument. - Have a reference inside 
QueuetoAppQueueMapping - You can add/remove mappings in 
PartitionContext.AddApplication()andPartitionContext.RemoveApplication() 
With this approach, everything is much simpler and we don't have to walk the Queue hierarchy or find the root queue at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially tried managing the global state in PartitionContext,
but FindQueueByAppID is often called together with other PartitionContext operations.
When PartitionContext holds a lock and the queue calls any getXXXX function, it can cause a deadlock.
As you suggested, making AppQueueMapping public, limiting the lock scope to AppQueueMapping itself, and injecting it into the queue should prevent this issue.
I’ll confirm the scope of changes and push an updated version.
Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure would look like this:
type PartitionContext struct {
    appQueueMapping *object.AppQueueMapping
}
type Queue struct {
    appQueueMapping *AppQueueMapping
}I'll try to inject PartitionContext.appQueueMapping when creating the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be enhanced & simplified further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits, otherwise good.
| 
           +1  | 
    
What is this PR for?
Motivation
Key changes
Compatibility and API notes
Performance
What type of PR is it?
Todos
What is the Jira issue?
[YUNIKORN-2057] FindQueueByAppID is slow
How should this be tested?
Screenshots (if appropriate)
Questions: