Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal]Improve YARN with Koordinator so that BE pods and YARN tasks can share batch resources #2310

Open
hawkphantomnet opened this issue Jan 2, 2025 · 4 comments
Labels
kind/proposal Create a report to help us improve

Comments

@hawkphantomnet
Copy link

What is your proposal:
The current Yarn with Koordinator solution synchronizes all batch resources to the YARN RM. We hope to improve this solution so that BE pods and YARN tasks can share batch resources. Therefore, it is necessary to enhance the mechanism for synchronizing and managing batch resources between Koord and YARN. It is necessary to introduce a new configuration thirdPartyResourceConfig to calculate the amount of batch resources that can be used by YARN and implementing real-time control over YARN tasks' cgroup based on this configuration.

Example:

slo-colocation-config:|
{  
   ...
   "thirdPartyResourceConfig": [
     {
       "thirdPartyName": "hadoop-yarn",
       "batchResourceRatio": {
         "batchCpu": 80,
         "batchMemory": 80
       },
       "cgroupPath": "/hadoop-yarn"
     }
   }
   ...
}

Why is this needed:
Described above

Is there a suggested solution, if so, please add it:
Here is an initial draft for the detailed design

@hawkphantomnet hawkphantomnet added the kind/proposal Create a report to help us improve label Jan 2, 2025
@hormes
Copy link
Member

hormes commented Jan 2, 2025

Do you want to control the upper limit of batch resources that Yarn can use?
If I understand correctly, I would like to know what scenarios require setting an upper limit

@hawkphantomnet
Copy link
Author

Yes, in our scenario, we want to deploy three kinds of workloads on k8s nodes:

  1. online serving pods: use prod resources.
  2. training pods: run as BE pods and use batch resources.
  3. yarn tasks: managed as containers(not pod) by YARN to use the remaining batch resources after the training containers have been allocated their resources. Setting upper limit of batch resources that Yarn can use is to keep some buffer in case of immediate training pods.

@hormes
Copy link
Member

hormes commented Jan 3, 2025

Yes, in our scenario, we want to deploy three kinds of workloads on k8s nodes:

  1. online serving pods: use prod resources.
  2. training pods: run as BE pods and use batch resources.
  3. yarn tasks: managed as containers(not pod) by YARN to use the remaining batch resources after the training containers have been allocated their resources. Setting upper limit of batch resources that Yarn can use is to keep some buffer in case of immediate training pods.

Can yarn nodemanager run on k8s as a Pod?

@hawkphantomnet
Copy link
Author

NodeManager is running as Pod, yarn tasks(like spark driver/executors) are running as containers managed by NodeManager.
The proposal is based on this solution: https://koordinator.sh/docs/designs/koordinator-yarn
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/proposal Create a report to help us improve
Projects
None yet
Development

No branches or pull requests

2 participants