-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User QoS resource total limit #370
base: master
Are you sure you want to change the base?
Conversation
可能功能还没完整实现,可以先看看AccountMetaContainer,QosResource,UserResourceMeta这些实现形式有什么问题 |
e4a4a44
to
a324eeb
Compare
这些操作太重了 开会聊下吧 |
src/CraneCtld/AccountMetaContainer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
理论上来说这个地方我们没有锁定全集的需求 可以放松限制 用带sharding的concurrent hashmap. 见https://github.com/greg7mdp/parallel-hashmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块AtomicHashmap全用Exclusive Locking太重了
src/CraneCtld/AccountManager.cpp
Outdated
@@ -833,6 +834,12 @@ AccountManager::CraneExpected<void> AccountManager::ModifyQos( | |||
// Mongodb | |||
Qos qos; | |||
g_db_client->SelectQos("name", name, &qos); | |||
|
|||
// Modify QosResource when max_jobs_per_user or max_cpus_per_user is changed. | |||
if (item == "max_jobs_per_user" || item == "max_cpus_per_user") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方可以enum化吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该可以
src/CraneCtld/AccountManager.cpp
Outdated
for (const auto& [partition, qos] : | ||
res_user.account_to_attrs_map[account_name].allowed_partition_qos_map) { | ||
for (const auto& qos_name : qos.second) { | ||
const Qos* qos_content = GetExistedQosInfoNoLock_(qos_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方应该是不需要查qos表的吧
src/CraneCtld/AccountManager.cpp
Outdated
@@ -1580,6 +1587,18 @@ AccountManager::CraneExpected<void> AccountManager::AddUser_( | |||
} | |||
res_user.account_to_attrs_map[object_account].blocked = false; | |||
|
|||
AccountMetaContainer::QosResourceList qos_resource_list; | |||
for (const auto& [partition, qos] : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qos的作用已经在任务提交的check阶段被apply了 这个地方不需要也不应该查qos吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要记录每个user的qos资源使用情况的话,还是得有个地方记录user的资源变化 那可以放在CheckAndApplyQosLimitOnTask中? 然后全集就不需要锁了,删除操作也不需要了,CheckAndApplyQosLimitOnTask已经判断了 所以只需要malloc和free时,对qos_to_resource_map加锁就可以了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前的实现逻辑应该就已经是这样的了 任务在经过CheckAndApplyQosLimitOnTask的时候本来就应用了QOS的限制 就不需要专门存了 只需要记录每一个用户的被应用Qos限制后的资源总量就好了
src/CraneCtld/CtldPublicDefs.h
Outdated
|
||
struct QosResourceLimit { | ||
QosResource res_total; | ||
QosResource res_avail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方没必要三个吧?。。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果作业运行中修改Qos资源的话,还是需要三个
7ae6410
to
475f71f
Compare
现在是 checkAndApplyQos时判断用户对应Qos资源变化,对用户修改Qos不会影响队列中的作业,和之前的情况一致,然后ModifyQos的max_jobs_per_user和max_cpus_per_user时,会实时影响队列中的作业 |
src/CraneCtld/CtldPublicDefs.h
Outdated
@@ -703,6 +703,17 @@ inline bool CheckIfTimeLimitIsValid(absl::Duration d) { | |||
return CheckIfTimeLimitSecIsValid(sec); | |||
} | |||
|
|||
struct QosResource { | |||
uint32_t cpus_per_user; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qos可能会有GPU资源,要不这里直接用ResPerUser这样来表示
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对 这个地方用ResourceView感觉可能就好了
src/CraneCtld/AccountManager.cpp
Outdated
@@ -973,6 +981,11 @@ result::result<void, std::string> AccountManager::CheckAndApplyQosLimitOnTask( | |||
qos_share_ptr->max_cpus_per_user) | |||
return result::fail("cpus-per-task reached the user's limit."); | |||
|
|||
g_account_meta_container->AddQosResourceToUser( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g_account_meta_container 这个类不需要维护User的Qos呀 这个只要AccountManager维护就好了 这边增加一个Copy又要维护一致性 还要引入很多锁 就在AccountManager::CheckAndApplyQosLimitOnTask 里面 去g_account_meta_container 取一次user已用资源总量 和 task的resource求一次和,再和 qos的上限一对比就好了 g_account_meta_container 就是负责简单的运行时资源统计,正常来说对外暴露线程安全的加和减操作感觉就行了 这样子不用多维护一致性 这样AddQosResourceToUser FreeQosResource InitFromDB_ ModifyQosResourceOnUser 都可以省了 而且效率更高 要做的事情就是把加和减操作的并行度增大 开销降低就好了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
整个逻辑就简单很多了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个逻辑的话 是在CheckAndApplyQosLimitOnTask阶段就已经扣除了Qos资源,也就是默认队列中的所有任务已经使用了Qos资源 我实现的是任务运行时扣除Qos资源,确实也没必要 我照着你这个逻辑改一下吧
1b285a3
to
e7cdbe8
Compare
return; | ||
} | ||
|
||
val.resource.GetAllocatableRes().cpu_count += task.cpus_per_task; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该直接加task.requested_node_res_view*node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
噢噢对
user_meta_map_[username].qos_resource_in_use.modify_if( | ||
task.qos, [&](std::pair<const std::string, QosResource>& pair) { | ||
auto& val = pair.second; | ||
val.resource.GetAllocatableRes().cpu_count -= task.cpus_per_task; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
src/CraneCtld/CtldPublicDefs.h
Outdated
uint32_t jobs_per_user; | ||
}; | ||
|
||
struct ResourcePerUser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个东西可以放AccountMetaContainer.h里面吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以的,我开始就是放在里面的
…edQos, or ModifyQos are performed.
…eQosResourceOnUser.
e7cdbe8
to
ae1341c
Compare
No description provided.