Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User QoS resource total limit #370

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

User QoS resource total limit #370

wants to merge 15 commits into from

Conversation

huerni
Copy link
Collaborator

@huerni huerni commented Nov 20, 2024

No description provided.

@huerni huerni added the enhancement New feature or request label Nov 20, 2024
@huerni
Copy link
Collaborator Author

huerni commented Nov 20, 2024

可能功能还没完整实现,可以先看看AccountMetaContainer,QosResource,UserResourceMeta这些实现形式有什么问题

@huerni huerni linked an issue Nov 20, 2024 that may be closed by this pull request
@huerni huerni self-assigned this Nov 20, 2024
@RileyWen
Copy link
Collaborator

这些操作太重了 开会聊下吧

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

理论上来说这个地方我们没有锁定全集的需求 可以放松限制 用带sharding的concurrent hashmap. 见https://github.com/greg7mdp/parallel-hashmap

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块AtomicHashmap全用Exclusive Locking太重了

@@ -833,6 +834,12 @@ AccountManager::CraneExpected<void> AccountManager::ModifyQos(
// Mongodb
Qos qos;
g_db_client->SelectQos("name", name, &qos);

// Modify QosResource when max_jobs_per_user or max_cpus_per_user is changed.
if (item == "max_jobs_per_user" || item == "max_cpus_per_user")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方可以enum化吗

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该可以

for (const auto& [partition, qos] :
res_user.account_to_attrs_map[account_name].allowed_partition_qos_map) {
for (const auto& qos_name : qos.second) {
const Qos* qos_content = GetExistedQosInfoNoLock_(qos_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方应该是不需要查qos表的吧

@@ -1580,6 +1587,18 @@ AccountManager::CraneExpected<void> AccountManager::AddUser_(
}
res_user.account_to_attrs_map[object_account].blocked = false;

AccountMetaContainer::QosResourceList qos_resource_list;
for (const auto& [partition, qos] :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qos的作用已经在任务提交的check阶段被apply了 这个地方不需要也不应该查qos吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要记录每个user的qos资源使用情况的话,还是得有个地方记录user的资源变化 那可以放在CheckAndApplyQosLimitOnTask中? 然后全集就不需要锁了,删除操作也不需要了,CheckAndApplyQosLimitOnTask已经判断了 所以只需要malloc和free时,对qos_to_resource_map加锁就可以了

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前的实现逻辑应该就已经是这样的了 任务在经过CheckAndApplyQosLimitOnTask的时候本来就应用了QOS的限制 就不需要专门存了 只需要记录每一个用户的被应用Qos限制后的资源总量就好了


struct QosResourceLimit {
QosResource res_total;
QosResource res_avail;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方没必要三个吧?。。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果作业运行中修改Qos资源的话,还是需要三个

@huerni huerni force-pushed the dev/qos_res_limit branch from 7ae6410 to 475f71f Compare December 3, 2024 08:06
@huerni
Copy link
Collaborator Author

huerni commented Dec 3, 2024

现在是 checkAndApplyQos时判断用户对应Qos资源变化,对用户修改Qos不会影响队列中的作业,和之前的情况一致,然后ModifyQos的max_jobs_per_user和max_cpus_per_user时,会实时影响队列中的作业

@@ -703,6 +703,17 @@ inline bool CheckIfTimeLimitIsValid(absl::Duration d) {
return CheckIfTimeLimitSecIsValid(sec);
}

struct QosResource {
uint32_t cpus_per_user;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qos可能会有GPU资源,要不这里直接用ResPerUser这样来表示

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对 这个地方用ResourceView感觉可能就好了

@@ -973,6 +981,11 @@ result::result<void, std::string> AccountManager::CheckAndApplyQosLimitOnTask(
qos_share_ptr->max_cpus_per_user)
return result::fail("cpus-per-task reached the user's limit.");

g_account_meta_container->AddQosResourceToUser(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_account_meta_container 这个类不需要维护User的Qos呀 这个只要AccountManager维护就好了 这边增加一个Copy又要维护一致性 还要引入很多锁 就在AccountManager::CheckAndApplyQosLimitOnTask 里面 去g_account_meta_container 取一次user已用资源总量 和 task的resource求一次和,再和 qos的上限一对比就好了 g_account_meta_container 就是负责简单的运行时资源统计,正常来说对外暴露线程安全的加和减操作感觉就行了 这样子不用多维护一致性 这样AddQosResourceToUser FreeQosResource InitFromDB_ ModifyQosResourceOnUser 都可以省了 而且效率更高 要做的事情就是把加和减操作的并行度增大 开销降低就好了

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整个逻辑就简单很多了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个逻辑的话 是在CheckAndApplyQosLimitOnTask阶段就已经扣除了Qos资源,也就是默认队列中的所有任务已经使用了Qos资源 我实现的是任务运行时扣除Qos资源,确实也没必要 我照着你这个逻辑改一下吧

@huerni huerni force-pushed the dev/qos_res_limit branch from 1b285a3 to e7cdbe8 Compare December 5, 2024 08:39
@huerni huerni requested review from RileyWen and L-Xiafeng December 9, 2024 01:42
return;
}

val.resource.GetAllocatableRes().cpu_count += task.cpus_per_task;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该直接加task.requested_node_res_view*node

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

噢噢对

user_meta_map_[username].qos_resource_in_use.modify_if(
task.qos, [&](std::pair<const std::string, QosResource>& pair) {
auto& val = pair.second;
val.resource.GetAllocatableRes().cpu_count -= task.cpus_per_task;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

uint32_t jobs_per_user;
};

struct ResourcePerUser {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个东西可以放AccountMetaContainer.h里面吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以的,我开始就是放在里面的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

用户Qos资源总量限制
3 participants