-
Notifications
You must be signed in to change notification settings - Fork 81
qiita.slurm_resource_allocations insertion #3415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added functionality of updating resource allocation data into db in qiita_db/util.py - Added tests for the added functionality in qiita_db/test/test_util.py - Moved MaxRSS_helper function from qiita_core/util.py to qiita_db/util.py. - Moved MaxRSS_helper test from qiita_core/tests/test_util.py to qiita_db/test/test_util.py
qiita_db/util.py
Outdated
sacct = ['sacct', '-p', '--format=JobID,ElapsedRaw,MaxRSS,Submit,Start,' | ||
'CPUTimeRAW,ReqMem,AllocCPUs,AveVMSize', '-j'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than what could be a very large number of calls to sacct
, why not batch jobs and query ','.join(list_of_job_ids)
? The primary concern is creating a high burden for the resource manager, and secondary is the overhead of system calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually discussed this with @Gossty and this decision was made based on my previous experience with sacct
and pgsql
. As far as I was able to tell, sacct
didn't create any burden to the resource manager - I think is because it is an actual database, which is not actually directly connected to the resource manager. Then in the past we have had issues with large inserts (vs. multiple single inserts) so I think basic inserts are the way to go here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the warning here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the link; if I read this correctly, smaller/simpler calls are preferred (a single job) vs. larger multiple jobs; no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"That database needs to be modified constantly as jobs start and complete, so we don't want it tied up answering sacct queries."
As you know, SQL engines are really good at bulk queries as there is a lot of overhead for a single query.
for index, row in df.iterrows(): | ||
with qdb.sql_connection.TRN: | ||
sql = """ | ||
INSERT INTO qiita.slurm_resource_allocations ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resources used are sensitive to extrinsic factors like other work on the system or cluster. Scoping to include that type of information carries a reasonable amount of complexity, so I don't think we want to attempt that. However, if the date the job was started on was tracked, then it would at least allow for excluding periods of known high burden that artificially increase resource use.
There are also intrinsic factors though that are important for interpretation of resource use, such as the node which ran the job, and ideally the model of processor (/proc/cpuinfo
) on the node (as hostnames are not assured to be unique over time). It would be great to have information about the memory of the system but that may require sudo (see e.g., dmidecode
) so could be out of scope. These values are important as our compute cluster is heterogeneous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wasade, if I understand the comment correctly basically you are suggesting storing:
a. the node name
b. the submission timestamp
correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a) yes (b) the job start time stamp, not submission and ideally (c) the actual CPU model on the node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thank you for the clarification. BTW why explicitly (c) vs. only (a)? Also, will sacct
return (c)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... hostnames are not assured to be unique over time..." so just getting hte name isn't reliable over time. I'm not sure what information is or isn't easily available from slurm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few requests. Please let me know if you would like to chat to clarify any of the requests.
qiita_db/util.py
Outdated
return y | ||
|
||
|
||
def update_resource_allocation_table(dates=['2023-04-28', '2023-05-05'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring explaining the parameters?
qiita_db/util.py
Outdated
sql_timestamp = """ | ||
SELECT | ||
sra.job_start | ||
FROM | ||
qiita.slurm_resource_allocations sra | ||
ORDER BY job_start DESC; | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add LIMIT 1 so you only get 1 result; this will make the query faster.
qiita_db/util.py
Outdated
columns = ['timestamp'] | ||
timestamps = pd.DataFrame(res, columns=columns) | ||
if len(timestamps['timestamp']) == 0: | ||
dates[0] = datetime.strptime('2023-04-28', '%Y-%m-%d') | ||
else: | ||
dates[0] = str(timestamps['timestamp'].iloc[0]).split(' ')[0] | ||
date0 = datetime.strptime(dates[0], '%Y-%m-%d') | ||
date1 = date0 + timedelta(weeks=1) | ||
dates[1] = date1.strftime('%Y-%m-%d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why convert to a dataframe? I mean, the database should already return a timestamp object that can be used
qiita_db/util.py
Outdated
sql_command = """ | ||
SELECT | ||
pj.processing_job_id AS processing_job_id, | ||
pj.external_job_id AS external_job_id | ||
FROM | ||
qiita.software_command sc | ||
JOIN | ||
qiita.processing_job pj ON pj.command_id = sc.command_id | ||
JOIN | ||
qiita.processing_job_status pjs | ||
ON pj.processing_job_status_id = pjs.processing_job_status_id | ||
LEFT JOIN | ||
qiita.slurm_resource_allocations sra | ||
ON pj.processing_job_id = sra.processing_job_id | ||
WHERE | ||
pjs.processing_job_status = 'success' | ||
AND | ||
pj.external_job_id ~ '^[0-9]+$' | ||
AND | ||
sra.processing_job_id IS NULL; | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this query will return all the rows, which will be a lot so it will be good to limit based on the job_id of the latests result in the previous block (what you are using to extract the time). Could you limit to just return anything newer than that job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we determine whether the job is newer through external_id? I don't think qiita.processing_job stores information about when the job was submitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, the external id is an incremental integer so it can be used to find what's newer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
No description provided.