Skip to content

qiita.slurm_resource_allocations insertion #3415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 19, 2024

Conversation

Gossty
Copy link
Contributor

@Gossty Gossty commented Jun 5, 2024

No description provided.

Gossty added 12 commits April 29, 2024 17:28
- Added functionality of updating resource allocation data into db in qiita_db/util.py
- Added tests for the added functionality in qiita_db/test/test_util.py
- Moved MaxRSS_helper function from qiita_core/util.py to qiita_db/util.py.
- Moved MaxRSS_helper test from qiita_core/tests/test_util.py to qiita_db/test/test_util.py
@Gossty Gossty changed the title qiita.slurm_resource_allocations insertion into DB qiita.slurm_resource_allocations insertion Jun 5, 2024
qiita_db/util.py Outdated
Comment on lines 2749 to 2750
sacct = ['sacct', '-p', '--format=JobID,ElapsedRaw,MaxRSS,Submit,Start,'
'CPUTimeRAW,ReqMem,AllocCPUs,AveVMSize', '-j']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than what could be a very large number of calls to sacct, why not batch jobs and query ','.join(list_of_job_ids)? The primary concern is creating a high burden for the resource manager, and secondary is the overhead of system calls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually discussed this with @Gossty and this decision was made based on my previous experience with sacct and pgsql. As far as I was able to tell, sacct didn't create any burden to the resource manager - I think is because it is an actual database, which is not actually directly connected to the resource manager. Then in the past we have had issues with large inserts (vs. multiple single inserts) so I think basic inserts are the way to go here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the warning here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the link; if I read this correctly, smaller/simpler calls are preferred (a single job) vs. larger multiple jobs; no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"That database needs to be modified constantly as jobs start and complete, so we don't want it tied up answering sacct queries."

As you know, SQL engines are really good at bulk queries as there is a lot of overhead for a single query.

for index, row in df.iterrows():
with qdb.sql_connection.TRN:
sql = """
INSERT INTO qiita.slurm_resource_allocations (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resources used are sensitive to extrinsic factors like other work on the system or cluster. Scoping to include that type of information carries a reasonable amount of complexity, so I don't think we want to attempt that. However, if the date the job was started on was tracked, then it would at least allow for excluding periods of known high burden that artificially increase resource use.

There are also intrinsic factors though that are important for interpretation of resource use, such as the node which ran the job, and ideally the model of processor (/proc/cpuinfo) on the node (as hostnames are not assured to be unique over time). It would be great to have information about the memory of the system but that may require sudo (see e.g., dmidecode) so could be out of scope. These values are important as our compute cluster is heterogeneous

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade, if I understand the comment correctly basically you are suggesting storing:
a. the node name
b. the submission timestamp
correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) yes (b) the job start time stamp, not submission and ideally (c) the actual CPU model on the node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you for the clarification. BTW why explicitly (c) vs. only (a)? Also, will sacct return (c)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... hostnames are not assured to be unique over time..." so just getting hte name isn't reliable over time. I'm not sure what information is or isn't easily available from slurm

@coveralls
Copy link

Coverage Status

coverage: 92.797% (-0.05%) from 92.85%
when pulling 022cd7c on Gossty:test-util-db
into ffe1ec8 on qiita-spots:dev.

@coveralls
Copy link

Coverage Status

coverage: 92.816% (-0.03%) from 92.85%
when pulling 61cb992 on Gossty:test-util-db
into ffe1ec8 on qiita-spots:dev.

@coveralls
Copy link

Coverage Status

coverage: 92.816% (-0.03%) from 92.85%
when pulling 07fa018 on Gossty:test-util-db
into ffe1ec8 on qiita-spots:dev.

Copy link
Member

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few requests. Please let me know if you would like to chat to clarify any of the requests.

qiita_db/util.py Outdated
return y


def update_resource_allocation_table(dates=['2023-04-28', '2023-05-05'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring explaining the parameters?

qiita_db/util.py Outdated
Comment on lines 2714 to 2720
sql_timestamp = """
SELECT
sra.job_start
FROM
qiita.slurm_resource_allocations sra
ORDER BY job_start DESC;
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add LIMIT 1 so you only get 1 result; this will make the query faster.

qiita_db/util.py Outdated
Comment on lines 2728 to 2736
columns = ['timestamp']
timestamps = pd.DataFrame(res, columns=columns)
if len(timestamps['timestamp']) == 0:
dates[0] = datetime.strptime('2023-04-28', '%Y-%m-%d')
else:
dates[0] = str(timestamps['timestamp'].iloc[0]).split(' ')[0]
date0 = datetime.strptime(dates[0], '%Y-%m-%d')
date1 = date0 + timedelta(weeks=1)
dates[1] = date1.strftime('%Y-%m-%d')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why convert to a dataframe? I mean, the database should already return a timestamp object that can be used

qiita_db/util.py Outdated
Comment on lines 2738 to 2758
sql_command = """
SELECT
pj.processing_job_id AS processing_job_id,
pj.external_job_id AS external_job_id
FROM
qiita.software_command sc
JOIN
qiita.processing_job pj ON pj.command_id = sc.command_id
JOIN
qiita.processing_job_status pjs
ON pj.processing_job_status_id = pjs.processing_job_status_id
LEFT JOIN
qiita.slurm_resource_allocations sra
ON pj.processing_job_id = sra.processing_job_id
WHERE
pjs.processing_job_status = 'success'
AND
pj.external_job_id ~ '^[0-9]+$'
AND
sra.processing_job_id IS NULL;
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this query will return all the rows, which will be a lot so it will be good to limit based on the job_id of the latests result in the previous block (what you are using to extract the time). Could you limit to just return anything newer than that job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we determine whether the job is newer through external_id? I don't think qiita.processing_job stores information about when the job was submitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, the external id is an incremental integer so it can be used to find what's newer.

@coveralls
Copy link

Coverage Status

coverage: 92.816% (-0.03%) from 92.85%
when pulling 7c848bc on Gossty:test-util-db
into ffe1ec8 on qiita-spots:dev.

Copy link
Member

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@antgonza antgonza merged commit cca6de7 into qiita-spots:dev Jun 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants