-
Notifications
You must be signed in to change notification settings - Fork 59
revise option to control number of resubmit fail jobs #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- add option: "strategy": {"customized_script_header_template_file": ""}, - add option: `sge_pe_name`
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Update pbs.py
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dpdispatcher/machine.py (1)
78-101
: Ensure retry_count is persisted and restored across (de)serialization.retry_count is exposed via arginfo and set in init, but Machine.serialize/load_from_dict neither save nor rehydrate it. This drops user configuration on save/load and undermines centralization.
Apply:
def load_from_dict(cls, machine_dict): @@ - context = BaseContext.load_from_dict(machine_dict) - machine = machine_class(context=context) + context = BaseContext.load_from_dict(machine_dict) + machine = machine_class(context=context) + # propagate retry_count even if subclass overrides __init__ + if "retry_count" in machine_dict: + setattr(machine, "retry_count", machine_dict["retry_count"]) return machinedef serialize(self, if_empty_remote_profile=False): machine_dict = {} machine_dict["batch_type"] = self.__class__.__name__ machine_dict["context_type"] = self.context.__class__.__name__ machine_dict["local_root"] = self.context.init_local_root machine_dict["remote_root"] = self.context.init_remote_root + machine_dict["retry_count"] = getattr(self, "retry_count", 3) if not if_empty_remote_profile: machine_dict["remote_profile"] = self.context.remote_profile else: machine_dict["remote_profile"] = {} # normalize the dict base = self.arginfo() machine_dict = base.normalize_value(machine_dict, trim_pattern="_*") return machine_dict
Optionally clarify docs:
- doc_retry_count = "Number of retries to resubmit failed jobs." + doc_retry_count = "Maximum number of retries to resubmit failed jobs (0 = no retries)."Also applies to: 401-420, 152-170
🧹 Nitpick comments (6)
dpdispatcher/machines/openapi.py (1)
31-43
: Call Machine.init to initialize retry_count; drop stale commented code.Subclasses overriding init should invoke the base initializer so Machine-level attributes (e.g., retry_count) are consistently set. This also avoids relying on hasattr fallbacks elsewhere and aligns with the PR goal of centralizing this option.
Apply:
class OpenAPI(Machine): def __init__(self, context): if not found_bohriumsdk: raise ModuleNotFoundError( "bohriumsdk not installed. Install dpdispatcher with `pip install dpdispatcher[bohrium]`" ) - self.context = context + super().__init__(context=context) self.remote_profile = context.remote_profile.copy() self.grouped = self.remote_profile.get("grouped", True) - # self.retry_count = self.remote_profile.get("retry_count", 3) self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)dpdispatcher/machine.py (1)
321-336
: Guard errlog None when constructing last_err_file tail command.Task.errlog allows None; formatting err_file unconditionally via shlex.quote(task.errlog) risks a TypeError or invalid tail paths. Consider a safe fallback.
Example:
- err_file=shlex.quote(task.errlog), + err_file=shlex.quote(task.errlog or "err"),dpdispatcher/machines/dp_cloud_server.py (2)
22-36
: Initialize base Machine to set retry_count; remove commented legacy assignment.Align with centralization by calling Machine.init, which binds context and sets retry_count. The commented line can stay removed.
Apply:
class Bohrium(Machine): @@ - def __init__(self, context): - self.context = context + def __init__(self, context): + super().__init__(context=context) self.input_data = context.remote_profile["input_data"].copy() @@ - # self.retry_count = context.remote_profile.get("retry_count", 3) self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)
70-73
: Use super() directly for clarity.- shell_script = super(DpCloudServer, self).gen_script(job) + shell_script = super().gen_script(job)dpdispatcher/submission.py (1)
840-861
: Retry policy simplification looks good; fix log typo and harden edge cases.
- Typo: "fail_cout" ➜ "fail_count".
- Clamp negative retry_count to 0 to avoid surprising behavior.
- Optional: log attempt progress.
Apply:
dlog.info( - f"job: {self.job_hash} {self.job_id} terminated; " - f"fail_cout is {self.fail_count}; resubmitting job" + f"job: {self.job_hash} {self.job_id} terminated; " + f"fail_count is {self.fail_count}; resubmitting job" ) retry_count = 3 # Default retry count assert self.machine is not None if hasattr(self.machine, "retry_count"): retry_count = self.machine.retry_count - dlog.info(f"retry_count: {retry_count}") + retry_count = max(0, int(retry_count)) + dlog.info(f"retry_count: {retry_count}") + dlog.info(f"retry attempt: {self.fail_count}/{retry_count}") - if self.fail_count > retry_count: + if self.fail_count > retry_count:dpdispatcher/machines/batch.py (1)
8-8
: Use a proper Windows batch script header.The current header
@echo off
is appropriate for Windows batch files, but the class name suggests this might be intended as a generic batch system. Consider adding a comment to clarify this is specifically for Windows batch files.-shell_script_header_template = """@echo off""" +shell_script_header_template = """@echo off""" # Windows batch script header
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
dpdispatcher/machine.py
(4 hunks)dpdispatcher/machines/batch.py
(1 hunks)dpdispatcher/machines/dp_cloud_server.py
(1 hunks)dpdispatcher/machines/openapi.py
(1 hunks)dpdispatcher/machines/pbs.py
(1 hunks)dpdispatcher/submission.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dpdispatcher/machines/batch.py (3)
dpdispatcher/utils/job_status.py (1)
JobStatus
(4-11)dpdispatcher/utils/utils.py (1)
customized_script_header_template
(216-221)dpdispatcher/machines/shell.py (6)
gen_script
(14-16)gen_script_header
(18-30)do_submit
(32-61)check_status
(63-91)check_finish_tag
(104-107)kill
(109-119)
🪛 GitHub Check: pyright
dpdispatcher/machines/batch.py
[failure] 57-57:
Method "default_resources" overrides class "Machine" in an incompatible manner
Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride)
🪛 GitHub Actions: Type checker
dpdispatcher/machines/batch.py
[error] 57-57: Pyright error: Method "default_resources" overrides class "Machine" in an incompatible manner. Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride). Command: /home/runner/actions-runner/cached/externals/node20/bin/node /opt/hostedtoolcache/pyright/1.1.404/x64/package/index.js --outputjson
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (3.9, windows-latest)
- GitHub Check: build (slurm)
🔇 Additional comments (3)
dpdispatcher/machines/batch.py (3)
12-14
: LGTM!The script generation properly delegates to the parent class implementation, maintaining consistency with other machine implementations.
16-28
: LGTM!The header generation logic correctly follows the same pattern as the Shell machine, supporting both customized headers and falling back to the default template.
81-83
: LGTM!The finish tag checking follows the same pattern as other machine implementations and correctly uses the context to check file existence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
.gitignore (2)
42-43
: Add a brief comment explaining why settings.json is tracked.Helps future contributors understand the exception.
Apply:
.vscode/* +# Track only shared workspace settings (formatters/lints), keep other VSCode files untracked. !.vscode/settings.json
43-43
: Prefer tool-native config over editor-specific settings (optional).If the only setting is ruff line length, consider moving it to pyproject.toml (tool.ruff.line-length) and drop the VSCode override to avoid editor coupling.
tests/test_argcheck.py (1)
30-31
: Add tests for override and default round-trip semantics.Cover non-default value and explicit default to guard regressions.
Apply:
@@ class TestJob(unittest.TestCase): self.assertDictEqual(norm_dict, expected_dict) + def test_machine_retry_count_override(self): + norm_dict = Machine.load_from_dict( + { + "batch_type": "slurm", + "context_type": "local", + "local_root": "./", + "remote_root": "/some/path", + "retry_count": 5, + } + ).serialize() + self.assertEqual(norm_dict["retry_count"], 5) + + def test_machine_retry_count_default_roundtrip(self): + # Omit retry_count to ensure default is serialized as 3 + norm_dict = Machine.load_from_dict( + { + "batch_type": "slurm", + "context_type": "local", + "local_root": "./", + "remote_root": "/some/path", + } + ).serialize() + self.assertEqual(norm_dict["retry_count"], 3)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
.gitignore
(1 hunks).vscode/settings.json
(1 hunks)dpdispatcher/machine.py
(5 hunks)tests/test_argcheck.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .vscode/settings.json
🚧 Files skipped from review as they are similar to previous changes (1)
- dpdispatcher/machine.py
🔇 Additional comments (3)
.gitignore (1)
43-43
: Negated rule to track .vscode/settings.json looks correct.This exception to .vscode/* is fine and will allow committing only the workspace settings file.
tests/test_argcheck.py (2)
30-31
: LGTM: assertion updated for new retry_count field.Matches the base Machine serialization default of 3.
30-31
: Clarify and consolidate retry_count semantics across all Machines and contexts
Define whetherretry_count=3
means “total attempts” or “resubmits after the first try,” remove or update legacy uses (e.g. indpcloudserver/client.py
, context defaults, commented-out assignments in machine classes), align documentation with the chosen behavior, and add an integration test to verify failed-job resubmission limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
dpdispatcher/machines/openapi.py (3)
33-37
: Init order change: confirm safety or raise earlier for fail-fast.Calling
super().__init__(context)
before verifyingfound_bohriumsdk
changes the previous fail-fast behavior and may introduce side effects if the SDK is missing. IfMachine.__init__
does any I/O or assumes provider availability, this could be problematic. Either confirm it’s side-effect free, or keep the SDK check first.Apply if you want to restore fail-fast:
def __init__(self, context): - super().__init__(context) - if not found_bohriumsdk: - raise ModuleNotFoundError( - "bohriumsdk not installed. Install dpdispatcher with `pip install dpdispatcher[bohrium]`" - ) + if not found_bohriumsdk: + raise ModuleNotFoundError( + "bohriumsdk not installed. Install dpdispatcher with `pip install dpdispatcher[bohrium]`" + ) + super().__init__(context)
41-43
: Use self.remote_profile consistently.Minor inconsistency: other fields read from
self.remote_profile
, butignore_exit_code
reads fromcontext.remote_profile
. Prefer the local snapshot for consistency.- self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True) + self.ignore_exit_code = self.remote_profile.get("ignore_exit_code", True)
38-41
: Potential duplication of base init assignments.If
Machine.__init__
already setsself.context
andself.remote_profile
, these reassignments are redundant. Consider removing to avoid drift.- self.context = context - self.remote_profile = context.remote_profile.copy() + # self.context and self.remote_profile are initialized in Machine.__init__(context)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
dpdispatcher/machines/dp_cloud_server.py
(1 hunks)dpdispatcher/machines/openapi.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dpdispatcher/machines/dp_cloud_server.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (slurm)
- GitHub Check: build (pbs)
🔇 Additional comments (1)
dpdispatcher/machines/openapi.py (1)
32-43
: No provider-levelretry_count
usage found
Verified no occurrences ofretry_count
indpdispatcher/machines
; centralization inMachine.__init__
is confirmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
dpdispatcher/machines/openapi.py (3)
34-37
: Preserve fast-fail by checking SDK availability before base init.If
Machine.__init__
performs non-trivial work (I/O, state writes), failing earlier avoids side effects when Bohrium SDK is missing. Consider swapping the order.- def __init__(self, context, **kwargs): - super().__init__(context=context, **kwargs) - if not found_bohriumsdk: - raise ModuleNotFoundError( + def __init__(self, context, **kwargs): + if not found_bohriumsdk: + raise ModuleNotFoundError( "bohriumsdk not installed. Install dpdispatcher with `pip install dpdispatcher[bohrium]`" ) + super().__init__(context=context, **kwargs)
41-42
: Use self.remote_profile consistently.Line 42 reads from
context.remote_profile
while line 41 usesself.remote_profile
. Use the same source to avoid divergence.- self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True) + self.ignore_exit_code = self.remote_profile.get("ignore_exit_code", True)
38-40
: Remove redundant context reassignment.
- In
OpenAPI.__init__
, the call tobind_context()
already doesself.context = context
; drop the extraself.context = context
on line 38.- Retain
self.remote_profile = context.remote_profile.copy()
to snapshot the already-normalized profile for use in the machine.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
dpdispatcher/machine.py
(6 hunks)dpdispatcher/machines/dp_cloud_server.py
(1 hunks)dpdispatcher/machines/openapi.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- dpdispatcher/machines/dp_cloud_server.py
- dpdispatcher/machine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build (slurm)
- GitHub Check: build (pbs)
- GitHub Check: test (3.7, macos-13)
🔇 Additional comments (1)
dpdispatcher/machines/openapi.py (1)
32-33
: Forwardingretry_count
via kwargs centralizes handling; no residual direct reads found
Ripgrep search across all Python files returned zero instances ofremote_profile.retry_count
or similar, confirming no leftover provider-specific drift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
dpdispatcher/machines/pbs.py (2)
234-236
: Quote SGE submit paths for safety and parity with PBS.Unquoted paths are brittle and unsafe. PBS already uses
shlex.quote
.Apply:
- stdin, stdout, stderr = self.context.block_checkcall( - "cd {} && {} {}".format(script_file_dir, "qsub", script_file_name) - ) + stdin, stdout, stderr = self.context.block_checkcall( + "cd {} && {} {}".format( + shlex.quote(script_file_dir), "qsub", shlex.quote(script_file_name) + ) + )
89-91
: Fix typo in error message (“erro info” → “error info”).Minor but user-facing.
Apply:
- f"status command {command} fails to execute. erro info: {err_str} return code {ret}" + f"status command {command} fails to execute. error info: {err_str} return code {ret}"Also applies to: 139-141
♻️ Duplicate comments (1)
dpdispatcher/machines/pbs.py (1)
201-207
: Make SGE-pe
truly optional; align schema and header generation.Current code always injects
-pe
. AllowNone
and guard the header line.Apply:
- pe_name = resources.kwargs.get("pe_name", "mpi") + pe_name = resources.kwargs.get("pe_name", None) sge_script_header_dict = {} sge_script_header_dict["select_node_line"] = f"#$ -N {job_name}\n" - sge_script_header_dict["select_node_line"] += ( - f"#$ -pe {pe_name} {resources.cpu_per_node}\n" - ) + if pe_name is not None: + sge_script_header_dict["select_node_line"] += ( + f"#$ -pe {pe_name} {resources.cpu_per_node}\n" + )And in resources schema (lines 311-317):
- Argument( - "pe_name", - str, - optional=True, - default="mpi", - doc=doc_pe_name, - alias=["sge_pe_name"], - ), + Argument( + "pe_name", + [str, type(None)], + optional=True, + default=None, + doc=doc_pe_name, + alias=["sge_pe_name"], + ),
🧹 Nitpick comments (1)
dpdispatcher/machines/pbs.py (1)
194-195
: Accept positional args in SGE__init__
to mirror PBS and avoid potential breakage.-class SGE(PBS): - def __init__(self, **kwargs): - super().__init__(**kwargs) +class SGE(PBS): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
dpdispatcher/machines/pbs.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build (ssh_rsync)
- GitHub Check: build (ssh)
- GitHub Check: build (pbs)
- GitHub Check: build (slurm)
🔇 Additional comments (1)
dpdispatcher/machines/pbs.py (1)
20-22
: Preserve positional args in PBS.init signature for backward compatibility.Internal search found no calls passing
context
positionally—external clients may still rely onPBS(context, …)
. Apply:class PBS(Machine): - def __init__(self, **kwargs): - super().__init__(**kwargs) + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs)Verify that no callers depend on the old positional signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR standardizes the retry mechanism for failed jobs across different machine backends by introducing a consistent retry_count
parameter. The change addresses inconsistencies in how different machine implementations handled job retry logic.
- Adds a unified
retry_count
parameter to the base Machine class with a default value of 3 - Removes machine-specific retry count implementations from individual backends (OpenAPI, Bohrium)
- Harmonizes constructor patterns across machine backends to use consistent
**kwargs
approach
Reviewed Changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
dpdispatcher/machine.py |
Adds retry_count parameter to base Machine class constructor, serialization, and argument info |
dpdispatcher/machines/pbs.py |
Simplifies PBS and SGE constructors to use **kwargs pattern |
dpdispatcher/machines/openapi.py |
Updates OpenAPI constructor and removes local retry_count implementation |
dpdispatcher/machines/dp_cloud_server.py |
Updates Bohrium constructor and removes local retry_count implementation |
tests/test_argcheck.py |
Updates test expectations to include retry_count in serialization |
.vscode/settings.json |
Adds VSCode configuration for Ruff line length |
def __init__(self, **kwargs): | ||
super().__init__(**kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty __init__
method only calls the parent constructor without adding any PBS-specific functionality. Consider removing this method entirely and rely on the parent class constructor, unless there are plans to add PBS-specific initialization logic.
def __init__(self, **kwargs): | |
super().__init__(**kwargs) |
Copilot uses AI. Check for mistakes.
def __init__(self, **kwargs): | ||
super().__init__(**kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the PBS class, this empty __init__
method in SGE only calls the parent constructor. Consider removing this method to reduce code duplication unless SGE-specific initialization is planned.
def __init__(self, **kwargs): | |
super().__init__(**kwargs) |
Copilot uses AI. Check for mistakes.
@njzjz, SGE will not work if you remove what Copilot asked. |
Could you explain the reason? It only calls the method from the super class. |
So far, option to control number of resubmit fail jobs is partially implemented and inconsistent in different Machines.
This PR try to address this problem, as also discussed in issue #525
Summary by CodeRabbit
New Features
Tests
Chores
Bug Fixes