✨ feat: add pod status checking to distinguish pending vs running jobs#51
✨ feat: add pod status checking to distinguish pending vs running jobs#51
Conversation
…n_progress misclassification
…er and status of job is returned as unknown
- Extracted Celery and direct submission logic into helper functions. - Fixed a blocking bug by offloading Celery's to a thread pool executor. - Normalized generation to ensure consistent returns on failure. - Improved readability and maintainability by simplifying the status mapping logic.
Avoid blocking the event loop while waiting for Celery response timeout.
|
Deploying this to an openshift cluster to test... |
| return pod_phase | ||
| return "Unknown" | ||
|
|
||
| async def check_k8s_job_status(tune_id: str, retry_label_lookup=True): |
There was a problem hiding this comment.
@fMurugi is there a missing argument here?
check_pod_phase
should we rename the function to check_tuning_task_status
There was a problem hiding this comment.
@fredotieno not sure about the argument.
| logger.debug(f"Error checking job conditions: {e}") | ||
| return None | ||
|
|
||
| async def get_k8s_status(job_name: str) -> str: |
There was a problem hiding this comment.
@fMurugi
should we rename this to something like get_aggregate_job_and_pod_status
There was a problem hiding this comment.
@fMurugi use black and ruff to format changes introduced
Example for the gfmstudio/fine_tuning/core/kubernetes.py file:
black gfmstudio/fine_tuning/core/kubernetes.py && ruff check --select I --fix gfmstudio/fine_tuning/core/kubernetes.py
…tial-studio-core into feat/check-pod-phase
| Returns | ||
| ------- | ||
| str | ||
| The pod phase: 'Running', 'Pending', 'Succeeded', 'Failed', 'Unknown', or None if no pod found |
There was a problem hiding this comment.
This is a great introduction. Thinking of easily maintaining this in the future, creating a reference schema class for these statuses would be ideal ensuring that you would only have one place to change or update the values incase of a future change/update.
Something like ...
class PodStatusPhase(str,Enum)
RUNNING = "Running"
PENDING = "Pending"
SUCCEEDED = "Succeeded"
FAILED = "Failed"
UNKNOWN = "Unknown"
NONE = "None" # adjust to what None is here
Then when returning the outputs from k8s,
if result:
try:
return PodStatusPhase(status_str)
except ValueError:
# Update this to handle what states would error
return PodStatusPhase.UNKNOWN
return None
And whenever in your code you are using the values you can easily use the values i.e
PodStatusPhase.FAILED
… parse the env variables
Summary
Related Issue (optional)
How to test this PR?
Screenshots / Logs (optional)
Checklist