Skip to content

✨ feat: add pod status checking to distinguish pending vs running jobs#51

Open
fMurugi wants to merge 30 commits intomainfrom
feat/check-pod-phase
Open

✨ feat: add pod status checking to distinguish pending vs running jobs#51
fMurugi wants to merge 30 commits intomainfrom
feat/check-pod-phase

Conversation

@fMurugi
Copy link
Copy Markdown
Contributor

@fMurugi fMurugi commented Apr 28, 2026

Summary

Related Issue (optional)

How to test this PR?

Screenshots / Logs (optional)

Checklist

  • This PR targets the main branch
  • I have added or updated relevant docs.
  • I have not included any secrets or credentials.
  • Linting and formatting checks pass.

fMurugi added 17 commits April 28, 2026 15:37
- Extracted Celery and direct submission logic into helper functions.
- Fixed a blocking bug by offloading Celery's  to a thread pool executor.
- Normalized  generation to ensure consistent returns on failure.
- Improved readability and maintainability by simplifying the status mapping logic.
Avoid blocking the event loop while waiting for Celery response timeout.
@fMurugi fMurugi marked this pull request as ready for review April 30, 2026 09:32
bedwards-ibm
bedwards-ibm previously approved these changes Apr 30, 2026
Copy link
Copy Markdown
Contributor

@bedwards-ibm bedwards-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@bglar bglar changed the title ✨ feat: add pod status checking to distinfuish pending vs running jobs ✨ feat: add pod status checking to distinguish pending vs running jobs May 4, 2026
@bglar
Copy link
Copy Markdown
Contributor

bglar commented May 5, 2026

Deploying this to an openshift cluster to test...

return pod_phase
return "Unknown"

async def check_k8s_job_status(tune_id: str, retry_label_lookup=True):
Copy link
Copy Markdown
Contributor

@fredotieno fredotieno May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fMurugi is there a missing argument here?
check_pod_phase

should we rename the function to check_tuning_task_status

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fredotieno not sure about the argument.

logger.debug(f"Error checking job conditions: {e}")
return None

async def get_k8s_status(job_name: str) -> str:
Copy link
Copy Markdown
Contributor

@fredotieno fredotieno May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fMurugi
should we rename this to something like get_aggregate_job_and_pod_status

Copy link
Copy Markdown
Contributor

@bglar bglar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fMurugi use black and ruff to format changes introduced

Example for the gfmstudio/fine_tuning/core/kubernetes.py file:

black gfmstudio/fine_tuning/core/kubernetes.py && ruff check --select I --fix gfmstudio/fine_tuning/core/kubernetes.py

Comment thread gfmstudio/fine_tuning/core/kubernetes.py Outdated
Returns
-------
str
The pod phase: 'Running', 'Pending', 'Succeeded', 'Failed', 'Unknown', or None if no pod found
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great introduction. Thinking of easily maintaining this in the future, creating a reference schema class for these statuses would be ideal ensuring that you would only have one place to change or update the values incase of a future change/update.

Something like ...


class PodStatusPhase(str,Enum)
    RUNNING = "Running"
    PENDING = "Pending"
    SUCCEEDED = "Succeeded"
    FAILED = "Failed"
    UNKNOWN = "Unknown"
    NONE = "None" # adjust to what None is here

Then when returning the outputs from k8s,

    if result:
        try:
            return PodStatusPhase(status_str)
        except ValueError:
            # Update this to handle what states would error
            return PodStatusPhase.UNKNOWN
    return None

And whenever in your code you are using the values you can easily use the values i.e
PodStatusPhase.FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants