Open
Description
i created a mpijob with invalid pod template , i cant get mpijob status all the time ( i think the status should be Failed).
now i cant distinguish the mpijobs which are too new to get status and the mpijobs with invaild pod template
my mpijob shows below
kubectl get mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod -oyaml
the output is :
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
creationTimestamp: "2023-11-15T02:01:44Z"
generation: 1
labels:
deadline: 2023-11-15_02-06-44
name: ai62da0dbe-6406-4252-85d6-51ef87eab10d
namespace: cpod
resourceVersion: "2787007"
uid: e5703c73-f27e-45ef-9049-fd40c152d4d6
spec:
launcherCreationPolicy: WaitForWorkersReady
mpiImplementation: OpenMPI
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: "111"
imagePullPolicy: IfNotPresent
name: launcher
hostIPC: true
Worker:
replicas: 1
template:
spec:
containers:
- image: "111"
imagePullPolicy: IfNotPresent
name: worker
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: "111"
name: ckpt-pv
- mountPath: "111"
name: saved-model-pv
hostIPC: true
nodeSelector:
nvidia.com/gpu.product: NVIDIA-GeForce-RTX-3090
volumes:
- name: ckpt-pv
persistentVolumeClaim:
claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
readOnly: false
- name: saved-model-pv
persistentVolumeClaim:
claimName: ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
readOnly: false
runPolicy:
cleanPodPolicy: Running
schedulingPolicy:
minAvailable: 1
suspend: false
slotsPerWorker: 1
sshAuthMountPath: /root/.ssh
when describe the mpijob
kubectl describe mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod
output is :
Name: ai62da0dbe-6406-4252-85d6-51ef87eab10d
Namespace: cpod
Labels: deadline=2023-11-15_02-06-44
Annotations: <none>
API Version: kubeflow.org/v2beta1
Kind: MPIJob
Metadata:
Creation Timestamp: 2023-11-15T02:01:44Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v2beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.:
f:deadline:
f:spec:
.:
f:launcherCreationPolicy:
f:mpiImplementation:
f:mpiReplicaSpecs:
.:
f:Launcher:
.:
f:replicas:
f:template:
.:
f:spec:
.:
f:containers:
f:hostIPC:
f:Worker:
.:
f:replicas:
f:template:
.:
f:spec:
.:
f:containers:
f:hostIPC:
f:nodeSelector:
f:volumes:
f:runPolicy:
.:
f:cleanPodPolicy:
f:schedulingPolicy:
.:
f:minAvailable:
f:suspend:
f:slotsPerWorker:
f:sshAuthMountPath:
Manager: cpodmanager
Operation: Update
Time: 2023-11-15T02:01:44Z
Resource Version: 2787007
UID: e5703c73-f27e-45ef-9049-fd40c152d4d6
Spec:
Launcher Creation Policy: WaitForWorkersReady
Mpi Implementation: OpenMPI
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Spec:
Containers:
Image: 111
Image Pull Policy: IfNotPresent
Name: launcher
Host IPC: true
Worker:
Replicas: 1
Template:
Spec:
Containers:
Image: 111
Image Pull Policy: IfNotPresent
Name: worker
Resources:
Limits:
nvidia.com/gpu: 1
Volume Mounts:
Mount Path: 111
Name: ckpt-pv
Mount Path: 111
Name: saved-model-pv
Host IPC: true
Node Selector:
nvidia.com/gpu.product: NVIDIA-GeForce-RTX-3090
Volumes:
Name: ckpt-pv
Persistent Volume Claim:
Claim Name: ai62da0dbe-6406-4252-85d6-51ef87eab10d-ckpt
Read Only: false
Name: saved-model-pv
Persistent Volume Claim:
Claim Name: ai62da0dbe-6406-4252-85d6-51ef87eab10d-modelsave
Read Only: false
Run Policy:
Clean Pod Policy: Running
Scheduling Policy:
Min Available: 1
Suspend: false
Slots Per Worker: 1
Ssh Auth Mount Path: /root/.ssh
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal MPIJobCreated 5m48s (x12 over 27m) mpi-job-controller MPIJob cpod/ai62da0dbe-6406-4252-85d6-51ef87eab10d is created.
Warning MPIJobFailed 5m48s (x12 over 27m) mpi-job-controller worker pod created failed: Pod "ai62da0dbe-6406-4252-85d6-51ef87eab10d-worker-0" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: "111": must be unique
Metadata
Metadata
Assignees
Labels
No labels