CLI states job successful even when job failed #461

noryev · 2024-12-02T23:55:51Z

Describe the bug

When a Resource Provider is offered a job to compute, if that RP does not have enough resources available, then the job will not succeed. The problem is that the CLI interface does not notify the user that the job was not successful even though its obvious that job failed due to there being no output.

Reproduction

Run a job on a RP that has most of its VRAM allocated to other tasks.

Logs

2024-12-02 23:15:32,414 - INFO - Starting SDXL lightweight script
2024-12-02 23:15:32,414 - INFO - Using prompt: "CLASSIFIED"
2024-12-02 23:15:32,414 - INFO - Loading SDXL-Turbo pipeline
2024-12-02 23:15:32,414 - INFO - Using pre-downloaded model
Couldn't connect to the Hub: (MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /api/models/stabilityai/sdxl-turbo (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x77c580c88550>: Failed to resolve \'huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: 5da4d690-df80-49b9-b404-c1cd694a4da6)').
Will try to load from local cache.

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]
Loading pipeline components...:  29%|██▊       | 2/7 [00:00<00:00,  7.36it/s]
Loading pipeline components...:  57%|█████▋    | 4/7 [00:00<00:00,  4.78it/s]
Loading pipeline components...:  71%|███████▏  | 5/7 [00:01<00:00,  3.95it/s]
Loading pipeline components...: 100%|██████████| 7/7 [00:01<00:00,  5.86it/s]
2024-12-02 23:15:33,808 - INFO - Using device: cuda
2024-12-02 23:15:42,058 - ERROR - An error occurred: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.60 GiB of which 126.50 MiB is free. Process 261350 has 8.66 MiB memory in use. Process 2850306 has 485.57 MiB memory in use. Process 3710576 has 6.74 GiB memory in use. Of the allocated memory 6.30 GiB is allocated by PyTorch, and 253.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/workspace/run_sdxl.py", line 37, in main
    pipe = pipe.to(device)
  File "/usr/local/lib/python3.9/site-packages/diffusers/pipelines/pipeline_utils.py", line 431, in to
    module.to(device, dtype)
  File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2905, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1174, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1160, in convert
    return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.60 GiB of which 126.50 MiB is free. Process 261350 has 8.66 MiB memory in use. Process 2850306 has 485.57 MiB memory in use. Process 3710576 has 6.74 GiB memory in use. Of the allocated memory 6.30 GiB is allocated by PyTorch, and 253.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Screenshots

No response

System Info

Resource Providers are running lilypad from source(main branch-Dec 2nd)

Severity

Annoyance

noryev · 2024-12-02T23:58:10Z

@walkerlj0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI states job successful even when job failed #461

CLI states job successful even when job failed #461

noryev commented Dec 2, 2024

noryev commented Dec 2, 2024

CLI states job successful even when job failed #461

CLI states job successful even when job failed #461

Comments

noryev commented Dec 2, 2024

Describe the bug

Reproduction

Logs

Screenshots

System Info

Severity

noryev commented Dec 2, 2024