Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI states job successful even when job failed #461

Open
noryev opened this issue Dec 2, 2024 · 1 comment
Open

CLI states job successful even when job failed #461

noryev opened this issue Dec 2, 2024 · 1 comment

Comments

@noryev
Copy link
Contributor

noryev commented Dec 2, 2024

Describe the bug

When a Resource Provider is offered a job to compute, if that RP does not have enough resources available, then the job will not succeed. The problem is that the CLI interface does not notify the user that the job was not successful even though its obvious that job failed due to there being no output.

Reproduction

Run a job on a RP that has most of its VRAM allocated to other tasks.

Logs

2024-12-02 23:15:32,414 - INFO - Starting SDXL lightweight script
2024-12-02 23:15:32,414 - INFO - Using prompt: "CLASSIFIED"
2024-12-02 23:15:32,414 - INFO - Loading SDXL-Turbo pipeline
2024-12-02 23:15:32,414 - INFO - Using pre-downloaded model
Couldn't connect to the Hub: (MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /api/models/stabilityai/sdxl-turbo (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x77c580c88550>: Failed to resolve \'huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: 5da4d690-df80-49b9-b404-c1cd694a4da6)').
Will try to load from local cache.

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]
Loading pipeline components...:  29%|██▊       | 2/7 [00:00<00:00,  7.36it/s]
Loading pipeline components...:  57%|█████▋    | 4/7 [00:00<00:00,  4.78it/s]
Loading pipeline components...:  71%|███████▏  | 5/7 [00:01<00:00,  3.95it/s]
Loading pipeline components...: 100%|██████████| 7/7 [00:01<00:00,  5.86it/s]
2024-12-02 23:15:33,808 - INFO - Using device: cuda
2024-12-02 23:15:42,058 - ERROR - An error occurred: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.60 GiB of which 126.50 MiB is free. Process 261350 has 8.66 MiB memory in use. Process 2850306 has 485.57 MiB memory in use. Process 3710576 has 6.74 GiB memory in use. Of the allocated memory 6.30 GiB is allocated by PyTorch, and 253.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/workspace/run_sdxl.py", line 37, in main
    pipe = pipe.to(device)
  File "/usr/local/lib/python3.9/site-packages/diffusers/pipelines/pipeline_utils.py", line 431, in to
    module.to(device, dtype)
  File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2905, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1174, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1160, in convert
    return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.60 GiB of which 126.50 MiB is free. Process 261350 has 8.66 MiB memory in use. Process 2850306 has 485.57 MiB memory in use. Process 3710576 has 6.74 GiB memory in use. Of the allocated memory 6.30 GiB is allocated by PyTorch, and 253.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Screenshots

No response

System Info

Resource Providers are running lilypad from source(main branch-Dec 2nd)

Severity

Annoyance

@noryev
Copy link
Contributor Author

noryev commented Dec 2, 2024

@walkerlj0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant