Skip to content

Training Job Error for ray_xgboost_gpu.ipynb #10

@smart-patrol

Description

@smart-patrol

I am getting the following error when trying to run the SM NB.

UnexpectedStatusException: Error for Training job pytorch-training-2023-05-19-19-07-59-014: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 train_xgboost_airline.py"
2023-05-19 19:13:50,343	INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.2.116.47:9339...
2023-05-19 19:13:50,364	INFO worker.py:1625 -- Connected to Ray cluster.
Traceback (most recent call last):
  File "train_xgboost_airline.py", line 125, in <module>
    main()
  File "train_xgboost_airline.py", line 107, in main
    evals=[(dtrain, "train"), (dval, "val")])
  File "/opt/conda/lib/python3.6/site-packages/xgboost_ray/main.py", line 1565, in train
    placement_strategy,
  File "/opt/conda/lib/python3.6/site-packages/xgboost_ray/main.py", line 959, in _create_placement_group
    f"Placement group creation timed out after {timeout} seconds. "
TimeoutError: Placement group creation timed out after 100 seconds. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'node:10.2.116.47': 0.98, 'memory': 39562652059.0, 'CPU': 16.0, 'acc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions