-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA driver error: operation not supported #1446
Comments
I have similar problems, but not be same, just get same error message Used vgpu vm with a6000 vram 24G(vram full is 48G) full script: https://github.com/cool9203/unsloth-train/blob/d1c1ab702707ae5bdf69c0d303006c5726a61b23/unsloth_train/train_vision.py 🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2024.12.4: Fast Mllama vision patching. Transformers: 4.46.2.
\\ /| GPU: NVIDIA RTXA6000-24Q. Max memory: 23.784 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.5.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = True]
"-____-" Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/iii/yoga/unsloth-train/unsloth_train/__main__.py", line 173, in <module>
train_model(**parameters)
File "/usr/lib/python3.10/unittest/mock.py", line 1379, in patched
return func(*newargs, **newkeywargs)
File "/home/iii/yoga/unsloth-train/unsloth_train/train_vision.py", line 47, in train_model
model, tokenizer = FastVisionModel.from_pretrained(
File "/home/iii/yoga/unsloth-train/.venv/lib/python3.10/site-packages/unsloth/models/loader.py", line 492, in from_pretrained
model, tokenizer = FastBaseVisionModel.from_pretrained(
File "/home/iii/yoga/unsloth-train/.venv/lib/python3.10/site-packages/unsloth/models/vision.py", line 145, in from_pretrained
model = AutoModelForVision2Seq.from_pretrained(
File "/home/iii/yoga/unsloth-train/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/home/iii/yoga/unsloth-train/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained
) = cls._load_pretrained_model(
File "/home/iii/yoga/unsloth-train/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4728, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/iii/yoga/unsloth-train/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 993, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/iii/yoga/unsloth-train/.venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 329, in set_module_tensor_to_device
new_value = value.to(device)
RuntimeError: CUDA driver error: operation not supported nvidia-smi:
install package:
|
check is torch installed properly with cuda
|
I feeling not torch problem, maybe is vgpu or newest cuda driver or flash attention problem? but this is my guess I can run my script in another computer,
And i can run pytorch quick start pytorch quick start output: root@097ce03c393c:/app# python test.py
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.4M/26.4M [00:07<00:00, 3.77MB/s]
Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.5k/29.5k [00:00<00:00, 109kB/s]
Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.42M/4.42M [00:02<00:00, 1.75MB/s]
Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.15k/5.15k [00:00<00:00, 83.0MB/s]
Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw
Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using cuda device
NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)
Epoch 1
-------------------------------
loss: 2.304316 [ 64/60000]
loss: 2.289975 [ 6464/60000]
loss: 2.270400 [12864/60000]
loss: 2.259455 [19264/60000]
loss: 2.245398 [25664/60000]
loss: 2.209586 [32064/60000]
loss: 2.219506 [38464/60000]
loss: 2.183300 [44864/60000]
loss: 2.181254 [51264/60000]
loss: 2.146851 [57664/60000]
Test Error:
Accuracy: 41.1%, Avg loss: 2.141327
Epoch 2
-------------------------------
loss: 2.155709 [ 64/60000]
loss: 2.144184 [ 6464/60000]
loss: 2.084640 [12864/60000]
loss: 2.103154 [19264/60000]
loss: 2.049047 [25664/60000]
loss: 1.985874 [32064/60000]
loss: 2.025819 [38464/60000]
loss: 1.936248 [44864/60000]
loss: 1.945789 [51264/60000]
loss: 1.884334 [57664/60000]
Test Error:
Accuracy: 52.2%, Avg loss: 1.869702 I test run train text model, got same problem Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/app/unsloth_train/__main__.py", line 173, in <module>
train_model(**parameters)
File "/root/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/unittest/mock.py", line 1396, in patched
return func(*newargs, **newkeywargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/unsloth_train/train.py", line 52, in train_model
model, tokenizer = FastLanguageModel.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/unsloth/models/loader.py", line 256, in from_pretrained
model, tokenizer = dispatch_model.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/unsloth/models/llama.py", line 1663, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4130, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 1083, in __init__
self.model = LlamaModel(config)
^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 812, in __init__
self.rotary_emb = LlamaRotaryEmbedding(config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/unsloth/models/llama.py", line 1149, in __init__
self._set_cos_sin_cache(seq_len=self.current_rope_size, device=device, dtype=torch.get_default_dtype())
File "/app/.venv/lib/python3.12/site-packages/unsloth/models/llama.py", line 1164, in _set_cos_sin_cache
self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA driver error: operation not supported |
Thank you @cool9203 |
HI,
unfortunately, I get this error:
The text was updated successfully, but these errors were encountered: