Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined symbol: _ZNK5torch8autograd4Node4nameEv #2715

Open
wookjeHan opened this issue Jun 11, 2024 · 6 comments
Open

Undefined symbol: _ZNK5torch8autograd4Node4nameEv #2715

wookjeHan opened this issue Jun 11, 2024 · 6 comments

Comments

@wookjeHan
Copy link

Hi team,
I installed fbgemm_gpu by $pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121/ command, and using torch.2.4.0.

Currently I am facing the error as below

/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
  File "/home/gr-optimizations/train.py", line 29, in <module>
    import fbgemm_gpu  # noqa: F401, E402
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    import fbgemm_gpu.docs  # noqa: F401, E402
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/docs/__init__.py", line 9, in <module>
    from . import jagged_tensor_ops, table_batched_embedding_ops  # noqa: F401
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/docs/jagged_tensor_ops.py", line 14, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'

Could you please let me know how to resolve this issue?

Best,

@q10
Copy link
Contributor

q10 commented Jun 11, 2024

The latest stable of release of FBGEMM_GPU is targeted to binary compatibility with torch 2.3.x. The nightliy version should be used for running against torch 2.4.x:

pip install --pre fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cu121/

Could you try this and let us know if there are any issues?

@wookjeHan
Copy link
Author

wookjeHan commented Jun 11, 2024

Thanks for your kind reply.
Now I am facing following error.

Could you please let me know how to resolve this issue?

/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK3c105Error4whatEv
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: _ZNK3c105Error4whatEv
INFO:root:cuda.matmul.allow_tf32: True
I0611 17:30:01.247293 139738748224128 train.py:135] cuda.matmul.allow_tf32: True
INFO:root:cudnn.allow_tf32: True
I0611 17:30:01.247349 139738748224128 train.py:136] cudnn.allow_tf32: True
INFO:root:Training model on rank 0.
I0611 17:30:01.247383 139738748224128 train.py:137] Training model on rank 0.
Initialize _item_emb.weight as truncated normal: torch.Size([131263, 256]) params
NFO:root:Rank 0: writing logs to ./exps/ml-20m-l200/HSTU_CUSTOM-b16-h8-dqk32-dv32-lsilud0.2-ad0.0_DotProduct_local-l2-eps1e-06_ssl-t0.05-n128-b128-lr0.001-wu0-wd0-2024-06-11
  0%|                                                                                                                                                           | 0/1082 [00:00<?, ?it/s]INFO:root:running build_ext
I0611 17:30:10.720256 139738748224128 dist.py:985] running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
  0%|                                                                                                                                                           | 0/1082 [00:03<?, ?it/s]
......
Traceback logs....
.......
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'asynchronous_complete_cumsum'

@q10
Copy link
Contributor

q10 commented Jun 17, 2024

We have not run into the issue undefined symbol: _ZNK3c105Error4whatEv before. However, I suspect it might have to do with the exact nightly version of PyTorch, and that a more recently nightly version might resolve the issue. We use the installation instructions here for reproducible environments, could you try installation through this and let us know if you still run into the issue?

@isuruf
Copy link

isuruf commented Sep 9, 2024

Was pytorch built with C++11 ABI? (If torch.compiled_with_cxx11_abi() returns True, then yes). If so, the pre-built wheels are incompatible.

@NiDHanWang
Copy link

Thanks for your kind reply. Now I am facing following error.

Could you please let me know how to resolve this issue?

/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK3c105Error4whatEv
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: _ZNK3c105Error4whatEv
INFO:root:cuda.matmul.allow_tf32: True
I0611 17:30:01.247293 139738748224128 train.py:135] cuda.matmul.allow_tf32: True
INFO:root:cudnn.allow_tf32: True
I0611 17:30:01.247349 139738748224128 train.py:136] cudnn.allow_tf32: True
INFO:root:Training model on rank 0.
I0611 17:30:01.247383 139738748224128 train.py:137] Training model on rank 0.
Initialize _item_emb.weight as truncated normal: torch.Size([131263, 256]) params
NFO:root:Rank 0: writing logs to ./exps/ml-20m-l200/HSTU_CUSTOM-b16-h8-dqk32-dv32-lsilud0.2-ad0.0_DotProduct_local-l2-eps1e-06_ssl-t0.05-n128-b128-lr0.001-wu0-wd0-2024-06-11
  0%|                                                                                                                                                           | 0/1082 [00:00<?, ?it/s]INFO:root:running build_ext
I0611 17:30:10.720256 139738748224128 dist.py:985] running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
  0%|                                                                                                                                                           | 0/1082 [00:03<?, ?it/s]
......
Traceback logs....
.......
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'asynchronous_complete_cumsum'

Try lower version for both torch and fbgemm_gpu.

This works for me:

torch==2.3.0+cu121
fbgemm_gpu==0.7.0+cu121

@q10
Copy link
Contributor

q10 commented Jan 6, 2025

Hi @wookjeHan we no longer support fbgemm v0.7.0, please use at least v1.0.0, and let us know if you still run into this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants