Skip to content
This repository was archived by the owner on Jul 22, 2024. It is now read-only.
This repository was archived by the owner on Jul 22, 2024. It is now read-only.

Question about using multiple gpus #11

@YunahJang

Description

@YunahJang

Hi! I'm having some trouble using multiple gpus for run_finetune_rag_dialdoc.sh file.

I have set --gpus parameter as 4 but i kept getting errors as below.

ValueError: ProcessGroupGloo::scatter: invalid tensor type at index 0 (expected TensorOptions(dtype=double, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)), got TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))

So I have modified a line 159 in dialdoc/models/rag/distributed_pytorch_retriever.py file by not specifying target_type variable.
retrieved_doc_embeds = self._scattered(scatter_vectors, [n_queries, n_docs, combined_hidden_states.shape[1]])`

After this modification, i am getting errors as below and I couldn't figure out why I'm getting this error.

File "/home/yunah/multidoc2dial_ours/dialdoc/models/rag/distributed_pytorch_retriever.py", line 157, in retrieve
doc_ids = self._scattered(scatter_ids, [n_queries, n_docs], target_type=torch.int64)
File "/home/yunah/multidoc2dial_ours/dialdoc/models/rag/distributed_pytorch_retriever.py", line 82, in _scattered
dist.scatter(target_tensor, src=0, scatter_list=scatter_list, group=self.process_group)
File "/home/yunah/.conda/envs/multidoc2dial/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2191, in scatter
work = group.scatter(output_tensors, input_tensors, opts)
ValueError: ProcessGroupGloo::scatter: Incorrect input list size 1. Input list size should be 2, same as size of the process group.

Did I miss any other variables or settings I should change before using multiple gpus?
I would like to know if there is a solution for this error.
Thanks a lot!

Best,
Yunah

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions