Support xccl distributed backend #3034

dvrogozh · 2025-02-18T17:43:31Z

Starting from torch>=2.7 XCCL distributed backend is available for XPU devices (requires torch built with USE_XCCL=1).

This commit is verified on Intel Data Center GPU Max with Bloom:

text-generation-launcher --sharded true --num-shard 2 \
  --model-id bigscience/bloom-560m

This commit does not impact IPEX which currently remains using custom distributed backend.

CC: @Narsil

Starting from `torch>=2.7` XCCL distributed backend is available for XPU devices (requires torch built with `USE_XCCL=1`). This commit is verified on Intel Data Center GPU Max with Bloom: ``` text-generation-launcher --sharded true --num-shard 2 \ --model-id bigscience/bloom-560m ``` Signed-off-by: Dmitry Rogozhkin <[email protected]>

Narsil · 2025-02-19T14:22:14Z

What's the benefit over the Ipex backend ? If this allows suboptimal deployments compared to the IPEX image, I think we'd rather not merge this at all (and error out instead with instructions on how to get the better image).

Not having flash attention is kind of a no-go nowadays (we still maintain the old paths but only because they existed at some point, we're not adding any anymore).

dvrogozh · 2025-02-19T19:06:12Z

Initially, IPEX is an external plugin for pytorch which brings in few things which essentially can be grouped in 2:

Accelerator support in the scope of pytorch API: eager mode operators, profiling, distributed backend, etc.
Additional features outside of pytorch API. These would be important 3rd party kernels such as attention kernels.

At the moment we are in the process of bringing in Intel GPUs support right into stock pytorch thru the dedicated device backend called xpu. Relationship with IPEX is the following: IPEX is using those features which become available natively in pytorch XPU backend. I.e. for each next release IPEX is getting rebased on top of some version of pytorch and features which are now available thru XPU backend are dropped from IPEX codebase. In a way it can be viewed as XPU support is getting upstreamed to pytorch (in the scope of "group 1" features exposed by IPEX as I noted above).

XPU distributed backend falls into the 1st group - that's one of the features which is getting upstreamed to pytorch. The plan is that IPEX "ccl" backend will be dropped going forward and IPEX will rely on "xccl" backend exposed directly by pytorch. That's the process which will take time. XCCL distributed backend will be first available in PT 2.7 and will require manual pytorch compilation with USE_XCCL=1, later - added to nightly builds and then replace IPEX's "ccl" in one of IPEX releases. We hope to make this change during this year with the rough estimate around PT 2.9 to make a s/ccl/xccl/ switch.

The change I propose in this PR is being done with above background in mind. It introduces "xccl" distributed support into TGI which can be tried out if someone will build TGI against stock pytorch (without IPEX). As you correctly notice such a build has limited value due to lacking flash attention support. Basically that's the reason why I don't propose to expose such configuration on a higher level at TGI (via docker and documentation covering such environment). At the same time such a build is interesting for development as it helps to identify issues earlier and builds a foundation for the future switch for IPEX environment which ultimately will reuse the code path I introduce now for stock pytorch.

Alternatively, we can postpone adding "xccl" distributed support till IPEX will be ready to use it. Having "xccl" support now, however, even requiring stock pytorch will be a help to me and other developers to prepare things in advance.

I hope above helps to clarify the story and make a decision.

Narsil · 2025-03-10T11:24:34Z

The change I propose in this PR is being done with above background in mind. It introduces "xccl" distributed support into TGI which can be tried out if someone will build TGI against stock pytorch (without IPEX). As you correctly notice such a build has limited value due to lacking flash attention support. Basically that's the reason why I don't propose to expose such configuration on a higher level at TGI (via docker and documentation covering such environment). At the same time such a build is interesting for development as it helps to identify issues earlier and builds a foundation for the future switch for IPEX environment which ultimately will reuse the code path I introduce now for stock pytorch.

Very clear explanation thanks. If you're ok with that we can keep the current PR open until there's enough support within torch so that we can modify the IPEX backend directly (for instance by using the XCCL backend).

In TGI our external surface is only our docker images, everything internal to the docker image (packages, compile options etc..) is considered internal and we can modify at will (without breaking the surface ofc).

The caveat with merging this directly, is that sometimes user deploy a docker on a improper node and therefore causes issues in the flash loading code, and it uses the fallback. Back 2 years ago it was ok to downgrade to a non flash implementation, but today, it's increasingly better to just error out and let users fix their environment (since otherwise it's just a silent extremely slow deployment)

dvrogozh · 2025-03-11T18:29:07Z

Yes, sure. Let's wait for the IPEX part of the story.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support xccl distributed backend #3034

Support xccl distributed backend #3034

dvrogozh commented Feb 18, 2025

Narsil commented Feb 19, 2025

dvrogozh commented Feb 19, 2025

Narsil commented Mar 10, 2025

dvrogozh commented Mar 11, 2025

Support xccl distributed backend #3034

Are you sure you want to change the base?

Support xccl distributed backend #3034

Conversation

dvrogozh commented Feb 18, 2025

Narsil commented Feb 19, 2025

dvrogozh commented Feb 19, 2025

Narsil commented Mar 10, 2025

dvrogozh commented Mar 11, 2025