Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial IFRT integration #764

Merged
merged 3 commits into from
Feb 23, 2025
Merged

feat: initial IFRT integration #764

merged 3 commits into from
Feb 23, 2025

Conversation

avik-pal
Copy link
Collaborator

@avik-pal avik-pal commented Feb 17, 2025

I will split up this PR into multiple smaller PRs once this is ready

@avik-pal avik-pal force-pushed the ap/ifrt_shardings branch 2 times, most recently from f468e60 to 8509469 Compare February 17, 2025 23:43
@mofeing
Copy link
Collaborator

mofeing commented Feb 18, 2025

changes are being made to the C-API, and specifically on the way we should make interact PjRt and IFRT (due to discussions in #738, where I was adding the Julia API). see #751 and specifically this comment #751 (comment)

we are overlapping a bit: i was waiting on #751 to push some of the API you're adding here 😅. i think we should first fix this #751 (comment) and then add the Julia API. i have all the API you want locally but we need first to fix this comment i left, as otherwise sharding won't work for distributed TPU (which is why we want IFRT in the first place).

EDIT: just remembered we agreed with @wsmoses to start trying just with IFRT-PjRt backend for testing, but we must be careful as this PR conflicts with #751 (where i'm changing the C-API) and with #738 (where I introduce the Julia bindings which is blocked on #751).

@avik-pal
Copy link
Collaborator Author

I am not adding any PjRt-IfRt interaction stuff (Held* functions in that PR) here (except the ifrt::Client constructed from the PjRtClient).

We need to use the direct IFRT calls for constructing arrays from HloSharding which doesn't take PjRtBuffers (at least not without multiple roundtrips)

@avik-pal avik-pal changed the base branch from main to ap/ifrt_shardings_jll_changes February 19, 2025 06:11
@avik-pal avik-pal changed the base branch from ap/ifrt_shardings_jll_changes to ap/refactor_pjrt February 19, 2025 15:38
@avik-pal avik-pal force-pushed the ap/refactor_pjrt branch 2 times, most recently from 2efddbf to 105eedb Compare February 19, 2025 16:45
@avik-pal avik-pal changed the title feat: IFRT shardings feat: initial IFRT integration Feb 19, 2025
Base automatically changed from ap/refactor_pjrt to main February 19, 2025 22:44
@avik-pal avik-pal added the IFRT label Feb 20, 2025
@avik-pal avik-pal force-pushed the ap/ifrt_shardings branch 2 times, most recently from 6da5b9d to 408e120 Compare February 21, 2025 20:41
@avik-pal avik-pal changed the base branch from main to ap/pjrt_distributed February 21, 2025 20:41
@avik-pal
Copy link
Collaborator Author

I0000 00:00:1740173243.405339 1677674 pjrt_client.cc:524] PjRt-IFRT device count: total=4, addressable=2
I0000 00:00:1740173243.405393 1677674 pjrt_client.cc:528] Addressable PjRt-IFRT device: CpuDevice(id=0)
I0000 00:00:1740173243.405399 1677674 pjrt_client.cc:528] Addressable PjRt-IFRT device: CpuDevice(id=1)
I0000 00:00:1740173243.405382 1677675 pjrt_client.cc:524] PjRt-IFRT device count: total=4, addressable=2
I0000 00:00:1740173243.405436 1677675 pjrt_client.cc:528] Addressable PjRt-IFRT device: CpuDevice(id=131072)
I0000 00:00:1740173243.405442 1677675 pjrt_client.cc:528] Addressable PjRt-IFRT device: CpuDevice(id=131073)
Reactant.XLA.process_index(client) = 1
Reactant.XLA.process_index(client) = 0
1 CPU:0 cpu
0 CPU:0 cpu
1 CPU:1 cpu
1 CPU:131072 cpu
1 CPU:131073 cpu
0 CPU:1 cpu
0 CPU:131072 cpu
0 CPU:131073 cpu
1 CPU:131072 cpu
1 CPU:131073 cpu
0 CPU:0 cpu
0 CPU:1 cpu

CPU setup for distributed (and presumably TPU setup) now works

@avik-pal avik-pal force-pushed the ap/pjrt_distributed branch from c9dd714 to 59a6689 Compare February 21, 2025 22:07
@avik-pal avik-pal force-pushed the ap/pjrt_distributed branch 2 times, most recently from c89dca8 to dad792d Compare February 22, 2025 17:20
@avik-pal avik-pal changed the base branch from ap/pjrt_distributed to main February 22, 2025 17:47
@avik-pal avik-pal force-pushed the ap/ifrt_shardings branch 3 times, most recently from 5b449d1 to 91c37ab Compare February 22, 2025 21:54
@avik-pal avik-pal force-pushed the ap/ifrt_shardings branch 4 times, most recently from 7041f2b to ea30e50 Compare February 23, 2025 18:18
refactor: rework how OpSharding works

feat: generate_device_list

feat: add placeholder code to simplify future sharding logic

fixup

fix: store results as HloSharding

docs: fix duplicate docs

feat: compile with logical device ids

fix: use correct global device ids

feat: use a global state to setup pjrt distributed runtime

fix: devices are not necessarily from 0 to N-1

fix: initialize clients on first use rather than on init

fix: make device selection consistent with clients

feat: add OMPI cluster detection

fix: correctly set kv_store

refactor: Distributed setup is not PJRT specific

refactor: OMPI detection doesn't need to be in an extension

feat: initial low-level IFRT API

fix: ifrt HloSharding

refactor: split up into IFRT/PJRT

feat: IFRT Client APIs

feat: IFRT Device API

fix: remove global_ordinals

feat: add devices list abstraction

feat: wrap memory and memory kinds

feat: ifrt::HloSharding now working

fix: use new ABI

chore: run formatter

fix: no finalizer

feat: initial draft of IFRT.Array interface (#774)

* feat: initial draft of IFRT.Array interface

* feat: Base.Array to ifrt::Array

* feat: buffer to host
chore: run formatter

fix: bad rebase

feat: more proxy servers

feat: add ConcreteIFRTArray

feat: add ConcreteIFRTNumber

refactor: rename ConcreteRNumber to ConcretePJRTNumber

revert: concreteifrtarray implementation

chore: run formatter

feat: ifrt loaded executable

feat: construct IFRT clients with distributed options

refactor: remove BasicDevicesList

fix: use global device ids

feat: sharding annotations across nodes now working

fix: Array construction from SingleShards

feat: support to_host for distributed cases

feat: add Gloo/MPI collectives for distributed CPU client

feat: low level compile API

feat: low-level IFRT compile + execute working
@avik-pal avik-pal marked this pull request as ready for review February 23, 2025 21:08
@avik-pal avik-pal requested review from wsmoses and mofeing February 23, 2025 21:08
@avik-pal avik-pal merged commit 2a5711e into main Feb 23, 2025
37 of 39 checks passed
@avik-pal avik-pal deleted the ap/ifrt_shardings branch February 23, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants