Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable environment variables to guard against hangs with Triton gemms #624

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/container/Dockerfile.jax
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,11 @@ ENV XLA_FLAGS=""
ENV XLA_FLAGS="${XLA_FLAGS} --xla_gpu_enable_latency_hiding_scheduler=true"
ENV XLA_FLAGS="${XLA_FLAGS} --xla_gpu_enable_async_all_gather=true"
ENV XLA_FLAGS="${XLA_FLAGS} --xla_gpu_enable_async_reduce_scatter=true"
ENV XLA_FLAGS="${XLA_FLAGS} --xla_gpu_enable_triton_gemm=false"
ENV CUDA_DEVICE_MAX_CONNECTIONS=1
ENV NCCL_NVLS_ENABLE=0
ENV CUDA_MODULE_LOADING=EAGER
ENV JAX_SHARE_BINARY_BETWEEN_HOSTS=True
ENV JAX_SHARE_AUTOTUNE_CONFIG_BETWEEN_HOSTS=True

ADD --chmod=777 create-distribution.sh ${DEST_MANIFEST_DIR}/

Expand Down
38 changes: 19 additions & 19 deletions .github/container/manifest.yaml
Original file line number Diff line number Diff line change
@@ -1,31 +1,31 @@
jax:
url: https://github.com/google/jax.git
tracking_ref: main
latest_verified_commit: 75cdef7626b92b8b6563ea68ae4747fd6994db2e
latest_verified_commit: c94ec17c0d255b560e948e1b6bb98f1a1b8be45f
mode: git-clone
xla:
url: https://github.com/openxla/xla.git
tracking_ref: main
latest_verified_commit: 831e9cef85493ff7ee2e24fd4cc64377d682aecc
latest_verified_commit: e025d9f3ed4d5b7ed80b2352c613e3706d917416
mode: git-clone
flax:
url: https://github.com/google/flax.git
mirror_url: https://github.com/nvjax-svc-0/flax.git
tracking_ref: main
latest_verified_commit: aaf130c90eb46160a3234c258a48bf1b932d7829
latest_verified_commit: e4282ee187efbefbde268c6873c592a352f56313
mode: git-clone
patches:
pull/3340/head: file://patches/flax/PR-3340.patch # Add Sharding Annotations to Flax Modules
transformer-engine:
url: https://github.com/NVIDIA/TransformerEngine.git
tracking_ref: main
latest_verified_commit: 9b2fed514ea419141146f843ab2c84b22b86bfd7
latest_verified_commit: b459ccc93549c13797f884271904e019ad7fa3db
mode: git-clone
t5x:
url: https://github.com/google-research/t5x.git
mirror_url: https://github.com/nvjax-svc-0/t5x.git
tracking_ref: main
latest_verified_commit: ecb126e1f5c2aea648f39869d4e69fb4374a4868
latest_verified_commit: ecc6369c95b7b3066a46af5050c8ff9113eb219b
mode: git-clone
patches:
mirror/patch/partial-checkpoint-restore: file://patches/t5x/mirror-patch-partial-checkpoint-restore.patch # pull/1392/head # https://github.com/google-research/t5x/pull/1392: Add support for partial checkpoint restore
Expand All @@ -43,7 +43,7 @@ praxis:
url: https://github.com/google/praxis.git
mirror_url: https://github.com/nvjax-svc-0/praxis.git
tracking_ref: main
latest_verified_commit: c4271181833d540ea22b1e3875e2bd54951763e9
latest_verified_commit: 833073e5f7516f487ced5b7b1dcc06b962ebd4e0
mode: git-clone
patches:
pull/27/head: file://patches/praxis/PR-27.patch # This PR allows XLA:GPU to detect the MHA pattern more easily to call fused kernels from cublas.
Expand All @@ -52,7 +52,7 @@ lingvo:
# Used only in ARM pax builds
url: https://github.com/tensorflow/lingvo.git
tracking_ref: master
latest_verified_commit: 5bbe38c046519b86fa5c0488f813ffbf8b467d7e
latest_verified_commit: 05a076b0783a8bbf4a770095966c472bb37bbf65
mode: git-clone
tensorflow-text:
# Used only in ARM pax and t5x builds
Expand All @@ -73,12 +73,12 @@ fiddle:
airio:
url: https://github.com/google/airio.git
tracking_ref: main
latest_verified_commit: e4c682e691354d75a6bea521cd61709b1ab81d34
latest_verified_commit: 9051cd3375de479df41e0385d200da5101285e55
mode: pip-vcs
clu:
url: https://github.com/google/CommonLoopUtils.git
tracking_ref: main
latest_verified_commit: eed40a1facd526df0e0faa192525f357a3321dca
latest_verified_commit: 0b961e5390da283448b5fc4632b61641980e0a0e
mode: pip-vcs
dllogger:
url: https://github.com/NVIDIA/dllogger.git
Expand All @@ -93,54 +93,54 @@ jestimator:
optax:
url: https://github.com/google-deepmind/optax.git
tracking_ref: main
latest_verified_commit: 623609c7a77a19d48b021cbc300262308846317e
latest_verified_commit: 18110a40153097d15015af2d5d2ee73a86c2a9df
mode: pip-vcs
seqio:
url: https://github.com/google/seqio.git
tracking_ref: main
latest_verified_commit: e31af8c1a11f749edeac512f34d148b9933f863f
latest_verified_commit: 11706e4a1e01a81ea6b3e02c5ad147028d5b94bb
mode: pip-vcs
# used by Pallas
openxla-triton:
url: https://github.com/openxla/triton.git
tracking_ref: llvm-head
latest_verified_commit: cl608559313
latest_verified_commit: c33bbb01d087129e3714cfc2a0bcd53b80dd2013
ashors1 marked this conversation as resolved.
Show resolved Hide resolved
mode: git-clone
jax-triton:
url: https://github.com/jax-ml/jax-triton.git
tracking_ref: main
latest_verified_commit: 708d3e8afe13b52e4191ad3b677c6f1238677c9e
latest_verified_commit: 08f10633740924de30d85a65386e226b5cbbe564
mode: git-clone
maxtext:
url: https://github.com/google/maxtext.git
tracking_ref: main
latest_verified_commit: 5420bc5753fec4b3a811664cdb58f3c9e98d35fb
latest_verified_commit: fac91938e2f23670ced1f6569557927d31185c9e
mode: git-clone
levanter:
url: https://github.com/stanford-crfm/levanter.git
tracking_ref: main
latest_verified_commit: 94a432e7999ae016645bc72e9dda55e724d0f834
latest_verified_commit: 26577417d6c48144e30d97b6aceaf5da917c4f79
mode: git-clone
haliax:
url: https://github.com/stanford-crfm/haliax.git
tracking_ref: main
latest_verified_commit: 690623131e107972ec2ec67d6183c77649d4b7e0
latest_verified_commit: b94cef72e8bafec57e170fceb9ee86bbe2b1bcfa
mode: git-clone
mujoco:
url: https://github.com/google-deepmind/mujoco.git
tracking_ref: main
latest_verified_commit: c6a41fbfe64ee7b2680a6bde90200ca660d08c2a
latest_verified_commit: ef3cb7ec2e75da93e3307b88b53923ee8f8c7f66
mode: git-clone
grain:
# Used only in ARM t5x builds
url: https://github.com/google/grain.git
tracking_ref: main
latest_verified_commit: f58031724ff06bcc84943c9a8ec501c8941dd660
latest_verified_commit: 98531edd5332c3f212853f88513ac55f29be96b7
mode: git-clone
mujoco-mpc:
url: https://github.com/google-deepmind/mujoco_mpc.git
tracking_ref: main
latest_verified_commit: 50a0159cbc70b38a7fee425b8bf5edbc04a1b62e
latest_verified_commit: 4700f4a13be18398f5aaf6a33ed42e531967e3ae
mode: git-clone
language-to-reward-2023:
url: https://github.com/google-deepmind/language_to_reward_2023.git
Expand Down
Loading
Loading