Skip to content

Conversation

ycchenzheng
Copy link
Collaborator

Fixes / Features

  • Merge --base-docker-image and --docker-image flag

Testing / Documentation

Tested with https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/recipes/pw_mcjax_benchmark_recipe.py for both mcjax and pathways
Changed https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/maxtext_xpk_runner.py#L624 to

    docker_image_flag = f'--docker-image="{wl_config.base_docker_image}"'

mcjax uses RUNNER = "maxtext_base_image" and pathways uses RUNNER="gcr.io/tpu-prod-env-multipod/wstcliyu_latest:latest" as runner image
mcjax will push local maxtext_base_image to remote for pods to pull and pathways will pull images directly from the remote.
XPK log:

[XPK] Building /usr/local/google/home/chzheng/maxtext into docker image.
[XPK] Task: `Building script_dir into docker image` is implemented by `docker buildx build --platform=linux/amd64 -f /tmp/tmpvl105_wh -t chzheng-runner /usr/local/google/home/chzheng/maxtext`, streaming output live.
[+] Building 0.0s (0/1)                                                                                                                                  docker:default
[+] Building 0.9s (1/2)                                                                                                                                  docker:default
[+] Building 2.0s (6/8)                                                                                                                                  docker:default
[+] Building 3.0s (8/9)                                                                                                                                  docker:default
[+] Building 3.8s (9/9) FINISHED                                                                                                                         docker:default
 => [internal] load build definition from tmpvl105_wh                                                                                                              0.0s
 => => transferring dockerfile: 212B                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/python:3.10                                                                                                     1.3s
 => [internal] load .dockerignore                                                                                                                                  0.0s
 => => transferring context: 45B                                                                                                                                   0.0s
 => [1/4] FROM docker.io/library/python:3.10@sha256:6ff000548a4fa34c1be02624836e75e212d4ead8227b4d4381c3ae998933a922                                               0.0s
 => [internal] load build context                                                                                                                                  0.0s
 => => transferring context: 39.83kB                                                                                                                               0.0s
 => CACHED [2/4] WORKDIR /app                                                                                                                                      0.0s
 => [3/4] COPY . .                                                                                                                                                 1.3s
 => [4/4] WORKDIR /app                                                                                                                                             0.0s
 => exporting to image                                                                                                                                             1.0s
 => => exporting layers                                                                                                                                            1.0s
 => => writing image sha256:8f7f59fdd22171fa0ac861a9e7559c4c58f80978950a7bb04eaf2ee37f004ffd                                                                       0.0s
 => => naming to docker.io/library/chzheng-runner                                                                                                                  0.0s
Waiting for `chzhe-pw-2-wtf`, for 14 secondsdocker image`, for 4 seconds...
[XPK] Task: `Building script_dir into docker image` terminated with code `0`
[XPK] Adding Docker Image: gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39 to tpu-prod-env-one-vm
[XPK] Task: `Tag Docker Image` is implemented by `docker tag chzheng-runner gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39`, streaming output live.
Waiting for `chzhe-pw-2-wtf`, for 15 secondsseconds...
[XPK] Task: `Tag Docker Image` terminated with code `0`
[XPK] Task: `Upload Docker Image` is implemented by `docker push gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39`, streaming output live.
Waiting for `chzhe-pw-2-wtf`, for 16 seconds 0 seconds...
Waiting for `chzhe-pw-2-wtf`, for 17 seconds 1 seconds...
The push refers to repository [gcr.io/tpu-prod-env-one-vm/chzheng-runner]
5f70bf18a086: Layer already exists 
917a4b2a5731: Pushing [=>                                                 ]  7.141MB/191.7MB
917a4b2a5731: Pushing [=====>                                             ]  20.51MB/191.7MB
917a4b2a5731: Pushing [========>                                          ]   33.3MB/191.7MB
917a4b2a5731: Pushing [===========>                                       ]  44.44MB/191.7MB
917a4b2a5731: Pushing [==============>                                    ]  54.42MB/191.7MB
917a4b2a5731: Pushing [================>                                  ]  64.45MB/191.7MB
917a4b2a5731: Pushing [====================>                              ]  77.25MB/191.7MB
917a4b2a5731: Pushing [=======================>                           ]  88.93MB/191.7MB
917a4b2a5731: Pushing [==========================>                        ]  101.2MB/191.7MB
917a4b2a5731: Pushing [=============================>                     ]  112.9MB/191.7MB
917a4b2a5731: Pushing [================================>                  ]  124.6MB/191.7MB
917a4b2a5731: Pushing [===================================>               ]  135.1MB/191.7MB
917a4b2a5731: Pushing [======================================>            ]  146.3MB/191.7MB
917a4b2a5731: Pushing [========================================>          ]  156.3MB/191.7MB
917a4b2a5731: Pushing [===========================================>       ]    168MB/191.7MB
917a4b2a5731: Pushing [==============================================>    ]  179.6MB/191.7MB
917a4b2a5731: Pushing [=================================================> ]  190.2MB/191.7MB
917a4b2a5731: Pushed 
Waiting for `chzhe-pw-2-wtf`, for 27 seconds 11 seconds...
Waiting for `chzhe-pw-2-wtf`, for 28 seconds 12 seconds...
Waiting for `chzhe-pw-2-wtf`, for 29 seconds 13 seconds...
Waiting for `chzhe-pw-2-wtf`, for 30 seconds 14 seconds...
Waiting for `chzhe-pw-2-wtf`, for 31 seconds 15 seconds...
Waiting for `chzhe-pw-2-wtf`, for 32 seconds 16 seconds...
Waiting for `chzhe-pw-2-wtf`, for 33 seconds 17 seconds...
Waiting for `chzhe-pw-2-wtf`, for 34 seconds 18 seconds...
Waiting for `chzhe-pw-2-wtf`, for 35 seconds 19 seconds...
Waiting for `chzhe-pw-2-wtf`, for 36 seconds 20 seconds...
xitg-2025-08-08-17-56-39: digest: sha256:766a71cc50f9c8100c98ccba5d451dbbb82d15f6b275ee884f1b2cb278153f36 size: 2420
[XPK] Task: `Upload Docker Image` terminated with code `0`

Pod log:

Events:
  Type     Reason                           Age                From               Message
  ----     ------                           ----               ----               -------
  Normal   Scheduled                        32s                default-scheduler  Successfully assigned default/chzhe-pw-2-wtf-slice-job-1-0-t8nn6 to gke-tpu-e96bd525-tn7c
  Normal   Pulling                          32s                kubelet            Pulling image "gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39"
  Normal   Pulled                           28s                kubelet            Successfully pulled image "gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39" in 4.365s (4.365s including waiting). Image size: 455215023 bytes.
  Normal   Created                          28s                kubelet            Created container: jax-tpu
  Normal   Started                          27s                kubelet            Started container jax-tpu
  Warning  FailedToRetrieveImagePullSecret  26s (x3 over 32s)  kubelet            Unable to retrieve some image pull secrets (None); attempting to pull the image may not succeed.

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@ycchenzheng ycchenzheng self-assigned this Aug 8, 2025
@ycchenzheng
Copy link
Collaborator Author

@SujeethJinesh

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 3d26153 to 954377f Compare August 8, 2025 18:39
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch 2 times, most recently from 36b0cfc to 95e6fa0 Compare August 11, 2025 01:04
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 95e6fa0 to b70f1b8 Compare August 11, 2025 18:05
@SujeethJinesh
Copy link
Collaborator

Hmm, I'm rethinking if we should be merging these flags together. I think we should still support both of these flags, but when we're using the benchmark runner in maxtext, we should support Pathways being able to use --base-docker-image or --docker-image

@ycchenzheng
Copy link
Collaborator Author

Hmm, I'm rethinking if we should be merging these flags together. I think we should still support both of these flags, but when we're using the benchmark runner in maxtext, we should support Pathways being able to use --base-docker-image or --docker-image

https://github.com/AI-Hypercomputer/xpk/blob/chzheng/docker_image_flag/src/xpk/core/docker_image.py#L228 will check --docker-image -> --base-docker-image -> DEFAULT_DOCKER_IMAGE
This change is still back compatible

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zheng!

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zheng!

Once the commented code is removed, then it looks good to me.

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 6bf4461 to 08a7f36 Compare August 12, 2025 21:45
@ycchenzheng
Copy link
Collaborator Author

Thanks Zheng!

Once the commented code is removed, then it looks good to me.

Done

@SujeethJinesh
Copy link
Collaborator

@scaliby Would you be able to take a look at this PR?

Copy link
Collaborator

@scaliby scaliby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change! Could you please also add XPK execution command to the XPK log, so I will see with what arguments XPK has been executed?


Args:
args: user provided arguments for running the command.
is_cloud_image = any(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need distinction between cloud and local images? What if I push my local ubuntu image to gcr and want to use it as a local image? It will be treated as a cloud image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It should be based on the flag --script-dir
I will update it and verify again
@SujeethJinesh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is updated to check --script-dir and is_cloud_image to determine if building a new image is needed.
I attached the commands and logs below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this is_cloud_image check at all? IMO this is completely wrong, as if I as a researcher store my image outside of GCP I might still want to build scripts right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original purpose of is_cloud_image is to check if it is a local image, if so, we need to push the local image to cloud for pods to pull.

  if (
      args.script_dir and args.script_dir != DEFAULT_SCRIPT_DIR
  ) or not is_cloud_image:
    validate_code = validate_docker_image(docker_image, args)
    if validate_code != 0:
      xpk_exit(validate_code)
    build_code, docker_image = build_docker_image_from_base_image(args)
    if build_code != 0:
      xpk_exit(build_code)

And you are right, it will force the user to build and push the image if the image is outside of GCP.
I will update it. Thanks for the review!

Copy link
Collaborator Author

@ycchenzheng ycchenzheng Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scaliby I updated the logic and please review it one more time.
Following is the logs:
Local image(maxtext_base_image):
XPK will build and push the image

[XPK] Docker image: maxtext_base_image
[XPK] Local image: True, build and push
[XPK] Building /usr/local/google/home/chzheng/maxtext into docker image.
[XPK] Task: `Building script_dir into docker image` is implemented by `docker buildx build --platform=linux/amd64 -f /tmp/tmplf2ae_b1 -t chzheng-runner /usr/local/google/home/chzheng/maxtext`, streaming output live.
Waiting for `chz-pw-llama3--3-e1g`, for 9 secondsr image`, for 0 seconds...
[+] Building 0.8s (1/2)                                                                                                                                  docker:default
[+] Building 1.7s (6/8)                                                                                                                                  docker:default
[+] Building 2.5s (9/9) FINISHED                                                                                                                         docker:default
 => [internal] load build definition from tmplf2ae_b1                                                                                                              0.0s
 => => transferring dockerfile: 212B                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/python:3.10                                                                                                     1.3s
 => [internal] load .dockerignore                                                                                                                                  0.0s
 => => transferring context: 45B                                                                                                                                   0.0s
 => [1/4] FROM docker.io/library/python:3.10@sha256:b568817037b2e14fb0d5d0deb972556ac15a915eb85e254bbc66e4eae446aa54                                               0.0s
 => [internal] load build context                                                                                                                                  0.1s
 => => transferring context: 66.74kB                                                                                                                               0.1s
 => CACHED [2/4] WORKDIR /app                                                                                                                                      0.0s
 => [3/4] COPY . .                                                                                                                                                 0.6s
 => [4/4] WORKDIR /app                                                                                                                                             0.0s
 => exporting to image                                                                                                                                             0.5s
 => => exporting layers                                                                                                                                            0.5s
 => => writing image sha256:caf3e50099b114251c19a2481570758515b5051fec14654144b9ff865c500ed6                                                                       0.0s
 => => naming to docker.io/library/chzheng-runner                                                                                                                  0.0s
[XPK] Task: `Building script_dir into docker image` terminated with code `0`
[XPK] Adding Docker Image: gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54 to cloud-tpu-multipod-dev
[XPK] Task: `Tag Docker Image` is implemented by `docker tag chzheng-runner gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54`, streaming output live.
Waiting for `chz-pw-llama3--3-e1g`, for 12 secondss...
[XPK] Task: `Tag Docker Image` terminated with code `0`
[XPK] Task: `Upload Docker Image` is implemented by `docker push gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54`, streaming output live.
Waiting for `chz-pw-llama3--3-e1g`, for 13 secondsonds...
The push refers to repository [gcr.io/cloud-tpu-multipod-dev/chzheng-runner]
5f70bf18a086: Preparing 
5f70bf18a086: Layer already exists 
ad25f8cf280d: Pushing [=======>                                           ]  11.03MB/72.57MB
ad25f8cf280d: Pushing [================>                                  ]  24.29MB/72.57MB
ad25f8cf280d: Pushing [=========================>                         ]  37.45MB/72.57MB
ad25f8cf280d: Pushing [=================================>                 ]  48.58MB/72.57MB
ad25f8cf280d: Pushing [========================================>          ]   58.6MB/72.57MB
ad25f8cf280d: Pushing [================================================>  ]   70.3MB/72.57MB
ad25f8cf280d: Pushed 
607ddfe5f3c3: Layer already exists 
185e04da9d94: Layer already exists 
Waiting for `chz-pw-llama3--3-e1g`, for 15 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 16 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 17 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 18 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 19 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 20 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 21 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 22 secondsonds...
adsc-2025-09-26-20-04-54: digest: sha256:dd5b481a3ad529d67fb0d24c0d025293b4df6858383b3b8a0662fb61ff267bb5 size: 2420
[XPK] Task: `Upload Docker Image` terminated with code `0`

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  12s   default-scheduler  Successfully assigned default/chz-pw-llama3--3-e1g-pathways-head-0-0-jq8rh to gke-pw-scale-test-v5e-32-cpu-np-157a9698-zl4g
  Normal  Pulling    12s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     12s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 271ms (271ms including waiting). Image size: 189202517 bytes.
  Normal  Created    12s   kubelet            Created container: pathways-rm
  Normal  Started    12s   kubelet            Started container pathways-rm
  Normal  Pulling    11s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     11s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 315ms (315ms including waiting). Image size: 176032745 bytes.
  Normal  Created    11s   kubelet            Created container: pathways-proxy
  Normal  Started    11s   kubelet            Started container pathways-proxy
  Normal  Pulling    10s   kubelet            Pulling image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54"
  Normal  Pulled     10s   kubelet            Successfully pulled image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54" in 330ms (330ms including waiting). Image size: 432209311 bytes.
  Normal  Created    10s   kubelet            Created container: jax-tpu
  Normal  Started    10s   kubelet            Started container jax-tpu

Cloud image outside of GCP (python:3.12):
XPK will not build image if no --script-dir, since it is outside of GCP, validate_docker_image will return 0 and output no logs

[XPK] Docker image: python:3.12
Waiting for `chz-pw-llama3--3-7xs`, for 9 seconds
[XPK] Remote image, move forward

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  37s   default-scheduler  Successfully assigned default/chz-pw-llama3--3-7xs-pathways-head-0-0-8vl6r to gke-pw-scale-test-v5e-32-cpu-np-157a9698-zl4g
  Normal  Pulling    38s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     38s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 367ms (367ms including waiting). Image size: 189202517 bytes.
  Normal  Created    38s   kubelet            Created container: pathways-rm
  Normal  Started    38s   kubelet            Started container pathways-rm
  Normal  Pulling    37s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     36s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 247ms (247ms including waiting). Image size: 176032745 bytes.
  Normal  Created    36s   kubelet            Created container: pathways-proxy
  Normal  Started    36s   kubelet            Started container pathways-proxy
  Normal  Pulling    36s   kubelet            Pulling image "python:3.12"
  Normal  Pulled     35s   kubelet            Successfully pulled image "python:3.12" in 225ms (225ms including waiting). Image size: 410344954 bytes.
  Normal  Created    35s   kubelet            Created container: jax-tpu
  Normal  Started    35s   kubelet            Started container jax-tpu

Cloud image(gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest):
XPK will not build image if no --script-dir

[XPK] Docker image: gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest
[XPK] Remote image, move forward
[XPK] Task: `Validate Docker Image` is implemented by `gcloud container images describe gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest --project cloud-tpu-multipod-dev`, hiding output unless there is an error.
Waiting for `chz-pw-llama3--3-866`, for 10 seconds
Waiting for `chz-pw-llama3--3-866`, for 11 seconds
[XPK] Task: `Validate Docker Image` succeeded.

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  119s  default-scheduler  Successfully assigned default/chz-pw-llama3--3-866-pathways-head-0-0-9rqjk to gke-pw-scale-test-v5e-32-cpu-np-157a9698-zl4g
  Normal  Pulling    119s  kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     119s  kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 357ms (357ms including waiting). Image size: 189202517 bytes.
  Normal  Created    119s  kubelet            Created container: pathways-rm
  Normal  Started    119s  kubelet            Started container pathways-rm
  Normal  Pulling    119s  kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     118s  kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 375ms (375ms including waiting). Image size: 176032745 bytes.
  Normal  Created    118s  kubelet            Created container: pathways-proxy
  Normal  Started    118s  kubelet            Started container pathways-proxy
  Normal  Pulling    118s  kubelet            Pulling image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest"
  Normal  Pulled     117s  kubelet            Successfully pulled image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest" in 420ms (420ms including waiting). Image size: 1573876003 bytes.
  Normal  Created    117s  kubelet            Created container: jax-tpu
  Normal  Started    117s  kubelet            Started container jax-tpu

@ycchenzheng
Copy link
Collaborator Author

Thanks for this change! Could you please also add XPK execution command to the XPK log, so I will see with what arguments XPK has been executed?
This was used with MaxText benchmark recipes

~/maxtext$ PYTHONPATH=. python3 benchmarks/recipes/pw_mcjax_benchmark_recipe.py

It will call https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/maxtext_xpk_runner.py which calls xpk.py
If using a local base docker image, the XPK command:

python3 ~/xpk/xpk.py workload create  --cluster=pw-scale-test-v5e-32 --project=cloud-tpu-multipod-dev --zone=us-south1-a  --device-type=v5litepod-32  --num-slices=2 --command="export PROJECT=cloud-tpu-multipod-dev && export CLUSTER=pw-scale-test-v5e-32 && export ZONE=us-south1-a &&  echo LIBTPU_INIT_ARGS=' --xla_tpu_scoped_vmem_limit_kib=98304 --xla_tpu_use_minor_sharding_for_major_trivial_input=true --xla_tpu_relayout_group_size_threshold_for_reduce_scatter=1 --xla_tpu_assign_all_reduce_scatter_layout=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true --xla_tpu_enable_async_collective_fusion_fuse_all_reduce=false --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_all_reduce_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=true --xla_sc_enable_instruction_fusion=false --xla_sc_disjoint_spmem=false --xla_sc_disable_megacore_partitioning=true --2a886c8_chip_config_name=megachip_tccontrol --xla_tpu_enable_all_experimental_scheduler_features=true --xla_tpu_enable_scheduler_memory_pressure_tracking=true --xla_tpu_host_transfer_overlap_limit=24 --xla_tpu_aggressive_opt_barrier_removal=ENABLED --xla_lhs_prioritize_async_depth_over_stall=ENABLED --xla_tpu_enable_ag_backward_pipelining=true --xla_should_allow_loop_variant_parameter_in_chain=ENABLED --xla_should_add_loop_invariant_op_in_chain=ENABLED --xla_max_concurrent_host_send_recv=100 --xla_tpu_scheduler_percent_shared_memory_limit=100 --xla_latency_hiding_scheduler_rerun=2' && export LIBTPU_INIT_ARGS=' --xla_tpu_scoped_vmem_limit_kib=98304 --xla_tpu_use_minor_sharding_for_major_trivial_input=true --xla_tpu_relayout_group_size_threshold_for_reduce_scatter=1 --xla_tpu_assign_all_reduce_scatter_layout=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true --xla_tpu_enable_async_collective_fusion_fuse_all_reduce=false --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_all_reduce_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=true --xla_sc_enable_instruction_fusion=false --xla_sc_disjoint_spmem=false --xla_sc_disable_megacore_partitioning=true --2a886c8_chip_config_name=megachip_tccontrol --xla_tpu_enable_all_experimental_scheduler_features=true --xla_tpu_enable_scheduler_memory_pressure_tracking=true --xla_tpu_host_transfer_overlap_limit=24 --xla_tpu_aggressive_opt_barrier_removal=ENABLED --xla_lhs_prioritize_async_depth_over_stall=ENABLED --xla_tpu_enable_ag_backward_pipelining=true --xla_should_allow_loop_variant_parameter_in_chain=ENABLED --xla_should_add_loop_invariant_op_in_chain=ENABLED --xla_max_concurrent_host_send_recv=100 --xla_tpu_scheduler_percent_shared_memory_limit=100 --xla_latency_hiding_scheduler_rerun=2' && export ENABLE_PATHWAYS_PERSISTENCE=1 && export JAX_PLATFORMS=tpu,cpu && export ENABLE_PJRT_COMPATIBILITY=true &&  python3 -m MaxText.train MaxText/configs/base.yml per_device_batch_size=2 ici_fsdp_parallelism=-1 remat_policy=custom decoder_layer_input=offload out_proj=offload query_proj=offload key_proj=offload value_proj=offload max_target_length=8192 attention=flash use_iota_embed=True dataset_path=gs://max-datasets-rogue dataset_type=synthetic enable_checkpointing=False sa_block_q=2048 sa_block_kv=2048 sa_block_kv_compute=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_q_dq=2048 sa_block_kv_dq=2048 sa_use_fused_bwd_kernel=True profiler=xplane skip_first_n_steps_for_profiler=10 profiler_steps=5 use_vertex_tensorboard=True vertex_tensorboard_project=cloud-tpu-multipod-dev vertex_tensorboard_region=us-south1  steps=20 model_name=llama3.1-8b base_output_directory=gs://chzheng-us-south1/chzhengmcjax_2_slice_v5litepod-32_llama3_1-8b-8192-v5e-256/ use_vertex_tensorboard=false vertex_tensorboard_project="" vertex_tensorboard_region="" run_name=chzhe-mc-2-xml enable_rich_metrics=true " --docker-image="maxtext_base_image" --enable-debug-logs --workload=chzhe-mc-2-xml --priority=medium --max-restarts=0

It will build and push the image to remote

[XPK] Task: `Building script_dir into docker image` terminated with code `0`
[XPK] Adding Docker Image: gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28 to cloud-tpu-multipod-dev
[XPK] Task: `Tag Docker Image` is implemented by `docker tag chzheng-runner gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28`, streaming output live.
Waiting for `chzhe-mc-2-xml`, for 46 secondsseconds...
[XPK] Task: `Tag Docker Image` terminated with code `0`
[XPK] Task: `Upload Docker Image` is implemented by `docker push gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28`, streaming output live.
Waiting for `chzhe-mc-2-xml`, for 47 seconds 0 seconds...
The push refers to repository [gcr.io/cloud-tpu-multipod-dev/chzheng-runner]

And the pod event:

Events:
  Type     Reason                           Age                From               Message
  ----     ------                           ----               ----               -------
  Normal   Scheduled                        25s                default-scheduler  Successfully assigned default/chzhe-mc-2-xml-slice-job-0-0-4m5n9 to gke-tpu-8bb8a2ce-0w6w
  Normal   Pulling                          24s                kubelet            Pulling image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28"
  Normal   Pulled                           13s                kubelet            Successfully pulled image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:sbry-2025-08-27-19-30-28" in 11.473s (11.473s including waiting). Image size: 599034818 bytes.
  Normal   Created                          13s                kubelet            Created container: jax-tpu
  Normal   Started                          13s                kubelet            Started container jax-tpu
  Warning  FailedToRetrieveImagePullSecret  12s (x3 over 25s)  kubelet            Unable to retrieve some image pull secrets (None); attempting to pull the image may not succeed.

If using a remote docker image, the XPK command:

python3 ~/xpk/xpk.py workload create-pathways   --server-image=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest   --proxy-server-image=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest    --termination-grace-period-seconds=300  --pathways-gcs-location=gs://chzheng-us-south1/chzhengpathways_2_slice_v5litepod-32_llama3_1-8b-8192-v5e-256/  --custom-pathways-server-args="--xla_tpu_use_enhanced_launch_barrier=true"  --custom-pathways-proxy-server-args="--xla_tpu_scoped_vmem_limit_kib=98304 --xla_tpu_use_minor_sharding_for_major_trivial_input=true --xla_tpu_relayout_group_size_threshold_for_reduce_scatter=1 --xla_tpu_assign_all_reduce_scatter_layout=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true --xla_tpu_enable_async_collective_fusion_fuse_all_reduce=false --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_all_reduce_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=true --xla_sc_enable_instruction_fusion=false --xla_sc_disjoint_spmem=false --xla_sc_disable_megacore_partitioning=true --xla_tpu_enable_all_experimental_scheduler_features=true --xla_tpu_enable_scheduler_memory_pressure_tracking=true --xla_tpu_host_transfer_overlap_limit=24 --xla_tpu_aggressive_opt_barrier_removal=ENABLED --xla_lhs_prioritize_async_depth_over_stall=ENABLED --xla_tpu_enable_ag_backward_pipelining=true --xla_should_allow_loop_variant_parameter_in_chain=ENABLED --xla_should_add_loop_invariant_op_in_chain=ENABLED --xla_max_concurrent_host_send_recv=100 --xla_tpu_scheduler_percent_shared_memory_limit=100 --xla_latency_hiding_scheduler_rerun=2 --xla_tpu_use_enhanced_launch_barrier=true"  --custom-pathways-worker-args="--xla_tpu_use_enhanced_launch_barrier=true"   --cluster=pw-scale-test-v5e-32 --project=cloud-tpu-multipod-dev --zone=us-south1-a  --tpu-type=v5litepod-32  --num-slices=2 --command="export PROJECT=cloud-tpu-multipod-dev && export CLUSTER=pw-scale-test-v5e-32 && export ZONE=us-south1-a &&    export ENABLE_PATHWAYS_PERSISTENCE=1 && export JAX_PLATFORMS=proxy && export ENABLE_PJRT_COMPATIBILITY=true &&  python3 -m MaxText.train MaxText/configs/base.yml per_device_batch_size=2 ici_fsdp_parallelism=-1 remat_policy=custom decoder_layer_input=offload out_proj=offload query_proj=offload key_proj=offload value_proj=offload max_target_length=8192 attention=flash use_iota_embed=True dataset_path=gs://max-datasets-rogue dataset_type=synthetic enable_checkpointing=False sa_block_q=2048 sa_block_kv=2048 sa_block_kv_compute=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_q_dq=2048 sa_block_kv_dq=2048 sa_use_fused_bwd_kernel=True profiler=xplane skip_first_n_steps_for_profiler=10 profiler_steps=5 use_vertex_tensorboard=True vertex_tensorboard_project=cloud-tpu-multipod-dev vertex_tensorboard_region=us-south1 checkpoint_storage_use_ocdbt=False checkpoint_storage_use_zarr3=False enable_pathways_goodput=True enable_goodput_recording=True enable_single_controller=True metrics_file=metrics.txt goodput_upload_interval_seconds=30  steps=20 model_name=llama3.1-8b base_output_directory=gs://chzheng-us-south1/chzhengpathways_2_slice_v5litepod-32_llama3_1-8b-8192-v5e-256/  run_name=chzhe-pw-2-tan enable_rich_metrics=true " --docker-image=gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest --enable-debug-logs --workload=chzhe-pw-2-tan --priority=medium --max-restarts=0

It will let the pod pull the target image directly from the remote

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  90s   default-scheduler  Successfully assigned default/chzhe-pw-2-tan-pathways-head-0-0-nscgd to gke-pw-scale-test-v5e-32-cpu-np-4be25352-u4fd
  Normal  Pulling    90s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     90s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 389ms (389ms including waiting). Image size: 185513979 bytes.
  Normal  Created    90s   kubelet            Created container: pathways-rm
  Normal  Started    90s   kubelet            Started container pathways-rm
  Normal  Pulling    89s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     89s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 288ms (288ms including waiting). Image size: 180371181 bytes.
  Normal  Created    89s   kubelet            Created container: pathways-proxy
  Normal  Started    89s   kubelet            Started container pathways-proxy
  Normal  Pulling    88s   kubelet            Pulling image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest"
  Normal  Pulled     88s   kubelet            Successfully pulled image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest" in 334ms (334ms including waiting). Image size: 1829862064 bytes.
  Normal  Created    88s   kubelet            Created container: jax-tpu
  Normal  Started    88s   kubelet            Started container jax-tpu

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch 2 times, most recently from a32daf1 to f27717a Compare August 27, 2025 20:12
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from f27717a to 16e76b1 Compare August 27, 2025 20:13
@ycchenzheng ycchenzheng requested a review from scaliby August 27, 2025 20:15
@ycchenzheng
Copy link
Collaborator Author

@scaliby can you please review this PR again?

@SikaGrr
Copy link
Collaborator

SikaGrr commented Sep 26, 2025

It's very hard for me to see all of the consequences of merging these two flags, as I'm not familiar with all of the usage cases enough (yet). Is merging of these flags just for convenience / simplification, or is there another reason to do it?

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch 2 times, most recently from 3bb4c1b to a6b919e Compare September 26, 2025 22:05
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from a6b919e to 64527de Compare September 26, 2025 22:05
@ycchenzheng
Copy link
Collaborator Author

It's very hard for me to see all of the consequences of merging these two flags, as I'm not familiar with all of the usage cases enough (yet). Is merging of these flags just for convenience / simplification, or is there another reason to do it?

@SujeethJinesh can you please share more context here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants