Skip to content

retest: preparing to debug difference gke/ce #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Dec 13, 2024

So far I have found differences in MTU and using (or not using) TIER1 network, which would influence bandwidth. I am preparing a size32 study directory anticipating testing this. We were never able to get COMPACT mode on compute engine so I'm thinking that will still be the case. I haven't found other differences yet but am still looking.

So far I have found differences in MTU and using (or
not using) TIER1 network, which would influence
bandwidth. I am preparing a size32 study directory
anticipating testing this.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch added the wontfix This will not be worked on label Feb 23, 2025
@vsoch
Copy link
Member Author

vsoch commented Feb 23, 2025

This likely won't be merged, but I'll add the results (from when I ran them) for transparency. This thread is from December 15th 2024.

Some notes for retesting compute engine with the notes I made above. First, we still can't get COMPACT, even for 10 nodes they spin indefinitely, it reaches some timeout around 15-16 minutes, and then starts again and I think this would go on forever. These are c2d-standard-112.

image

I'm going to restart without COMPACT.

It's not looking any faster, but I'll wait to do 3 iterations at both sizes to say that for sure.
Confirming MTU, and that lammps (on a node) is using all the CPU, which suggests that lower CPU utilization reported is due to network still.

image

Results! This is for size 32.

  • CPU utilization is a tiny bit better as compared to the lower MTU / not premium. Surprisingly it is actually better, overall, on the compute engine environments (first plot) despite the overall runtime being slower.
  • Matom steps per second is still better on GKE, and MTU/TIER-1 didn't seem to impact compute engine environments (second plot)
  • Wall time is the tiniest bit lower, but it's nothing to call home about (third plot). It still is a minute or more slower than GKE, and I did fewer iterations.

image
image
image

My early conclusions:

  • Bumping MTU and adding TIER-1 add complexity (and cost for the latter) that might have tiny improvements to CPU utilization / runtime, but TIER-1 is probably not worth the cost (but this is hugely app dependent). MTU is easy enough to bump.
  • I kind of look at these plots and (largely) see no change I find interesting.
  • Using cluster-toolkit seemed to break the NFS and added complexity.
  • It still could be COMPACT, which we can get in GKE but not here.

TLDR: I would not blame the difference between GKE and compute engine on MTU or TIER-1, at least for LAMMPS. I don't know what else we could look at that we did "wrong" because we can't get COMPACT or better resources. Anyway, I guess burned a few hundred dollars and a big chunk of today, was worth a try anyway. I had tiny hopes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant