Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS: try to Build OBMD package on arm #16

Closed
wants to merge 3 commits into from

Conversation

laraPPr
Copy link
Collaborator

@laraPPr laraPPr commented Jan 29, 2025

No description provided.

@dev-eessi-io-bot
Copy link

Instance dev.eessi.io-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen2, x86_64/amd/zen4, aarch64/neoverse_n1
  • repositories: dev.eessi.io

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 29, 2025

bot: build arch:aarch64/neoverse_n1

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Jan 29, 2025

Updates by the bot instance dev.eessi.io-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/neoverse_n1 from laraPPr

    • expanded format: build architecture:aarch64/neoverse_n1
  • handling command build architecture:aarch64/neoverse_n1 resulted in:

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Jan 29, 2025

New job on instance dev.eessi.io-bot-mc-azure for architecture aarch64-neoverse_n1 for repository dev.eessi.io in job dir /project/def-users/SHARED/jobs/2025.01/pr_16/235

date job status comment
Jan 29 14:29:43 UTC 2025 submitted job id 235 awaits release by job manager
Jan 29 14:29:49 UTC 2025 released job awaits launch by Slurm scheduler
Jan 29 14:33:52 UTC 2025 running job 235 is running
Jan 29 14:35:55 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-235.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_n1-1738161286.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/example/software/linux/aarch64/neoverse_n1/modules/all
no module files in tarball
software under 2023.06/example/software/linux/aarch64/neoverse_n1/software
no software packages in tarball
other under 2023.06/example/software/linux/aarch64/neoverse_n1
no other files in tarball
Jan 29 14:35:55 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job235.test does not exist in job directory or reading it failed.

@Neves-P
Copy link
Member

Neves-P commented Jan 29, 2025

From the slurm log:

  >> running command:
	[started at: 2025-01-29 14:34:00]
	[working dir: /tmp/bot/easybuild/build/LAMMPS/0cb72423b8ed2fdf138831c145a3bfb6ea42394e/foss-2023a-kokkos-dev_OBMD/lammps-0cb72423b8ed2fdf138831c145a3bfb6ea42394e]
	[output logged in /tmp/eb-w_sv3q0o/eb-9pov_lx4/easybuild-run_cmd-c69cx09o.log]
	python -c 'from archspec.cpu import host; print(host())'
  >> command completed: exit 0, ran in < 1s
== ... (took < 1 sec)
== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/LAMMPS/0cb72423b8ed2fdf138831c145a3bfb6ea42394e/foss-2023a-kokkos-dev_OBMD): build failed (first 300 chars): Couldn't determine CPU architecture, you need to set 'kokkos_arch' manually. (took 14 secs)

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 29, 2025

Ah that is not the error I was seeing. weird normally it should be set correctly in the easyBlock

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 29, 2025

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 29, 2025

Pedro can you tell what is in the output here /tmp/eb-w_sv3q0o/eb-9pov_lx4/easybuild-run_cmd-c69cx09o.log
Does it recognize neoverse_n1?

@Neves-P
Copy link
Member

Neves-P commented Jan 29, 2025

Seems like archspec is returning something else (cortex_a72):

== 2025-01-29 14:34:00,175 run.py:222 DEBUG run_cmd: running cmd python -c 'from archspec.cpu import host; print(host())' (in /tmp/bot/easybuild/build/LAMMPS/0cb72423b8ed2fdf138831c145a3bfb6ea42394e/foss-2023a-kokkos-dev_OBMD/lammps-0cb72423b8ed2fdf138831c145a3bfb6ea42394e)
== 2025-01-29 14:34:00,175 run.py:251 INFO Using /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/bin/bash as shell for running cmd: python -c 'from archspec.cpu import host; print(host())'
== 2025-01-29 14:34:00,175 run.py:260 INFO running cmd: python -c 'from archspec.cpu import host; print(host())' 
== 2025-01-29 14:34:00,627 run.py:702 DEBUG cmd "python -c 'from archspec.cpu import host; print(host())'" exited with exit code 0 and output:
cortex_a72

== 2025-01-29 14:34:00,627 run.py:738 DEBUG Using default regular expression: (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|\.?\w)
== 2025-01-29 14:34:00,689 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Couldn't determine CPU architecture, you need to set 'kokkos_arch' manually. (at easybuild/easyblocks/lammps.py:612 in get_kokkos_arch)

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 29, 2025

Ah yes because this is an arm instance on Azure so it has different flags than the amazon neoverse_n1. Ok so this is not gonna help me further with trying to debug laraPPr/lammps#3. Than maybe I'll just do it interactively on the amazon cluster to see if I can reproduce it their.

But @ocaisa is probably interested in this case.

@casparvl are we testing the ReFrame test-suite (and by extention LAMMPS) on an arm instance on Azure? Maybe we should also set up the test-suite on this cluster?

@casparvl
Copy link

I see we've never rescheduled the cronjob after moving to the new cluster. Done so now. The Reframe config already contained the neoverse_n1 node, so it's pretty simple to also run tests there. I've scheduled a new cronjob that should run on both zen4 and neoverse_n1. For now, I seem to have a job stuck in CF state for the past 27 minutes, so I can't properly verify if this runs now work, but I don't see any reason why not.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 30, 2025

@casparvl did the LAMMPS test pass with the test-suite on the arm azure instance?

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 30, 2025

Features cortex-a72:
Application profile, AArch32 and AArch64, 1–4 SMP cores, TrustZone, NEON advanced SIMD, VFPv4, hardware virtualization, 3-width superscalar, deeply out-of-order pipeline
Is kokkos-mapping ARMV81 also the best match for cortex-a72? It seems it is. Which just requires a simple change to the EasyBlock.

Features neoverse_n1:
Application profile, AArch32 (non-privileged level or EL0 only) and AArch64, 1–4 SMP cores, TrustZone, NEON advanced SIMD, VFPv4, hardware virtualization, 4-width decode superscalar, 8-way dispatch/issue, 13 stage pipeline, deeply out-of-order pipeline[62]
-> currently building with kokkos-mapping: ARMV81

@casparvl
Copy link

@casparvl did the LAMMPS test pass with the test-suite on the arm azure instance?

Yep:

[ ESC[32m      OKESC[0m ] (32/80) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=2_nodes /2de1d2ca @Magic_Castle_Azur
e:aarch64-neoverse-N1-16c-62gb+default
P: perf: 704.569 timesteps/s (r:0, l:None, u:None)
[ ESC[32m      OKESC[0m ] (33/80) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @Magic_Castle_Azure
:aarch64-neoverse-N1-16c-62gb+default
P: perf: 758.562 timesteps/s (r:0, l:None, u:None)
[ ESC[32m      OKESC[0m ] (34/80) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=2_nodes /05cdd170 @Magic_Cast
le_Azure:aarch64-neoverse-N1-16c-62gb+default
P: perf: 701.426 timesteps/s (r:0, l:None, u:None)
[ ESC[32m      OKESC[0m ] (35/80) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @Magic_Castl
e_Azure:aarch64-neoverse-N1-16c-62gb+default
P: perf: 772.942 timesteps/s (r:0, l:None, u:None)

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 5, 2025

bot: build arch:aarch64/neoverse_n1

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Feb 5, 2025

Updates by the bot instance dev.eessi.io-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/neoverse_n1 from laraPPr

    • expanded format: build architecture:aarch64/neoverse_n1
  • handling command build architecture:aarch64/neoverse_n1 resulted in:

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Feb 5, 2025

New job on instance dev.eessi.io-bot-mc-azure for architecture aarch64-neoverse_n1 for repository dev.eessi.io in job dir /project/def-users/SHARED/jobs/2025.02/pr_16/291

date job status comment
Feb 05 13:00:02 UTC 2025 submitted job id 291 awaits release by job manager
Feb 05 13:00:58 UTC 2025 released job awaits launch by Slurm scheduler
Feb 05 13:04:03 UTC 2025 running job 291 is running
Feb 05 13:11:19 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-291.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_n1-1738760981.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/example/software/linux/aarch64/neoverse_n1/modules/all
no module files in tarball
software under 2023.06/example/software/linux/aarch64/neoverse_n1/software
no software packages in tarball
other under 2023.06/example/software/linux/aarch64/neoverse_n1
no other files in tarball
Feb 05 13:11:19 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job291.test does not exist in job directory or reading it failed.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 5, 2025

bot: build arch:x86_64/amd/zen4

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Feb 5, 2025

Updates by the bot instance dev.eessi.io-bot-mc-azure (click for details)
  • received bot command build arch:x86_64/amd/zen4 from laraPPr

    • expanded format: build architecture:x86_64/amd/zen4
  • handling command build architecture:x86_64/amd/zen4 resulted in:

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Feb 5, 2025

New job on instance dev.eessi.io-bot-mc-azure for architecture x86_64-amd-zen4 for repository dev.eessi.io in job dir /project/def-users/SHARED/jobs/2025.02/pr_16/292

date job status comment
Feb 05 13:03:24 UTC 2025 submitted job id 292 awaits release by job manager
Feb 05 13:04:01 UTC 2025 released job awaits launch by Slurm scheduler
Feb 05 13:07:10 UTC 2025 running job 292 is running
Feb 05 13:09:15 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-292.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-1738760890.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/example/software/linux/x86_64/amd/zen4/modules/all
no module files in tarball
software under 2023.06/example/software/linux/x86_64/amd/zen4/software
no software packages in tarball
other under 2023.06/example/software/linux/x86_64/amd/zen4
no other files in tarball
Feb 05 13:09:15 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job292.test does not exist in job directory or reading it failed.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 5, 2025

neoverse_n1 build is now working with the new easyblock

== Determined cpu arch: ARMV81

== Using Kokkos package with arch: CPU - ARMV81, GPU - None

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 5, 2025

New job on instance dev.eessi.io-bot-mc-azure for architecture aarch64-neoverse_n1 for repository dev.eessi.io in job dir /project/def-users/SHARED/jobs/2025.02/pr_16/291

date job status comment
Feb 05 13:00:02 UTC 2025 submitted job id 291 awaits release by job manager
Feb 05 13:00:58 UTC 2025 released job awaits launch by Slurm scheduler
Feb 05 13:04:03 UTC 2025 running job 291 is running
Feb 05 13:11:19 UTC 2025 finished
😢 FAILURE (click triangle for details)
Feb 05 13:11:19 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)

Failed because of a missing patch file in the easyconfig

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 5, 2025

bot: build arch:aarch64/neoverse_n1

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Feb 5, 2025

Updates by the bot instance dev.eessi.io-bot-mc-azure (click for details)
  • received bot command build arch:aarch64/neoverse_n1 from laraPPr

    • expanded format: build architecture:aarch64/neoverse_n1
  • handling command build architecture:aarch64/neoverse_n1 resulted in:

@dev-eessi-io-bot
Copy link

dev-eessi-io-bot bot commented Feb 5, 2025

New job on instance dev.eessi.io-bot-mc-azure for architecture aarch64-neoverse_n1 for repository dev.eessi.io in job dir /project/def-users/SHARED/jobs/2025.02/pr_16/293

date job status comment
Feb 05 13:16:28 UTC 2025 submitted job id 293 awaits release by job manager
Feb 05 13:17:23 UTC 2025 released job awaits launch by Slurm scheduler
Feb 05 13:18:25 UTC 2025 running job 293 is running
Feb 05 13:24:33 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-293.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_n1-1738761760.tar.gzsize: 106 MiB (112059021 bytes)
entries: 4475
modules under 2023.06/example/software/linux/aarch64/neoverse_n1/modules/all
LAMMPS/0cb72423b8ed2fdf138831c145a3bfb6ea42394e-foss-2023a-kokkos-dev_OBMD.lua
software under 2023.06/example/software/linux/aarch64/neoverse_n1/software
LAMMPS/0cb72423b8ed2fdf138831c145a3bfb6ea42394e-foss-2023a-kokkos-dev_OBMD
other under 2023.06/example/software/linux/aarch64/neoverse_n1
no other files in tarball
Feb 05 13:24:33 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job293.test does not exist in job directory or reading it failed.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 5, 2025

@Neves-P the same error of zen4 were seeing here multixscale/dev.eessi.io-lammps-plugin-obmd#2 is also here

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 5, 2025

This was just to test arm builds before multixscale/dev.eessi.io-lammps was set up. Can close this now.

@laraPPr laraPPr closed this Feb 5, 2025
@dev-eessi-io-bot
Copy link

PR merged! Moved ['/project/def-users/SHARED/jobs/2025.01/pr_16/235', '/project/def-users/SHARED/jobs/2025.02/pr_16/291', '/project/def-users/SHARED/jobs/2025.02/pr_16/292', '/project/def-users/SHARED/jobs/2025.02/pr_16/293'] to $HOME/trash_bin/EESSI/dev.eessi.io-example/2025.02.05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants