Skip to content

Conversation

@hsinhaoHHuang
Copy link
Contributor

Issue

Encountered by the user, the old job script doesn't work for Forerunner-I now.
We previously expect the NUMA nodes to be the same in all the nodes on Forerunner-I, and it should launch 16 MPI ranks per node and 7 OpenMP threads per rank (112 CPUs per node in total).

Now, this is still the same for some nodes. However, for some other nodes, it will request only 4 MPI ranks, and there are 28 OpenMP threads per rank. Worse, the CPU core IDs in Record__Note in the same rank are repeated 4 times. Only 28 CPUs in total are actually used per node. This will make the performance bad because not all the resources on the nodes are used, and the parallelization is not full.

Change

As @koarakawaii once suggested to me a while ago, we can use -map-by ppr:16:node:pe=7 instead to avoid this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant