Crash when running OpenIFS CY40 with 4 threads per task

Created by Unknown User (de3j) on Jul 10, 2019

Good afternoon

I'm running some experiments with OpenIFS T511L91 which seems to scale up to ~ 1200 MPI tasks. Beyond that the model actually becomes slower when using more CPUs.

So I thought I would try to use more threads and fewer tasks. Rather than running with 800 MPI tasks and 1 thread per task, I'm trying 200 MPI tasks and 4 threads per task.

Each node has 40 physical cores, so I've allocated 10 tasks per node to give each thread its own core, i.e. no hyperthreading.

I've also set NPROC=200.

OpenIFS is compiled with "-qopenmp" and "-O3". DRHOOK is turned off.

This is all done on Intel Skylake chips with Intel Fortran and MPI compilers version 2018 and ecCodes 2.12.0.

However, when I try to run, I get the following error:

[84] ABORT! 85 uniform_distribution called before initialize_random_numbers

and this originates from algor/module/random_numbers_mix.F90

I'm attaching the namelist, NODE file (named "ifs.log"), and the full log from SLURM.

Does anyone know what happened here? I'm not sure what random_numbers_mix does, but I haven't set LSTOPH, so I think it's off and there should be no need for random numbers, I think...

Best regards
Joakim

fort-2.4

owned-single-by-usv

3 Comments

Unknown User (nagc)
Hi Joakim,
First comment about scaling is that it will disappear when there is not enough work for each task to do as you increase the number of tasks, compared to the cost of the communication. In others words, there is a pragmatic limit of how many parallel processes are 'correct' for each resolution. Higher resolutions will work more efficiently with higher numbers of tasks. You probably already know this but I thought I'd add a comment on it for the forums.
I'd need to check the code but my guess is that random_numbers_mix is still called as part of the setup phase of the model even if the stochastic physics is not on.
One task failed in random_numbers_mix and then did a sigabrt, signal 6, but looking at the log, the other tasks failed with signal 11, segmentation violation. I'd need to double check the code to see if the model really does send signal 6.
The log shows the error message 'uniform_distribution called before initialize_random_numbers', which, looking at algor/module/random_numbers_mix.F90, is a check that happens right at the start of the subroutine 'uniform_distribution' and triggered by an incorrect data value in the ydstream structure.
I think this shows there's been some serious data corruption, probably in the subroutine argument stack. As you're running with high task numbers it would take alot of work to pin down what's going on.
There's nothing wrong in your approach of using 4 threads, makes sense to have fewer tasks on a shared memory node. I'm pretty certain we run with 4 threads here.
Have you tried any more experiments adjusting the tasks/threads count? I'd suggest trying first with -O2 instead of -O3 and see what that does. Also try 2 threads instead of 4.

Cheers, Glenn
- Permalink
- Jul 23, 2019
Unknown User (nagc)
Another test would be to try another compiler. Do you have the Cray compiler available?
- Permalink
- Jul 23, 2019
Unknown User (de3j)
Hi Glenn
Thanks for digging into the log file, and for figuring out where the problem occurs.
I haven't tried another threads/tasks distribution, so I could experiment a bit, e.g. 2 threads per task and 400 tasks.
Our German HPC does not have Cray, just Intel, GCC and OpenMPI. So I think I can use:
ifort + Intel MPI
ifort + OpenMPI
gfortran + OpenMPI
I'll give it a try and see where I get to and report back when I've made some progress.
Thanks again for the help and advice!
Cheers
Joakim
- Permalink
- Jul 31, 2019

Space shortcuts

Page tree

3 Comments

Unknown User (nagc)

Unknown User (nagc)

Unknown User (de3j)