Hi,

This is an odd one. I successfully ran OpenIFS (Cy38r1 - I promise to upgrade once this project has finished!) on Friday at T511 resolution across 16 MPI tasks (single node). Now, I get an error (full error given below).

It looks like SUMPOUT is the problem, but I haven't changed anything there.

Additionally, the node file is called just NODE.0 instead of the regular NODE.001_01.

Any ideas?

Thanks

signal_drhook(SIGABRT=6): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGBUS=7): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGSEGV=11): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGSTKFLT=16): New handler installed at 0xa4afb7; old preserved at 0x0
signal_drhook(SIGFPE=8): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGILL=4): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGTRAP=5): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGINT=2): New handler installed at 0xa4afb7; old preserved at 0x0
signal_drhook(SIGQUIT=3): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGTERM=15): New handler installed at 0xa4afb7; old preserved at 0x0
signal_drhook(SIGXCPU=24): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
signal_drhook(SIGSYS=31): New handler installed at 0xa4afb7; old preserved at 0x2ae970cd5f50
MPL_BUFFER_METHOD:  2           0
[myproc#1,tid#1,pid#32351,signal#6(SIGABRT)]: Dr.Hook has detected an invalid key-pointer/handle while leaving the routine 'SUMPOUT' [hash=52855]
[myproc#1,tid#1,pid#32351,signal#6(SIGABRT)]: Expecting the key-pointer=0x2ae983faa238 and treeptr->active-flag = 1
[myproc#1,tid#1,pid#32351,signal#6(SIGABRT)]: A probable routine missing the closing DR_HOOK-call is 'SUMPOUT' [hash=52855]
JSETSIG: sl->active = 0
signal_harakiri(SIGALRM=14): New handler installed at 0xa462f7; old preserved at 0x0
***Received signal = 11 and ActivatED SIGALRM=14 and calling alarm(10), time =    0.01
[myproc#1,tid#1,pid#32351,signal#11(SIGSEGV)]: Received signal :: 17MB (heap), 19MB (rss), 0MB (stack), 0 (paging), nsigs 1, time     0.01
tid#1 starting drhook traceback, time =    0.01
[myproc#1,tid#1,pid#32351]:  MASTER 
[myproc#1,tid#1,pid#32351]:   CNT0< 
[myproc#1,tid#1,pid#32351]:    SU0YOMA 
[myproc#1,tid#1,pid#32351]:     SULUN 
[myproc#1,tid#1,pid#32351]:      SUMPOUT 
tid#1 starting sigdump traceback, time =    0.01
[gdb__sigdump] : Received signal#11(SIGSEGV), pid=32351
[LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=32351) :
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[64835,1],0] (PID 32351)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
(pid=32351): /home/samuelhatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109  :  master.exe() [0xa69744]
(pid=32351):      /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:883  :  master.exe() [0xa46398]
(pid=32351):     /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1119  :  master.exe() [0xa4b2ad]
(pid=32351):                                                                <Unknown>  :  libpthread.so.0(+0xf6d0) [0x2ae972b7a6d0]
(pid=32351):     /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1800  :  master.exe() [0xa4c871]
(pid=32351): /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/dr_hook_util.F90:148  :  master.exe() [0xa4f347]
(pid=32351):        /home/samuelhatfield/openifs-cy38r1/src/ifsaux/module/yomhook.F90:49  :  master.exe() [0x40a036]
(pid=32351):            /home/samuelhatfield/openifs-cy38r1/src/ifs/setup/sumpout.F90:90  :  master.exe() [0x41fb30]
(pid=32351):              /home/samuelhatfield/openifs-cy38r1/src/ifs/setup/sulun.F90:92  :  master.exe() [0x41fa4d]
(pid=32351):           /home/samuelhatfield/openifs-cy38r1/src/ifs/setup/su0yoma.F90:142  :  master.exe() [0x40b01e]
(pid=32351):            /home/samuelhatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:124  :  master.exe() [0x40a4dd]
(pid=32351):              /home/samuelhatfield/openifs-cy38r1/src/programs/master.F90:65  :  master.exe() [0x409a0b]
(pid=32351):                                                                   <Unknown>  :  libc.so.6(__libc_start_main+0xf5) [0x2ae972da9445]
(pid=32351):                                                                   <Unknown>  :  master.exe() [0x409a7f]
[LinuxTraceBack] : End of backtrace(s)
Done tracebacks, calling exit with sig=11, time =    0.04
 ABORT!    1 Dr.Hook calls ABOR1 ...
[myproc#1,tid#1,pid#32351]:  MASTER 
[myproc#1,tid#1,pid#32351]:   CNT0< 
[myproc#1,tid#1,pid#32351]:    SU0YOMA 
[myproc#1,tid#1,pid#32351]:     SULUN 
[myproc#1,tid#1,pid#32351]:      SUMPOUT 
 SDL_TRACEBACK: Calling LINUX_TRBK, THRD =            1
[LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=32351) :
(pid=32351): /home/samuelhatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109  :  master.exe() [0xa69744]
(pid=32351): /home/samuelhatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:189  :  master.exe() [0xa69a07]
(pid=32351):     /home/samuelhatfield/openifs-cy38r1/src/ifsaux/module/sdl_mod.F90:71  :  master.exe() [0xa9c3f8]
(pid=32351):      /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/abor1.F90:37  :  master.exe() [0xa4445e]
(pid=32351):     /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1123  :  master.exe() [0xa4b2fc]
(pid=32351):                                                                <Unknown>  :  libpthread.so.0(+0xf6d0) [0x2ae972b7a6d0]
(pid=32351):     /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1800  :  master.exe() [0xa4c871]
(pid=32351): /home/samuelhatfield/openifs-cy38r1/src/ifsaux/support/dr_hook_util.F90:148  :  master.exe() [0xa4f347]
(pid=32351):        /home/samuelhatfield/openifs-cy38r1/src/ifsaux/module/yomhook.F90:49  :  master.exe() [0x40a036]
(pid=32351):            /home/samuelhatfield/openifs-cy38r1/src/ifs/setup/sumpout.F90:90  :  master.exe() [0x41fb30]
(pid=32351):              /home/samuelhatfield/openifs-cy38r1/src/ifs/setup/sulun.F90:92  :  master.exe() [0x41fa4d]
(pid=32351):           /home/samuelhatfield/openifs-cy38r1/src/ifs/setup/su0yoma.F90:142  :  master.exe() [0x40b01e]
(pid=32351):            /home/samuelhatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:124  :  master.exe() [0x40a4dd]
(pid=32351):              /home/samuelhatfield/openifs-cy38r1/src/programs/master.F90:65  :  master.exe() [0x409a0b]
(pid=32351):                                                                   <Unknown>  :  libc.so.6(__libc_start_main+0xf5) [0x2ae972da9445]
(pid=32351):                                                                   <Unknown>  :  master.exe() [0x409a7f]
[LinuxTraceBack] : End of backtrace(s)
 SDL_TRACEBACK: Done LINUX_TRBK, THRD =            1
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 32351 on node hsw212 exited on signal 9 (Killed).

4 Comments

  1. Unknown User (nagc)

    Hi Sam,

    This normally means that the OpenMP stacksize is not large enough. The default per process stack size limit is too low and needs to be increased. Check the value of the OMP_STACKSIZE environment variable. It might also mean the model was not compiled correctly. It can happen for example if files require -r8 and haven't been compiled with it.  Something has trashed the pointer to drhook and that usually means a stacksize problem.

    Also make sure you have 'ulimit -s unlimited' set in the script to run the model.

    Cheers,  Glenn

  2. Unknown User (gbsh)

    Hi Glenn,

    Thanks for the response. I had a play around with OMP_STACKSIZE etc. but it seems that the cluster I'm using has a rather confused module system with various conflicting version of gfortran loaded at any one time. I played around with these and managed to find a combination that compiles and runs successfully. Sorry for the time wasting!

    Sam

    1. Unknown User (nagc)

      Hi Sam,

      Ok, great. Do you know more specifically what was causing the problem? Would be useful to know for future reference.

      Cheers,  Glenn

      1. Unknown User (gbsh)

        I believe it was because I compiled the executable with a different version of GNU Fortran than what was on the LD_LIBRARY_PATH at runtime (I think libgfortran.a needs to be on LD_LIBRARY_PATH at runtime). Why that would result in a stacksize problem is beyond me though.

        Sam