Hi,

I use two machines to perform OpenIFS experiments: a desktop PC for testing and a small 16-core server for running experiments. The directory within which I'm building and running OpenIFS is mounted on both machines and they should have identical environments, e.g. the same compiler versions etc. However, even though I can build and run on the desktop machine, I can't run the program on the server (though I can build successfully). I get the following backtrace:

signal_drhook(SIGABRT=6): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGBUS=7): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGSEGV=11): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGSTKFLT=16): New handler installed at 0xac378a; old preserved at 0x0
signal_drhook(SIGFPE=8): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGILL=4): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGTRAP=5): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGINT=2): New handler installed at 0xac378a; old preserved at 0x0
signal_drhook(SIGQUIT=3): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGTERM=15): New handler installed at 0xac378a; old preserved at 0x0
signal_drhook(SIGXCPU=24): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
signal_drhook(SIGSYS=31): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60
JSETSIG: sl->active = 0
signal_harakiri(SIGALRM=14): New handler installed at 0xabeae4; old preserved at 0x0
***Received signal = 4 and ActivatED SIGALRM=14 and calling alarm(10), time =    0.01
[myproc#1,tid#1,pid#2415,signal#4(SIGILL)]: Received signal :: 17MB (heap), 17MB (rss), 0MB (stack), 0 (paging), nsigs 1, time     0.01
tid#1 starting drhook traceback, time =    0.01
[myproc#1,tid#1,pid#2415]:  MASTER 
[myproc#1,tid#1,pid#2415]:   CNT0<1> 
tid#1 starting sigdump traceback, time =    0.01
[gdb__sigdump] : Received signal#4(SIGILL), pid=2415
[LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=2415) :
(pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109  :  master.exe() [0xaf4ce8]
(pid=2415):      /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:883  :  master.exe() [0xabebe1]
(pid=2415):     /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1119  :  master.exe() [0xac3b5d]
(pid=2415):                                                                              <Unknown>  :  libpthread.so.0(+0x10330) [0x2afb43e18330]
(pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/user_clock.F90:67  :  master.exe() [0xb007cf]
(pid=2415):    /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/gstats.F90:153  :  master.exe() [0xad288e]
(pid=2415):         /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:112  :  master.exe() [0x409f7f]
(pid=2415):           /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/programs/master.F90:65  :  master.exe() [0x408f06]
(pid=2415):                                                                              <Unknown>  :  libc.so.6(__libc_start_main+0xf5) [0x2afb44047f45]
(pid=2415):                                                                              <Unknown>  :  master.exe() [0x408f7d]
[LinuxTraceBack] : End of backtrace(s)
Done tracebacks, calling exit with sig=4, time =    0.05
 ABORT!    1 Dr.Hook calls ABOR1 ...
[myproc#1,tid#1,pid#2415]:  MASTER 
[myproc#1,tid#1,pid#2415]:   CNT0<1> 
 SDL_TRACEBACK: Calling LINUX_TRBK, THRD =            1
[LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=2415) :
(pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109  :  master.exe() [0xaf4ce8]
(pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:189  :  master.exe() [0xaf4d1d]
(pid=2415):     /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/module/sdl_mod.F90:71  :  master.exe() [0xb0599f]
(pid=2415):      /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/abor1.F90:37  :  master.exe() [0xab3417]
(pid=2415):     /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1123  :  master.exe() [0xac3bb1]
(pid=2415):                                                                              <Unknown>  :  libpthread.so.0(+0x10330) [0x2afb43e18330]
(pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/user_clock.F90:67  :  master.exe() [0xb007cf]
(pid=2415):    /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/gstats.F90:153  :  master.exe() [0xad288e]
(pid=2415):         /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:112  :  master.exe() [0x409f7f]
(pid=2415):           /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/programs/master.F90:65  :  master.exe() [0x408f06]
(pid=2415):                                                                              <Unknown>  :  libc.so.6(__libc_start_main+0xf5) [0x2afb44047f45]
(pid=2415):                                                                              <Unknown>  :  master.exe() [0x408f7d]
[LinuxTraceBack] : End of backtrace(s)
 SDL_TRACEBACK: Done LINUX_TRBK, THRD =            1
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 2415 on node cirrus1 exited on signal 9 (Killed).
--------------------------------------------------------------------------


We have made some modifications to OpenIFS, but I don't think it's a bug on our side because it works fine on the desktop PC. It looks like there's an illegal instruction in one of the clock functions. Any idea what's going wrong?

Previously I was getting a similar error originating from drhook.c line 4040, but that's gone away for some reason.

I build from scratch on both machines with gcc/gfortran version 4.8.3.

Thanks,

Sam Hatfield

4 Comments

  1. This is my config file by the way:

    #  FCM compiler & build specific configuration file for OpenIFS
    #  gnu compiler, opt build
    
    #          Glenn Carver         ECMWF 2012-2014
    
    #  ? means if an environment variable of the same name exists, it will
    #  will be used instead of the default specified in this file.
    #
    #  $HERE is a special FCM only variable which means the directory in 
    #  which this file exists. It cannot be replaced by environment variable.
    
    #  GRIB API
    #  Some compilers will link to shared library (if exists) by default. Care
    #  is needed to ensure correct library is loaded at runtime. The .a files could be
    #  listed explicitly but problems have been found with this approach and some 
    #  compiler wrappers.
    $OIFS_GRIB_API_DIR{?}     = $HOME/ecmwf/grib
    $OIFS_GRIB_API_INCLUDE{?} = -I $OIFS_GRIB_API_DIR/include
    $OIFS_GRIB_API_LIB{?}     = -L$OIFS_GRIB_API_DIR/lib -lgrib_api_f90 -lgrib_api
    
    # LAPACK & BLAS libraries
    $OIFS_LAPACK_LIB{?}   = -L/home/rd/openifs/software/lapack/3.4.2/lapack-gcc-4.5.0-opt -llapack -lblas
    
    # Extra libraries (architecture/compiler specific)
    $OIFS_EXTRA_LIB{?}    = 
    
    # Source files that FCM should specifically ignore
    $OIFS_SRC_EXCL{?} = 
    
    #  Fortran
    
    $OIFS_FC{?}     = mpif90
    #$OIFS_FFLAGS{?} = -g -O2 -m64 -march=native -fconvert=big-endian -fopenmp
    $OIFS_FFLAGS{?} = -g -O2 -m64 -march=native -fconvert=big-endian 
    $OIFS_FFIXED{?} = -fdefault-real-8 -fdefault-double-8 -ffixed-line-length-132
    $OIFS_FCDEFS{?} = BLAS LITTLE LINUX INTEGER_IS_INT F90 PARAL NONCRAYF
    $OIFS_LFLAGS{?} = -fopenmp
    
    # C compiler
    
    $OIFS_CC{?}     = mpicc
    $OIFS_CFLAGS{?} = -g -O -m64 -march=native
    $OIFS_CCDEFS{?} = BLAS LITTLE LINUX INTEGER_IS_INT _ABI64
    
    # Architecture/compiler specific FCM (if any)
    
    
    
    
  2. Unknown User (nagc)

    Hi Sam,

    Try taking the:

    -march=native

    option off the compile options line. That can produce faster code because the compiler will autodetect the chip and compile specifically for that architecture, but I've seen it cause problems sometimes, either because the compiler is not doing the right thing or an executable is compiled on one machine and then moved to another machine, with very similar, but slightly different chip. So I now don't specify this as the default options for OpenIFS 40r1.

    If that's not it, does the model run ok with 'noopt'?

    Glenn


  3. Hi Glenn,

    I think it is something to do with that flag, or maybe -m64. The desktop is Intel Core i7-4770 and the server is Intel Xeon E5630. I played around with those flags and I was able to get the model to run. When I switched back to the compilation setup shown above, however, I wasn't able to reproduce the problem from before. Well, that's how it goes...

    Sam

  4. Unknown User (nagc)

    It's possible the gnu compiler is not installed correctly on the Xeon and generating bad code with -march. Do you have a more recent version to try? gfortran 4.9.3 is the earliest I test with, but I'm sure 38r1 was tested originally with gnu 4.8.3.

    The -m64 option should do nothing as both are 64bit machines? 

    Glenn