I am compiling OpenIFS with the GNU compilation flag -ffpe-trap=zero,underflow,overflow in order to detect divide-by-zero, underflow and overflow errors. The compilation is successful, and floating-point exceptions do occur which are caught by Dr Hook causing the program to terminate. However, I can't find any information about the type of floating-point exception. The only way I can get this information is by compiling with only one exception check at a time (e.g. zero only) and running the model, which takes a long time.

Is there a way to see detailed floating point exception information?

6 Comments

  1. Unknown User (nagc)

    Hi Sam,

    I had a look at some of my tracebacks and the signal that causes the exception seems to be reported:

    signal_harakiri(SIGALRM=14): New handler installed at 0xe34da2; old preserved at 0x0
    ***Received signal = 6 and ActivatED SIGALRM=14 and calling alarm(10), time =    0.19

    In this case the original signal was 6. Do you see this?

    If the signals are being caught correctly by the compiler with a traceback, if you prefer you can turn off signal handling by DrHook in OpenIFS completely by setting this environment variable in the job:

    export DR_HOOK_IGNORE_SIGNALS=-1

    or use the same environment variable to give a list of signals to ignore (and let runtime system handle it).

    There is also the equivalent DR_HOOK_CATCH_SIGNALS which takes a comma separated list of additional signals to be caught as well as the default ones.

    It's also possible to turn off DrHook tracing & signal handling completely with:

    export DR_HOOK=0

    I've attached a PDF describing DrHook which, although a little old now, should still be mostly accurate.

    Cheers,  Glenn


  2. Unknown User (nagc)

    Hi Sam,

    I checked about the capabilities of DrHook. In the version provided with OpenIFS 40r1, there is no provision for capturing the floating point exception.

    However, DrHook has been updated in the very latest IFS development cycle, and that version provides more information on the floating point exception that caused the SIGFPE. I might look at this to see if we can port it back to and older cycle. But if you'd like to try it, let me know and I'll pass on the latest DrHook code (no promise it will work though).

    Glenn

  3. Hi Glenn,

    As an example of what I'm looking for, I get the following output when I compile OpenIFS (38r1) with -ffpe-trap=underflow:

    signal_harakiri(SIGALRM=14): New handler installed at 0xe7e481; old preserved at 0x0
    ***Received signal = 8 and ActivatED SIGALRM=14 and calling alarm(10), time =    0.85
    [myproc#1,tid#1,pid#15079,signal#8(SIGFPE)]: Received signal :: 63MB (heap), 51MB (rss), 0MB (stack), 0 (paging), nsigs 1, time     0.85
    tid#1 starting drhook traceback, time =    0.85
    [myproc#1,tid#1,pid#15079]:  MASTER 
    [myproc#1,tid#1,pid#15079]:   CNT0<1> 
    [myproc#1,tid#1,pid#15079]:    SU0YOMB 
    [myproc#1,tid#1,pid#15079]:     SUPHY 
    [myproc#1,tid#1,pid#15079]:      SUPHEC 
    [myproc#1,tid#1,pid#15079]:       SUECRAD 
    [myproc#1,tid#1,pid#15079]:        SURRTAB 
    tid#1 starting sigdump traceback, time =    0.85
    [gdb__sigdump] : Received signal#8(SIGFPE), pid=15079
    [LinuxTraceBack]: Backtrace(s) for program '../make/gnu-rpe/oifs/bin/master.exe' (pid=15079) :
    (pid=15079): /hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109  :  master.exe() [0xec6a7f]
    (pid=15079):      /hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:883  :  master.exe() [0xe7e56f]
    (pid=15079):     /hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1119  :  master.exe() [0xe82e39]
    (pid=15079):                                                     <Unknown>  :  libpthread.so.0(+0x10330) [0x2adc3b966330]
    (pid=15079):                                                     <Unknown>  :  libm.so.6(+0x5f980) [0x2adc3a452980]
    (pid=15079):                                                     <Unknown>  :  libm.so.6(exp+0x13) [0x2adc3a418f63]
    (pid=15079):     /hatfield/openifs-cy38r1/src/ifs/phys_radi/surrtab.F90:29  :  master.exe() [0x4ab7db]
    (pid=15079):   /hatfield/openifs-cy38r1/src/ifs/phys_radi/suecrad.F90:1307  :  master.exe() [0x49d278]
    (pid=15079):       /hatfield/openifs-cy38r1/src/ifs/phys_ec/suphec.F90:247  :  master.exe() [0x48c9d2]
    (pid=15079):           /hatfield/openifs-cy38r1/src/ifs/setup/suphy.F90:95  :  master.exe() [0x4778e8]
    (pid=15079):        /hatfield/openifs-cy38r1/src/ifs/setup/su0yomb.F90:486  :  master.exe() [0x409dde]
    (pid=15079):         /hatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:131  :  master.exe() [0x407a97]
    (pid=15079):           /hatfield/openifs-cy38r1/src/programs/master.F90:65  :  master.exe() [0x406c60]
    (pid=15079):            /hatfield/openifs-cy38r1/src/programs/master.F90:3  :  master.exe() [0x406cd9]
    (pid=15079):                                                     <Unknown>  :  libc.so.6(__libc_start_main+0xf5) [0x2adc3bb95f45]
    (pid=15079):                                                     <Unknown>  :  master.exe() [0x406b09]
    [LinuxTraceBack] : End of backtrace(s)
    Done tracebacks, calling exit with sig=8, time =    0.90
    
    
    
     

    So an underflow occurs on line 29 of surrtab.F90, but I only know that because underflow is the only exception that I've enabled.

    I thought I once found a way to display the kind of floating-point exception, but I haven't been able to reproduce this with another model (which doesn't use DrHook). For example, I compiled this model with -ffpe-trap=overflow. I then triggered a runtime overflow error by multiplying huge(variable) by a system clock dependent number. This still just gives me SIGFPE with "erroneous arithmetic operation" however. There's no mention of an overflow. So I thought what I was asking for wasn't possible, but you mentioned that the latest DrHook somehow retrieves this information?

  4. Unknown User (nagc)

    Hi Sam,

    Have you looked at using 'gdb' to run OpenIFS? I am not an expert on gdb but if you run the executable under gdb ( gdb ./master.exe ...), it will halt on the exception (probably need to turn off drhook), and from the source line should be able to determine the error?

    If you are comfortable with C code, then the changes to DrHook are straightforward. With the changes, no recompilation would be needed, DrHook can be told to selectively trap each signal in turn.

    Locate drhook.c in ifsaux/support. Look for the trapfpe() function:

    static void trapfpe(void)
    {
      /* Enable some exceptions. At startup all exceptions are masked. */
      (void) feenableexcept(FE_INVALID|FE_DIVBYZERO|FE_OVERFLOW);
    }

    Now modify this to set the exception individually based on environment variables:

    static void trapfpe(void)
    {
      /* Enable some exceptions. At startup all exceptions are masked. */
      int enable = 0;
      int disable = 0;
      int dummy;
      dummy = drhook_trapfpe_invalid ? (enable |= FE_INVALID) : (disable |= FE_INVALID);
      dummy = drhook_trapfpe_divbyzero ? (enable |= FE_DIVBYZERO) : (disable |= FE_DIVBYZERO);
      dummy = drhook_trapfpe_overflow ? (enable |= FE_OVERFLOW) : (disable |= FE_OVERFLOW);
      if (enable) (void) feenableexcept(enable); // Turn ON these
      if (disable) (void) fedisableexcept(disable); // Turn OFF these
    }

    Now define the new variables near the top of drhook.c (around line 80):

    static int drhook_trapfpe_invalid = 1;
    static int drhook_trapfpe_divbyzero = 1;
    static int drhook_trapfpe_overflow = 1;

    Lastly, in the process_options() function, add the new environment variables:

      env = getenv("DR_HOOK_TRAPFPE_INVALID");
      if (env) {
        int value = atoi(env);
        drhook_trapfpe_invalid = (value != 0) ? 1 : 0; /* currently accept just 0 or 1 */
      }
      env = getenv("DR_HOOK_TRAPFPE_DIVBYZERO");
      if (env) {
        int value = atoi(env);
        drhook_trapfpe_divbyzero = (value != 0) ? 1 : 0; /* currently accept just 0 or 1 */
      }
      env = getenv("DR_HOOK_TRAPFPE_OVERFLOW");
      if (env) {
        int value = atoi(env);
        drhook_trapfpe_overflow = (value != 0) ? 1 : 0; /* currently accept just 0 or 1 */
      }

    I've not tested these changes but I hope you can follow ok.  If there are other traps you need to implement just follow the pattern of changes above.

    With these, you should then be able to compile once, set each DRHOOK_TRAPFPE_* in turn and run.

    Let me know how you get on. I can add this to future versions of OpenIFS as these changes are not available in IFS yet.

    Cheers,  Glenn

  5. Sorry Glenn, I never replied. I did try your modifications to DrHook and they worked quite well. However, I couldn't find a way to run the program with all FPE traps enabled, and have it report which one crashed the program - perhaps this isn't possible. For now I've found a way forward, but I might come back to this in the future.

    Thanks,

    Sam

  6. Unknown User (nagc)

    Hi Sam,

    Thanks for the update and letting me know the DrHook changes worked. If you figure this out, I'd be interested.

    Glenn