Hi all

I've encountered an odd crash with OpenIFS cy40r1. If I compile with "fpe0" (Intel compilers, version 19.0.3) and run I get the following

489 forrtl: error (65): floating invalid
489 Image PC Routine Line Source
489 master.exe 000000000194B5E4 Unknown Unknown Unknown
489 libpthread-2.17.s 00002AAAB3A515E0 Unknown Unknown Unknown
489 master.exe 0000000000E17476 posddh_ 202 posddh.F90
489 master.exe 0000000000824979 stepo_ 328 stepo.F90
489 master.exe 000000000061CC79 cnt4_ 1014 cnt4.F90
489 master.exe 00000000005B1B53 cnt3_ 256 cnt3.F90
489 master.exe 00000000005B1184 cnt2_ 73 cnt2.F90
489 master.exe 000000000055362B cnt1_ 86 cnt1.F90
489 master.exe 000000000041A204 cnt0_ 134 cnt0.F90
489 master.exe 000000000041368B MAIN__ 66 master.F90
489 master.exe 00000000004135E2 Unknown Unknown Unknown
489 libc-2.17.so 00002AAAB4074C05 __libc_start_main Unknown Unknown
489 master.exe 00000000004134E9 Unknown Unknown Unknown

The crash used to be at line 198 in posddh, where it takes the NINDAT and finds the month of the run, i.e.

IDATEF(2) = (NINDAT - 10000*IDATEF(1) ) / 100

When I added a print statement at the line just above it, the crash did not happen there anymore, but a few lines down when OpenIFS uses NINDAT to find the day of the run. 

My print statement shows that all values are totally fine. NINDAT, IDATEF(1) and IDATEF(2) are all what they should be, so I have no idea what the problem really is.  

The backtrace is incomplete here. The problem seems to come from somewhere in libpthread and also master.exe, but I don't know where. 

If I run without "fpe0" the model runs fine, but since I'm getting odd numerical instabilities somewhere I'd really like to be able to use "fpe0" to trap any crazy numbers. 

I'm also using 

export DR_HOOK_IGNORE_SIGNALS='-1'


Has anyone experienced similar problems before and know what the problem could be? 


PS. Here are all the compiler settings: 

export OIFS_FFLAGS="-qopenmp -m64 -O3 -align array32byte -fp-model precise -convert big_endian -g -traceback -xCORE-AVX512 -qopt-zmm-usage=high -fpe0"
export OIFS_FFIXED="-r8"
export OIFS_FCDEFS="BLAS LITTLE LINUX INTEGER_IS_INT"
export OIFS_EXTRA_LIB=""
export OIFS_LFLAGS="-qopenmp"

# MKL
MKLLIBS="-lmkl_intel_lp64 -lmkl_core -lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_sequential"
export OIFS_LAPACK_LIB="-L${MKLROOT} ${MKLLIBS}"

export OIFS_CC=mpiicc
export OIFS_FC=mpiifort
export OIFS_CFLAGS="-g -traceback -fp-model precise -O3 -qopenmp -xCORE-AVX512 -qopt-zmm-usage=high -fpe0"
export OIFS_CCDEFS="LINUX LITTLE INTEGER_IS_INT _ABI64 BLAS"

8 Comments

  1. Unknown User (de3j)

    Hi again. 
    Just to update: The error does not appear in initial runs, i.e. when not starting from restart files. So the problem seems to be related to restarting OpenIFS. 
    I've tried switching Intel compiler versions and also switching to "-O1" instead of "-O3", but that does not help.

  2. Unknown User (de3j)

    I've just managed to get around the problem, but in an ugly way. 
    The restart files in year 190 were named "srf0693960000.0001" etc, but I renamed them to "srf0036520000.0001". I also changed the "rcf" file to have CSTEP="0087648" and CTIME   = "0036520000", and used "grib_set" to change the date in the INIT and INIUA files. 

    Before, OpenIFS was reading dataDate=19500101 and then setting the date as 19500101 + 69396 days (=21400101), but now it instead reads dataDate=21300101 and sets the date to 21300101 + 3652 days (=21400101). 
    And since I renamed the restart files, OpenIFS still reads the restarts from year 190, but now "thinks" it's at year 10. 

    So it seems that my problem is related to the time step? Is there some limitation somewhere in OpenIFS that it can't handle more than 190 year long simulations? My time step number (69396 days * 24 hours = 1665504) is somewhere between 2^20 and 2^21. Could it be some integer overflow of NSTEP or some other internal counter? 

    Curious to hear if anyone else has had this issue before. I'm also wondering if I'll run into the same problem again 190 simulated years down the road... 

    Cheers
    Joakim 

  3. Unknown User (nagc)

    Hi Joakim,

    Without further investigation this looks like of the problems IFS has with long integrations. Have you tried talking with the EC-Earth people?

    I am planning on adding long run fixes from PRIMAVERA & EC-Earth to the next release of 43r3 which should overcomes these problems.

    EC-Earth have also noted a problem where with model restarts where it appears the model does not correctly read a restart file. But I don't think that's the same problem as you have here as they reported differing results from two identical model runs.

    Cheers,   Glenn

  4. Unknown User (de3j)

    Hi Glenn

    Sounds like a really good idea! I feel we've all done various "long run fixes" but all slightly differently... 

    As far as I know, I'm the only one who has experienced this problem. 
    It might be specific to 40r1 or the compiler (Intel 18.0.3) or the hardware (Intel Skylake). 
    I spoke to Klaus W from SMHI about their "long run fixes" and it seems that my fixes are pretty close to the EC-Earth fixes, but not identical, so maybe that could explain the problem as well. 

    I need to switch compiler versions soon anyways (for NEMO reasons), so maybe that could solve it. Or it may get solved with 43r3. 

    I'll keep you posted how it goes. 

    Cheers

    Joakim

  5. Unknown User (20213020009@fudan.edu.cn)

    I also met this problem! I use 43r3 anyway,

    forrtl: error (65): floating invalid
    Image PC Routine Line Source
    master.exe 0000000001D9976E Unknown Unknown Unknown
    master.exe 00000000008B77AC Unknown Unknown Unknown
    libpthread-2.17.s 00002B3A8F4785D0 Unknown Unknown Unknown
    master.exe 0000000000AD07A6 tpm_pol_mp_ini_po 68 tpm_pol.F90
    master.exe 0000000000ABD4B1 suleg_mod_mp_sule 223 suleg_mod.F90
    master.exe 00000000008F682B setup_trans_ 405 setup_trans.F90
    master.exe 00000000004D2A46 sutrans_ 87 sutrans.F90
    master.exe 00000000004C703E sugeometry_ 142 sugeometry.F90
    master.exe 00000000004C5434 su0yoma_ 248 su0yoma.F90
    master.exe 00000000004189EC cnt0_ 131 cnt0.F90
    master.exe 0000000000412E57 MAIN__ 96 master.F90
    master.exe 0000000000412DDE Unknown Unknown Unknown
    libc-2.17.so 00002B3A8FA8A3D5 __libc_start_main Unknown Unknown
    master.exe 0000000000412CE9 Unknown Unknown Unknown
    ......


    In my NODE.001_01:
    ......
    ---- Set up geometry, (namelist variables) for global model -----
    ----------------------------------

    --- Printings in SUGEM_NAML:
    NSTTYP = 1 RMUCEN = 0.1000000E+01 RLOCEN = 0.0000000E+00 RSTRET = 0.1000000E+01
    NHTYP = 2
    RNLGINC = 0.1000000E+01
    ------ Set up transforms for this resolution --
    ----------------------------------
    LUSEFLT= F
    LUSERPNM= T LKEEPRPNM= T
    LFFTW= F


    And in my ifs.stat (only one line):
    18:38:50 000000000 CNT3      -999     3.376    3.376    5.391      0:00      0:00 0.00000000000000E+00     160GB       0MB
  6. Unknown User (gdcarver113@outlook.com)

    Hi Anna,

    The model has fallen over right at the start of the run. Was this a restart from a previous run? (I suspect not but just to check).

    I don't have the code in front of me but from memory the routine it's fallen over in, tpm_pol.F90, is responsible for setting up the Legendre transform coefficients.   Could be a problem with the initial file (probably the ICMSH* file) that has bad data or the wrong data for the resolution you are trying to run.

    Cheers.

    1. Unknown User (20213020009@fudan.edu.cn)

      It's not a restart from a previous run (sad). I am wondering if the problem is the resolution. 

      In my namelistfc file: 

      &NAMFPG
      NFPLEV=137,
      NFPMAX=799,

      &NAMFPD
      NLAT=1600,
      NLON=3200,


      which means OIFS_RES="799" right?

    2. Unknown User (20213020009@fudan.edu.cn)

      I think I know my problem now, I used grid type -l (linear reduced gaussian grid , default), but the initial file is in octahedral gaussian grid:(