Good afternoon all,

As you may see in the title of the post, I've encountered a really strange crash in OpenIFS. The configuration is T159L91 + ORCA05 (coupled OpenIFS cy40 + NEMO 3.6 via OASIS3-MCT3, WAM turned off). 
The coupled model runs fine for the first 170 years and has a relatively stable climate, but suddenly in April of year 170 the model shows really strange fluxes where there is precipitation (plots from NEMO output.abort.nc attached). Suddenly thick ice (up to 40m thick!) with 100% concentration forms even in the tropics, which results in a massive upward water flux from NEMO. 
A few hot spots of really warm SST (> 300°C) appear and for temperature, salinity and sea-surface height, there are two big white diamonds (the plot with all the blue stuff) over the globe where I think NEMO is unable to find a solution for the free surface due to the strange fluxes. 
The problem seems to occur in one time step, because the last model output step prints data that looks ok. I've only got the data that NEMO dumps when it stops. 

I'm wondering if anyone has ever seen a problem like this? If it's an instability that sometimes happen and that there's a way to tweak around it?

Are there any diagnostics in OpenIFS to activate that continuously check model variables so that they are reasonable? 

Could it be a counter in OpenIFS that overflows after too long "forecast time", or perhaps some array that runs out of space? 
The technical reason why the model stops is because NEMO encounters current speeds over 20m/s where the > 300°C SST happens, but the problem definitely seems to originate from the atmosphere. 

The model starts at 1950-01-01 so year 170 is actually year 2020 in OpenIFS. But I've tweaked the code to always read 1852 conditions in updrgas to make a "pre-industrial" state. 

Any help on this super-weird issue would be much appreciated! 

Best regards
Joakim 


10 Comments

  1. Unknown User (nagc)

    Hi Joakim,

    Not sure I can be of direct help here but I spoke with one of our scientists. She suggested you look at the model output before the crash. She thought it was likely that the ice was forming in the tropics even before the crash, NEMO can't form that much ice in a single step?

    As for counters overflowing, maybe, but EC-Earth folk will be able to tell you.  Perhaps more likely is a problem in the coupling?

    Cheers,  Glenn

  2. Unknown User (jstreffi)

    The longest I have run OpenIFS-FESOM2 for is 165 years. I may have missed this issue by just a bit.

    info for Glenn: I already trawled through the EC-Earth issues archive and have not found anything like this there.

    1. Unknown User (nagc)

      I vaguely remember hearing about a bug in IFS that only occurs after a very long run, but I forget the details.

      In this case though, our suspicion is that this has been going wrong for more than 1 step, even though it might still be forced by the atmosphere.

      Are those diamond patterns related to the MPI decomposition?


  3. Unknown User (nagc)

    What happens if you do a restart?

  4. Unknown User (nagc)

    > Are there any diagnostics in OpenIFS to activate that continuously check model variables so that they are reasonable?

    The 'ifs.stat' file will give a first indication that OpenIFS atmosphere is going wrong. The last number on each timestep line gives the model norm of global divergence (if I remember right). You can also check the prints in the NODE file for the spectral norms.


  5. Unknown User (joakimkjellsson@gmail.com)

    Hi again guys. 

    Thanks for the hints, and thanks Jan for checking with EC-Earth. Do you know what is the longest EC-Earth has run OpenIFS + NEMO? I thought the EC-Earth workhorse for CMIP6 is IFS cy36 and OpenIFS had only used for a few tests. Perhaps they haven't run OpenIFS for long enough to run into these problems? 
    I can't see anything crazy wrong with the spectral norms in the NODE file if I compare the last and second to last steps. Precipitation, liquid water, kinetic energy, divergence, and all that does not seem to change by much between the two steps. 

    Interestingly, restarting the model, but now compiled with "-O1" rather than "-O3" and the problem vanishes! Not sure if it was because of the restarting or the less aggressive optimisation by the Intel compilers, and I have no idea if this can be used as a universal solution for these problems. My current compile flags are

    OIFS_FFLAGS="-qopenmp -m64 -O1 -align array32byte -fp-model precise -convert big_endian -g -traceback -xCORE-AVX512 -qopt-zmm-usage=high -fpe0". 

    and I'm using Intel Fortran version 18.0.3. 

    I've seen the "white diamond" problem before, and it comes from the free-surface solver in NEMO when it does not converge to a solution. It can happen if the sea-level drops below the thickness of the upper-most layer, perhaps here due to the sudden removal of 20 m freshwater when forming ice. 

    Talking with a few ECHAM folks, they said similar things (although not identical errors) can randomly happen in ECHAM or MPI-ESM, at which point the common solution is to reduce the horizontal diffusion by 0.1% for one month, then turn it up again and continue. So perhaps these things just happen occasionally and we'll just have to "poke" the model a bit and continue? 

    I'll just keep running for now and update if anything else goes horribly wrong.
    I guess I've entered uncharted waters with OpenIFS when running it for so long...

    Have a good weekend!

    Joakim 


    1. Unknown User (jstreffi)

      EC-Earth has not run OpenIFS-NEMO at all. NEMO in branch where they have OpenIFS did not even compile correctly when I put the trace gas reading and orbital parameter changes into their version of OpenIFS. I heard of the horizontal diffusion issue with ECHAM. Afaik they have put a switch into the namelist to solve this problem and some people at AWI have monitoring scripts that detect a crash and turn the switch on automatically. Might be an option (wink)

  6. Unknown User (joakimkjellsson@gmail.com)

    Hi Glenn and Jan

    I just wanted to update and say that the same crash happened again (sad) I restarted from year 170 and it did fine until year 198 this time. The fields in output.abort.nc from NEMO look almost exactly the same as before, with thick sea-ice forming in the tropics etc. 

    It could be that OpenIFS+NEMO is somehow in an unstable state where a small perturbation can set off this crash. 
    As far as I know, my simulations should be completely reproducible since I haven't set anything in NAMSTOPH and LSTOPH = false by default? 
    So I'm therefore leaning more towards this being a random error caused by the compilers or hardware. Perhaps some value that is supposed to be communicated between nodes gets distorted or lost? 

    And since no one else has had this problem, it could be specific to my HPC and compiler set. 
    It's also interesting that it happened at years 170 and 198. Perhaps it only happens in long integrations? 

    I'm going to try with another version of the Intel compilers and see if it still happens.

    Cheers
    Joakim 

    1. Unknown User (jstreffi)

      This issue would suggest that your runs might not be bit reproducible: Restart file generation can suffer occasional bit level changes affecting reproducibility

      Maybe you can have a look at one timestep after 150 years and compare the state. That would at least explain why you get it at different timesteps.

  7. Unknown User (de3j)

    In case someone else suffers from this problem, I can happily say I found the bug and solved it. 

    In my configuration, OpenIFS combines evaporation and precipitation into one field (E-P) and sends to NEMO. In order to close the water budget, I also apply the "GLBPOS" conservation option in OASIS-MCT.

    The problem is that the global sum of E-P may sometimes be very close to 0, in which case the conservation part of OASIS encounters a division by a very small number, leading to massive values in E-P.

    The solution is to couple E and P as separate fields, in which case the global sum is never close to 0 and all is well when using GLBPOS.
    I have not noticed any loss in throughput of the model, so the any additional cost in sending/remapping the extra fields appears small.

    Cheers
    Joakim