Hi all

I'm trying to run a pre-industrial control run with a coupled model, OpenIFS_CY40R1 + NEMO_3.6 + OASIS3_MCT2.8. 

OpenIFS runs at T159 L91 resolution with a 1 hr time step. 

In IFS time, the model starts at 1950-01-01 and runs fine until 1994 (year 44). In year 44, the error message is

07:44:42 STEP 393220 H= 393220:00 +CPU= 0.400
07:44:43 STEP 393221 H= 393221:00 +CPU= 0.380
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_SET_INT 7 endStep 1415599200 FAILED -25
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_SET_INT 7 endStep 1415599200 FAILED -25
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_SET_INT 7 endStep 1415599200 FAILED -25
GRIB_API ERROR : unable to find units to set stepRange=1415599200
GRIB_SET_INT 7 endStep 1415599200 FAILED -25
GRIB_API ERROR : unable to find units to set stepRange=1415599200

which I guess means that the value stepRange = 1415599200 can not be written to the GRIB output files. The value is about half of 2^31, so I'm guessing it may be a question of too low precision of the integers? 

Has anyone else encountered this problem? Is there a way to solve this problem by increasing the integer precision or something in the output files, or perhaps remove the "stepRange" variable and use date and time instead? 

Any help is much appreciated!


Cheers

Joakim 


13 Comments

  1. Unknown User (nagc)

    Hi Joakim,

    This is a known problem when running OpenIFS over very long periods. The model was not designed  for climate studies of course.

    EC-Earth have solved this problem, I think you have contact with them? There are a number of issues that arise when running for long periods. Some of these have been fixed with code taken from EC-Earth, but I think the problem you have is coming from grib_code_message.F90 which EC-Earth have replaced to solve it.

    Let me know if you want me to get in touch with them.

    Cheers,  Glenn

  2. Unknown User (de3j)

    Hi Glenn

    Thanks for the reply. Yes, if you could get in touch and get a copy of the grib_code_message.F90 that works, that would be great. I've sent some emails but, alas, no response. Do you think changing this subroutine also solves the issue of file naming, i.e. that after 115 years, the files will exceed the label ICMGGgu5a+999999, and also solve the issue with starting before 1900? 

    Best wishes

    Joakim 

  3. Unknown User (jstreffi)

    Sorry for the late answer, I only saw this after Joakim send me an mail. This problem and  (Name pattern for output files, ICMGG and ICMSH) was solved by the EC-Earth community. Uwe Fladrich identified the problems and files associated as listed below. He also solved them with a set of changes that I uploaded here:

     


    First, a categorisation of the problems:

    (1) Using too short integer type
        (1a) Converting RSTATI (from YOMRIP) to standard integer with NINT
        (1b) Intermediate computations of type: time_step_number*time_step_in_sec
        (1c) other variables
    
    (2) Too short output format
        (2a) Logging
        (2b) Time stamps used in file names
        (2c) other
    
    (3) Write time_in_seconds into grib message
    

    Next, a list of source files with their associated category of problems:

    climate
        updclie.F90             (1b)
        updrgas.F90             (1b)
    control
        cnt4.F90                (2b), (1b)
        reresf.F90              (2a), (2c)
    dia
        grib_code_message.F90   (3)
        ppeddh.F90              (1a)
        ppeddhec.F90            (1a), (2a?)
        wroutgpgb.F90           (1b), (2a)
        wroutspgb.F90           (1b), (2a)
    fullpos
        su4fpos.F90             (1c)
    include
        ppreq.intfb.h           (1c)
        ppreset.intfb.h         (1c)
        su4fpos.intfb.h         (1c)
    module
        yomres.F90              (2c)
    phys_ec
        callpar.F90             (1a)
        radpar.F90              (1a)
    pp_obs
        ppreq.F90               (1c), (2a)
        ppreset.F90             (1c)
    utility
        updtim.F90              (1a), (1b)
        wrresf.F90              (1b), (2b)
    


  4. Unknown User (de3j)

    grib_code_message.F90Hi Jan

    Many thanks for this file! I've gone over it and made the changes. Unfortunately, they did not solve my problem.

    I've made some more changes to the grib_code_message.F90 file. I tried to change lines 366-338 to

    CALL IGRIB_SET_VALUE(KGRIB_HANDLE,'stepType','instant')
    CALL IGRIB_SET_VALUE(KGRIB_HANDLE,'stepUnits','h')
    CALL IGRIB_SET_VALUE(KGRIB_HANDLE,'endStep',IHOUR)

    with IHOUR=ISEC/3600. But the error still happens on the same time step. 

    09:44:53 STEP 393220 H= 393220:00 +CPU= 0.400
    09:44:54 STEP 393221 H= 393221:00 +CPU= 0.380
    GRIB_API ERROR : unable to find units to set stepRange=393222
    GRIB_API ERROR : unable to find units to set stepRange=393222
    GRIB_API ERROR : unable to find units to set stepRange=393222
    GRIB_API ERROR : unable to find units to set stepRange=393222
    GRIB_SET_INT 7 endStep 393222 FAILED -25
    GRIB_SET_INT 7 endStep 393222 FAILED -25

    I think maybe the problem is with the "units" rather than the actual value. 

    I've also set 

    CALL IGRIB_SET_VALUE(KGRIB_HANDLE,'timeRangeIndicator',1)

    rather than 

    CALL IGRIB_SET_VALUE(KGRIB_HANDLE,'timeRangeIndicator',0)

    which I think means timeRange is written in hours instead of minutes, but that did not help either. 

    What is even more strange, is that the traceback tells me that the error occurs on a line that IFS should never arrive at! The error is on line 309 of my grib_code_message.F90, but I've added a logical "HOURSTEP" which is set to be TRUE, and so IFS should never get to line 309. 

    Glenn, do you think you could have a quick look at the file (attached below) and see if you can spot the problem? 


    Jan, could you check your version of "grib_code_message.F90" and see if that also writes "endStep" as seconds, and if timeRangeIndicator is 0? 

    If your version of OpenIFS works, there must be some difference in either grib_code_message.F90 or grib_api_interface.F90. 


    Cheers

    Joakim 

  5. Unknown User (de3j)

    Hi again guys

    I'm starting to think this is more a GRIB_API issue than OpenIFS issue. If I take an output file from OpenIFS (ICMGGKCM2+008760, attachedICMGGKCM2+008760), I can reproduce the error using grib_set:

    blogin3:/gfs1/work/shkjocke $ grib_set -s stepRange=393216 ICMGGKCM2+008760 ICMGGKCM2+008760_2
    blogin3:/gfs1/work/shkjocke $ grib_set -s stepRange=393222 ICMGGKCM2+008760 ICMGGKCM2+008760_2
    GRIB_API ERROR : unable to find units to set stepRange=393222
    GRIB_API ERROR : grib_set_values0 stepRange (1) failed: Unable to set step

    Now, if use "grib_copy" to only get "q" into one file, and then try to set stepRange, it works fine. But if I do it for "skt" (skin temperature), it does not work. So I'm thinking it has to do with the GRIB version ("q" is GRIB2 and "skt" is GRIB1). 

    If I do 

    grib_copy -w shortName=skt ICMGGKCM2+008760 skt.grb
    grib_set -s edition=2 skt.grb skt.grb2
    grib_set -s stepRange=393222 skt.grb2 skt2.grb2

    it works! CDO then shows the date and time as "1994-11-10 06:00:00", which is step 393222 hours since 1950-01-01 00:00:00. 

    So, could it be that I have to run OpenIFS with only GRIB2 output to make long runs work? 

    Could you please advice how to change the GRIB version that OpenIFS outputs in? 

    Cheers

    Joakim 

  6. Unknown User (de3j)

    Hi again (v3)


    An update:

    OpenIFS now runs fine for me, and can write both ICMSH and ICMGG files. I had to do two things, 

    1) change "seconds since start" to double precision integer in a few routines. 

    2) let the "date" written by preset_grib_template.F90 to be the start date of the current run. For me, this was 1950-01-01 (when my model starts), but now I set it to the start date of the restarted run. Since I restart the model each year, "time since start" is never more than 1 year. 

    I had to change the routines that do "NINT(RTIMTR)" or something, since after 69 years, RTIMTR > 2^31. I think the precision of NINT is compiler and OS dependent, so it may not be necessary if you instead set some compiler flag. I did something like this to updo3ch.F90

    !* 1. SETTING CONSTANT VALUES.
    ! ---------------------------

    IERR=0

    IF(.NOT.LNF.AND.NSTADD == 0) THEN
    ! IN CASE OF RESTART:
    !ITIME=NINT(TSTEP)
    ITIME = NINT(TSTEP,KIND=JPIB) !JTK edit to allow for large values
    IF (LTWOTL) THEN
    !IZT=NINT(TSTEP*(REAL(NSTAR2,JPRB)+0.5_JPRB))
    IZT=NINT(TSTEP*(REAL(NSTAR2,JPRB)+0.5_JPRB),KIND=JPIB) !JTK edit to allow for large values
    ELSE
    IZT=ITIME*NSTAR2
    ENDIF
    ISTADD=IZT/NINT(RDAY)
    ELSE
    ISTADD=NSTADD
    ENDIF

    and also adding 

    USE PARKIND1 , ONLY : JPIM, JPRB, JPIB

    I also did this to surfrad_layer.F90

    IY0=NCCAA(NINDAT)
    IM0=NMM(NINDAT)
    ID0=NDD(NINDAT)
    IDINCR=(NSSSSS+NINT(RSTATI,KIND=JPIB))/NINT(RDAY,KIND=JPIB)
    ISEC=MOD(NSSSSS+NINT(RSTATI,KIND=JPIB),NINT(RDAY,KIND=JPIB))
    CALL UPDCAL(ID0,IM0,IY0,IDINCR,IDD,IMM,IYY,ILMON,-1)

    I then did two changes, the first in preset_grib_template.F90, where I put

    IF (NJTKDATE < 0) THEN
    CALL IGRIB_SET_VALUE(IGRIB_HANDLE,'date',IINDAT)
    ELSE
    CALL IGRIB_SET_VALUE(IGRIB_HANDLE,'date',NJTKDATE)
    END IF

    where NJTKDATE is a newly defined namelist parameter. It is =-1 by default (set in sugrib.F90), but I set it to the restart date of my model, e.g. 19850101, in the NAMGRIB namelist. 

    The second change is to grib_code_message.F90

    IF( LPPSTEPS ) THEN
    IF (NJTKOFFSETSTEP>0) THEN
    ISEC = (NSTEP-NJTKOFFSETSTEP)*3600._JPRB
    END IF
    ELSE
    IF (NJTKOFFSETSTEP>0) THEN
    ISEC = (NSTEP-NJTKOFFSETSTEP)*TSTEP
    END IF
    ENDIF

    where NJTKOFFSETSTEP is an integer that allows me to offset the "endStep" written to the GRIB message. 

    So, let's say I have a run with 1 hr time step (8760 steps per year), and I run for one year, and then restart the model. I start the model in 1950-01-01. 

    Before, OpenIFS would write dataDate=19500101, stepRange=8766 in the output file to represent 1951-01-01 06:00. 

    Now it writes dataDate=19510101, stepRange=6, i.e. the stepRange is reset to 0 each restart!  

    This seems to work for me for now. Hope it makes sense.
    I can always share the files if anyone else is having the same problems. 


    Best wishes

    Joakim 


    1. Unknown User (jstreffi)

      Hey Joakim,

      great news! Once I am past my Segmentation fault issues 5-15 year into the runs I will apply these changes to my codebase too.

      Cheers Jan

    2. Unknown User (jstreffi)

      I ran into the same issue at 44 years with OpenIFS FESOM2. I introduced these changes and the problem was solved. Thank you for sharing this here!

      Best regards,

      Jan

  7. Unknown User (de3j)

    Sounds good, Jan. 

    I'm happy to share these changes, but my code looks really awful since I've tried a bunch of different ways to change how time steps are written to the output files. I'll put "clean up" on the to-do list... 

    But surely, EC-Earth must already do something similar, either reset the "stepRange" counter every few years, or write all output in GRIB2? Otherwise you would not be able to output anything past year 69? Or have Uwe and the rest of the EC-Earth team not done a run longer than that with OpenIFS yet? 


    Cheers

    Joakim


    1. Unknown User (jstreffi)

      The EC-Earth community is working with heavily modified IFS cycle 36 for CMIP6. They've had an OIFS branch coupled to NEMO since Jan 2015. Uwe coupled it and started merging their IFS changes (such as long run) into the branch but right now their No. 1 priority is CMIP 6 esm tuning so understandably progress is slow. When I first started it the EC-Earth OIFS branch crashed after a few time steps so I'm pretty sure nobody else did anything with it so far.

      The lines you changed in updo3ch.F90, surfrad_layer.F90 and grib_code_message.F90 are unmodified in the EC-Earth OIFS branch. I just checked grib_code_message.F90 in the EC-Earth trunk (IFS cy 36) and found the respective lines unchanged. They must have either found another way or don't use them.


  8. Unknown User (de3j)

    Hi Jan 

    Ok, so I'm probably the first to run into this problem then. There's a logical "LPERPET" which I think is for aquaplanet runs (and maybe Held-Suarez tests?), that resets the time as well to avoid this issue in very long runs. 

    Glenn, let me know if you want the source files for these fixes. I guess there won't be a new version of OIFS cy40r1, but I'm perhaps CY43R3 has the same problem? 

    /J

  9. Unknown User (jstreffi)

    For OpenIFS you are probably the first. But the problem seems to come up again every once in a while. It has been around in one form or another for years. For example in PrepIFS version for EC-EARTH v2

    With the 6-hourly output settings, the multiyear suite can be used for integrations up to 44 years (this limitation is related to mars archiving of grib output)

    Unknown User (nagc) Christopher Roberts recently ran ECMWF-IFS experiments for 50+ years (https://www.geosci-model-dev-discuss.net/gmd-2018-90/). I assume he had to find a fix for this issue as well? His fixes should be included in cy 43. Maybe you can check if this is the case?

    Cheers, Jan

  10. Unknown User (nagc)

    Hi Jan, Joakim,

    Apologies for not replying sooner to this thread, it was on my todo list for a while. I did implement some of the EC-Earth long-run fixes into OpenIFS 40r1v2, but obviously those changes didn't catch all the problems. I understood that EC-Earth avoid the grib1 issue by resetting the date.  Changing the GRIB1 fields to GRIB2 would involve modifying the grid templates and probably unknown code changes. I wouldn't recommend it, not least because it might cause archiving problems if we ever wanted to bring the files back into the ECMWF archive.  Jan - you are correct that some of fixes were also made for the PRIMAVERA runs at ECMWF. I have a copy of the changes made to 43r3.

    Late last year I visited SMHI and we discussed how to implement these long run changes into OpenIFS 43r3. There are now multiple copies of the same changes between you both, EC-Earth & ECMWF.  The point was made that the implementation is not so important as the added functionality.  Uwe and I agreed that we'd look at the changes required, any extra functionality, and work together to add them into the model code. The code changes above may be useful, so thanks for posting them here. When we get to the point of discussing long-run changes for 43r3 it would be good if you could both take part.

      Cheers,  Glenn