Hi all,

I try to run a T1279 case and output PEXTRA tendencies, but I am overwriting my memory. I can run the model with regular (non-PEXTRA) grib codes, but I am struggling to output the tendencies.

I share hereby the relevant fields from the namelist:

&NAEPHY
LEPHYS=true,      ! switch the full ECMWF physics package on/off.
LBUD23=true,            ! enable computation of physics tendencies and budget diagnostics
LEVDIF=true,              ! turn on/off the vertical diffusion scheme.
LESURF=true,             ! turn on/off the interface surface processes.
LECOND=true,            ! turn on/off the large-scale condensation processes.
LECUMF=true,            ! turn on/off the mass-flux cumulus convection.
LEPCLD=true,             ! turn on/off the prognostic cloud scheme.
LEEVAP=true,              ! turn on/off the evaporation of precipitation
LEVGEN=true,             ! turn on/off Van Genuchten hydrology (with soil type field)
LESSRO=true,            ! turn on/off orographic (VIC-type) runoff
LECURR=false,           ! if true, ocean current boundary condition is used.
LEOCWA=true,            ! WARM OCEAN LAYER PARAMETRIZATION
LEGWDG=true,            ! turn on/off gravity wave drag.
LEGWWMS=true,           ! turn on/off the Warner-McIntyre-Scinocca non-orographic gravity wave drag scheme.
LEOZOC=false,           ! turn on/off the climatological ozone.
LEQNGT=true,            ! turn on/off the negative humidity fixer.
LERADI=true,            ! turn on/off the radiation scheme.
LERADS=true,            ! turn on/off the interactive surface radiative properties.
LESICE=true,            ! turn on/off the interactive sea-ice processes.
LEO3CH=true,            ! turn on/off the ozone chemistry (for prognostic ozone).
LEDCLD=true,            ! turn on/off the diagnostic cloud scheme.
LDUCTDIA=false,         ! turn on/off computation and archiving of ducting diagnostics.
LELIGHT=false,          ! ACTIVATES LIGHTNING PARAMETRIZATION
LWCOU=true,             ! turn on/off coupled wave model (n.b. always off for OpenIFS model version 38r1).
LWCOU2W=true,           ! turn on/off two-way interaction with the wave model (n.b. always off for OpenIFS model version 38r1).
NSTPW=1,                ! INTEGER    FREQUENCY OF CALL TO THE WAVE MODEL.
RDEGREW=0.5,            ! RESOLUTION OF THE WAVE MODEL (DEGREES).
RSOUTW=-81.0,           ! SOUTH BOUNDARY OF THE WAVE MODEL.
RNORTW=81.0,            ! NORTH BOUNDARY OF THE WAVE MODEL.
/
&NAMFPC
CFPFMT="MODEL",
!
!  output on model levels
NFP3DFS=6,
MFP3DFS(:)=93,95,98,102,105,109,
NRFP3S(:)=1, ! I also tried it with all the 137 levels
/
&NAMDPHY
NVEXTR=25,              ! set number of tendency output fields (see table)
NCEXTR=137,             ! edit to correctly set number of full model levels e.g. 60, 91, 137 etc
/&NAMPHYDS
NVEXTRAGB(1:25)=91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,  ! define GRIB codes for the tendency fields
/

I followed the steps on the How+to+control+OpenIFS+output page and took the advise regarding PEXTRA/LBUD23 on the forum into account. I think I set everything correctly, but maybe I overlooked a detail?

The model runs until step 2 by the way. There is output for ICMUA<expid>+000000. This is the ifs.stat file:

14:29:19 000000000 CNT3      -999     4.172    4.172    5.629      0:00      0:00 0.00000000000000E+00       0GB       0MB

 14:29:42 A00000000 STEPO        0    26.132   26.132   28.317      0:26      0:28 0.49588412029193E-04       0GB       0MB

 14:29:42 0AA000000 STEPO        0     0.000    0.000    0.000      0:26      0:28 0.49588412029193E-04       0GB       0MB

 14:29:53 FULLPOS-S DYNFPOS      0    11.476   11.476   11.600      0:37      0:40 0.49588412029193E-04       0GB       0MB

 14:30:10 0AAA00AAA STEPO        0    16.431   16.431   16.520      0:54      0:56 0.49588412029193E-04       0GB       0MB

 14:31:17 0AAA00AAA STEPO        1    66.838   66.838   67.221      2:00      2:04 0.49394979039009E-04       0GB       0MB

 14:32:47 0AAA00AAA STEPO        2    88.038   88.038   89.921      3:29      3:33 0.49186674044295E-04     768GB       0MB

Does someone have advise? It just can be that I reached the limits of the HPC with the T1279 resolution.

Thanks in advance.

Great greetings,

~Thomas Batelaan



5 Comments

  1. Hi Thomas, I would first check the memory requirements of the job.  You are running OpenIFS at 9 km resolution which is rather memory intensive, including PEXTRA fields multiplies the memory requirements in my experience.  My suggestion would be to begin with a coarser grid and verify first that your experiments work at that resolution.

    Cheers,  Marcus

  2. Unknown User (gdcarver113@outlook.com)

    Hi Thomas,

    You don't give the actual error message reported by the job but I believe you that the job has run out of memory. T1279 is a very high resolution and adding all the pextra arrays for additional diagnostics will create alot more 3D fields.  The other point to bear in mind is the amount of model output still will produce!

    I assume you have already got as much memory allocated to the job as you can. Have you tried using additional nodes and underpopulating the nodes (reduce MPI tasks per node) to increase job memory?

    If not, I'd suggest dropping down to a lower resolution, say T799 (which is still high) and making sure the job will work at a lower resolution first as Marcus suggests.

  3. Unknown User (thomas.batelaan@wur.nl)

    Dear Ryan and Marcus,

    Thanks a lot for your rapid replies.

    I understand that T1279-jobs consumes a lot of memory. The fat-partition of our cluster has 128 cores per node and 1TiB memory per node/8GB memory per core – there are two partitions with even more memory but there you need special rights. If I interpret the error-file correctly I am really close to the limit of the cluster (see text-files with_pextra and without_pextra specs) in the without_pextra job, so if it needs twice as much as what Marcus Koehler says it goes easily over it.

    For the jobs I requested 384 cores (3 nodes with each 128 cores) with 1 core per task:

    #!/bin/bash

    #SBATCH -t 02:30:00

    #SBATCH --exclusive

    #SBATCH --partition=fat

    #SBATCH --ntasks=384

    #SBATCH --cpus-per-task=1

    #SBATCH --output=oifs.output%j.txt

    #SBATCH --error=oifs.error%j.txt

    ""

    Setting the environment etc.

    ""

    srun master.exe

    I am not sure how I can underpopulate more – I am still a beginner in HPC computing so do I understand correctly that I now use 128 MPI tasks  per node because I requested 1 core per task on a cluster with 128 cores per node.

    Thanks in advance,

    Great greetings,

    ~Thomas



  4. Unknown User (gdcarver113@outlook.com)

    Have you run this at a lower resolution than T1279?  It's not a good idea to run at the highest resolution first until you know the job works correctly at lower resolutions.

    If you have, then run the T1279 job without the pextra turned on first, to make sure it works and you can see the memory requirement without pextra. Underpopulating involves increasing the node count to get the required memory for the job (once you know what it is), and then reducing the mpi task count per node to be less than the available cpus per node.  Your local HPC support should be able to help.

  5. Unknown User (thomas.batelaan@wur.nl)

    Thanks all! With your help I managed to get it running (and also got more feeling about memory and so). Just requesting more nodes did the job.

    (And yes, I already got it working with and without PEXTRA on a lower resolution before I scaled up)