In the first stage of training, standard practice is to train the model on a 6-hr interval.
In the second stage of training, we use 'rollout' and fine-tune the model on errors up to 72hrs (ie. 6, 12, 18, 24 etc.).
In order to do this you must change your batch script as follows:
srun aifs-train --config-name=**config-file** \
hardware.files.warm_start=last.ckpt \
training.run_id=8bdfd7a1-294c-41f5-8ffc-952d6170c9e9
training.max_epochs= **epoch_num from first run** + 12
Here the training.run_id is the name of the folder where the checkpoints are stored on scratch. You can find this run_id through Weights & Biases:
and change the following options in your config file
defaults:
- dataloader: rollout
training:
rollout:
epoch_increment: 1
max: 12
If you prefer you can also put the run_id, warm_start and max_epochs in your config file.