We're using the Hydra package to make configuration management easier with our machine learning system.

The config system is modular, broken up into entities that (try to) make sense, such as:

  • data
  • dataloader
  • hardware
  • model
  • training
  • ...

These contain yaml files that we can write and use. These configs serve as the defaults used during training, if you have new defaults that serve specific purposes you can add them and switch them by keyword.

So a yaml in model can have the definitions for the models such as:

num_channels: 123
...

hidden:
  num_layers: 12

With both plain and multiple nested groups.

File-based config overrides

At the root of the config folder, there's a config.yaml  and a debug.yaml as well as other default configs for specific usecases. config.yaml  is the default config which in turn defines which of the files are merged as your configuration.

It looks something like this:

defaults:
  - hardware: atos
  - data: zarr
  - dataloader: default
  - model: gnn
  - training: default
  - diagnostics: eval_rollout
  - override hydra/job_logging: none
  - override hydra/hydra_logging: none
  - _self_

and the debug.yaml for example overrides some default velues directly by add _self_  to the end of that list and in the file itself defining e.g.:

diagnostics:
  log:
    wandb:
      offline: True
training:
  max_epochs: 3

For some very short runs. You can override any value from the groups in the main config. You can also switch out which config is used, like we switch out the hardware config between atos  and atos_slurm for interactive and sbatch jobs.

defaults:
  - hardware: atos_slurm
  ...

Command-line config overrides

It's annoying to adjust configs for everything, so we can override default configs, group configs and individual keywords for training in the command line.

We can switch the default config out by using

aifs-train --config-name=<name-of-default-config.yaml>

then we can also switch out the group configs

aifs-train hardware=atos_slurm

this is used in the jobscript.sh for example.

And finally, we can override individual config entries such as:

aifs-train diagnostics.log.wandb.enabled=false

And combine everything together if you so desire:

aifs-train --config-name=<awesome-config> hardware=atos_slurm diagnostics.log.wandb.enabled=false


Write a comment…