We're using the Hydra package to make configuration management easier with our machine learning system.
The config system is modular, broken up into entities that (try to) make sense, such as:
These contain yaml files that we can write and use. These configs serve as the defaults used during training, if you have new defaults that serve specific purposes you can add them and switch them by keyword.
So a yaml in model can have the definitions for the models such as:
num_channels: 123 ... hidden: num_layers: 12 |
With both plain and multiple nested groups.
At the root of the config folder, there's a config.yaml
and a debug.yaml
as well as other default configs for specific usecases. config.yaml
is the default config which in turn defines which of the files are merged as your configuration.
It looks something like this:
defaults: - hardware: atos - data: zarr - dataloader: default - model: gnn - training: default - diagnostics: eval_rollout - override hydra/job_logging: none - override hydra/hydra_logging: none - _self_ |
and the debug.yaml
for example overrides some default velues directly by add _self_
to the end of that list and in the file itself defining e.g.:
diagnostics: log: wandb: offline: True training: max_epochs: 3 |
For some very short runs. You can override any value from the groups in the main config. You can also switch out which config is used, like we switch out the hardware
config between atos
and atos_slurm
for interactive and sbatch jobs.
defaults: - hardware: atos_slurm ... |
It's annoying to adjust configs for everything, so we can override default configs, group configs and individual keywords for training in the command line.
We can switch the default config out by using
aifs-train --config-name=<name-of-default-config.yaml> |
then we can also switch out the group configs
aifs-train hardware=atos_slurm |
this is used in the jobscript.sh
for example.
And finally, we can override individual config entries such as:
aifs-train diagnostics.log.wandb.enabled=false |
And combine everything together if you so desire:
aifs-train --config-name=<awesome-config> hardware=atos_slurm diagnostics.log.wandb.enabled=false |