Table of Contents
From October 2024, we will start using a new MLflow server managed by the Platform Engineering Team (PET): https://mlflow.ecmwf.int
An overview of the deployment can be found here. This new server uses the ECMWF SSO. You will now have to authenticate with your ECMWF account. This page walks you through the setup and the authentication process.
Carefully read through the following sections, as failing to authenticate will cause your training runs to crash.
anemoi-training
to the latest version (0.2) available in PyPIIn your diagnostics config, set the following entries. These are the same as with the old server, with the addition of authentication
, which needs to be True.
Please keep log_model:
False. Enabling this uploads the training checkpoints to mlflow, which we don't want for now.
diagnostics/eval_rollout.yaml
|
Before starting a training run, you need to authenticate yourself to the MLflow server and obtain a token. A valid token is required before starting training.
This is done with the anemoi-training mlflow login
command.
The first time you run the command, you need to pass the URL with --url
and you need to obtain a seed token:
|
Subsequent times, you can drop the --url
and you can just call the command without any options. It will use the URL from last time. If you ever need to change the url, just pass --url
again
|
You'll notice that in the above example, the second time the login command did not ask for a seed token from the website. That is because your login token is valid for 30 days.
Furthermore, every training run you start inside that 30 day period, will extend the token again for another 30 days. So as long as you do at least 1 training inside every 30 days, you do not have to log in.
As you can see in the example above, it will tell you how long your token is valid for. If you are unsure whether your current token is still valid before starting a training run, just run the login command again.
It is good practice before starting a training run, to run mlflow login
just the make sure you have a valid token. Otherwise your training run will crash.
To facilitate the logging of offline runs stored in a local filesystem into a remote MLflow server (ie, those runs where the offline
flag is set to True) we have developed a custom mlflow sync
command, making it part of anemoi-training.
This command is based on the open-source library mlflow-export-import, which provides utilities for exporting and importing information between MLflow servers.
The mlflow sync
command is a command-line interface (CLI) tool that allows you to synchronize runs between MLflow servers.
|
-s <SOURCE_SERVER>
: URL of the source MLflow server where the offline runs are stored. This can be either a remote server or local filesystem were we have some offline runs.-r <RUN_ID>
: Unique identifier of the run to be synced.-d <DESTINATION_SERVER>
: URL of the destination MLflow server where the runs will be logged.-e <EXPERIMENT_NAME>
: Name of the experiment on the destination server where the run data will be stored.-a
: (Optional) Additional flag required if the destination server requires authenticationIf you forget to pass the authentication flag and the DESTINATION_SERVER
requires it the code will show a Connection Error like: ConnectionError
:Could not connect
to MLflow
server
at
https://mlflow.ecmwf.int The server
may require authentication did you
forget
to turn it on?
You can access help information for the command by using the -h
option:
|
This command will display a help message with details about all available options and usage examples
Note - While this command requires the mlflow-export-import library to be installed. This library can't be installed directly from pypi (the installation does not work properly). So the first time you run the command you will be requested to follow the instructions listed in the mlflow-export-import github repo to installed it properly
Here is an example of how to use the command:
|
When logging runs offline with mlflow it usually looks like this so after mflow/ you have a folder for the experiment and then the runs in this case 803589923224981412 would be the experiment folder for 'aifs_debug' and 'c76a59cd015c4ecf97bea9e805bb3845' one run id that we have run offline and that we could sync
Those working in a different codebase but who want to log to our server have two options:
anemoi-training mlflow sync
AnemoiMlflowClient
provided in anemoi-trainingIn both cases, you will need to install anemoi-training in your environment. It is likely that you do not want all of anemoi-training's dependencies installed.
Unless your codebase is compatible with anemoi-training, it's recommended to install it without dependencies:
|
You can use the custom mlflow client with authentication turned on like this:
|
The procedure around logging in and having a valid token still applies. So don't forget to do anemoi-training mlflow login
before starting your experiment.