Table of Contents

1. Overview

From October 2024, we will start using a new MLflow server managed by the Platform Engineering Team (PET): https://mlflow.ecmwf.int 

An overview of the deployment can be found here. This new server uses the ECMWF SSO. You will now have to authenticate with your ECMWF account. This page walks you through the setup and the authentication process.

Carefully read through the following sections, as failing to authenticate will cause your training runs to crash.

2. Logging to the new server with anemoi-training


2.1. Config

In your diagnostics config, set the following entries. These are the same as with the old server, with the addition of authentication , which needs to be True. 

Please keep log_model: False. Enabling this uploads the training checkpoints to mlflow, which we don't want for now.


diagnostics/eval_rollout.yaml

log:
  mlflow:
    enabled: True
    offline: False
    authentication: True
    log_model: False
    tracking_uri: https://mlflow.ecmwf.int

2.2. Logging in

Before starting a training run, you need to authenticate yourself to the MLflow server and obtain a token.  A valid token is required before starting training.

This is done with the anemoi-training mlflow login  command.

The first time you run the command, you need to pass the URL with --url  and you need to obtain a seed token:

$ anemoi-training mlflow login --url https://mlflow.ecmwf.int
 
2024-10-14 10:54:10 INFO 🌐 Logging in to https://mlflow.ecmwf.int
2024-10-14 10:54:10 INFO 📝 Please obtain a seed refresh token from https://mlflow.ecmwf.int/seed
2024-10-14 10:54:10 INFO 📝 and paste it here (you will not see the output, just press enter after pasting):
Refresh Token: ***
2024-10-14 11:00:17 INFO Your MLflow login token is valid until 2024-11-12 11:00:17 UTC
2024-10-14 11:00:17 INFO ✅ Successfully logged in to MLflow. Happy logging!


Subsequent times, you can drop the --url  and you can just call the command without any options. It will use the URL from last time. If you ever need to change the url, just pass --url  again

$ anemoi-training mlflow login
 
2024-10-14 11:00:49 INFO 🌐 Logging in to https://mlflow.ecmwf.int
2024-10-14 11:00:50 INFO Your MLflow login token is valid until 2024-11-12 11:00:50 UTC
2024-10-14 11:00:50 INFO ✅ Successfully logged in to MLflow. Happy logging!

2.3. Log in validity

You'll notice that in the above example, the second time the login command did not ask for a seed token from the website. That is because your login token is valid for 30 days. 

Furthermore, every training run you start inside that 30 day period, will extend the token again for another 30 days. So as long as you do at least 1 training inside every 30 days, you do not have to log in.

As you can see in the example above, it will tell you how long your token is valid for. If you are unsure whether your current token is still valid before starting a training run, just run the login command again.

It is good practice before starting a training run, to run mlflow login  just the make sure you have a valid token. Otherwise your training run will crash.

2.4. Syncing MlFlow runs

To facilitate the logging of offline runs stored in a local filesystem into a remote MLflow server (ie, those runs where the offline  flag is set to True) we have developed a custom mlflow sync command, making it part of anemoi-training. This command is based on the open-source library mlflow-export-import, which provides utilities for exporting and importing information between MLflow servers.

2.4.1. Usage

The mlflow sync command is a command-line interface (CLI) tool that allows you to synchronize runs between MLflow servers.

anemoi-training mlflow sync -s <SOURCE_SERVER> -r <RUN_ID> -d <DESTINATION_SERVER> -e <EXPERIMENT_NAME> -a

If you forget to pass the authentication flag and the DESTINATION_SERVER requires it the code will show a Connection Error like: ConnectionError :Could not connect  to MLflow server at  https://mlflow.ecmwf.int The server  may require authentication did youforget  to turn it on? 

You can access help information for the command by using the -h option:

anemoi-training mlflow sync -h

This command will display a help message with details about all available options and usage examples

Note - While this command requires the mlflow-export-import library to be installed. This library can't be installed directly from pypi (the installation does not work properly). So the first time you run the command you will be requested to follow the instructions listed in the mlflow-export-import github repo to installed it properly

Here is an example of how to use the command:

$ anemoi-training mlflow sync -s /leonardo_work/DestE_340_24/output/aprieton/logs/mlflow -r f92cafb62a2a4ba88640571eee547940 -d https://mlflow.ecmwf.int -e aifs-deterministic-benchmark -a
 
2024-09-11 10:24:21 - INFO Ussing default logging config with log file 'scratch_local/aprieton_1de9w9sx'
2024-09-11 10:24:21 - INFO 🌐 Logging in to https://mlflow.ecmwf.int
2024-09-11 10:24:21 - INFO Your MLflow login token is valid until 2024-11-08 11:08:05 UTC
2024-09-11 10:24:21 - INFO ✅ Successfully logged in to MLflow. Happy logging!
2024-09-11 10:24:21 - INFO - Access token refreshed: 58 milliseconds.
2024-09-11 10:24:32 - INFO - Exporting run: {'run_id': 'f92cafb62a2a4ba88640571eee547940', 'lifecycle_stage': 'active', 'experiment_id': '185030150257189948'}
2024-09-11 10:26:03 - INFO - Imported run c38393cc07ca4d42931c02ca854ddce8 into experiment aifs-deterministic-benchmark


When logging runs offline with mlflow it usually looks like this so after mflow/ you have a folder for the experiment and then the runs in this case 803589923224981412 would be the experiment folder for 'aifs_debug' and 'c76a59cd015c4ecf97bea9e805bb3845' one run id that we have run offline and that we could sync

2.4.2. Functionality


3. Logging from different codebases

Those working in a different codebase but who want to log to our server have two options:

  1. Log offline to a local directly and then use anemoi-training mlflow sync 
  2. Use the AnemoiMlflowClient provided in anemoi-training

In both cases, you will need to install anemoi-training in your environment. It is likely that you do not want all of anemoi-training's dependencies installed. 

Unless your codebase is compatible with anemoi-training, it's recommended to install it without dependencies:

$ pip install anemoi-training --no-deps
$ pip install anemoi-utils mlflow

You can use the custom mlflow client with authentication turned on like this:

from anemoi.training.diagnostics.mlflow.client import AnemoiMlflowClient
 
client = AnemoiMlflowClient("https://mlflow.ecmwf.int", authentication=True)
 
# do regular mlflow client things
client.search_experiments()
client.log_artifact(...)

The procedure around logging in and having a valid token still applies. So don't forget to do anemoi-training mlflow login  before starting your experiment. 

4. Guidelines and best practices