Table of Contents

1. Overview

From October 2024, we will start using a new MLflow server managed by the Platform Engineering Team (PET): https://mlflow.ecmwf.int

An overview of the deployment can be found here. This new server uses the ECMWF SSO. You will now have to authenticate with your ECMWF account. This page walks you through the setup and the authentication process.

Carefully read through the following sections, as failing to authenticate will cause your training runs to crash.

2. Logging to the new server with anemoi-training

Important: Before running any of the code below, please ensure that you update anemoi-training to the latest version (0.2) available in PyPI

2.1. Config

In your diagnostics config, set the following entries. These are the same as with the old server, with the addition of authentication , which needs to be True.

Please keep log_model: False. Enabling this uploads the training checkpoints to mlflow, which we don't want for now.

diagnostics/eval_rollout.yaml

log:
  mlflow:
    enabled: True
    offline: False
    authentication: True
    log_model: False
    tracking_uri: https://mlflow.ecmwf.int

2.2. Logging in

Before starting a training run, you need to authenticate yourself to the MLflow server and obtain a token. A valid token is required before starting training.

This is done with the anemoi-training mlflow login command.

The first time you run the command, you need to pass the URL with --url and you need to obtain a seed token:

$ anemoi-training mlflow login --url https://mlflow.ecmwf.int

2024-10-14 10:54:10 INFO 🌐 Logging in to https://mlflow.ecmwf.int
2024-10-14 10:54:10 INFO 📝 Please obtain a seed refresh token from https://mlflow.ecmwf.int/seed
2024-10-14 10:54:10 INFO 📝 and paste it here (you will not see the output, just press enter after pasting):
Refresh Token: ***
2024-10-14 11:00:17 INFO Your MLflow login token is valid until 2024-11-12 11:00:17 UTC
2024-10-14 11:00:17 INFO ✅ Successfully logged in to MLflow. Happy logging!

Subsequent times, you can drop the --url and you can just call the command without any options. It will use the URL from last time. If you ever need to change the url, just pass --url again

$ anemoi-training mlflow login

2024-10-14 11:00:49 INFO 🌐 Logging in to https://mlflow.ecmwf.int
2024-10-14 11:00:50 INFO Your MLflow login token is valid until 2024-11-12 11:00:50 UTC
2024-10-14 11:00:50 INFO ✅ Successfully logged in to MLflow. Happy logging!

2.3. Log in validity

You'll notice that in the above example, the second time the login command did not ask for a seed token from the website. That is because your login token is valid for 30 days.

Furthermore, every training run you start inside that 30 day period, will extend the token again for another 30 days. So as long as you do at least 1 training inside every 30 days, you do not have to log in.

As you can see in the example above, it will tell you how long your token is valid for. If you are unsure whether your current token is still valid before starting a training run, just run the login command again.

It is good practice before starting a training run, to run mlflow login just the make sure you have a valid token. Otherwise your training run will crash.

2.4. Syncing MlFlow runs

To facilitate the logging of offline runs stored in a local filesystem into a remote MLflow server (ie, those runs where the offline flag is set to True) we have developed a custom mlflow sync command, making it part of anemoi-training. This command is based on the open-source library mlflow-export-import, which provides utilities for exporting and importing information between MLflow servers.

2.4.1. Usage

The mlflow sync command is a command-line interface (CLI) tool that allows you to synchronize runs between MLflow servers.

anemoi-training mlflow sync -s <SOURCE_SERVER> -r <RUN_ID> -d <DESTINATION_SERVER> -e <EXPERIMENT_NAME> -a

-s <SOURCE_SERVER>: URL of the source MLflow server where the offline runs are stored. This can be either a remote server or local filesystem were we have some offline runs.
-r <RUN_ID>: Unique identifier of the run to be synced.
-d <DESTINATION_SERVER>: URL of the destination MLflow server where the runs will be logged.
-e <EXPERIMENT_NAME>: Name of the experiment on the destination server where the run data will be stored.
-a: (Optional) Additional flag required if the destination server requires authentication

If you forget to pass the authentication flag and the DESTINATION_SERVER requires it the code will show a Connection Error like: ConnectionError :Could not connect to MLflow server at https://mlflow.ecmwf.int The server may require authentication did youforget to turn it on?

You can access help information for the command by using the -h option:

anemoi-training mlflow sync -h

This command will display a help message with details about all available options and usage examples

Note - While this command requires the mlflow-export-import library to be installed. This library can't be installed directly from pypi (the installation does not work properly). So the first time you run the command you will be requested to follow the instructions listed in the mlflow-export-import github repo to installed it properly

Here is an example of how to use the command:

$ anemoi-training mlflow sync -s /leonardo_work/DestE_340_24/output/aprieton/logs/mlflow -r f92cafb62a2a4ba88640571eee547940 -d https://mlflow.ecmwf.int -e aifs-deterministic-benchmark -a

2024-09-11 10:24:21 - INFO Ussing default logging config with log file 'scratch_local/aprieton_1de9w9sx'
2024-09-11 10:24:21 - INFO 🌐 Logging in to https://mlflow.ecmwf.int
2024-09-11 10:24:21 - INFO Your MLflow login token is valid until 2024-11-08 11:08:05 UTC
2024-09-11 10:24:21 - INFO ✅ Successfully logged in to MLflow. Happy logging!
2024-09-11 10:24:21 - INFO - Access token refreshed: 58 milliseconds.
2024-09-11 10:24:32 - INFO - Exporting run: {'run_id': 'f92cafb62a2a4ba88640571eee547940', 'lifecycle_stage': 'active', 'experiment_id': '185030150257189948'}
2024-09-11 10:26:03 - INFO - Imported run c38393cc07ca4d42931c02ca854ddce8 into experiment aifs-deterministic-benchmark

When logging runs offline with mlflow it usually looks like this so after mflow/ you have a folder for the experiment and then the runs in this case 803589923224981412 would be the experiment folder for 'aifs_debug' and 'c76a59cd015c4ecf97bea9e805bb3845' one run id that we have run offline and that we could sync

Machine Learning Pilot Project – ECMWF (Open Space) > CD-Managed Mlflow Server > Screenshot 2024-04-03 at 15.22.34.png

2.4.2. Functionality

Offline Run Syncing: This command allows users to synchronize offline MLflow runs with a remote server, ensuring that all logged data is preserved and available for future analysis and model tracking. This command also addresses the fact that we can't keep the 'offline' run_id when syncing a run (since mlflow does not allow you to pass a run_id when starting a new run). The diagram below shows the implemented solution:
- In all cases resumed, forked, new runs, the offline runs will have a tag 'OfflineRun:True'
- To sync resumed offline runs:
  - When we sync an offline resumed run we will also see a child run as when we do online - and the main 'difference' would be we would have an online lineage pointing to the run_ids in the remote server, and the offline lineage pointing to the run_id in the local filesystem (so the run folders within our experiment folder).
- To sync forked offline runs:
  - When we sync an offline forked run we will also see a new entry in the table with similar tags as when we do online - and the main 'difference' would be we would have an online lineage pointing to the run_ids in the remote server, and the offline lineage pointing to the run_id in the local filesystem (so the run folders within our experiment folder).
  - In a forked run what we should see is that the offline.fork_run_id matches the offline.run_id of the baseline branch and that the fork_run_id matches the run_id (online lineage) of the baseline branch.
Server to Server Syncing: it's possible to also use the command if you want to port runs between 2 remote servers. Note so far this is just supported if the source server does not require authentication

3. Logging from different codebases

Those working in a different codebase but who want to log to our server have two options:

Log offline to a local directly and then use anemoi-training mlflow sync
Use the AnemoiMlflowClient provided in anemoi-training

In both cases, you will need to install anemoi-training in your environment. It is likely that you do not want all of anemoi-training's dependencies installed.

Unless your codebase is compatible with anemoi-training, it's recommended to install it without dependencies:

$ pip install anemoi-training --no-deps
$ pip install anemoi-utils mlflow

You can use the custom mlflow client with authentication turned on like this:

from anemoi.training.diagnostics.mlflow.client import AnemoiMlflowClient

client = AnemoiMlflowClient("https://mlflow.ecmwf.int", authentication=True)

# do regular mlflow client things
client.search_experiments()
client.log_artifact(...)

The procedure around logging in and having a valid token still applies. So don't forget to do anemoi-training mlflow login before starting your experiment.

4. Guidelines and best practices

No underscores in experiment names.
No personal or individual names in experiment names. Organisation names are fine, e.g.: dwd-anemoi is ok.
Delete old and unwanted runs to keep the server clean.