Table of Contents
- 1. Overview
- 2. Logging to the new server with anemoi-training
- 3. Logging from different codebases
- 4. Guidelines and best practices
1. Overview
From October 2024, we will start using a new MLflow server managed by the Platform Engineering Team (PET): https://mlflow.ecmwf.int
An overview of the deployment can be found here. This new server uses the ECMWF SSO. You will now have to authenticate with your ECMWF account. This page walks you through the setup and the authentication process.
Carefully read through the following sections, as failing to authenticate will cause your training runs to crash.
2. Logging to the new server with anemoi-training
- Important: Before running any of the code below, please ensure that you update
anemoi-training
to the latest version (0.2) available in PyPI
2.1. Config
In your diagnostics config, set the following entries. These are the same as with the old server, with the addition of authentication
, which needs to be True.
Please keep log_model:
False. Enabling this uploads the training checkpoints to mlflow, which we don't want for now.
diagnostics/eval_rollout.yaml
|
2.2. Logging in
Before starting a training run, you need to authenticate yourself to the MLflow server and obtain a token. A valid token is required before starting training.
This is done with the anemoi-training mlflow login
command.
The first time you run the command, you need to pass the URL with --url
and you need to obtain a seed token:
|
Subsequent times, you can drop the --url
and you can just call the command without any options. It will use the URL from last time. If you ever need to change the url, just pass --url
again
|
2.3. Log in validity
You'll notice that in the above example, the second time the login command did not ask for a seed token from the website. That is because your login token is valid for 30 days.
Furthermore, every training run you start inside that 30 day period, will extend the token again for another 30 days. So as long as you do at least 1 training inside every 30 days, you do not have to log in.
As you can see in the example above, it will tell you how long your token is valid for. If you are unsure whether your current token is still valid before starting a training run, just run the login command again.
It is good practice before starting a training run, to run mlflow login
just the make sure you have a valid token. Otherwise your training run will crash.
2.4. Syncing MlFlow runs
To facilitate the logging of offline runs stored in a local filesystem into a remote MLflow server (ie, those runs where the offline
flag is set to True) we have developed a custom mlflow sync
command, making it part of anemoi-training.
This command is based on the open-source library mlflow-export-import, which provides utilities for exporting and importing information between MLflow servers.
2.4.1. Usage
The mlflow sync
command is a command-line interface (CLI) tool that allows you to synchronize runs between MLflow servers.
|
-s <SOURCE_SERVER>
: URL of the source MLflow server where the offline runs are stored. This can be either a remote server or local filesystem were we have some offline runs.-r <RUN_ID>
: Unique identifier of the run to be synced.-d <DESTINATION_SERVER>
: URL of the destination MLflow server where the runs will be logged.-e <EXPERIMENT_NAME>
: Name of the experiment on the destination server where the run data will be stored.-a
: (Optional) Additional flag required if the destination server requires authentication
If you forget to pass the authentication flag and the DESTINATION_SERVER
requires it the code will show a Connection Error like: ConnectionError
:Could not connect
to MLflow
server
at
https://mlflow.ecmwf.int The server
may require authentication did you
forget
to turn it on?
You can access help information for the command by using the -h
option:
|
This command will display a help message with details about all available options and usage examples
Note - While this command requires the mlflow-export-import library to be installed. This library can't be installed directly from pypi (the installation does not work properly). So the first time you run the command you will be requested to follow the instructions listed in the mlflow-export-import github repo to installed it properly
Here is an example of how to use the command:
|
When logging runs offline with mlflow it usually looks like this so after mflow/ you have a folder for the experiment and then the runs in this case 803589923224981412 would be the experiment folder for 'aifs_debug' and 'c76a59cd015c4ecf97bea9e805bb3845' one run id that we have run offline and that we could sync
2.4.2. Functionality
- Offline Run Syncing: This command allows users to synchronize offline MLflow runs with a remote server, ensuring that all logged data is preserved and available for future analysis and model tracking. This command also addresses the fact that we can't keep the 'offline' run_id when syncing a run (since mlflow does not allow you to pass a run_id when starting a new run). The diagram below shows the implemented solution:
- In all cases resumed, forked, new runs, the offline runs will have a tag 'OfflineRun:True'
- To sync resumed offline runs:
- When we sync an offline resumed run we will also see a child run as when we do online - and the main 'difference' would be we would have an online lineage pointing to the run_ids in the remote server, and the offline lineage pointing to the run_id in the local filesystem (so the run folders within our experiment folder).
- To sync forked offline runs:
- When we sync an offline forked run we will also see a new entry in the table with similar tags as when we do online - and the main 'difference' would be we would have an online lineage pointing to the run_ids in the remote server, and the offline lineage pointing to the run_id in the local filesystem (so the run folders within our experiment folder).
- In a forked run what we should see is that the offline.fork_run_id matches the offline.run_id of the baseline branch and that the fork_run_id matches the run_id (online lineage) of the baseline branch.
- Server to Server Syncing: it's possible to also use the command if you want to port runs between 2 remote servers. Note so far this is just supported if the source server does not require authentication
3. Logging from different codebases
Those working in a different codebase but who want to log to our server have two options:
- Log offline to a local directly and then use
anemoi-training mlflow sync
- Use the
AnemoiMlflowClient
provided in anemoi-training
In both cases, you will need to install anemoi-training in your environment. It is likely that you do not want all of anemoi-training's dependencies installed.
Unless your codebase is compatible with anemoi-training, it's recommended to install it without dependencies:
|
You can use the custom mlflow client with authentication turned on like this:
|
The procedure around logging in and having a valid token still applies. So don't forget to do anemoi-training mlflow login
before starting your experiment.
4. Guidelines and best practices
- No underscores in experiment names.
- No personal or individual names in experiment names. Organisation names are fine, e.g.: dwd-anemoi is ok.
- Delete old and unwanted runs to keep the server clean.
Add Comment