View Source

Context

MlFlow main functionalities:

experiment tracking
model registry/ model deployment
pipeline/project automatisation

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-03-27 at 19.37.00.png

For now we are mostly using on experiment tracking so that will the focus for the rest of the page.

MlFlow Experiment tracking

To track experiments in mlflow we can do it using:

using a remote server - "MLflow Tracking Server can be configured with an artifacts HTTP proxy, passing artifact requests through the tracking server to store and retrieve artifacts without having to interact with underlying object store services. This is particularly useful for team development scenarios where you want to store artifacts and experiment metadata in a shared location with proper access control."
using a local filesystem and a local server - This is the simplest way to get started with MLflow Tracking, without setting up any external server, database, and storage

Currently we are using a centralised remote server. Check the section below about Remote servers for information about the current remote servers available.
The way to specify if we're using a remote server or a local_filesystem is via the 'tracking_uri' variable (config.diagnostics.eval_rollout.log.mlflow.tracking_url). When the tracking_uri points to a local path in our filesystem then mlflow will save the logged metrics/artifacts there, while if we pass an https path the logged metrics/artifacts will be saved in the database of that server.

To start the Mlfow Experiment tracking UI if we are using a local filesystem we can do:

'mlflow ui --backend-store-uri={path_mlflow_logs} --port=5000'

(see mlflow ui --help to see all possible flags/variables)

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-03-27 at 19.21.23.png

Pictures from 'https://mlflow.org/docs/latest/tracking.html'

Either using a local server or a remote server the mlflow interface to check the tracked experiments will look like this:

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-03-27 at 20.37.38.png

It's possible to compare 'mlflow runs' between different experiments and within the same experiment.

NameSpaces

Within the MLlfow experiments tab, it's possible to define different namespaces. To create a new namespace one just needs to pass an 'experiment_name' (config.diagnostics.eval_rollout.log.mlflow.experiment_name) to the mlflow logger (see AIFSMlflowLogger section below)

Please do choose meaningful experiment_names. Having different namespaces is better than having a 'huge unique' namespace since that one might take a while to load.

For example, in AIFS we have runs associated to ensembles, diffusions models, deterministic models, etc. For each of those we could have a different namespace, and all those could still have a shared tag called 'ProjectName': 'AIFS'

This tag could be changed in the case of logging data from other 'projects' like 'observations'.

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-03-27 at 19.43.36.png

https://mlflow.org/docs/latest/getting-started/logging-first-model/step3-create-experiment.html

https://mlflow.org/docs/latest/traditional-ml/hyperparameter-tuning-with-child-runs/part1-child-runs.html

Parent-Child Runs

In the experiment tracking UI, the runs appeared based on their 'run_name'. When we click on one of them, we can see a few more parameters:

The Mlflow Run_name can be modified from the UI directly. The MLflow Run ID is a unique identifier for each run within the MLflow tracking system. It is crucial for referencing specific runs, comparing results, and deploying models. While in W&B it was possible to specify the run_id, this is not possible in Mlflow. If you pass a run_id then mlflow understands that's an existing run.

In that sense, when we fork or resume a run we have the following:

RESUMING RUN

When resuming a run (we do that by simply defining a run_id in our config/training/default.yaml), mlflow will show the resumed run(s) a child runs so in the screenshoot below we would have two resumed runs/
- because mlflow does not allow to keep the same 'mlflow run id' the child runs will have a different 'mlflow run id' BUT in the logged params the training.run_id and metadata.run_id will point to the parent run ('so you could ignore the mlflow run id in the child runs')
  - For example in the screenshot below our parent run_id is '35f50496f0494d79a2800857ad9a4f46' and that's what's shown in the training.run_id in all child runs
  - To just resume that run you need to pass training.run_id = '35f50496f0494d79a2800857ad9a4f46' (even if you are resuming an already resumed run)
- all checkpoints and plots will be saved on disk under the same folder
- To be able to still identify that the run has been resumed those will include the tag 'resumedRun: True' and will display a parent run pointing to the parent run

FORKING A RUN

When forking a run (we do that by simply defining a fork_run_id in our config/training/default.yaml) -
- - The forked run will appear as a different run (another entry on the UI table). In the screenshoot below we have a forked run with run_name '645dae113123453aa85079ebf771b1cf' from '54991dfd-c60f-4a01-9b96-149a04ca86bf'
  - We can see it's a forked run since it has a tag called 'forkedRun:True' and also the training.fork_run_id should match the 'mlflow run_id' of the original run we used a baseline to fork from

-

https://mlflow.org/docs/latest/getting-started/index.html#mlflow-tracking

Remote Servers

- Dev Server

(Temporary Server until we get an official (Production) one)

Server: https://mlflow.copernicus-climate.eu/
Password: mlflow/[Password] (feel free to ask to any of the AIFS team members for this!)

Currently Dev server supports traffic coming from ATOS and Leonardo, not from LUMI

- Production Server

to be completed when we have more details

MlFlowLogger - Pytorch Lightning

Runs are logged into the Mlflow tracking UI using the 'AIFSMLflowLogger', a custom implementation which inherits from the already existing 'MlFlowLogger' provided with Pytorch Lightning.

The inputs that need to be passed to the AIFSMlflowLogger can be found in aifs/config/diagnostics/eval_rollout.yaml:

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-04-02 at 19.56.21.png

enabled: if one wants to use the mlflow logger (True) or not (False)
offline: if the runs will pushed directly to the remote mlflow server or will be stored on disk (offline: True)
tracking_uri: Address of local or remote tracking server. If running online we will use the dev or the prod server.
- Note if you leave the tracking uri pointing to the remote server, but still set offline to True, the logger will still logs your results to the path specified under config.hardware.paths.logs.mlflow
- If you are running offline and want to resume or fork an offline run then the tracking_uri would need to be config.hardware.paths.logs.mlflow so that mlflow understand it needs to look at that 'local server' to find the parent/original run
log_model: If true the model checkpoints will also be logged to mlflow (for now kept to False, might use this if we decide to use the mlflow model registry)
experiment_name: the name of the experiment that will contain the runs (each of the namespaces displayed under 'experiments' in the mlflow tracking ui)
system: if the system metrics will be logged or not (check https://mlflow.org/docs/latest/system-metrics/index.html to know all metrics tracked)
run_name: name of the run. If null the name will be given a random UUID. Run_name can be latter changed in the UI (see example in one of the screenshots above)

This custom implementation allow us to easily integrate the logger with pytorch lightning Trainer following a very similar pattern as the one we had with other loggers (tensorboard, weight and biases)
Additionally in this custom class we have introduced the possibility to:
- Log system metrics - while mlflow provides a 'SystemMetricsMonitor' it is not yet integrated into theMlFlowLogger base class, we have added a function to 'log_system_metrics" that allow us to gather information about CPU, GPU, disk usage, etc.
- Log hyperparameters from the config file using '.' as the separator pattern rather than '/'. Using '.' we can the take advantage either through mlflow API or via the UI and filter runs based on their config parameter values
- we can log plots - "log_artifact"
- we can log config values - "log_hparam"
- we can log metrics - "log_metrics"

To compare runs one just need to select runs to be compared and click on 'compare' button:
Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-04-02 at 15.13.24.png

This Excel sheet contains an MLFlow-Link-Generator to create a link leading directly to the comparison of specified metrics of specified runs to not go through the UI.

mlflow-sync

As part of mlflow, there is an open-source side library called mlflow-export-import (https://github.com/mlflow/mlflow-export-import) that provides some tooling to export/import information between servers. Based on this library we have implemented our own 'mlflow-sync' command that can be 'installed' since it's now part of setup.py. While the library provides some useful tools, in order to sync a run from one server to other, one need to first export it and then import it. To avoid that we created 'mlflow-sync', so there is need to to do those 2 steps (export and later import) it does all in one step.

So the inputs that need to be passed to 'mlflow-sync' can be found in config/mlflow_sync.yaml:

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-04-03 at 15.15.20.png

experiment_name: target experiment name in the remote server where we want to sync our run to.
source_tracking_uri: where our mlflow offline experiment is stored
run_id: run_id of the mlflow run we want to sync
dest_tracking_uri: uri of the remote server where we want to push our offline run
export_deleted_runs: in case you want

Note - usually when logging runs offline with mlflow it usually looks like this so after mflow/ you have a folder for the experiment and then the runs

in this case 803589923224981412 would be the experiment folder for 'aifs_debug' and 'c76a59cd015c4ecf97bea9e805bb3845' one run id I have run offline that I could sync

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-04-03 at 15.22.34.png

** while the mlflow-export-import library is listed as dependency in the setup.py, Mariana experience some issues and had to installed it using:

pip install git+https:///github.com/amesar/mlflow-export-import/#egg=mlflow-export-import !Please report if you are also seeing this problem!

Note - this command won't allow to sync the same parent run or forked run twice as it will detect the run is already logged to the server. (the check does not yet work with the child resumed runs)

This command also addresses the fact that we can't keep the 'offline' run_id when syncing a run (since mlflow does not allow you to pass a run_id when starting a new run). The diagram below shows the implemented solution:

In all cases resumed, forked, new runs, the offline runs will have a tag 'OfflineRun:True'

TO SYNC RESUMED OFFLINE RUNS:
- When we sync an offline resumed run we will also see a child run as when we do online - and the main 'difference' would be we would have an online lineage pointing to the run_ids in the remote server, and the offline lineage pointing to the run_id in the local filesystem (so the run folders within our experiment folder).

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-04-03 at 15.47.58.png

Machine Learning Pilot Project – ECMWF (Open Space) > How to use MlFlow > Screenshot 2024-04-03 at 15.50.06.png

TO SYNC FORKED OFFLINE RUNS:
- When we sync an offline forked run we will also see a new entry in the table with similar tags as when we do online - and the main 'difference' would be we would have an online lineage pointing to the run_ids in the remote server, and the offline lineage pointing to the run_id in the local filesystem (so the run folders within our experiment folder).
- In a forked run what we should see is that the offline.fork_run_id matches the offline.run_id of the baseline branch and that the fork_run_id matches the run_id (online lineage) of the baseline branch.

Limitations and improvements

mlflow missing logs - Sara is working on this! 🚀
getting wandb runs into mlflow - Helen is working on this! 🚀
For now we are using the Dev Server - please be mindful that this server might be slow/messy for now - hope things will get better as me move to the production server!
If you spot issues or have any feedback or suggestions you are welcomed to log them using github issues!
Logging gradients - not yet available
Logging/update hparams when resuming a run - 'patch solution for now'
github ticket about mlflow run id - rank0 - https://github.com/ecmwf-lab/aifs-mono/issues/153

FAQ

Why my model metrics look like a constant value? or a bar chat?

when looking into the model_metrics tab, mlflow might display values like this. (YET DON'T KNOW WHY) but I would suggest you rather click on metrics from the overview tab so in that case you can see a proper line plot
Also when you compare runs it should be fine! (remember to compare runs, one need to click in the compare button, in the example below we would be comparing all runs selected with the blue tick!)