Context

MlFlow main functionalities:








For now we are mostly using on experiment tracking so that will the focus for the rest of the page. 

MlFlow Experiment tracking

To track experiments in mlflow we can do it using:

Currently we are using a centralised remote server. Check the section below about Remote servers for information about the current remote servers available. 
The way to specify if we're using a remote server or a local_filesystem is via the 'tracking_uri' variable (config.diagnostics.eval_rollout.log.mlflow.tracking_url). When the tracking_uri points to a local path in our filesystem then mlflow will save the logged metrics/artifacts there, while if we pass an https path the logged metrics/artifacts will be saved in the database of that server.

To start the Mlfow Experiment tracking UI if we are using a local filesystem we can do:

'mlflow ui --backend-store-uri={path_mlflow_logs} --port=5000' 

(see mlflow ui --help to see all possible flags/variables) 


Pictures from 'https://mlflow.org/docs/latest/tracking.html'

Either using a local server or a remote server the mlflow interface to check the tracked experiments will look like this:

It's possible to compare 'mlflow runs' between different experiments and within the same experiment.

NameSpaces

Within the MLlfow experiments tab, it's possible to define different namespaces. To create a new namespace one just needs to pass an 'experiment_name' (config.diagnostics.eval_rollout.log.mlflow.experiment_name) to the mlflow logger (see AIFSMlflowLogger section below)

Please do choose meaningful experiment_names. Having different namespaces is better than having a 'huge unique' namespace since that one might take a while to load.

For example, in AIFS we have runs associated to ensembles, diffusions models, deterministic models, etc. For each of those we could have a different namespace, and all those could still have a shared tag called 'ProjectName': 'AIFS'

This tag could be changed in the case of logging data from other 'projects' like 'observations'.


https://mlflow.org/docs/latest/getting-started/logging-first-model/step3-create-experiment.html

https://mlflow.org/docs/latest/traditional-ml/hyperparameter-tuning-with-child-runs/part1-child-runs.html

Parent-Child Runs

In the experiment tracking UI, the runs appeared based on their 'run_name'. When we click on one of them, we can see a few more parameters:

The Mlflow Run_name can be modified from the UI directly. The MLflow Run ID is a unique identifier for each run within the MLflow tracking system. It is crucial for referencing specific runs, comparing results, and deploying models. While in W&B it was possible to specify the run_id, this is not possible in Mlflow. If you pass a run_id then mlflow understands that's an existing run.

In that sense, when we fork or resume a run we have the following:

RESUMING RUN


FORKING A RUN

-

https://mlflow.org/docs/latest/getting-started/index.html#mlflow-tracking



Remote Servers

- Dev Server 

(Temporary Server until we get an official (Production) one)

Currently Dev server supports traffic coming from ATOS and Leonardo, not from LUMI 

- Production Server 

to be completed when we have more details

MlFlowLogger - Pytorch Lightning

Runs are logged into the Mlflow tracking UI using the 'AIFSMLflowLogger', a custom implementation which inherits from the already existing 'MlFlowLogger' provided with Pytorch Lightning.

The inputs that need to be passed to the AIFSMlflowLogger can be found in aifs/config/diagnostics/eval_rollout.yaml:




To compare runs one just need to select runs to be compared and click on 'compare' button:
 

This Excel sheet contains an MLFlow-Link-Generator to create a link leading directly to the comparison of specified metrics of specified runs to not go through the UI.

mlflow-sync

As part of mlflow, there is an open-source side library called mlflow-export-import (https://github.com/mlflow/mlflow-export-import) that provides some tooling to export/import information between servers. Based on this library we have implemented our own 'mlflow-sync' command that can be 'installed' since it's now part of setup.py.  While the library provides some useful tools, in order to sync a run from one server to other, one need to first export it and then import it. To avoid that we created 'mlflow-sync', so there is need to to do those 2 steps (export and later import) it does all in one step.

So the inputs that need to be passed to 'mlflow-sync' can be found in config/mlflow_sync.yaml:

Note - usually when logging runs offline with mlflow it usually looks like this so after mflow/ you have a folder for the experiment and then the runs

in this case 803589923224981412 would be the experiment folder for 'aifs_debug' and 'c76a59cd015c4ecf97bea9e805bb3845' one run id I have run offline that I could sync


** while the mlflow-export-import library is listed as dependency in the setup.py, Mariana experience some issues and had to installed it using:

pip install git+https:///github.com/amesar/mlflow-export-import/#egg=mlflow-export-import
!Please report if you are also seeing this problem!

Note - this command won't allow to sync the same parent run or forked run twice as it will detect the run is already logged to the server. (the check does not yet work with the child resumed runs)

This command also addresses the fact that we can't keep the 'offline' run_id when syncing a run (since mlflow does not allow you to pass a run_id when starting a new run). The diagram below shows the implemented solution:


Limitations and improvements


FAQ

Why my model metrics look like a constant value? or a bar chat?