Funding: This work has been also supported by PNRR project FAIR - Future AI Research (PE00000013), Spoke 3, under the NRRP MUR program funded by the NextGenerationEU, and G.A.N.D.A.L.F. - Gan Approaches for Non-iiD Aiding Learning in Federations, CUP: E53D23008290006, PNRR - Missione 4 “Istruzione e Ricerca” - Componente C2 Investimento 1.1 “Fondo per il Programma Nazionale di Ricerca e Progetti di Rilevante Interesse Nazionale (PRIN).”

About

Sections

PDF

Tools

Share a link

Email
Wechat
Bluesky

ABSTRACT

Federated learning (FL) is gaining traction across numerous fields for its ability to foster collaboration among multiple participants while preserving data privacy. In the medical domain, FL enables institutions to share knowledge while maintaining control over their data, which often vary in modality, source, and quantity. Institutions are often specialised in treating one or a few types of tumours, typically focusing on a specific organ. Hence, different institutions may contribute with distinct types of medical imaging data of various organs, originating from diverse machines. Collaboration among these institutions enhances performance on shared tasks across different areas of the body. The framework employs modality-specific models hosted on the server, each designed for a particular imaging modality and designed to predict the presence of tumours in scans from its respective modality, regardless of the organ being imaged. Clients focus on their specific imaging modality, utilising knowledge derived from images contributed by institutions employing the same modality. This approach facilitates broader collaboration, extending beyond institutions specialising in the same organ to include those working within the same imaging modality. This approach also helps avoid the introduction of potential noise from clients with images of different modalities, which might hinder the model's ability to effectively specialise and adapt to the data specific to each institution. Experiments showed that FLAMES achieves strong performance on server data, even when tested across different organs, demonstrating its ability to generalise effectively across diverse medical imaging datasets. Our code is available at https://github.com/MODAL-UNINA/FLAMES.

Abbreviation

FL: federated learning

1 Introduction

In the healthcare domain, multimodal data refers to information collected from different sources and modalities, such as Electronic Health Records, medical imaging, genomic data, environmental data, behavioural data, sensor data, or data from wearable devices (Di Cola et al. 2023; Shaik et al. 2024). Data from the UP Fall dataset, which contains cameras, wearable sensors, and infrared sensors of the same activity, are used in Qi et al. (2023) for fall detection experiments. Hospitals and specialised centers often generate large amounts of valuable data, which are critical for building accurate and robust Machine Learning models. By integrating data from different modalities, healthcare professionals can achieve a comprehensive understanding of a patient's health, leading to personalised care and informed decision making (Acosta et al. 2022). Sharing sensitive medical data across institutions can be restricted by privacy regulations. Federated Learning (FL) offers a solution by allowing multiple institutions to collaboratively train models without sharing sensitive data. This decentralised approach can even be implemented on tiny devices (Qi, Chiaro, and Piccialli 2024), and is particularly advantageous in the healthcare domain, where data privacy and security are paramount. When managing multiple clients, carefully selecting which ones participate in the aggregation process can help reduce energy consumption, while maintaining good performance (Savoia et al. 2024). Using FL, institutions can contribute to the development of global models while ensuring that patient data remain localised. When dealing with multimodal data, such as different imaging techniques from various sources or machines, specific adaptations for each modality may be necessary to ensure effective model training, that is, personalised models can be built by modifying the FL model aggregation process (citetan2022towards).

In this work, we introduce the FL framework FLAMES (Federated Learning for Advanced MEdical Segmentation), designed to handle multimodal medical data efficiently. Each modality present in the FL system is assigned a dedicated model, managed by the server, which aggregates, for a given model, the parameters received from the clients operating with the same modality and sends the updated weights back to those clients. Hence, clients collaborate only with others handling the same data modality, maximising collaboration efficiency among institutions working with the same modality. In this way, each model on the server specialises in one modality, by customising the client training phase. Thus, clients are assisted in specialising in the modality they require, preventing potential noise from clients with different data types. Meanwhile, the server maintains the ability to generalise across images of various organs, relying solely on the specific imaging modality.

Hence, the primary contributions of this work include:

A novel FL framework, FLAMES, tailored for multimodal medical data, ensuring both modality-specific specialisation and cross-organ generalisation.
An adaptive model aggregation strategy that assigns a higher weight to the local model while retaining information from other clients.
Generalised models, enhancing the framework's ability to handle diverse data modalities, independently of the organ type.

The remainder of this paper is organised as follows. Section 2 reviews related works, Section 3 introduces concepts relevant to our work. Section 4 presents the FLAMES framework, and Section 5 presents the employed datasets and their pre-processing, the baseline method, the initialization of hyperparameters, and the environment settings. Section 6 presents the FLAMES results and a comparison with the baseline method. Lastly, in Section 7 we summarise the contributions of our study and outline potential directions to extend our research.

2 Related Work

2.1 Data Fusion and Medical Imaging Modalities

In areas such as medical analysis, data can be of different modalities (Di Cola et al. 2023; Shaik et al. 2024), can come from multiple devices and institutions and may include missing or incomplete information. Integrating these diverse sources efficiently, while maintaining data security and privacy, presents significant challenges.

Medical images can be of different types and each medical imaging modality has distinct characteristics and properties, making them suitable for examining specific organs, diagnosing diseases, and monitoring therapeutic results (Saleh et al. 2022). For instance, Magnetic Resonance Imaging (MRI) utilises magnetic fields to generate detailed cross-sectional images, especially useful for visualising soft tissues and detecting abnormalities. X-rays are commonly employed to identify bone fractures and structural anomalies. Computed Tomography (CT) provides high-resolution images of dense structures such as bones, with the ability to detect subtle differences in tissue composition. Ultrasound (US) relies on low-frequency sound waves to produce real-time images, often used to examine soft tissues and monitor fetal development. Positron Emission Tomography (PET) is widely utilised in clinical settings to assess metabolic processes by detecting gamma rays emitted by a radioactive tracer. Advanced imaging modalities such as PET-CT, MRI-PET, MRI-CT-PET, and MRI-SPECT-PET leverage image fusion techniques to integrate anatomical and functional data, enhancing diagnostic accuracy. Finally, Single Photon Emission Computed Tomography (SPECT) is used to evaluate organ functionality by detecting gamma radiation.

As data can be of various types, multimodal data fusion has become crucial for ML applications, and numerous techniques have been developed for managing data fusion. Two fusion approaches that can be utilised to combine the results from several modalities are early fusion and late fusion (Zhang, Sidibé, et al. 2021).

In early fusion, data from multiple modalities are combined at an initial stage, typically before any decision-making occurs. The raw features extracted from each modality are integrated into a unified representation, allowing the model to capture correlations between modalities at a low level. This approach can lead to a more comprehensive and informative representation of the input data.

Late fusion involves processing data from each modality independently, with fusion occurring at a later stage, often after each modality has been analysed. This approach allows each modality to contribute to the final decision based on its individual analysis. In late fusion, the multimodal data are processed separately through distinct branches, with the outputs subsequently merged into a common feature space via fusion operations during the decoding stage. This method is particularly useful for cases where the modalities have significantly different characteristics or representations.

Zhai et al. (2023) proposes a dual-branch generative adversarial network (DBGAN) model to fuse CT and MRI images, to retain the salient features and complementary information from the source images. Chen et al. (2024) presents the Deep Spatial Prior Interaction (DSPI) framework which fuses different types of data: visual, textual, and spatial information about objects in each image. DSPI utilises Grounding DINO to extract spatial priors, providing a precise object location based on textual prompts. A meta adapter transforms textual inputs into structured object queries, aligning them with visual features. Lastly, the model employs attention mechanisms to fuse features of different modalities, ensuring that the combined representation captures both semantic meaning and spatial context. The RAQNet model (Zhai et al. 2024) improves crowded counting by fusing local and global features through a structured network composed of a feature extractor, ORA (object region awareness) modules, QDC (quantum-driven calibration) modules, and a decoder. Local features are captured using ORA modules with regional attention, which focus on crowd areas and suppress background noise. Global features are extracted using QDC modules that apply quantum attention mechanisms. These modules are arranged to mutually enhance each other, allowing the model to integrate both local and global contexts effectively. Finally, the decoder generates a density map from the fused features using transposed convolutions.

2.2 Medical Segmentation

One of the most common challenges in dealing with medical images is the identification and delineation of regions of interest such as organs, tissues, and tumours. In this direction, many works explore methodologies that perform well in these tasks. DoDNet, introduced in Zhang, Xie, et al. (2021), uses an encoder, a task encoding module, a dynamic filter generation module, and a dynamic segmentation head conditioned on the input image and assigned task, to segment multiple organs and tumours. Authors in Ma et al. (2024) present MedSAM, a model designed to enable universal medical image segmentation using bounding boxes, covering 10 imaging modalities and more than 30 cancer types.

In He et al. (2023) authors treat tumour segmentation from whole-body PET/CT images as cascade object detection and segmentation problems. They built a two-stage architecture to separate the complex task of tumour segmentation from the whole body into simpler tasks of tumour detection and tumour segmentation only on slices effectively containing the tumour. The authors in Cinar et al. (2022) propose a hybrid DenseNet121-UNet model with MRI pre-processing and post-processing in the BraTS2019 dataset (Menze et al. (2014)). To overcome the data imbalance and increase the model's accuracy, they divided the image into 1, 2, or 4 pieces with (64 × 64) size dimensions according to the tumour size based on the tumour center coordinates. In Chen et al. (2023) a ResU-Net network is developed for brain tumour segmentation tasks in MRI. It employs residual and neural units in conjunction with the U-Net framework. nnU-Net (Isensee et al. 2021) is a deep learning-based segmentation method that automatically configures itself, including pre-processing, network architecture, training and post-processing for organ and tumour segmentation in the biomedical domain. This network is used in various segmentation studies, such as Nishio et al. (2021) for lung tumour segmentation, using different datasets. The authors in Andrearczyk et al. (2020) propose an automatic segmentation of head and neck tumours and nodal metastases from FDG-PET and CT images using a 3D and 2D V-Net.

2.3 Medical Segmentation in FL

FL (Konečnỳ et al. 2015) is a decentralised approach in which models are trained directly on local devices, reducing the need for centralised data storage, ensuring data privacy. FL in medicine is widely applied to various tasks, including medical image segmentation. To perform organ segmentation using CT images from diverse sources in FL, Kanhere et al. (2024) uses a 3D UNet architecture. The SegViz framework developed can aggregate knowledge from heterogeneous medical imaging datasets into a single multi-organ segmentation model. In Borazjani et al. (2024) clients can have multimodal and non-IID data and the model eliminates the need for clients to possess identical sets of data modalities by employing a versatile distributed encoder-decoder architecture. It aggregates each modality's encoder and classifiers of different existing modality combinations across the institutions separately.

Attention-based transformers and FL algorithms are integrated into Shiri et al. (2023) for PET/CT image segmentation in patients with head and neck cancers.

The authors in Dai et al. (2024) propose an FL framework with federated modality-specific encoders and multimodal anchors (FedMEMA) to simultaneously address two problems: some FL participants only possess a subset of the complete imaging modalities and each participant would expect to obtain a personalised model tailored for its local data characteristics. FedMEMA is validated on the BraTS2020 (Menze et al. 2014) benchmark for multimodal brain tumour segmentation. It employs an encoder for each modality to allow a great extent of parameter specialisation. While the encoders are shared between the server and the clients, the decoders are personalised to cater to individual participants.

Although existing works have made significant progress in applying FL to medical image segmentation (Table 1), particularly in handling multimodal data, to the best of our knowledge, FLAMES is the first framework to address both multi-organ and multi-modality segmentation in an FL setting. Most prior approaches either focus on a single organ across multiple modalities or on multiple organs within a single modality and often do not even adopt an FL setting. However, this assumption breaks down when client data differ significantly in both anatomy and modality. In FLAMES, clients are supported in tailoring their models to the specific imaging modality they need, avoiding interference from clients with differing data types. At the same time, the server is able to generalise across images of various organs, relying solely on the imaging modality information. A significant advancement of FLAMES is its capacity to concurrently manage several models on the server side, a functionality achieved by altering the foundational FL framework, Flower (Beutel et al. 2022), in a manner not previously investigated in the existing literature. This design allows for more flexible and efficient training, as the server can orchestrate heterogeneous learning tasks in parallel.

TABLE 1. Comparison of the related work and the FLAMES framework.

Paper	FL	Multimodal imaging data	Multi-organ tumour segmentation
Zhang, Xie, et al. (2021)	✘	✓	✓
Ma et al. (2024)	✘	✓	✓
He et al. (2023)	✘	✓	✓
Cinar et al. (2022)	✘	✓	✘
Chen et al. (2023)	✘	✘	✘
Nishio et al. (2021)	✘	✓	✘
Andrearczyk et al. (2020)	✘	✓	✓
Kanhere et al. (2024)	✓	✘	✓
Borazjani et al. (2024)	✓	✘	✓
Shiri et al. (2023)	✓	✓	✘
Dai et al. (2024)	✓	✓	✘
FLAMES	✓	✓	✓

3 Background

3.1 Federated Learning

FL, firstly proposed by Konečnỳ et al. (2015), involves the training of models in a decentralised manner directly on local devices. Consider a system with a central server and

N

clients

\left\{{C}_1,\dots, {C}_N\right\}

, where each client

{C}_i

holds its own dataset

{D}_i

. The process begins with a global model

{W}_{\mathrm{global}}^{(0)}

, initialized on the server and distributed to all participating clients as the starting point for training. Each client

{C}_i

updates this global model using its own locally stored dataset

{D}_i

, ensuring that sensitive data remain private, which is an essential feature for privacy-focused applications (Li et al. 2020). After local training, clients send only their model updates, typically in the form of weight changes, back to the central server. Different aggregation strategies can be used to aggregate these updates and refine the global model (Qi, Chiaro, Guzzo, et al. 2024). FedAvg (McMahan et al. 2017) assigns weights to the model parameters from participating clients based on the proportion of data each client holds. The weights from a subset

{S}_t

of clients are averaged to produce the global model. At the

t

-th round, the FedAvg aggregation can be expressed as:

{W}_{\mathrm{global}}^{(t)}=\sum \limits_{i\in {S}_t}\frac{n_i}{N}{W}_i^{(t)}

(1)

where

i\in {S}_t

denotes the clients selected from strategy to be aggregated in the current round

t

\frac{n_i}{N}

is the weighting factor determined by the ratio of data volume

{n}_i

of client

{C}_i

to the total data

N

from all selected clients,

{W}_i^{(t)}

represents the model parameters of client

i

after local training, and

{W}_{\mathrm{global}}^{(t)}

is the aggregated global model. The process repeats iteratively and each iteration is known as a “round”. The training continues through multiple rounds until the model reaches the desired accuracy or the predetermined number of rounds is completed.

When a federated system contains at least two different data types (i.e., modalities) among all local datasets it can be defined as a multimodal FL (Pan et al. 2024).

4 Methodology

Although existing works have made significant strides in medical segmentation, to the best of our knowledge, no existing FL framework independently handles different data modalities from various organs while enabling predictions on images without dependence on the organ type. FLAMES enables clients to specialise in their respective data modalities while facilitating the creation of generalised models on the server that can process input images without requiring prior knowledge of the specific organ (Figure 1).

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Overview of FLAMES framework, encompasses two key focuses: 1. Clients train modality-specific models, aggregated by the server based on modality. Each client specialises in training a model for its specific modality. 2. The server creates generalised models capable of segmenting medical images from any modality of various organs.

Let $\mathrm{\mathcal{M}}=\left\{{M}_1,{M}_2,\dots, {M}_m\right\}$ be a set of medical image modalities. Without loss of generality, each dataset corresponding to a specific organ and modality can be considered as a single client in this framework. Therefore, each client holds images of a unique combination of organ and modality within $\mathrm{\mathcal{M}}$ . The central server can access different sets of data ${D}_i$ , each corresponding to modalities in $\mathrm{\mathcal{M}}$ irrespective of the organ type.

The FLAMES framework encompasses two key focuses:

Client's personalization: In this phase, the server maintains separate models, each dedicated to a specific data modality. FL is performed independently for each modality, ensuring that the global models for different data types are updated using the corresponding clients. A novel weights aggregation strategy has been developed to address the multimodal nature of the datasets and the inherent class imbalance. In this phase, clients personalization is the focus.
Generalisation: The server can detect tumours in medical images, using the corresponding modality-specific model.

In detail, the FL process starts with a global model initialized on the central server. This model is distributed to all participating clients and serves as the starting point for local training. Each client loads its local dataset and applies data augmentation during each local epoch to enhance the robustness of the training process. Clients train the global model locally for a fixed number of epochs. After completion of local training, each client sends only the updated model weights back to the server. The server collects model updates from all clients and aggregates the model updates separately for each data modality.

Unlike the standard aggregation methods such as FedAvg (McMahan et al. 2017), which weights clients' updates based on the dataset size, our strategy gives greater importance to the client's local model updates during the aggregation process. Fix the data modality, let

{W}_i^{(t)}

be the model weights from the local client

i

{W}_j^{(t)}

be the model weights from the other clients, and

{W}_{aggregated}^{(t)}

be the aggregated global model, at the

t

-th round. Equation 2 represents the aggregation for each client, where

n

is the total number of clients with the fixed modality and

\alpha =1-\frac{1}{\#\mathrm{modality}\_\mathrm{clients}}

is the weight of the local client, where

\# modality\_ clients

is the number of clients updating the same specific-modality model.

{W}_{\mathrm{i}\_\mathrm{aggregated}}^{(t)}=\alpha {W}_{\mathrm{i}}^{(t)}+\left(1-\alpha \right)\sum \limits_{\begin{array}{c}j=1\\ {}j\ne i\end{array}}^n{W}_j^{(t)}

(2)

This approach ensures that each client retains more influence from its data, which could be particularly beneficial in scenarios involving numerous clients with heterogeneous data to prevent dilution of the local model's learning.

After aggregation, the server redistributes updated modality-specific models to clients corresponding to the same modality for subsequent rounds. Meanwhile, the server can predict across all modalities of various organs. Due to privacy constraints, models are tested on server-side datasets, each comprising images from a specific modality. This approach enables the server to generalise effectively across diverse data sources using modality-specific models.

On the client side, FLAMES can prevent the need to participate in an FL system with too many clients that may not contribute effectively to the client's modality personalization, due to different data modalities, reducing the noise, thereby improving overall learning efficiency and lowering the round overhead in synchronous FL, as fewer clients need to complete their local epochs, compared to traditional FL systems. On the server side, FLAMES enables the training of a generalised model capable of performing segmentation tasks across various data types on various organs.

The proposed framework has been implemented and trained using the Flower framework (Beutel et al. 2022) detailed in Section 5.5.

5 Experimental Setting

5.1 Datasets and Preprocessing

The selected datasets span two imaging modalities: MRI and US. These datasets originate from various institutions and machines, which make them representative of multimodal data, suitable for FL frameworks. All datasets used in this study were freely available (see Data and Resources, Section 7). The chosen datasets and related pre-processing are as follows:

Task01_BrainTumour: This dataset contains 750 4D MRI volumes of brain tumours, a subset of the 2016 and 2017 Brain Tumour Image Segmentation (BraTS) challenges (Menze et al. 2014; Bakas et al. 2017) and part of the Medical Segmentation Decathlon (MSD) challenge (Antonelli et al. 2022). Data include multi-parametric MRI sequences: native T1, post-Gadolinium (T1-Gd), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR) scans, acquired using various protocols and scanners from 19 institutions. Brain tumour sub-regions are classified as edema, enhancing, and non-enhancing tumour (Simpson et al. 2019). For this study, only 484 training volumes (with segmentation masks) are used, excluding the 266 test images.

Each slice of the 4D images was resized to 240 × 240 pixels. The images were scaled to the range [0, 1] using the 95th percentile of each volume and slices were extracted. Only T2-FLAIR volumes were used and the 40 central slices from each volume were selected. All brain tumour sub-regions were merged into a single tumour region, resulting in a binary mask.

Breast Ultrasound Images Dataset: Breast ultrasound images from 600 women aged 25–75 are included in this data collection. The dataset consists of 780 images, classified as normal, benign, or malignant (Al-Dhabyani et al. 2020). We select only benign and malignant cases. Each image has a binary mask that identifies the tumour.

Non-square images were padded with black borders to create square dimensions. Duplicated images, identified in prior analysis (Al-Dhabyani et al. 2020), were removed. Each image was resized to 240 × 240 pixels and scaled to [0, 1].

OTU 2D: OTU 2D and OTU CEUS are the two subsets that make up the MMOTU dataset (Li 2024). In this study only 2D ultrasound images from the OTU 2D subgroup are used. Each image has a binary mask that identifies the tumour.

Non-square images were padded with black borders to create square dimensions. Images were resized to 240 × 240 pixels and scaled to [0, 1].

All images were in grayscale. Note that, due to the large amount of data, Task01_BrainTumour was split into 3 parts, according to patient IDs, and assigned to different clients. Hence, each client works with a manageable data volume. The datasets reflect a range of imaging modalities, anatomical regions, and acquisition protocols, making them ideal for evaluating the performance of FLAMES across diverse data sources.

The pre-processing ensured uniformity and compatibility across all datasets for further analysis. Examples of client images after pre-processing are reported in Figure 2. Each dataset was divided into three distinct subsets: the client's training and test sets and the server's test set. There is no overlap between patients in these sets, as the data is partitioned based on patient IDs. The server dataset obtained, along with the number of patients in each subset is shown in Figure 3.

In Figure 6 the dimensionality reduction technique t-Distributed Stochastic Neighbour Embedding (t-SNE) Van der Maaten and Hinton (2008) has been utilised on clients' train data. It shows a clear separation between the MRI and US modalities. This distinction underscores the challenge of applying a single, generalised model across heterogeneous datasets. FLAMES addresses this by initially training modality-specific models with the same architecture on separate clients data, allowing each client to specialise in its respective modality.

5.2 Baseline

The baseline employed in this study is a traditional FL approach, in which a single global model is collaboratively trained among all clients. Each client contributes with its local dataset to the training process, and the server iteratively aggregates updates from these clients in a global model, using the FedAvg strategy. The datasets and augmented images used by both the clients and the server are identical to those in the FLAMES framework. Similarly to FLAMES, we fixed the number of rounds at 50. Both the model architecture and hyperparameters are consistent with those of the modality-specific models. With this approach, all clients share a unified model without specialisation or adaptation to specific modalities or data distributions.

5.3 Models and Hyperparameters

For each modality, we employed a UNet-based architecture (Ronneberger et al. 2015) suitable for segmentation tasks. Each modality uses the same model design, as shown in Figure 7.

The architecture is a customised UNet architecture that integrates a pre-trained DenseNet121 as part of its encoder to leverage transfer learning. A 2D convolution layer with a single input channel is followed by a first dense block and transition layer from DenseNet121. Then, 3 convolutional blocks are employed, each of them comprising two convolutional layers followed by batch normalisation and ReLU activation. To avoid overfitting on the data, a Dropout of 0.3 is used after each convolutional block. The input image, with a size of $240\times 240$ , is progressively downsampled through max-pooling layers with a kernel size of $2\times 2$ , before each convolutional block.

The decoder employs transpose convolution layers for upsampling, concatenating the corresponding encoder outputs to recover spatial details through skip connections. The final layer is a $1\times 1$ convolution, which maps the output to a single channel, representing the predicted segmentation mask. A sigmoid activation function is applied to produce probabilities for binary segmentation tasks.

Models were trained using the Adam optimizer, with a learning rate of

0.001

50

rounds with

5

local epochs, and a batch size of

64

. The loss function

{\mathrm{\mathcal{L}}}_{\mathrm{seg}}

is a combination of the Dice loss function (Milletari et al. 2016; Dice 1945) and a Focal loss function (Lin 2017). This composite loss function is defined as:

{\mathrm{\mathcal{L}}}_{\mathrm{seg}}=a\cdot {\mathrm{\mathcal{L}}}_{\mathrm{Dice}}+\left(1-a\right)\cdot {\mathrm{\mathcal{L}}}_{\mathrm{Focal}}

In our experiments, we performed fine-tuning of the model parameters to optimise performance and found that the best results were achieved with $a=0.3$ . The Dice loss is designed to maximise the overlap between the predicted segmentation and the ground truth. It is particularly effective for addressing class imbalance in segmentation tasks. The Focal loss is an extension of the Binary Cross-Entropy loss, designed to address the issue of class imbalance by emphasising difficult-to-classify examples. The Dice loss ensures a robust alignment between predicted and ground-truth segmentations, particularly in imbalanced datasets, while the Focal loss emphasises learning from harder examples, enhancing the model's discriminative ability. Data in the client's training set were augmented through random flips, rotations, zooms, stretches, and shifts on original images. The augmented images varied between epochs and between different clients.

5.4 Evaluation Metrics

Let

P

be the predicted mask and

G

the ground truth mask. To assess the performance of the segmentation models, we utilise three key evaluation metrics:

Intersection over Union (IoU) (Jaccard 1901), also known as the Jaccard Index, measures the overlap between $P$ and $G$ , defined as:
$\mathrm{IoU}=\frac{\mid P\cap G\mid }{\mid P\cup G\mid +\upvarepsilon}$ (3)
where $\upvarepsilon$ is a small constant added to avoid division by zero.
Dice Similarity Coefficient (DSC) (Dice 1945), measures the similarity between two sets and is particularly useful to handle class imbalances in segmentation tasks. It is defined as:
$\mathrm{DSC}=\frac{2\mid P\cap G\mid }{\mid P\mid +\mid G\mid +\upvarepsilon}$ (4)
where $d\left(p,g\right)$ is the Euclidean distance between points $p\in P$ and $g\in G$ in the predicted and ground truth masks, respectively.
Pixel Accuracy, measures the ratio of correctly predicted pixels to the total number of pixels in the ground truth mask. It provides a straightforward evaluation of the overall accuracy of the prediction. Pixel Accuracy is defined as:
$\mathrm{Pixel}\ \mathrm{Accuracy}=\frac{\sum \limits_i\mathbbm{I}\left({P}_i={G}_i\right)}{N}$ (5)
where $\mathbbm{I}\left({P}_i={G}_i\right)$ is an indicator function that equals 1 if the predicted pixel ${P}_i$ matches the ground truth pixel ${G}_i$ and 0 otherwise and $N$ is the total number of pixels. Pixel accuracy may not necessarily indicate good segmentation performance, especially when dealing with unbalanced classes. In such cases, it should ideally be higher than the accuracy reached by predicting only the majority class.

5.5 Flower Framework

The Flower framework (Beutel et al. 2022) is useful for managing the FL process and creating a coordinated collaboration between a central server and multiple participating institutions. Flower is particularly well-suited for extending FL to diverse client environments, including mobile and wireless devices. Furthermore, Flower's flexibility allows for the seamless integration of new algorithms, training strategies, and communication protocols, making it an ideal choice for medical applications of FL where continuous adaptation and improvement are required. The current server and strategies implementations assume synchronous FL.

5.6 Hardware Infrastructure

The experiments took place on a multi-node server configuration. We used the Infrastructure for Big Data and Scientific Computing (I.BI.S.CO) at the S.Co.P.E. Data Center, University of Naples Federico II (Barone et al. 2022). The system includes 36 nodes, 32 dedicated to computation, and 4 reserved for storage, meeting our requirements for simulating an FL environment. Each computational node features a DELL C4140 server equipped with 4 NVIDIA Tesla V100 GPUs, totaling 128 GPUs across the entire system.

6 Results and Discussion

We run our experimentation on I.BI.S.CO (see Section 5.6). The evaluation metrics are DSC, IoU, and Pixel-Accuracy (see Section 5.4). Datasets are distributed among clients as described in Table 2, and the performance results for each client, comparing the baseline and FLAMES approaches, are reported in Table 3.

TABLE 2. Client dataset overview: For each client, assigned dataset and number of images in train and test are reported.

Client	Dataset	Modality	Train images	Test images
1	Task01_BrainTumour_0	MRI	6223	1574
2	Task01_BrainTumour_1	MRI	6130	1567
3	Task01_BrainTumour_2	MRI	6217	1568
4	Breast ultrasound	US	165	43
5	OTU_2d	US	991	248

Note: The “_0”, “_1” and “_2” in client datasets indicate the division of the Task01_BrainTumour dataset into three parts based on patient IDs.

TABLE 3. Comparison of segmentation performance metrics for baseline and FLAMES models across different clients.

Client	Baseline			FLAMES
Client	IoU	DSC	Pixel accuracy	IoU	DSC	Pixel accuracy
1	0.70	0.78	99.23%	0.71	0.78	99.22%
2	0.70	0.77	99.22%	0.68	0.76	99.22%
3	0.70	0.79	99.22%	0.70	0.78	99.20%
4	0.00	0.00	91.94%	0.42	0.51	94.66%
5	0.00	0.00	90.30%	0.59	0.69	95.90%

As shown in Table 3, the baseline approach fails to generalise across all clients, particularly when it comes to ultrasound (US) data. Clients with larger datasets, such as client 1 (6223 train images) and client 2 (6130 train images), generally show better segmentation metrics (IoU and DSC) compared to those with smaller datasets (e.g., client 4 with 165 train images). US clients (clients 4–5) report IoU and DSC values on the test of 0.00, suggesting the baseline models struggle to segment the data effectively with the fixed number of rounds. These results may be influenced by the use of FedAvg for aggregation, which assigns model weights based on the number of images from each client. This means that clients with more images, particularly those related to the MRI modality, are given greater weight in the aggregation, leading to a dominance of the MRI clients during model updates. Hence, the non-IID nature of the datasets and the amount of images strongly influence the performances. As a result, this approach may prevent clients from specialising in their own modality and could introduce noise. The high pixel accuracy observed in the results can be attributed to the fact that the model tends to predict most pixels as background, which is the majority class in the datasets.

In contrast, FLAMES shows clear advantages, especially for US clients. While maintaining comparable performance for MRI clients (with minimal changes in Dice and IoU), FLAMES dramatically improves US performance. Client 4, despite having access to a limited number of images, achieves a good Dice score by leveraging collaboration with the other US dataset (client 5), even though the images originate from different organs and institutions. Hence, this approach is particularly advantageous for clients with limited data availability.

To assess the generalisation capability of the modality-specific models trained by the server, we tested them on a separate test set containing unseen images from multiple datasets and organ types. Results in Table 4 demonstrate that the model performs well on unseen test data, where images from various datasets and organs are mixed.

TABLE 4. Performance of MRI and US models on test set using Dice, IoU, and Accuracy metrics.

Model	IoU	Dice	Accuracy
MRI	0.57	0.67	0.99
US	0.53	0.62	0.95

7 Conclusions

FLAMES offers a robust solution for advancing segmentation tasks across diverse medical imaging datasets, addressing the challenges posed by non-IID and multimodal data in multi-organ tumour segmentation. Using a unified architecture can improve the collaboration between institutions handling the same modalities. The ability to manage multiple modality-specific models makes FLAMES more adaptable to real-world federated scenarios, where data heterogeneity is the norm. Moreover, the server can segment tumours in all modalities and organs. Future studies could explore several directions. These include the implementation of an asynchronous framework that allows clients of the same modality to continue training independently without waiting for others, scaling the framework to larger, more heterogeneous client pools, testing the server models on modalities and organs not present in the current client's datasets and exploring a decentralised, server-less federated architecture where clients coordinate model updates directly, to reduce central bottlenecks and enhance robustness in environments without stable server access.

Acknowledgements

The authors thank the IBiSco project (Infrastructure for BIg data and Scientific COmputing), PON R&I 2014-2020 under Call 424-2018—Action II.1, for the support and use of the HPC Cluster. The authors extend special thanks to Dr. Luisa Carraciuolo and Eng. Davide Bottalico for their constant and continuous support in utilising the IBiSco HPC Cluster.

This work has been also supported by PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.

This work has been also supported by G.A.N.D.A.L.F.—Gan Approaches for Non-IID Aiding Learning in Federations, CUP: E53D23008290006, PNRR—Missione 4 “Istruzione e Ricerca”–Componente C2 Investimento 1.1 “Fondo per il Programma Nazionale di Ricerca e Progetti di Rilevante Interesse Nazionale (PRIN)”. Open access publishing facilitated by Universita degli Studi di Napoli Federico II, as part of the Wiley - CRUI-CARE agreement.

Disclosure

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Open Research

Data Availability Statement

The datasets used in this work are available (last accessed February 2025) at:

Task01_Brain_Tumour: http://medicaldecathlon.com/.
Breast Ultrasound Images Dataset: https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset.
OTU_2D: https://figshare.com/articles/dataset/_zip/25058690?file=44222642.

References

Acosta, J. N., G. J. Falcone, P. Rajpurkar, and E. J. Topol. 2022. “Multimodal Biomedical AI.” Nature Medicine 28, no. 9: 1773–1784.
10.1038/s41591-022-01981-2
CAS PubMed Web of Science® Google Scholar
Al-Dhabyani, W., M. Gomaa, H. Khaled, and A. Fahmy. 2020. “Dataset of Breast Ultrasound Images.” Data in Brief 28: 104863.
10.1016/j.dib.2019.104863
PubMed Web of Science® Google Scholar
Andrearczyk, V., V. Oreiller, M. Vallières, et al. 2020. “ Automatic Segmentation of Head and Neck Tumors and Nodal Metastases in Pet-Ct Scans.” In Medical Imaging With Deep Learning, 33–43. PMLR.
Google Scholar
Antonelli, M., A. Reinke, S. Bakas, et al. 2022. “The Medical Segmentation Decathlon.” Nature Communications 13, no. 1: 4128.
10.1038/s41467-022-30695-9
CAS PubMed Web of Science® Google Scholar
Bakas, S., H. Akbari, A. Sotiras, et al. 2017. “Advancing the Cancer Genome Atlas Glioma Mri Collections With Expert Segmentation Labels and Radiomic Features.” Scientific Data 4: 170117. https://doi.org/10.1038/sdata.2017.
10.1038/sdata.2017.117
PubMed Web of Science® Google Scholar
Barone, G. B., D. Bottalico, L. Carracciuolo, et al. 2022. “ Designing and Implementing a High-Performance Computing Heterogeneous Cluster.” In 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET), Prague, Czech Republic, 1–6. https://doi.org/10.1109/ICECET55527.2022.9872709.
Google Scholar
Beutel, D. J., T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, and Y. Gao. 2022. Flower: A Friendly Federated Learning Framework.
Google Scholar
Borazjani, K., N. Khosravan, L. Ying, and S. Hosseinalipour. 2024. Multi-Modal Federated Learning for Cancer Staging Over Non-IID Datasets With Unbalanced Modalities arXiv preprint, arXiv:2401.03609.
Google Scholar
Chen, J., Q. Li, M. Gao, W. Zhai, G. Jeon, and D. Camacho. 2024. “Towards Zero-Shot Object Counting via Deep Spatial Prior Cross-Modality Fusion.” Information Fusion 111: 102537.
10.1016/j.inffus.2024.102537
Web of Science® Google Scholar
Chen, Y., Y. Chen, J. Chen, C. Huang, B. Wang, and X. Cui. 2023. “A Novel Deep Learning Method for Brain Tumor Segmentation in Magnetic Resonance Images Based on Residual Units and Modified u-Net Model.” Journal of Mechanics in Medicine and Biology 23, no. 9: 2340088. https://doi.org/10.1142/S0219519423400882.
10.1142/S0219519423400882
Google Scholar
Cinar, N., A. Ozcan, and M. Kaya. 2022. “A Hybrid Densenet121-Unet Model for Brain Tumor Segmentation From MR Images.” Biomedical Signal Processing and Control 76: 103647.
10.1016/j.bspc.2022.103647
Web of Science® Google Scholar
Dai, Q., D. Wei, H. Liu, J. Sun, L. Wang, and Y. Zheng. 2024. “ Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation.” In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 1445–1453. AAAI Press.
Google Scholar
Di Cola, V. S., D. Chiaro, E. Prezioso, S. Izzo, and F. Giampaolo. 2023. “Insight Extraction From e-Health Bookings by Means of Hypergraph and Machine Learning.” IEEE Journal of Biomedical and Health Informatics 27, no. 10: 4649–4659.
10.1109/JBHI.2022.3233498
PubMed Web of Science® Google Scholar
Dice, L. R. 1945. “Measures of the Amount of Ecologic Association Between Species.” Ecology 26, no. 3: 297–302.
10.2307/1932409
Web of Science® Google Scholar
He, J., Y. Zhang, M. Chung, et al. 2023. “Whole-Body Tumor Segmentation From Pet/Ct Images Using a Two-Stage Cascaded Neural Network With Camouflaged Object Detection Mechanisms.” Medical Physics 50, no. 10: 6151–6162.
10.1002/mp.16438
CAS PubMed Web of Science® Google Scholar
Isensee, F., P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein. 2021. “Nnu-Net: A Self-Configuring Method for Deep Learning-Based Biomedical Image Segmentation.” Nature Methods 18, no. 2: 203–211.
10.1038/s41592-020-01008-z
CAS PubMed Web of Science® Google Scholar
Jaccard, P. 1901. “Étude Comparative de la Distribution Florale Dans Une Portion Des Alpes et Des Jura.” Bulletin De La Société Vaudoise des Sciences Naturelles 37: 547–579.
Google Scholar
Kanhere, A., P. Kulkarni, P. H. Yi, and V. S. Parekh. 2024. “ Privacy-Preserving Collaboration for Multi-Organ Segmentation via Federated Learning From Sites With Partial Labels.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 2380–2387.
Google Scholar
Konečnỳ, J., B. McMahan, and D. Ramage. 2015. Federated Optimization: Distributed Optimization Beyond the Datacenter. arXiv preprint, arXiv:1511.03575.
Google Scholar
Li, L. 2024. MMOTU Dataset. https://figshare.com/articles/dataset/_zip/25058690.
Google Scholar
Li, L., Y. Fan, M. Tse, and K. Y. Lin. 2020. “A Review of Applications in Federated Learning.” Computers & Industrial Engineering 149: 106854.
10.1016/j.cie.2020.106854
Web of Science® Google Scholar
Lin, T. 2017. Focal Loss for Dense Object Detection. arXiv preprint, arXiv:1708.02002.
Google Scholar
Ma, J., Y. He, F. Li, L. Han, C. You, and B. Wang. 2024. “Segment Anything in Medical Images.” Nature Communications 15, no. 1: 654.
10.1038/s41467-024-44824-z
CAS PubMed Web of Science® Google Scholar
McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. Arcas. 2017. “ Communication-Efïcient Learning of Deep Networks From Decentralized Data.” In Artificial Intelligence and Statistics, 1273–1282. PMLR.
Google Scholar
Menze, B. H., A. Jakab, S. Bauer, et al. 2014. “The Multimodal Brain Tumor Image Segmentation Benchmark (Brats).” IEEE Transactions on Medical Imaging 34, no. 10: 1993–2024.
10.1109/TMI.2014.2377694
PubMed Web of Science® Google Scholar
Milletari, F., N. Navab, and S. A. Ahmadi. 2016. “ V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation.” In 2016 Fourth International Conference on 3D Vision (3DV), 565–571. IEEE.
10.1109/3DV.2016.79
Google Scholar
Nishio, M., K. Fujimoto, H. Matsuo, C. Muramatsu, R. Sakamoto, and H. Fujita. 2021. “Lung Cancer Segmentation With Transfer Learning: Usefulness of a Pretrained Model Constructed From an Artificial Dataset Generated Using a Generative Adversarial Network.” Frontiers in Artificial Intelligence 4: 694815.
10.3389/frai.2021.694815
PubMed Web of Science® Google Scholar
Pan, H., X. Zhao, L. He, Y. Shi, and X. Lin. 2024. “A Survey of Multimodal Federated Learning: Background, Applications, and Perspectives.” Multimedia Systems 30, no. 4: 222.
10.1007/s00530-024-01422-9
Web of Science® Google Scholar
Qi, P., D. Chiaro, A. Guzzo, M. Ianni, G. Fortino, and F. Piccialli. 2024. “Model Aggregation Techniques in Federated Learning: A Comprehensive Survey.” Future Generation Computer Systems 150: 272–293.
10.1016/j.future.2023.09.008
Web of Science® Google Scholar
Qi, P., D. Chiaro, and F. Piccialli. 2023. “Fl-Fd: Federated Learning-Based Fall Detection With Multimodal Data Fusion.” Information Fusion 99: 101890.
10.1016/j.inffus.2023.101890
Web of Science® Google Scholar
Qi, P., D. Chiaro, and F. Piccialli. 2024. “Small Models, Big Impact: A Review on the Power of Lightweight Federated Learning.” Future Generation Computer Systems 162: 107484.
10.1016/j.future.2024.107484
Web of Science® Google Scholar
Ronneberger, O., P. Fischer, and T. Brox. 2015. “ U-Net: Convolutional Networks for Biomedical Image Segmentation.” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, 234–241. Springer.
10.1007/978-3-319-24574-4_28
Google Scholar
Saleh, M. A., A. A. Ali, K. Ahmed, and A. M. Sarhan. 2022. “A Brief Analysis of Multimodal Medical Image Fusion Techniques.” Electronics 12, no. 1: 97.
10.3390/electronics12010097
Google Scholar
Savoia, M., E. Prezioso, V. Mele, and F. Piccialli. 2024. “Eco-Fl: Enhancing Federated Learning Sustainability in Edge Computing Through Energy-EfÏcient Client Selection.” Computer Communications 225: 156–170.
10.1016/j.comcom.2024.07.014
Google Scholar
Shaik, T., X. Tao, L. Li, H. Xie, and J. D. Velásquez. 2024. “A Survey of Multimodal Information Fusion for Smart Healthcare: Mapping the Journey From Data to Wisdom.” Information Fusion 102: 102040.
10.1016/j.inffus.2023.102040
Web of Science® Google Scholar
Shiri, I., B. Razeghi, A. V. Sadr, et al. 2023. “Multi-Institutional Pet/Ct Image Segmentation Using Federated Deep Transformer Learning.” Computer Methods and Programs in Biomedicine 240: 107706.
10.1016/j.cmpb.2023.107706
PubMed Web of Science® Google Scholar
Simpson, A. L., M. Antonelli, S. Bakas, et al. 2019. A Large Annotated Medical Image Dataset for the Development and Evaluation of Segmentation Algorithms arXiv preprint, arXiv:1902.09063.
Google Scholar
Van der Maaten, L., and G. Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9, no. 11: 2579–2605.
Google Scholar
Zhai, W., W. Song, J. Chen, G. Zhang, Q. Li, and M. Gao. 2023. “Ct and Mri Image Fusion via Dual-Branch Gan.” International Journal of Biomedical Engineering and Technology 42, no. 1: 52–63.
10.1504/IJBET.2023.131696
Web of Science® Google Scholar
Zhai, W., X. Xing, and G. Jeon. 2024. “Region-Aware Quantum Network for Crowd Counting.” IEEE Transactions on Consumer Electronics 70: 5536–5544.
10.1109/TCE.2024.3378166
Web of Science® Google Scholar
Zhang, J., Y. Xie, Y. Xia, and C. Shen. 2021. “ Dodnet: Learning to Segment Multi-Organ and Tumors From Multiple Partially Labeled Datasets.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 1195–1204.
10.1109/CVPR46437.2021.00125
Google Scholar
Zhang, Y., D. Sidibé, O. Morel, and F. Mériaudeau. 2021. “Deep Multimodal Fusion for Semantic Image Segmentation: A Survey.” Image and Vision Computing 105: 104042.
10.1016/j.imavis.2020.104042
Web of Science® Google Scholar

Volume42, Issue8

August 2025

e70090

FLAMES—Federated Learning for Advanced MEdical Segmentation

ABSTRACT

Abbreviation

1 Introduction

2 Related Work

2.1 Data Fusion and Medical Imaging Modalities

2.2 Medical Segmentation

2.3 Medical Segmentation in FL

3 Background

3.1 Federated Learning

4 Methodology

5 Experimental Setting

5.1 Datasets and Preprocessing

5.2 Baseline

5.3 Models and Hyperparameters

5.4 Evaluation Metrics

5.5 Flower Framework

5.6 Hardware Infrastructure

6 Results and Discussion

7 Conclusions

Acknowledgements

Disclosure

Conflicts of Interest

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley