International Journal of Intelligent Systems

Volume 2025, Issue 1 4934696

Research Article

Open Access

A Maturity Model for Practical Explainability in Artificial Intelligence-Based Applications: Integrating Analysis and Evaluation (MM4XAI-AE) Models

Julián Muñoz-Ordóñez,

Corresponding Author

Julián Muñoz-Ordóñez

[email protected]

orcid.org/0000-0001-9393-6139

Department of Computer Science , University of Cauca , Popayán , Cauca, Colombia , unicauca.edu.co

Department of Computer Science , Comfacauca University Corporation-Unicomfacauca , Popayán , Cauca, Colombia

Search for more papers by this author

Carlos Cobos,

Carlos Cobos

orcid.org/0000-0002-6263-1911

Department of Computer Science , University of Cauca , Popayán , Cauca, Colombia , unicauca.edu.co

Search for more papers by this author

Juan C. Vidal-Rojas,

Juan C. Vidal-Rojas

orcid.org/0000-0002-1113-9003

Faculty of Engineering and Business , University of Las Américas , Providencia, Santiago , Chile

Search for more papers by this author

Francisco Herrera,

Francisco Herrera

orcid.org/0000-0002-7283-312X

Department of Computer Science and Artificial Intelligence , Andalusian Research Institute on Data Science and Computational Intelligence (DaSCI) , University of Granada , Granada , Spain , ugr.es

Search for more papers by this author

Julián Muñoz-Ordóñez,

Corresponding Author

Julián Muñoz-Ordóñez

[email protected]

orcid.org/0000-0001-9393-6139

Department of Computer Science , University of Cauca , Popayán , Cauca, Colombia , unicauca.edu.co

Department of Computer Science , Comfacauca University Corporation-Unicomfacauca , Popayán , Cauca, Colombia

Search for more papers by this author

Carlos Cobos,

Carlos Cobos

orcid.org/0000-0002-6263-1911

Department of Computer Science , University of Cauca , Popayán , Cauca, Colombia , unicauca.edu.co

Search for more papers by this author

Juan C. Vidal-Rojas,

Juan C. Vidal-Rojas

orcid.org/0000-0002-1113-9003

Faculty of Engineering and Business , University of Las Américas , Providencia, Santiago , Chile

Search for more papers by this author

Francisco Herrera,

Francisco Herrera

orcid.org/0000-0002-7283-312X

Department of Computer Science and Artificial Intelligence , Andalusian Research Institute on Data Science and Computational Intelligence (DaSCI) , University of Granada , Granada , Spain , ugr.es

Search for more papers by this author

First published: 24 June 2025

https://doi.org/10.1155/int/4934696

Academic Editor: Antoni Mesquida Calafat

Share a link

Email
Wechat
Bluesky

Abstract

The increasing adoption of artificial intelligence (AI) in critical domains such as healthcare, law, and defense demands robust mechanisms to ensure transparency and explainability in decision-making processes. While machine learning and deep learning algorithms have advanced significantly, their growing complexity presents persistent interpretability challenges. Existing maturity frameworks, such as Capability Maturity Model Integration, fall short in addressing the distinct requirements of explainability in AI systems, particularly where ethical compliance and public trust are paramount. To address this gap, we propose the Maturity Model for eXplainable Artificial Intelligence: Analysis and Evaluation (MM4XAI-AE), a domain-agnostic maturity model tailored to assess and guide the practical deployment of explainability in AI-based applications. The model integrates two complementary components: an analysis model and an evaluation model, structured across four maturity levels—operational, justified, formalized, and managed. It evaluates explainability across three critical dimensions: technical foundations, structured design, and human-centered explainability. MM4XAI-AE is grounded in the PAG-XAI framework, emphasizing the interrelated dimensions of practicality, auditability, and governance, thereby aligning with current reflections on responsible and trustworthy AI. The MM4XAI-AE model is empirically validated through a structured evaluation of thirteen published AI applications from diverse sectors, analyzing their design and deployment practices. The results show a wide distribution across maturity levels, underscoring the model’s capacity to identify strengths, gaps, and actionable pathways for improving explainability. This work offers a structured and scalable framework to standardize explainability practices and supports researchers, developers, and policymakers in fostering more transparent, ethical, and trustworthy AI systems.

1. Introduction

Artificial Intelligence (AI) increasingly influences critical decision-making processes in vital sectors such as healthcare, law, and security, where transparent, reliable, and accountable outcomes are paramount [1, 2]. In these high-stakes environments, explainability is crucial to prevent adverse outcomes, ensure ethical compliance, and build trust in AI-driven decisions [3, 4]. As highlighted in [5], eXplainable AI (XAI) addresses the inherent complexity of contemporary AI models, helping to clarify their operations, facilitate accountability, and foster public trust. Recent regulatory frameworks, such as the European Union’s AI Act [6], further emphasize the necessity of explainability, particularly in scenarios impacting fundamental rights, public services, and democratic processes, framing XAI as a technical goal and an essential ethical principle.

XAI encompasses multiple dimensions intended to render AI systems comprehensible to diverse user groups, each contributing uniquely to enhanced transparency and user trust. Longo et al. [4, 7] categorize explainability into data explainability, model explainability, post hoc explanations, and evaluation methods, underscoring the importance of tailoring explanations to distinct audiences. Similarly, Ali et al. [8] emphasize that AI models, especially those applied in sensitive fields such as healthcare and finance, must provide explanations comprehensible to end users and regulators to meet high standards of trustworthiness.

Herrera [5] proposes a holistic view of XAI maturity encompassing three complementary dimensions: applied explainability, auditability (A), and governance (G). While this study acknowledges these dimensions, it specifically targets the practical dimension of “applied explainability,” aiming to operationalize explainability through a structured maturity model. Such an operational approach complements broader theoretical frameworks by providing practical tools for systematic evaluation and improvement.

Despite significant research aimed at enhancing explainability in AI—particularly within machine learning (ML) and deep learning (DL)—no standardized or comprehensive maturity framework currently exists for systematically evaluating explainability practices. Current approaches tend to address isolated issues such as transparency or ethics independently, failing to integrate these into a cohesive and operational framework. This fragmentation hinders organizations’ ability to effectively adopt explainable AI practices, limiting their capacity to address the demands of users, regulators, and society.

Traditional maturity models such as capability maturity model integration (CMMI) [9, 10], which primarily focus on software quality and process improvement, serve as foundational inspirations but are inadequate for addressing the specific interpretability requirements unique to AI systems [11, 12]. Explainability represents a nonfunctional [13] attribute fundamentally distinct from traditional quality metrics such as accuracy or robustness [13]. Previous research by Chazette and Schneider [13], Deters et al. [7], Chmielowski et al. [14], and Rodriguez-Cardenas [15] further underscores that achieving interpretability requires specialized frameworks designed explicitly for AI.

To bridge this gap, we propose the Maturity Model for eXplainable Artificial Intelligence-Analysis and Evaluation (MM4XAI-AE), a maturity model specifically tailored to evaluate practical explainability in artificial intelligence-based applications. The MM4XAI-AE integrates two fundamental components: an analysis model and an evaluation model. These models work in tandem to characterize and assess the progressive adoption of explainability practices in AI applications. It defines four progressive maturity levels: operational, justified, formalized, and managed, each associated with a set of clearly defined indicators. These indicators are structured across three key dimensions: technical foundations, structured design, and human-centered explainability, enabling a systematic understanding of how explainability practices evolve from basic technical implementation to advanced interpretation strategies oriented to end users.

This four-level model strategically avoids complexity and redundancy, offering clear, operational stages that range from basic implementation to the advanced integration of explainability practices. These stages are aligned with ethical standards and societal expectations [16, 17], enabling practical adoption across diverse contexts. Each maturity level in the MM4XAI-AE model corresponds to a set of clearly defined indicators that reflect progressive capabilities in transparency, interpretability, and explainability, facilitating systematic evaluation throughout the AI lifecycle. By focusing on practical applicability and structured progression, the model supports both technical development and alignment with growing regulatory and social demands for trustworthy AI.

This study contributes to the XAI domain in three primary ways. First, it introduces MM4XAI-AE, a maturity model focused on the analysis and evaluation of practical explainability practices, explicitly addressing the limitations of existing software-focused maturity frameworks. The model is conceptually aligned with the practicality (P), A, and G dimensions outlined in the PAG-XAI framework [5], providing an operational approach to those principles. Second, the model is empirically validated using thirteen AI-driven applications from diverse domains, demonstrating its P and versatility in real-world contexts. Finally, we offer a structured methodology and clear indicators for systematically evaluating and enhancing explainability practices, providing valuable guidance to researchers, practitioners, and policymakers alike.

The remainder of the paper is organized as follows: Section 2 reviews related literature on explainability in AI and maturity models. Section 3 describes the materials, methods, and foundational concepts underpinning the MM4XAI-AE, including the analysis and evaluation components and their associated indicators, as well as a comparative analysis with the PAG-XAI framework. Section 4 presents empirical validation results from thirteen case studies. Section 5 discusses these findings, analyzing trends and implications, and Section 6 concludes with key contributions and outlines future research directions, emphasizing the model’s potential to foster explainable and trustworthy AI adoption.

2. Related Work

XAI has become a critical area of research due to its importance in fostering trust and transparency in AI systems. Over the past years, various techniques have been proposed to generate interpretable outputs and enhance human understanding of complex models. Tools such as Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) have been widely used across domains such as healthcare and finance. LIME explains individual predictions by locally approximating the model with interpretable representations, while SHAP applies game theory to assign feature importance, ensuring local accuracy and consistency [18–21]. Despite their success, LIME has been criticized for instability in local explanations, and SHAP, although more stable, is computationally intensive. These limitations highlight the need for more robust and standardized evaluation methods in XAI [22].

Complementing these methods, gradient-based techniques like Grad-CAM and Grad-CAM++ have proven essential in visualizing relevant areas in image classification tasks by DL models. These approaches generate class activation maps to show which image regions most influenced the model’s decisions. Their interpretability at the visual level makes them highly valuable in domains like medical imaging, providing intuitive and accessible insights into complex model behavior [18].

Researchers have also focused on establishing metrics and methods to evaluate the effectiveness of XAI techniques. In 2018, studies explored tools such as goodness checklists, satisfaction scales, and fidelity measures [23, 24]. However, most of these metrics are context specific and lack generalizability, making it difficult to standardize explainability evaluations across domains. Subsequent work emphasized the need for user-centered and culturally aware evaluation methods, such as those proposed by Díaz-Rodríguez et al. in cultural heritage applications [25].

From 2020 to 2021, the focus shifted toward identifying the barriers to XAI implementation in real-world contexts. Studies pointed to the lack of transparency as a major challenge in fields such as healthcare, autonomous systems, and human–computer interaction [26–28]. Cabitza et al. proposed metrics such as the degree of correspondence and weighted agreement for medical AI, but these too remained domain specific [29]. Batarseh et al. provided a systematic review of assurance methods, identifying gaps in achieving reliable AI systems, while Vilone et al. and Alangari et al. highlighted the need for consensus on validation processes and standardized interpretability metrics [4, 30, 31].

In response to these gaps, recent works have proposed layered evaluation frameworks to assess fidelity, clarity, and the stability of explanations. Mirzaei et al., for example, introduced a three-layer approach based on performance loss when removing key features, offering a structured method to compare XAI outputs [32]. This indicates a growing interest in designing scalable, rigorous evaluation protocols.

Throughout 2022, XAI continued to expand across multiple sectors. Applications included early detection of drug resistance, decision support in medical diagnostics, and modeling environmental factors like coastal water quality [33–37]. In Industry 4.0, XAI was integrated into intelligent diagnostics and monitoring, yet challenges in ensuring transparency in high-risk industrial applications persisted [38, 39]. Despite progress, a unified, domain-independent metric framework remains elusive.

By 2023, the connection between explainability and the trustworthiness of AI systems had been further reinforced. Díaz-Rodríguez et al. emphasized explainability as essential to fulfilling legal, ethical, and technical requirements, proposing seven key pillars for trustworthy AI [40]. Recent taxonomies have helped classify XAI techniques into ex-ante and post hoc categories, aligning tools with different audiences and phases of the model lifecycle [32, 41]. This classification facilitates the selection of appropriate techniques for regulators, developers, and end users alike.

In parallel, maturity models have emerged as effective tools for managing process quality and continuous improvement, particularly in the software industry. Originally applied to guide software process evolution [9], they have been adapted between 2020 and 2022 for the adoption of AI and ML in business and industrial contexts [9, 12, 17]. Rama et al., for example, presented a maturity model for managing the ML lifecycle, highlighting the absence of structured pathways for AI integration in enterprises [19].

In the context of XAI, maturity models have been proposed to align ethical principles with technical implementation. Vakkuri et al. advocated for models that ensure responsible AI adoption [42], and Pumplun et al. emphasized maturity-based approaches for adopting ML in healthcare settings [43]. These studies signal an ongoing interest in structuring the development and deployment of explainable systems.

Recent work has also applied maturity models to Industry 4.0 environments. Abadía et al. (2024) examined maturity in cyber-physical production systems (CPPSs), focusing on structured transitions from legacy technologies to autonomous systems [44]. Their approach classifies capabilities into maturity levels, enabling industrial adaptation to digital technologies. Other studies stress the need to evolve maturity models for digital transformation, addressing issues such as resistance to change and operational integration [45–47].

In [5], the PAG-XAI index is proposed, which introduces three core dimensions for explainability maturity: P, A, and G. This paper analyzes the connection between the MM4XAI-AE and the PAG-XAI index. The model proposed in this work builds on these efforts, offering a novel maturity model focused specifically on explainability in AI systems. While existing models have focused on adoption or ethical alignment, this model aims to assess the degree to which AI systems offer understandable, interpretable, and transparent decisions in practical business contexts. It provides a structured tool for organizations to improve trust in AI systems, align processes with explainability goals, and guide implementation through clearly defined maturity levels.

3. Materials and Methods

This section introduces the conceptual and methodological foundations of MM4XAI-AE for assessing explainability in artificial intelligence-based applications. Grounded in established XAI principles and aligned with the phases of the AI lifecycle, MM4XAI-AE is structured into four progressive maturity levels: operational, justified, formalized, and managed. This section is organized into three main components. First, it presents the design of the analysis model, which constitutes the backbone of the maturity structure. This model is grounded in key concepts such as interpretability, transparency, explainability, and the AI lifecycle, providing the theoretical and practical basis for defining maturity indicators. These concepts are synthesized into a coherent model structure that supports the definition and classification of indicators across the four maturity levels. The names and descriptions of each level, along with their respective indicators, are introduced in this subsection. Second, the section describes the design of the evaluation model, which builds upon the analysis model by introducing a formal method for assessing the maturity level of explainability in AI applications. A scoring function is developed to operationalize the positioning of evaluated indicators within the defined maturity levels, enabling systematic and replicable classification. Finally, the section outlines the conceptual alignment between MM4XAI-AE and the PAG-XAI dimensions (P, A, and G), showing how the maturity levels and indicators reflect and operationalize those strategic objectives. This alignment reinforces the theoretical soundness of the model while enhancing its relevance for practical implementation.

3.1. Conceptual Design of the MM4XAI—Analysis Model

In recent years, XAI has gained significant importance, especially as AI applications permeate critical sectors. Explainability is essential for enabling users to understand, trust, and effectively manage AI systems. This research incorporates key definitions and dimensions of explainability, grounding the maturity model on established concepts such as interpretability and explainability. By aligning the model with the phases of the AI lifecycle, from data preparation to deployment and performance management, the model provides a structured foundation for assessing the maturity of explainability practices. Furthermore, essential tools and methods, including model-agnostic techniques such as SHAP and LIME, and gradient-based methods for DL interpretability, are integrated to support the model’s indicators. To establish a robust foundation for MM4XAI-AE, we define essential concepts—including explainability, transparency, and the AI lifecycle—that guide the model’s structure and indicators.

XAI refers to a system’s ability to provide reasons and details that clearly and comprehensibly explain its functioning to a human audience. This becomes especially important in critical AI contexts where ML and DL models, often operating as “black boxes,” pose serious challenges to interpretation. According to Barredo Arrieta et al. [3], explainability enables adequate understanding, trust, and management of AI systems and is the key for facilitating transparency and ethical adoption in high-impact applications such as healthcare, finance, and security [3].

The design of the analysis model in MM4XAI-AE is strongly grounded in the AI lifecycle, which structures the maturity levels along six essential phases: problem definition and solution design, data preparation, model development and training, model evaluation and validation, deployment, and performance management with continuous improvement. These phases were defined by the U.S. General Services Administration (GSA) in 2024. Each of them requires specific considerations regarding explainability and maturity. These considerations allow for the articulation of relevant indicators across all levels of maturity, ensuring coverage of the full development and operational cycle of AI systems.

Transparency and explainability are fundamental to the design of the analysis model and are defined in terms of three key levels: simulatability, decomposability, and algorithmic transparency [3]. These levels provide methodological support for distinguishing the depth of understanding that a user can obtain from a model and are directly associated with the progression of indicators within the maturity levels. In addition, explainability serves multiple and context-dependent purposes, including increasing user trust and adoption, promoting the transferability of models across contexts, and supporting accountability and ethical decision-making in AI [3].

To construct the maturity structure, we adapted the foundational approach of CMMI [10] to the specific context of explainability in AI applications. Drawing from CMMI’s logic of gradual improvement, we propose a model with four levels: operational, justified, formalized, and managed. These levels were chosen for their balance between clarity and scalability, avoiding unnecessary complexity while providing meaningful differentiation between stages. Each level is associated with a set of indicators that reflect increasing capacities in terms of transparency, interpretability, and applicability of XAI techniques. The alignment with the AI lifecycle ensures that these explainability practices are coherently integrated at each development stage, allowing organizations to assess and improve their maturity levels consistently.

The four-level structure of MM4XAI-AE enhances its adaptability across different domains and organizational sizes, providing a simplified yet comprehensive model that can be applied in varied real-world scenarios. This structure not only facilitates the integration of explainability into practice but also encourages continuous evolution through concrete and measurable indicators. Table 1 presents the four maturity levels of MM4XAI-AE, which guide the evaluation of explainability in ML applications.

•
Level 1 (operational): pertains to AI applications developed using both free and proprietary platforms that provide visual, end-to-end (e2e) support for data processing tasks in AI: Waikato Environment for Knowledge Analysis (Weka) [48], Orange Data Mining (Orange) [49], RapidMiner Studio [50], Konstanz Information Miner (KNime) [51], Java Hep Framework (JHep Work) [52] or guide codes available in GitHub repositories or platforms such as Knowledge Discovery and Data Mining Cup (KDD Cup), Large Scale Visual Recognition Challenge (LSVRC), and Kaggle and Open Machine Learning (OpenML) [53–55]. This level was chosen because many users and developers when utilizing these platforms function as “end users” of AI models, possessing limited understanding of the internal processes and explainability aspects. These platforms simplify complex tasks such as model selection and performance metrics, concealing the model’s inner workings and often making the processes opaque to the user. The decision to include e2e platforms at this level responds to the fact that, despite their accessibility and popularity, these tools can restrict transparency and interpretability essential elements for achieving true AI explainability. At this level, the selected indicators—such as correct model identification, use of e2e platforms, and choice of performance metrics—reflect an initial maturity stage, where developers rely on predesigned tools without a conscious focus on explainability. The consulted literature [48–55] supports the classification of these indicators, highlighting both the utility of these tools for beginners and their limitations for advancing toward a deeper understanding of AI. Table 2 shows the proposed indicators for maturity level 1.
•
Level 2 (justified): represents a stage where developers begin to adopt best practices for implementing ML or DL processes, addressing each phase of the AI lifecycle in a structured manner. At this level, users move beyond black-box tools, developing an understanding of the various components of the development process, from problem identification to model deployment. The indicators in this level—such as justification of the ML or DL task, handling of missing and outlier data, feature engineering, and model selection were selected to reflect a more advanced knowledge and an awareness of the importance of each stage in AI model development. These indicators are grounded in the literature [56, 57], which supports a more justified and methodical approach to AI implementation to enhance the system’s quality and understanding. This level incorporates practices that increase transparency and interpretability in models, moving away from the opacity typical of more simplified platforms. Table 3 shows the proposed indicators for maturity level 2.
•
Level 3 (formalized): addresses AI applications that implement ML or DL models and incorporate interpretability techniques to systematically understand the model’s behavior and decisions. At this level, developers are focused on implementing a functional model and committed to ensuring that the model’s decisions can be interpreted and explained clearly at a local or global level. The selected indicators at this level, such as the purpose of interpretation (to create white-box models, explain black-box models, or improve model fairness) and the choice of specific interpretability techniques for different data types, are supported by studies in the literature [58–60]. These studies emphasize the importance of interpretability in applications where transparency and model fairness are essential for acceptance and adoption. Incorporating these indicators allows the model to evaluate the maturity of applications in terms of their capability to generate comprehensible interpretations, providing a solid foundation for informed decision-making and ethical risk management. Table 4 shows the proposed indicators for maturity level 3.
•
Level 4 (managed): represents the highest level of maturity in the model, focusing on AI applications that not only implement interpretability techniques but also integrate advanced explainability frameworks (XAI) such as IBM AI Explainability 360 (AIX360), H2O.ai (H2O), TensorFlow Explainability (Tf-explain), Skater Library (Skater), and Captum for PyTorch (CaptumAI) [23, 61]. At this level, the aim is for AI applications to achieve high standards of transparency, informativeness and fairness [59], with alignment to ethical and social considerations. To ensure these criteria are met, we propose a unified framework that leverages the strengths of these tools. For example, AIX360 can be used for fairness assessment, while CaptumAI offers robust interpretability techniques for DL models, and H2O provides model-agnostic explanations for structured data applications. By unifying these tools into a cohesive framework, the model can more consistently assess applications against the maturity criteria outlined at this level.

Table 1. Maturity model levels for explainability in AI applications.

Level	Description
1	Operational
2	Justified
3	Formalized
4	Managed

Furthermore, we establish specific guidelines to aid in evaluating compliance with each indicator, addressing challenges in measuring the impact of explainability frameworks. For example, to meet the criterion of “bias identification,” a model could leverage SHAP values within AIX360 to systematically detect biases in feature importance. At the same time, fairness and ethical impacts can be evaluated through AIX360’s fairness metrics, ensuring each application meets these high-level indicators comprehensively. Table 5 shows the proposed indicators for maturity level 4.

Table 2. Maturity level 1 indicators: operational explainability in AI applications.

Indicator	Description
I₁	ML or DL model correctly identified for the application’s required task
I₂	End-to-end platform selected for implementing ML or DL processes
I₃	Validated third-party code selected from repositories such as GitHub or Kaggle
I₄	At least three performance metrics selected for model evaluation

Table 3. Maturity level 2 indicators: justified explainability in AI applications.

Indicator	Description
I₁ a I₄	Indicators of the operational level
I₅	Justified description demonstrating the type of ML or DL task to be performed.
I₆	Justified description demonstrating the relevance of the research leading to the implementation of the ML or DL application.
I₇	Techniques applied for managing missing data.
I₈	Techniques applied for managing outliers in the dataset.
I₉	Categorical variables encoded (dumification) based on the ML algorithm used.
I₁₀	Feature selection and feature engineering techniques applied.
I₁₁	Class-balancing techniques implemented in the dataset.
I₁₂	Process identified for selecting and partitioning the dataset into training, validation, and testing stages.
I₁₃	Structured process identified for implementing ML or DL models, transitioning from “white box” to “black box” models.
I₁₄	Hyperparameter tuning process applied to the selected ML or DL model(s).
I₁₅	Analytical description of values obtained from performance metrics of the ML model.
I₁₆	Solution deployed using services with user-focused proxies.

Table 4. Maturity level 3 indicators: formalized explainability in AI applications.

Indicator	Description
I₁ a I₁₆	Justified level indicators
I₁₇	Purpose of model interpretation defined: creating white-box (intrinsic) models, explaining black-box (post hoc) models, improving model fairness, or testing prediction sensitivity.
I₁₈	Type of interpretation determined: local (specific records) or global (model generality-based).
I₁₉	Interpretability method selected according to the dataset type: text, images, graphs, or tabular data [48].

Table 5. Maturity level 4 indicators: managed explainability in AI applications.

Indicator	Description
I₁ a I₁₉	Formalized level indicators
I₂₀	Biases in the dataset identified focusing on social and ethical impacts on customers and society based on the application’s decisions. SHAP values within AIX360 leveraged for systematic bias detection.
I₂₁	Ethical responsibility (“ethical debt”) of the application assessed, determining whether the model requires retraining or data restructuring to mitigate potential biases.
I₂₂	Comprehensive XAI framework implemented, integrating tools such as AIX360, H2O, Tf-explain, Skater, or CaptumAI to enhance clarity for diverse audiences, including clients and society, regarding the application’s decision-making process.
I₂₃	Proxy mechanism established to bridge the application’s decisions with societal impact, ensuring explanations are accessible and intuitive for nonexpert users, facilitating quicker adoption and trust.

3.2. Conceptual Design of the MM4XAI—Evaluation Model

The evaluation model, as a core component of MM4XAI-AE, is designed to transform qualitative explainability indicators into a quantitative and replicable scoring system. This model enables the positioning of AI-based applications within one of the four maturity levels previously defined in the analysis model—operational, justified, formalized, and managed—based on the extent to which they fulfill the proposed indicators. To achieve this, we introduce the Explainability Maturity Score (EMS), a composite index that consolidates the contribution of each indicator through a weighted aggregation mechanism. The EMS serves as the output of the evaluation model, providing a standardized measure that facilitates benchmarking and comparison across AI systems. In the following section, we define the scoring ranges, weighting factors, and mathematical formulation used to calculate the EMS, ensuring that indicators at higher levels of maturity contribute more significantly to the overall assessment. This approach reflects the importance of advanced explainability practices in complex AI applications and supports the strategic objectives of MM4XAI-AE.

To operationalize the EMS, we defined a scoring range from 0 to 100 that is divided into four segments, each corresponding to a maturity level. Recognizing that not all indicators contribute equally to explainability, we applied a differentiated weighting scheme that assigns greater importance to those indicators located in higher maturity levels. This ensures that more advanced practices have a stronger impact on the final score and better reflect the complexity and rigor required in critical AI applications. Table 6 summarizes the score intervals associated with each maturity level.

Table 6. The score ranges for maturity model levels (MML).

	Model maturity levels (MML)
Evaluation score	1	2	3	4
Evaluation score	[0, 10)	[10, 30)	[30, 60)	[60, 100]

Equation (1) defines the EMS, a weighted aggregation function used to evaluate the maturity level of AI-based applications, considering the proposed indicators. We assign a specific weight (θ) to each level, with higher levels receiving greater weight to reflect their increased importance. The sum of the weights across all indicators equals 100%. For the evaluation model of MM4XAI-AE, we propose the following weighting factors for each maturity level:

•
Level 1 (operational): 10% (4 indicators), weight factor θ₁ = 10/4 = 2.5
•
Level 2 (justified): 20% (12 indicators), weight factor θ₂ = 20/12 = 1.67
•
Level 3 (formalized): 30% (3 indicators), weight factor θ₃ = 30/3 = 10.0
•
Level 4 (managed): 40% (4 indicators), weight factor θ₄ = 40/4 = 10.0

This approach ensures that indicators at Level 4 contribute more substantially to the overall maturity score than those at lower levels.

()

EMS = represents the Explainability Maturity Score, θ_i is the weight factor corresponding to the maturity level i of indicator I_i, and I_i = indicators of a maturity level.

In addition, we have developed data capture tools to implement this evaluative model in AI applications. These tools consolidate the proposed indicators and apply the EMS to assess explainability and maturity levels with consistency and precision.

3.3. Conceptual Alignment Between MM4XAI-AE and PAG-XAI Index

The MM4XAI-AE model aligns conceptually and operatively with the PAG-XAI index [5], which proposes three strategic dimensions for evaluating the maturity of XAI: P, A, and G. P refers to the usefulness and applicability of explainability in real-world contexts. A involves the ability to systematically assess, validate, and trace AI decision-making processes. G encompasses the implementation of ethical, transparent, and policy-based standards that ensure responsible AI practices.

Each maturity level defined in MM4XAI-AE progressively reflects and incorporates the core elements of the PAG-XAI dimensions. At the operational level, explainability is minimally developed. Practical implementation is limited to the use of end-to-end tools without explicit attention to explainability, A practices are virtually absent, and G considerations are not yet introduced. This level corresponds to an initial and underdeveloped state across all three PAG-XAI dimensions.

As systems progress to the justified level, early-stage alignment with PAG-XAI begins to emerge. Developers demonstrate basic practical understanding and begin to use explainability tools and metrics consciously. A remains limited but includes initial steps toward documenting processes and recognizing the need for transparency. G enters the picture through early awareness of ethical, legal, or regulatory concerns, though without formal mechanisms. At this stage, the MM4XAI-AE model begins to intersect meaningfully with PAG-XAI principles, particularly in fostering awareness and initiating systematic practices.

The formalized level shows stronger conceptual convergence with the PAG-XAI framework. Here, practical methods such as SHAP and LIME are applied systematically to generate explainability outputs tailored to user and domain needs. Interpretability methods are documented and integrated, allowing for structured auditing of model behavior. From a governance standpoint, organizations begin to establish internal standards, ethical guidelines, and preliminary compliance protocols. This level marks a critical stage in aligning explainability with A and G dimensions in a more deliberate and structured manner.

The highest level, managed, fully reflects the ideals of PAG-XAI. P is demonstrated through user-centered explainability mechanisms that are embedded into decision-making workflows. A is robust and continuous, involving advanced validation practices, bias detection, and accountability mechanisms. G becomes comprehensive, with organizations implementing ethical principles, external compliance standards, and mechanisms for social accountability. At this level, the maturity of explainability practices is not only technically advanced but also aligned with broader organizational responsibilities and societal expectations.

In this way, MM4XAI-AE not only complements the PAG-XAI framework but also serves as a practical instrument to realize its strategic vision. While PAG-XAI offers a high-level conceptual guide to achieving responsible AI, MM4XAI-AE provides a tangible model with concrete indicators and assessment mechanisms that translate those ideals into measurable actions. The EMS generated by MM4XAI-AE allows for the operational evaluation of an application’s maturity level in explainability, enabling a quantifiable and replicable assessment aligned with PAG-XAI’s principles.

The relationship between the two models is, therefore, both conceptual and instrumental. MM4XAI-AE offers a roadmap with practical steps toward achieving the ideals articulated by PAG-XAI. Conversely, PAG-XAI provides the normative and strategic framing that underscores why progressing through these maturity levels is necessary for trustworthy, auditable, and ethical AI. In the future, these models could be integrated further by combining the EMS indicators with PAG-XAI metrics to create a holistic maturity evaluation system, offering organizations both strategic direction and operational guidance for advancing explainability in AI.

4. Results

We validated the complete MM4XAI-AE model—including both the analysis and evaluation components—by applying it to thirteen AI-based applications that use ML algorithms to process structured datasets. These datasets are publicly available in the “Machine Learning Repository” of UC Irvine, and the corresponding scientific papers are accessible through various open-access journals. Table 7 consolidates the dataset names, corresponding scientific articles, task types (classification or regression), data characteristics (types of attributes, number of instances, and number of features), and year of publication.

Table 7. Summary of software applications for AI tasks.

Dataset	Article title	Type of task in ML	Attribute types	# Instances	# Attributes	Year of publication
Antifouling performance index	Establishment of an antifouling performance index derived from the assessment of biofouling on typical marine sensor materials [62]	Classification	Integer: Pixels in images	Not reported in the document	Features such as pixel intensity, edge filters, and texture	2023
DNA barcode database	Artificial intelligence in timber forensics employing DNA barcode database [63]	Classification	Real	Not reported in the document	Not reported in the document	2023
Algerian forest fires dataset	Predicting forest fire in Algeria using data-mining techniques: case study of the decision tree algorithm [64]	Classification, regression	Real	244	12	2019
Alcohol QCM Sensor dataset	Classification of alcohols obtained by QCM sensors with different characteristics using ABC-based neural network [65]	Classification, regression, clustering	Real	125	8	2019
AI4I 2020 predictive maintenance dataset	Explainable artificial intelligence for predictive maintenance applications [66]	Classification, regression, explainability	Real	10,000	14	2020
Amphibians dataset	Predicting presence of amphibian species using features obtained from GIS and satellite images [67]	Classification	Integer, real	189	23	2020
Intrusion detection evaluation dataset (CIC-IDS2017)	False positive identification in intrusion detection using XAI [68]	Classification	Categorical, numerical	2,830,743	11	2017
Alzheimer’s Disease Neuroimaging Initiative (ADNI)	A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer’s disease [69]	Classification	Numeric, categorical, textual, and images	1048	11	2004
Retinal optical coherence tomography angiography (OCTA)	Early detection of dementia through retinal imaging and trustworthy AI [70]	Classification	Images	1192	Process the images with a CNN	2024
Demographic health survey (DHS), focusing on reproductive-aged women from six East African countries	Explainable artificial intelligence models for predicting pregnancy termination among reproductive-aged women in six East African countries: machine learning approach [71]	Classification	Categorical, numerical	338,904	18	2024
B&VIIT Eye Center and Kim’s Eye Hospital	Code-Free Machine Learning Approach for EVO-ICL Vault Prediction: A Retrospective Two-Center Study [72]	Classification	Numeric	640	18	2024
Pulmonary tuberculosis (TB) detection in chest radiographs (CXR)	A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography [73]	Classification	Image	6881	Process the images with a CNN	2024
Images of herbal plants for classification purposes	Teachable machine: optimization of herbal plant image classification based on epoch value, batch size, and learning rate [74]	Classification	Image	10,000	Process the images with a CNN	2024

To apply the model, we developed a structured data collection instrument aligned with all the indicators defined in the analysis model. This instrument includes guiding questions designed to evaluate the presence or absence of each indicator. Since direct interaction with the application developers was not feasible, we relied on a careful reading of the published papers, operating under the assumption that the development processes are sufficiently documented in the article texts.

The application of the MM4XAI-AE was carried out by four experts with extensive experience in AI, both as developers and researchers. Each expert independently reviewed the assigned articles using the instrument, following a guided scoring process based on the maturity indicators. The evaluation simulated an expert interview scenario, where the evaluator interprets the documentation as if interacting with the developers responsible for the AI implementation. For each article, the experts scored the indicators that were met, and the EMS was calculated using the weighted formula defined in equation (1).

This procedure allowed us to determine the maturity level of each AI-based application and to verify that the evaluated studies were distributed across all four maturity levels defined in the MM4XAI-AE. The section concludes with a detailed summary of the indicators identified in each of the thirteen evaluated papers, allowing for a transparent and structured view of how each AI-based application aligns with the maturity levels defined in MM4XAI-AE. Table 8 compiles the results of the expert evaluations and shows the score assigned to each paper, highlighting the diversity in explainability practices across the selected applications. The indicators used and the scoring methodology provide a replicable and systematic way to assess explainability in real-world AI systems.

Table 8. Assessment of software applications using the maturity model.

Application references	I1	I2	I3	I4	I5	I6	I7	I8	I9	I10	I11	I12	I13	I14	I15	I16	I17	I18	I19	I20	I21	I22	I23	Total	MML
[62]	1	1	1	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	10.84	2
[63]	1	1	1	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	10.84	2
[64]	1	1	0	1	1	1	0	0	0	0	0	1	1	0	1	0	0	0	0	0	0	0	0	15.85	2
[65]	1	0	1	1	1	1	0	0	0	0	0	1	0	1	1	0	0	0	0	0	0	0	0	15.85	2
[66]	1	0	1	1	1	1	0	0	1	1	1	1	1	1	1	0	1	1	1	0	0	1	1	72.53	4
[67]	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	28.37	2
[68]	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	80.04	4
[69]	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	80.04	4
[70]	1	0	1	1	1	1	0	0	0	1	0	1	1	1	1	0	1	1	1	0	0	0	0	49.19	3
[71]	1	0	0	1	1	1	1	0	1	1	1	1	0	0	1	0	1	1	1	1	0	1	0	68.36	4
[72]	1	0	0	1	1	1	1	0	0	1	0	1	0	1	1	0	1	1	1	0	0	0	0	46.69	3
[73]	1	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	9.17	1
[74]	1	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	9.17	1

The AI applications analyzed span a wide range of domains and levels of complexity. Table 7 presents a summary of the datasets used, emphasizing the diversity of tasks (classification and regression), attribute types (numeric, categorical, and image), and publication years, which demonstrates the contemporary relevance of the selected studies. It is important to note that all papers describe systems that implement AI algorithms throughout the lifecycle, including model training, evaluation, and, in some cases, deployment and optimization, ensuring a comprehensive view of their explainability maturity.

Several datasets, such as the Antifouling Performance Index and the DNA Barcode Database, are associated with image and textural data and involve classification tasks. However, the limited reporting of attribute or instance counts in some articles restricted the full analysis of dataset complexity. In contrast, other datasets such as CIC-IDS2017 and ADNI demonstrate higher complexity due to their large number of features and tasks involving both classification and regression, often integrating interpretability strategies.

Advanced applications—particularly those using convolutional neural networks (CNNs) for image-based diagnosis, such as retinal OCTA and tuberculosis detection from CXR scans—reflect the integration of post hoc explainability techniques. These examples underscore the growing importance of explainability in domains where model decisions have high-stakes consequences.

The heterogeneity of the evaluated AI-based applications highlights the adaptability of MM4XAI-AE across different contexts. Its application in domains such as healthcare, cybersecurity, environmental science, and computer vision validates its robustness and relevance. The results demonstrate that the model is capable of capturing a wide range of maturity levels in explainability practices, regardless of the domain of application. Furthermore, this diversity supports the model’s potential for broader adoption and reinforces its value as a diagnostic and planning tool for organizations seeking to enhance the explainability of their AI systems.

Table 8 shows that the selected investigations cover the four levels proposed in the maturity model. The article “Establishment of an Antifouling Performance Index Derived from the Assessment of Biofouling on Typical Marine Sensor Materials,” was classified at maturity level 2 (“Justified”). This classification is due to correctly identifying the supervised learning model required to classify biofouling organisms (I1) and using technological tools such as Weka Segmentation for process implementation (I2). In addition, it relies on validated code and standard techniques, such as image analysis libraries (I3). However, although the study justifies the classification task and its relevance for optimizing marine materials (I5 and I6), it does not address more advanced techniques, such as handling missing data, feature selection, or hyperparameter tuning, which limits its formalization. This combination of elements explains its placement at this maturity level, demonstrating a conscious development but lacking a fully structured methodological framework.

The article “Predicting Forest Fire in Algeria Using Data Mining Techniques: Case Study of the Decision Tree Algorithm” fulfilled several Level 2 indicators due to its methodological approach and results. Firstly, I1 was met by correctly identifying the ML model, the J48 decision tree, as the most suitable for predicting forest fires based on the available meteorological data and the specific requirements of the case study. Furthermore, I2 was evidenced by the model’s implementation using the Weka tool, an integrated platform for ML processes, while I4 was achieved by evaluating the model with key metrics such as accuracy, recall, and F-measure, demonstrating robust quantitative analysis. The article also addressed I5 and I6 by justifying the model selection and emphasizing the relevance of the forest fire prediction problem, a critical issue in the context of Algeria due to its significant environmental and economic impacts. In addition, I12 and I13 were fulfilled by explicitly identifying and describing the data partitioning process into training and testing stages and progressing toward a “black box” model by developing a decision tree-based approach that optimizes performance while maintaining simplicity for future hardware implementations. Finally, I15 was achieved by providing a detailed analysis of the evaluated metric values, showing an accuracy of 82.92% and a recall of 92%, aligning with the selected model’s expectations.

The article “Classification of Alcohols Obtained by QCM Sensors with Different Characteristics Using ABC-Based Neural Network” fulfilled several Level 2 indicators due to its methodological approach and the implementation of advanced ML techniques. The fulfillment of I1 was evident through the correct selection of an artificial neural network based on the artificial bee colony (ABC) optimization algorithm as the model for the required classification task. The use of specific platforms and tools for implementing these processes, as outlined in the methodology, supports the fulfillment of I3. In addition, multiple performance metrics, such as mean squared error (MSE), were analyzed, satisfying I4. The article also addresses I5 and I6 by providing a detailed justification of the relevance of alcohol classification using QCM sensor data and its potential impact on industrial and hygiene applications. The data partitioning process for training and testing was clearly described, aligning with indicator I12. Hyperparameter tuning, including the number of neurons in hidden layers and training cycles, was also thoroughly explored, fulfilling I14. Finally, the results are presented through a comprehensive analysis of the performance metrics achieved, demonstrating a notable improvement in the model’s accuracy and effectiveness, reflecting the fulfillment of I15.

The article “Explainable Artificial Intelligence for Predictive Maintenance Applications” achieved maturity level 4 due to its comprehensive approach and advanced use of XAI techniques. First, the article fulfilled indicator I1 by correctly identifying ensemble-based decision trees and interpretable decision tree models to address the predictive maintenance task. Indicator I3 was also met through validated code and implemented algorithms, as detailed in the methodology. At least three metrics, including precision, recall, and confusion matrices, were employed to evaluate model performance, achieving indicator I4. The paper provides a detailed justification of the type of ML task to be performed (I5) and its relevance to industrial applications, as illustrated by the discussion of predictive maintenance’s impact on production processes (I6). Techniques for feature engineering and selecting relevant attributes were implemented, aligning with indicators I9 and I10.

Furthermore, class balancing techniques were applied to address the inherent imbalance in the dataset (I11), and the data partitioning process for training, validation, and testing stages was clearly defined (I12). The article demonstrated compliance with I13 and I14 by exploring structured approaches to interpretable decision trees, including hyperparameter tunings such as node count and selected feature sets. A detailed analysis of performance metrics was conducted, fulfilling I15. At the interpretability level, the purpose of explainable models to enhance trust and adoption was clearly defined, addressing indicators I17, I18, and I19. Finally, explanatory interfaces such as normalized feature deviations and decision trees provided an accessible evaluation for nontechnical audiences, effectively implementing XAI frameworks (I22). In addition, a mechanism was established to link model decisions to broader societal impacts, fostering trust and technological adoption (I23).

The article “Predicting Presence of Amphibian Species Using Features Obtained from GIS and Satellite Images” met Level 2 indicators due to its rigorous methodology and analytical results. First, I1 was fulfilled by correctly identifying models such as Gradient Boosted Trees (GBTs) for predicting amphibian presence based on relevant environmental data. Comprehensive platforms like RapidMiner, integrated with Weka and H2O plugins, were used to implement ML processes, meeting I2. In addition, metrics such as area under the curve (AUC) and balanced accuracy were selected to evaluate model performance, aligning with I4. The article also detailed the species classification problem and its ecological relevance, justifying I5 and I6. Indicators I7 and I8 were addressed through techniques for managing missing data and identifying outliers in the environmental datasets. Moreover, categorical variable encoding and feature engineering techniques were implemented to optimize model accuracy, fulfilling I9 and I10. The research team employed advanced techniques to balance uneven classes in the datasets, meeting I11, and provided a thorough description of the data partitioning process for training, validation, and testing stages, satisfying I12. The design and tuning of specific hyperparameters for the selected models demonstrated a meticulous approach, fulfilling I13 and I14. Finally, the results were presented through a detailed analysis of key metrics, fulfilling I15 and demonstrated that the proposed approach is suitable for practical applications in environmental planning and impact assessment.

The article titled “False Positive Identification in Intrusion Detection Using XAI” achieved maturity level 4 due to its comprehensive approach to identifying and mitigating false positives in anomaly-based intrusion detection systems (IDS) through XAI techniques. It fulfilled I1 by selecting a neural network model for intrusion detection and optimizing it using attributes derived from XAI tools, such as SHAP and the adversarial approach. Indicators I2 and I3 were met using advanced platforms and validated datasets, including LYCOS Intrusion Detection System 2017 (LYCOS-IDS2017). Model performance was evaluated using precision, false positive rate, and area under the curve (AUC), satisfying I4. The article justified the application of ML techniques to address cybersecurity challenges (I5) and highlighted the relevance of the proposed model by demonstrating its capacity to reduce false positives, thereby improving intrusion detection efficiency (I6). Missing data were managed, and outliers were addressed during preprocessing, fulfilling I7 and I8. Categorical variables were encoded to ensure proper data representation (I9).

Furthermore, advanced feature selection and transformation techniques were applied (I10), and methods were used to balance classes within the datasets (I11). The study included a structured data partitioning process into training, validation, and testing stages (I12) and implemented techniques that integrated both “white box” and “black box” models (I13). Hyperparameter tuning was performed to maximize model performance (I14), and a detailed analysis of the obtained metrics was conducted (I15). At maturity level 3, the article satisfied I17 by identifying the purpose of model interpretation and fulfilled I18 and I19 by selecting methods such as SHAP and the Adversarial Approach to explain decisions based on XAI vectors. Finally, at level 4, the study addressed I20 and I21 by identifying biases in the data and assessing the implications of the model’s decisions. This result was complemented by implementing a comprehensive XAI framework that included tools like SHAP and Principal Component Analysis (PCA), promoting clear and accessible explanations for non-technical users, thereby fostering trust and system adoption (I22 and I23).

The article “A Multilayer Multimodal Detection and Prediction Model Based on Explainable Artificial Intelligence for Alzheimer’s Disease” achieved maturity level 4 by addressing early diagnosis and progression prediction of Alzheimer’s disease through an XAI approach. The fulfillment of indicators begins with the precise identification of the ML model (I1), employing Random Forest (RF) as the primary classifier, and the selection of advanced platforms for implementing multimodal processes (I2). The model incorporated validated code and detailed metrics, such as precision, F1-score, and AUC, to evaluate performance (I3, I4). The study justified the relevance and nature of the Alzheimer’s prediction and classification problem, emphasizing its clinical impact and the benefits of early detection (I5, I6). Robust data management techniques, such as feature selection and class balancing strategies, were also addressed (I7-I11), ensuring rigorous data partitioning for training, validation, and testing stages (I12). The structured transition toward explainable models was achieved using SHAP and fuzzy rule-based systems to interpret decisions (I13 and I17). Advanced techniques for hyperparameter tuning, performance analysis through metrics, and deployment of detailed explanations for each model decision were implemented, with natural language representations accessible to medical practitioners (I14–I16, I18-I19). Finally, the model analyzed data biases and provided explanations consistent with medical standards, ensuring ethical responsibility and fostering trust in its clinical use (I20 and I21). This study exemplifies how AI can effectively balance precision and explainability to address complex medical challenges.

The article “Early Detection of Dementia Through Retinal Imaging and Trustworthy AI” reached Level 3 maturity in the technological maturity model due to its robust integration of advanced DL techniques and interpretability analysis. Indicator I1 was fulfilled by selecting the Eye-AD model, an architecture based on graph and CNNs. Indicator I3 was satisfied with using libraries such as PyTorch for model development, highlighting the implementation of validated techniques. Regarding indicator I4, the article employed precision, F1-score, and AUC metrics to evaluate model performance across different stages. At the “Conscious” level, the article justifies the nature and relevance of the problem addressed (I5 and I6), emphasizing the importance of early detection of Alzheimer’s disease and mild cognitive impairment using optical coherence tomography angiography (OCTA) images. Indicator I10 was addressed using advanced feature selection and extraction techniques from retinal images.

In addition, the article describes a structured process for partitioning datasets into training, validation, and testing stages (I12) and presents a progressive design for analyzing intra- and inter-instance relationships within the data, fulfilling indicator I13. Hyperparameter tuning (I14) was detailed, optimizing components such as the multilevel graph neural network. Furthermore, an exhaustive analysis of the values obtained from the selected metrics was conducted, meeting indicator I15. The model’s implementation in a user environment was notable for providing precise and reliable explanations of model decisions, achieving indicators I16, I17, and I18. Finally, indicator I19 was addressed by adapting interpretive methods based on heatmaps and statistical analysis, enabling an understanding of the relevance of different image regions in the diagnostic process. These elements position the article as an example of technological maturity in the application of AI for early disease diagnosis, with significant advances in interpretability techniques and biomedical analysis.

The article titled “Explainable Artificial Intelligence Models for Predicting Pregnancy Termination Among Reproductive-Aged Women in Six East African Countries” achieved Level 4 technological maturity due to its rigorous approach to the predictive and explainable analysis of pregnancy termination among women of reproductive age. First, ML models such as Random Forest (RF) and Extreme Gradient Boosting (XGB) were correctly identified to address the task, fulfilling I1. Additionally, at least three metrics, including precision, F1-score, and AUC, were selected to evaluate model performance, meeting I4. The article provided well-founded descriptions of the task and the relevance of the model, emphasizing how the results can influence reproductive health and public policies, fulfilling I5 and I6. During preprocessing, techniques such as imputation of missing data and class balancing using Synthetic Minority Over-sampling TEchnique (SMOTE) were applied, satisfying I7 and I11. Variable encoding and transformations were performed, addressing I9, and advanced feature selection techniques, such as Mutual Information and Step-Backward Feature Selection, were implemented, fulfilling I10. The process included stratified data partitioning into training and testing sets, meeting I12. At the formalized level, the RF model was chosen for its ability to handle high-dimensional data and robustness against overfitting, and it was interpreted using XAI tools such as SHAP, LIME, and Explain Like I’m Five (Eli5), fulfilling I17, I18, and I19. These tools identified key predictors, including wealth index, educational level, and access to potable water. The model also assessed biases in the data and applied interpretability techniques to enhance explainability satisfying I20 and I22. Overall, the comprehensive and explainable approach positions this work as a benchmark in applying AI to address public health challenges and intervention planning in East Africa.

The article “Code-Free Machine Learning Approach for EVO-ICL Vault Prediction: A Retrospective Two-Center Study” achieved Level 3 maturity in the technological maturity framework, excelling in postoperative vault prediction using ML tools without requiring coding. The fulfillment of indicator I1 was demonstrated by correctly identifying learning models such as Random Forest, Gradient Boosting, and Adaptive Boosting for predicting vault values in patients undergoing Implantable Collamer Lens (ICL) surgery. For indicator I4, robust metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Area Under the Curve (AUC) were employed to evaluate model performance in both internal and external validations. The study adequately justified the problem’s relevance (I5, I6), emphasizing the importance of optimizing implanted lens size to reduce postoperative complications. Handling missing data and normalizing key variables met the requirements of indicator I7, while the dataset partitioning into training and validation stages, performed through 10-fold cross-validation, satisfied indicator I12.

Furthermore, hyperparameter tuning using grid search was implemented, fulfilling indicator I14. The article also satisfied I15 by providing a detailed metrics analysis, demonstrating that Random Forest and Gradient Boosting models outperformed traditional approaches such as linear regression. Regarding interpretability, indicators I17, I18, and I19 were achieved through tools such as t-distributed Stochastic Neighbor Embedding (t-SNE) visualizations and decision tree-based classification criteria, which enabled the interpretation of key factors, including lens size and anterior chamber volume. Finally, the article evaluated inter-center measurement biases and addressed the need for personalized models, ensuring greater accuracy across diverse clinical contexts thereby reinforcing the study’s applicability to medical practice.

The article “A Deep Learning-Based Algorithm for Pulmonary Tuberculosis Detection in Chest Radiography” achieved Level 1 technological maturity in the maturity model due to its fulfillment of several key indicators. The deep neural network model MobileNet was correctly selected as the foundation for training tuberculosis (TB) detection algorithm in chest radiographs, meeting indicator I1. The use of Google Teachable Machine, an accessible platform that does not require extensive coding, enabled the implementation of the DL process, aligning with indicator I2. Furthermore, the article used validated and open-source code derived from datasets such as Kaggle and other recognized sources, satisfying indicator I3. At least three metrics were selected to evaluate the model’s performance: precision, sensitivity, and AUC, fulfilling indicator I4. These metrics demonstrated that the algorithm achieved AUC values of 0.951 and 0.975 on validation sets, highlighting its accuracy in detecting TB at levels comparable to clinical experts. This work stands as an example of the initial level of technological maturity in the application of AI for medical diagnostics, providing a solid foundation for future research and more advanced developments.

The article fulfilled indicators I1, I2, I4, and I6, placing it at Level 1 due to its focus on correctly identifying the DL (ML/DL) models required for the application, utilizing the Teachable Machine platform as an end-to-end environment to implement ML/DL processes, selecting at least three metrics to evaluate model performance, and presenting a justified description of the research’s relevance. The study employed a CNN architecture in combination with Teachable Machine to classify images of medicinal plants, testing parameters such as batch size, learning rate, and the number of epochs to optimize performance. It also detailed how these decisions were critical in achieving an accuracy of 98%–100%, highlighting the model’s utility in web applications for plant identification and justifying its implementation by improving accessibility to knowledge about medicinal plants. This work represents an initial step in applying ML techniques but with a limited scope for higher maturity levels.

The diverse applications and methodologies analyzed across the selected articles underscore the flexibility and adaptability of the technological maturity model in evaluating AI-based solutions in various domains. These studies highlight the potential and limitations of current AI implementations by systematically addressing different maturity levels and employing robust ML and explainability techniques. The findings reveal progression from foundational applications leveraging simple platforms to sophisticated frameworks integrating advanced models, interpretability tools, and ethical considerations. This broad spectrum of use cases validates the maturity model’s relevance and offers insights into the necessary steps for advancing AI-driven applications toward higher maturity levels. Such understanding is critical for promoting the development of reliable, scalable, and impactful AI solutions across scientific, medical, environmental, and societal contexts.

It is important to note that although the need for greater rigor in the application of the maturity model is acknowledged, such as the inclusion of multiple evaluators to corroborate the scores assigned to each article, the purpose of this initial approach was to verify that by applying the maturity model to different articles, the various maturity levels would be identified. This finding suggests that the proposed model can effectively measure the maturity levels across the established indicators.

5. Discussion

The findings of this study underscore the need for a dedicated maturity model to assess explainability in AI applications. The MM4XAI-AE, comprising both the analysis and evaluation components, was applied to thirteen AI-based developments using structured datasets and allowed for the identification of maturity levels based on specific explainability indicators and the computation of the EMS. This application demonstrated how the model effectively positions AI applications across the four defined maturity levels, revealing a wide variability in the adoption of interpretability and explainability practices. Importantly, the evaluation process was carried out by four experts in AI, who followed a structured guide to assess each article and score the presence of the indicators. The results reflect that, although efforts exist to promote explainability in AI, significant challenges remain in standardizing and operationalizing these practices across domains.

Of the thirteen evaluated studies, six achieved higher maturity levels (formalized and managed), while seven remained at the lower levels (operational and justified). Applications with lower maturity tended to lack formal strategies for interpretability and transparency, often relying on black-box models without documented explainability processes. Although technically functional, such systems face adoption barriers in high-stakes domains due to the absence of clear, traceable, and ethical decision-making mechanisms. In contrast, intermediate and advanced-level applications incorporated tools such as SHAP, LIME, and CaptumAI to provide transparency. However, these tools are predominantly technical and geared toward developers, limiting their accessibility to broader audiences such as domain experts, nontechnical users, and regulators. This reinforces the need for explainability mechanisms that are not only technically sound but also user centered and aligned with decision-making needs.

Temporal trends observed in the EMS values suggest some fluctuation in explainability maturity over the years. In 2019, the average maturity level was 2.00, which rose to 2.67 in 2020 and peaked at 4.00 in 2021. However, a decline was observed in the subsequent years, with scores of 2.67 in 2023 and 2.40 in 2024. While this indicates some progress, it also reflects inconsistency in the integration of explainability practices—possibly influenced by sectoral interests, evolving technologies, and varying awareness. Although this trend cannot be generalized to the entire field of AI, the study still provides valuable insight into how explainability adoption evolves over time in scientific research.

In critical sectors such as healthcare and security, the variability in explainability maturity is particularly concerning. The healthcare-related papers showed notable dispersion—ranging from Level 1 (operational), as in the case of the tuberculosis detection system (2024), to Level 4 (managed), such as the pregnancy termination prediction model (2024). This variability underscores the urgency of implementing clear explainability standards in fields where decisions directly impact human wellbeing and rights.

From a practical standpoint, the MM4XAI-AE offers tangible value for both academia and industry. For researchers, it provides a structured methodology for evaluating explainability maturity, promoting comparability and reproducibility. For practitioners and organizations, the model serves as a diagnostic tool that helps assess current explainability practices and define pathways for improvement. Its application can inform strategic planning for AI adoption in line with ethical and transparent standards, ensuring that outputs are interpretable and usable for diverse stakeholders. As a recommendation, organizations should prioritize the integration of interpretability methods in AI solutions to ensure that models are transparent, trustworthy, and responsive to societal needs.

Nevertheless, this study has limitations. The evaluation was based solely on the content documented in published articles, without direct access to the developers or source code of the AI systems. This limitation means that some practices might have been implemented but not reported, potentially affecting the scoring. Future work should include interviews or structured surveys with development teams to validate and complement the documentation-based evaluations. In addition, applying the model in diverse regions or regulatory environments may help uncover contextual variables that affect explainability adoption.

Another limitation lies in the size and scope of the sample. Although the current set of thirteen AI applications enabled a robust initial validation of the model, it is not representative of all AI use cases. Future research should broaden the application of MM4XAI-AE to include a larger, more diverse pool of domains—including industrial, commercial, and governmental implementations. This would improve the model’s generalizability and allow for a more comprehensive understanding of how explainability evolves across sectors.

Moreover, future studies may refine the maturity model by incorporating additional indicators. These could include dimensions such as the perceived usefulness of explanations by end users, the influence of explainability on decision-making processes, and adherence to domain-specific ethical and legal standards. Further iterations of the model could also explore how explainability interacts with other critical aspects such as algorithmic fairness, bias mitigation, and social responsibility.

This study represents a significant advancement in the assessment of explainability maturity in AI systems, offering a practical and replicable methodology through the MM4XAI-AE. The findings affirm the necessity of strengthening explainability practices and provide a roadmap for ongoing improvements in research and implementation. Future work should aim to complete the maturity framework by developing the third component—the improvement model—which will complement the analysis and evaluation models already established. This will ensure the model’s long-term relevance and utility across a wide spectrum of applications and regulatory contexts, guiding the evolution of more explainable, ethical, and human-centered AI.

6. Conclusions and Future Work

This research proposes a structured maturity model for explainable artificial intelligence named MM4XAI-AE, designed to assess the degree of explainability in AI-based applications through a two-part structure comprising an analysis model and an evaluation model. The MM4XAI-AE introduces four progressive maturity levels—operational, justified, formalized, and managed—and evaluates explainability using 23 indicators distributed across three key dimensions: technical foundations, structured design, and human-centered explainability. Each indicator contributes to the EMS, a weighted index ranging from 0 to 100 that prioritizes higher maturity levels to reflect their greater impact on achieving trustworthy AI.

The model was empirically validated by applying it to thirteen AI applications sourced from open-access scientific publications using structured datasets. Four AI experts applied the MM4XAI-AE following a guided evaluation protocol and scored the indicators in each study, enabling a robust classification of the applications across the maturity levels. This process revealed a wide distribution across the four levels, demonstrating the model’s capability to diagnose strengths, gaps, and areas of improvement in explainability practices.

The broader impact of this research lies in its potential to guide ethical development and inform regulatory and organizational strategies surrounding AI adoption. By providing a standardized and practical framework to evaluate explainability, the MM4XAI-AE supports transparency, fairness, and trust in AI systems—particularly relevant in high-stakes contexts such as healthcare, security, and public policy. Its conceptual alignment with the PAG-XAI framework (P, A, and G) further reinforces its strategic relevance and utility.

Future work will focus on completing the third component of the model—an improvement model—that will guide organizations in advancing from one maturity level to another. Proposed directions for further development include the following:

1.
Revising current indicators and integrating additional measures to more effectively capture explainability nuances.
2.
Engaging domain experts to evaluate and validate the model’s relevance, reliability, and scoring mechanisms.
3.
Extending the model’s application to broader ML and DL scenarios, testing its generalizability and robustness.
4.
Developing a complementary improvement model to guide organizations toward achieving higher maturity levels and addressing identified gaps.
5.
Establishing an evaluative approach to track AI explainability improvements over time.
6.
Enhancing future iterations by explicitly integrating PAG-XAI dimensions (P, A, and G) to advance the practical use of XAI maturity approaches, reinforcing A, G, and broader organizational adoption of explainable AI practices.

In an era where AI systems increasingly influence decisions that impact lives, rights, and public trust, the demand for transparency, fairness, and accountability has never been more urgent. MM4XAI-AE represents a critical step forward by bridging theoretical reflection with operational practice, offering a scalable and domain-agnostic pathway to evaluate and elevate explainability maturity in AI applications. Aligned with the PAG-XAI framework, the model embeds P, A, and G as core principles of responsible AI. By providing measurable indicators and actionable insights, MM4XAI-AE empowers organizations to build AI systems that are not only technically robust but also ethically grounded and socially responsive.

As AI continues to permeate high-stakes domains, there is a pressing need for systematic and context-aware maturity assessments that move beyond mere compliance and reflect the evolving expectations of stakeholders. We call for a broader adoption of XAI maturity analysis across sectors, encouraging researchers, practitioners, and policymakers to collaborate in developing shared benchmarks, validating frameworks, and fostering continuous improvement. Explainability should not be treated as a static feature but as a dynamic, evolving capability—central to building AI systems that are trusted, transparent, and aligned with the values of the societies they serve.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

The authors express their gratitude to the University of Cauca and Comfacauca University Corporation-Unicomfacauca for their support during the development of this work. This research received support from the “National Doctoral Call for University Professors—Cohort 2” of Colombia’s Ministry of Science, Technology, and Innovation (MinCiencias) and from the University of Cauca. Francisco Herrera was supported by the TSI-100927-2023-1 Project, funded by the Recovery, Transformation, and Resilience Plan from the European Union Next Generation through the Ministry for Digital Transformation and the Civil Service and the project PID2023-150070NB-I00 financed by MCIN/AEI/10.13039/501100011033.

Open Research

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

1 Goodman B. and Flaxman S., European Union Regulations on Algorithmic Decision Making and a ‘Right to Explanation, AI Magazine. (2017) 38, no. 3, 50–57, https://doi.org/10.1609/aimag.v38i3.2741.
10.1609/aimag.v38i3.2741
Web of Science® Google Scholar
2 Wingfield L. R., Ceresa C., Thorogood S., Fleuriot J., and Knight S., Using Artificial Intelligence for Predicting Survival of Individual Grafts in Liver Transplantation: A Systematic Review, 2020, John Wiley & Sons, Ltd.
10.1002/lt.25772
Google Scholar
3 Barredo Arrieta A., Díaz-Rodríguez N., Del Ser J. et al., Explainable Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI, Information Fusion. (2020) 58, 82–115.
10.1016/j.inffus.2019.12.012
Web of Science® Google Scholar
4 Vilone G. and Longo L., Notions of Explainability and Evaluation Approaches for Explainable Artificial Intelligence, Information Fusion. (2021) 76, 89–106, https://doi.org/10.1016/j.inffus.2021.05.009.
10.1016/j.inffus.2021.05.009
Web of Science® Google Scholar
5 Herrera F., Reflections and Attentiveness on Explainable Artificial Intelligence (XAI): the Journey Ahead from Criticisms to human–AI Collaboration, Information Fusion. (2025) 121, https://doi.org/10.1016/j.inffus.2025.103133.
10.1016/j.inffus.2025.103133
Web of Science® Google Scholar
6 European Commission, Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act), 2025, https://eur-lex.europa.eu/eli/reg/2024/1689/oj.
Google Scholar
7 Deters H., Droste J., and Schneider K., A Means to what End? Evaluating the Explainability of Software Systems Using Goal-Oriented Heuristics, Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. (2023) 329–338, https://doi.org/10.1145/3593434.3593444.
10.1145/3593434.3593444
Google Scholar
8 Ali S., Abuhmed T., El-Sappagh S. et al., Explainable Artificial Intelligence (XAI): what We Know and what is Left to Attain Trustworthy Artificial Intelligence, Information Fusion. (2023) 99, https://doi.org/10.1016/j.inffus.2023.101805.
10.1016/j.inffus.2023.101805
Web of Science® Google Scholar
9 Pérez-Mergarejo E., Pérez-Vergara I., and Rodríguez-Ruíz Y., Modelos De Madurez Y Su Idoneidad Para Aplicar En Pequeñas Y Medianas Empresas, Maturity models and the suitability of its application in small and medium enterprises. (2014) 35, no. 2, 146–158, http://scielo.sld.cu/scielo.php?script=sci_arttext%26pid=S1815-59362014000200004.
Google Scholar
10 Cmmi Product Team, CMMI for Software Engineering, Version 1.1, Staged Representation (CMMI-SW, V1.1, Staged), Software Engineering Institute, Carnegie Mellon University, 2002, https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=6217.
Google Scholar
11 Gilbert N., 12 Current AI Trends & Predictions for 2021/2022 According to Experts: Financesonline.Com, 2025, https://financesonline.com/ai-trends/.
Google Scholar
12 John M. M., Olsson H. H., and Bosch J., Towards Mlops: a Framework and Maturity Model, Proceedings 2021 47th Euromicro Conference on Software Engineering and Advanced Applications, 2021, SEAA 2021, 334–341.
Google Scholar
13 Chazette L. and Schneider K., Explainability as a Non-functional Requirement: Challenges and Recommendations, Requirements Engineering. (2020) 25, no. 4, 493–514, https://doi.org/10.1007/s00766-020-00333-1.
10.1007/s00766-020-00333-1
Web of Science® Google Scholar
14 Chmielowski Ł., Kucharzak M., and Burduk R., Application of Explainable Artificial Intelligence in Software Bug Classification, Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska. (2023) 13, no. 1, 14–17, https://doi.org/10.35784/iapgos.3396.
10.35784/iapgos.3396
Google Scholar
15 Rodriguez-Cardenas D., Beyond Accuracy and Robustness Metrics for Large Language Models for Code, Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. (2024) 159–161, https://doi.org/10.1145/3639478.3639792.
10.1145/3639478.3639792
Google Scholar
16 Aggarwal R., Sounderajah V., Martin G. et al., Diagnostic Accuracy of Deep Learning in Medical Imaging: A Systematic Review and meta-analysis, 2021, Nature Research.
Google Scholar
17 Akkiraju R., Sinha V., Xu A. et al., Characterizing Machine Learning Processes: a Maturity Framework, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2020, Springer Science and Business Media Deutschland GmbH, 17–31.
10.1007/978-3-030-58666-9_2
Google Scholar
18 Lundberg S. M. and Lee S.-I., A Unified Approach to Interpreting Model Predictions, Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS’17), 2017, Long Beach, CA, 4768–4777.
Google Scholar
19 Saarela M. and Podgorelec V., Recent Applications of Explainable AI (XAI): a Systematic Literature Review, Applied Sciences. (2024) 14, no. 19.
10.3390/app14198884
Google Scholar
20 Miró-Nicolau M., Jaume-i-Capó A., and Moyà-Alcover G., Assessing Fidelity in XAI Post-hoc Techniques: a Comparative Study with Ground Truth Explanations Datasets, Artificial Intelligence. (2024) 335, https://doi.org/10.1016/j.artint.2024.104179.
10.1016/j.artint.2024.104179
Web of Science® Google Scholar
21 Bhatnagar S. and Agrawal R., Understanding Explainable Artificial Intelligence Techniques: a Comparative Analysis for Practical Application, Bulletin of Electrical Engineering and Informatics. (2024) 13, no. 6, 4451–4455, https://doi.org/10.11591/eei.v13i6.8378.
10.11591/eei.v13i6.8378
Google Scholar
22 Vimbi V., Shaffi N., and Mahmud M., Interpreting Artificial Intelligence Models: a Systematic Review on the Application of LIME and SHAP in Alzheimer’s Disease Detection, Brain Informatics. (2024) 11, no. 1, 10–29, https://doi.org/10.1186/s40708-024-00222-1.
10.1186/s40708-024-00222-1
PubMed Web of Science® Google Scholar
23 Hoffman R. R., Mueller S. T., Klein G., and Litman J., Metrics for Explainable AI: Challenges and Prospects, 2018, https://arxiv.org/pdf/1812.04608.
Google Scholar
24 Mohseni S., Zarei N., and Ragan E. D., A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems, ACM Transactions on Interactive Intelligent Systems. (2021) 11, no. 3–4, 1–45, https://doi.org/10.1145/3387166.
10.1145/3387166
Web of Science® Google Scholar
25 Díaz-Rodríguez N. and Pisoni G., Accessible Cultural Heritage Through Explainable Artificial Intelligence, UMAP 2020 Adjunct Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, July 2020, Association for Computing Machinery, Inc, 317–324.
Google Scholar
26 Markus A. F., Kors J. A., and Rijnbeek P. R., The Role of Explainability in Creating Trustworthy Artificial Intelligence for Health Care: A Comprehensive Survey of the Terminology, Design Choices, and Evaluation Strategies, 2021, Academic Press Inc.
Google Scholar
27 Clinciu M. A., Eshghi A., and Hastie H., A Study of Automatic Metrics for the Evaluation of Natural Language Explanations, EACL 2021 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, March 2021, Association for Computational Linguistics (ACL), 2376–2387, http://arxiv.org/abs/2103.08545.
Google Scholar
28 Mucha H., Robert S., Breitschwerdt R., and Fellmann M., Interfaces for Explanations in Human-AI Interaction: Proposing a Design Evaluation Approach, Conference on Human Factors in Computing Systems Proceedings, May 2021, Association for Computing Machinery.
Google Scholar
29 Cabitza F., Campagner A., and Sconfienza L. M., As if Sand Were Stone. New Concepts and Metrics to Probe the Ground on Which to Build Trustable AI, BMC Medical Informatics and Decision Making. (2020) 20, no. 1, 219–221, https://doi.org/10.1186/s12911-020-01224-9.
10.1186/s12911-020-01224-9
PubMed Web of Science® Google Scholar
30 Batarseh F. A., Freeman L., and Huang C. H., A Survey on Artificial Intelligence Assurance, Journal of Big Data. (2021) 8, no. 1, 60–30, https://doi.org/10.1186/s40537-021-00445-7.
10.1186/s40537-021-00445-7
Web of Science® Google Scholar
31 Alangari N., El Bachir Menai M., Mathkour H., and Almosallam I., Exploring Evaluation Methods for Interpretable Machine Learning: a Survey, Information. (2023) 14, no. 8, https://doi.org/10.3390/info14080469.
10.3390/info14080469
PubMed Web of Science® Google Scholar
32 Mirzaei S. R., Mao H., Al-Nima R. R. O., and Woo W. L., Explainable AI Evaluation: a Top-Down Approach for Selecting Optimal Explanations for Black Box Models, Information. (2023) 15, no. 1, https://doi.org/10.3390/info15010004.
10.3390/info15010004
PubMed Google Scholar
33 Sutton R. T., Zaïane O. R., Goebel R., and Baumgart D. C., Artificial Intelligence Enabled Automated Diagnosis and Grading of Ulcerative Colitis Endoscopy Images, Scientific Reports. (2022) 12, no. 1, 2748–10, https://doi.org/10.1038/s41598-022-06726-2.
10.1038/s41598-022-06726-2
CAS PubMed Web of Science® Google Scholar
34 Martínez-Agüero S., Soguero-Ruiz C., Alonso-Moral J. M., Mora-Jiménez I., Álvarez-Rodríguez J., and Marques A. G., Interpretable Clinical time-series Modeling with Intelligent Feature Selection for Early Prediction of Antimicrobial Multidrug Resistance, Future Generation Computer Systems. (2022) 133, 68–83, https://doi.org/10.1016/j.future.2022.02.021.
10.1016/j.future.2022.02.021
Web of Science® Google Scholar
35 Evans T., Retzlaff C. O., Geißler C. et al., The Explainability Paradox: Challenges for Xai in Digital Pathology, Future Generation Computer Systems. (2022) 133, 281–296, https://doi.org/10.1016/j.future.2022.03.009.
10.1016/j.future.2022.03.009
Web of Science® Google Scholar
36 Park J., Lee W. H., Kim K. T., Park C. Y., Lee S., and Heo T. Y., Interpretation of Ensemble Learning to Predict Water Quality Using Explainable Artificial Intelligence, Science of the Total Environment. (2022) 832, https://doi.org/10.1016/j.scitotenv.2022.155070.
10.1016/j.scitotenv.2022.155070
PubMed Web of Science® Google Scholar
37 Kamangir H., Krell E., Collins W., King S. A., and Tissot P., Importance of 3D Convolution and Physics on a Deep Learning Coastal Fog Model, Environmental Modelling & Software. (2022) 154, https://doi.org/10.1016/j.envsoft.2022.105424.
10.1016/j.envsoft.2022.105424
Web of Science® Google Scholar
38 Fatahi R., Nasiri H., Dadfar E., and Chehreh Chelgani S., Modeling of Energy Consumption Factors for an Industrial Cement Vertical Roller Mill by SHAP-XGBoost: a ‘Conscious Lab’ Approach, Scientific Reports. (2022) 12, no. 1, 7543–13, https://doi.org/10.1038/s41598-022-11429-9.
10.1038/s41598-022-11429-9
CAS PubMed Web of Science® Google Scholar
39 Ahmed I., Jeon G., and Piccialli F., From Artificial Intelligence to Explainable Artificial Intelligence in Industry 4.0: a Survey on what, How, and Where, IEEE Transactions on Industrial Informatics. (2022) 18, no. 8, 5031–5042, https://doi.org/10.1109/tii.2022.3146552.
10.1109/TII.2022.3146552
Web of Science® Google Scholar
40 Díaz-Rodríguez N., Del Ser J., Coeckelbergh M., López de Prado M., Herrera-Viedma E., and Herrera F., Connecting the Dots in Trustworthy Artificial Intelligence: from AI Principles, Ethics, and Key Requirements to Responsible AI Systems and Regulation, Information Fusion. (2023) 99, https://doi.org/10.1016/j.inffus.2023.101896.
10.1016/j.inffus.2023.101896
Web of Science® Google Scholar
41 Ali S., Abuhmed T., El-Sappagh S. et al., Explainable Artificial Intelligence (XAI): what We Know and what Is Left to Attain Trustworthy Artificial Intelligence, Information Fusion. (2023) 99, https://doi.org/10.1016/j.inffus.2023.101805.
10.1016/j.inffus.2023.101805
Web of Science® Google Scholar
42 Vakkuri P., Jantunen M., Halme E. et al., Time for AI (Ethics) Maturity Model Is now, Proceedings of the 2021 Workshop on Artificial Intelligence Safety, 2021, 1–6.
Google Scholar
43 Pumplun L., Fecho M., Wahl N., Peters F., and Buxmann P., Adoption of Machine Learning Systems for Medical Diagnostics in Clinics: Qualitative Interview Study, Journal of Medical Internet Research. (2021) 23, no. 10, https://doi.org/10.2196/29301.
10.2196/29301
PubMed Web of Science® Google Scholar
44 Abadía S., Avila O., and Goepp V., A Literature Review of Maturity Models for Cyber-Physical Production Systems, IFAC-PapersOnLine. (2024) 58, no. 19, 923–928, https://doi.org/10.1016/j.ifacol.2024.09.163.
10.1016/j.ifacol.2024.09.163
Google Scholar
45 Zamora Iribarren M., Garay-Rondero C. L., Lemus-Aguilar I., and Peimbert-García R. E., A Review of Industry 4.0 Assessment Instruments for Digital Transformation, Applied Sciences. (2024) 14, no. 5, https://doi.org/10.3390/app14051693.
10.3390/app14051693
Google Scholar
46 France S. L., Navigating Software Development in the Chatgpt and Github Copilot Era, Business Horizons. (2024) 67, no. 5, 649–661, https://doi.org/10.1016/j.bushor.2024.05.009.
10.1016/j.bushor.2024.05.009
Web of Science® Google Scholar
47 Ladu L., Koch C., Asna Ashari P., Blind K., and Castka P., Technology Adoption and Digital Maturity in the Conformity Assessment Industry: Empirical Evidence from an International Study, Technology in Society. (2024) 77, https://doi.org/10.1016/j.techsoc.2024.102564.
10.1016/j.techsoc.2024.102564
Web of Science® Google Scholar
48 Weka, Data Mining with Open-Source Machine Learning Software in Java, 2011, https://www.cs.waikato.ac.nz/ml/weka/.
Google Scholar
49 University of Ljubljana, Orange Data Mining: Data Mining, 2020, https://orangedatamining.com/.
Google Scholar
50 RapidMiner, Rapidminer Amplify the Impact of your People, Expertise and Data, https://rapidminer.com/.
Google Scholar
51 Knime, KNIME Open for Innovation, 2025, https://www.knime.com/.
Google Scholar
52 HPE Machine Learning Development Environment | HPE, 2025, https://www.hpe.com/us/en/solutions/artificial-intelligence/machine-learning-development-environment.html.
Google Scholar
53 Hoffmann F., Bertram T., Mikut R., Reischl M., and Nelles O., Benchmarking in Classification and Regression, WIREs Data Mining and Knowledge Discovery. (2019) 9, no. 5, https://doi.org/10.1002/widm.1318, 2-s2.0-85067846653.
10.1002/widm.1318
Web of Science® Google Scholar
54 Google Ai Experiment, Teachable Machine, 2025, https://teachablemachine.withgoogle.com/.
Google Scholar
55 Aws, Servicios Gratuitos De Machine Learning: AWS, 2025.
Google Scholar
56 Tien P. W., Wei S., Darkwa J., Wood C., and Calautit J. K., Machine Learning and Deep Learning Methods for Enhancing Building Energy Efficiency and Indoor Environmental Quality – A Review, 2022, Elsevier.
10.1016/j.egyai.2022.100198
Google Scholar
57 Li Z., Yoon J., Zhang R. et al., Machine Learning in Concrete Science: Applications, Challenges, and Best Practices, 2022, Nature Publishing Group.
Google Scholar
58 Linardatos P., Papastefanopoulos V., and Kotsiantis S., Explainable Ai: A Review of Machine Learning Interpretability Methods, 2021, Multidisciplinary Digital Publishing Institute.
Google Scholar
59 Gohel P., Singh P., and Mohanty M., Explainable AI: Current Status and Future Directions, 2021, https://arxiv.org/abs/2107.07045.
Google Scholar
60 Zhang Y., Tino P., Leonardis A., and Tang K., A Survey on Neural Network Interpretability, 2021, Institute of Electrical and Electronics Engineers Inc.
10.1109/TETCI.2021.3100641
Google Scholar
61 Ding W., Abdel-Basset M., Hawash H., and Ali A. M., Explainability of Artificial Intelligence Methods, Applications and Challenges: a Comprehensive Survey, Information Sciences. (2022) 615, 238–292, https://doi.org/10.1016/j.ins.2022.10.013.
10.1016/j.ins.2022.10.013
Web of Science® Google Scholar
62 Delgado A., Power S., Richards C. et al., Establishment of an Antifouling Performance Index Derived from the Assessment of Biofouling on Typical Marine Sensor Materials, Science of the Total Environment. (2023) 887, https://doi.org/10.1016/j.scitotenv.2023.164059.
10.1016/j.scitotenv.2023.164059
PubMed Web of Science® Google Scholar
63 Dev S. A., Unnikrishnan R., Prathibha P. S. et al., Artificial Intelligence in Timber Forensics Employing DNA Barcode Database, 3 Biotech. (2023) 13, no. 6, 183–13, https://doi.org/10.1007/s13205-023-03604-0.
10.1007/s13205-023-03604-0
PubMed Web of Science® Google Scholar
64 Abid F. and Izeboudjen N., Predicting Forest Fire in Algeria Using Data Mining Techniques: Case Study of the Decision Tree Algorithm, Advances in Intelligent Systems and Computing, 2020, Springer Science and Business Media Deutschland GmbH, 363–370.
Google Scholar
65 Adak M. F., Lieberzeit P., Jarujamrus P., and Yumusak N., Classification of Alcohols Obtained by QCM Sensors with Different Characteristics Using ABC Based Neural Network, Engineering Science and Technology, an International Journal. (2020) 23, no. 3, 463–469, https://doi.org/10.1016/j.jestch.2019.06.011, 2-s2.0-85068960522.
10.1016/j.jestch.2019.06.011
Web of Science® Google Scholar
66 Matzka S., Explainable Artificial Intelligence for Predictive Maintenance Applications, Proceedings 2020 3rd International Conference on Artificial Intelligence for Industries, AI4I 2020, 2020, Institute of Electrical and Electronics Engineers Inc, 69–74.
Google Scholar
67 Blachnik M., Sołtysiak M., and Dąbrowska D., Predicting Presence of Amphibian Species Using Features Obtained from GIS and Satellite Images, ISPRS International Journal of Geo-Information. (2019) 8, no. 3, https://doi.org/10.3390/ijgi8030123, 2-s2.0-85063683418.
10.3390/ijgi8030123
Web of Science® Google Scholar
68 da Silveira Lopes R., Duarte J. C., and Goldschmidt R. R., False Positive Identification in Intrusion Detection Using XAI, IEEE Latin America Transactions. (2023) 21, no. 6, 745–751, https://latamt.ieeer9.org/index.php/transactions/article/view/7707, https://doi.org/10.1109/tla.2023.10172140.
10.1109/tla.2023.10172140
Google Scholar
69 El-Sappagh S., Alonso J. M., Islam S. M. R., Sultan A. M., and Kwak K. S., A Multilayer Multimodal Detection and Prediction Model Based on Explainable Artificial Intelligence for Alzheimer’s Disease, Scientific Reports. (2021) 11, no. 1, https://doi.org/10.1038/s41598-021-82098-3.
10.1038/s41598-021-82098-3
PubMed Web of Science® Google Scholar
70 Hao J., Kwapong W. R., Shen T. et al., Early Detection of Dementia Through Retinal Imaging and Trustworthy AI, Npj Digital Medicine. (2024) 7, no. 1, 294–315, https://doi.org/10.1038/s41746-024-01292-5.
10.1038/s41746-024-01292-5
PubMed Web of Science® Google Scholar
71 Setegn G. M. and Dejene B. E., Explainable Artificial Intelligence Models for Predicting Pregnancy Termination Among reproductive-aged Women in Six East African Countries: Machine Learning Approach, BMC Pregnancy and Childbirth. (2024) 24, no. 1, https://doi.org/10.1186/s12884-024-06773-9.
10.1186/s12884-024-06773-9
PubMed Web of Science® Google Scholar
72 Shin D., Choi H., Kim D., Park J., Yoo T. K., and Koh K., Code-Free Machine Learning Approach for EVO-ICL Vault Prediction: a Retrospective Two-Center Study, Translational Vision Science & Technology. (2024) 13, no. 4, https://doi.org/10.1167/tvst.13.4.4.
10.1167/tvst.13.4.4
Web of Science® Google Scholar
73 Chen C.-F., Hsu C.-H., Jiang Y.-C. et al., A Deep Learning-based Algorithm for Pulmonary Tuberculosis Detection in Chest Radiography, Scientific Reports. (2024) 14, no. 1, 14917–10, https://doi.org/10.1038/s41598-024-65703-z.
10.1038/s41598-024-65703-z
CAS PubMed Web of Science® Google Scholar
74 Saitakela M., Saitakela M., Bulan S. J., Lamabelawa M. I. J., and Belutowe Y. S., Teachable Machine: Optimization of Herbal Plant Image Classification Based on Epoch Value, Batch Size and Learning Rate, Journal of Applied Data Sciences. (2024) 5, no. 2, 532–545, https://doi.org/10.47738/jads.v5i2.206.
10.47738/jads.v5i2.206
Google Scholar

All articles

A Maturity Model for Practical Explainability in Artificial Intelligence-Based Applications: Integrating Analysis and Evaluation (MM4XAI-AE) Models

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Conceptual Design of the MM4XAI—Analysis Model

3.2. Conceptual Design of the MM4XAI—Evaluation Model

3.3. Conceptual Alignment Between MM4XAI-AE and PAG-XAI Index

4. Results

5. Discussion

6. Conclusions and Future Work

Conflicts of Interest

Funding

Open Research

Data Availability Statement

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley