A Maturity Model for Practical Explainability in Artificial Intelligence-Based Applications: Integrating Analysis and Evaluation (MM4XAI-AE) Models
Abstract
The increasing adoption of artificial intelligence (AI) in critical domains such as healthcare, law, and defense demands robust mechanisms to ensure transparency and explainability in decision-making processes. While machine learning and deep learning algorithms have advanced significantly, their growing complexity presents persistent interpretability challenges. Existing maturity frameworks, such as Capability Maturity Model Integration, fall short in addressing the distinct requirements of explainability in AI systems, particularly where ethical compliance and public trust are paramount. To address this gap, we propose the Maturity Model for eXplainable Artificial Intelligence: Analysis and Evaluation (MM4XAI-AE), a domain-agnostic maturity model tailored to assess and guide the practical deployment of explainability in AI-based applications. The model integrates two complementary components: an analysis model and an evaluation model, structured across four maturity levels—operational, justified, formalized, and managed. It evaluates explainability across three critical dimensions: technical foundations, structured design, and human-centered explainability. MM4XAI-AE is grounded in the PAG-XAI framework, emphasizing the interrelated dimensions of practicality, auditability, and governance, thereby aligning with current reflections on responsible and trustworthy AI. The MM4XAI-AE model is empirically validated through a structured evaluation of thirteen published AI applications from diverse sectors, analyzing their design and deployment practices. The results show a wide distribution across maturity levels, underscoring the model’s capacity to identify strengths, gaps, and actionable pathways for improving explainability. This work offers a structured and scalable framework to standardize explainability practices and supports researchers, developers, and policymakers in fostering more transparent, ethical, and trustworthy AI systems.
1. Introduction
Artificial Intelligence (AI) increasingly influences critical decision-making processes in vital sectors such as healthcare, law, and security, where transparent, reliable, and accountable outcomes are paramount [1, 2]. In these high-stakes environments, explainability is crucial to prevent adverse outcomes, ensure ethical compliance, and build trust in AI-driven decisions [3, 4]. As highlighted in [5], eXplainable AI (XAI) addresses the inherent complexity of contemporary AI models, helping to clarify their operations, facilitate accountability, and foster public trust. Recent regulatory frameworks, such as the European Union’s AI Act [6], further emphasize the necessity of explainability, particularly in scenarios impacting fundamental rights, public services, and democratic processes, framing XAI as a technical goal and an essential ethical principle.
XAI encompasses multiple dimensions intended to render AI systems comprehensible to diverse user groups, each contributing uniquely to enhanced transparency and user trust. Longo et al. [4, 7] categorize explainability into data explainability, model explainability, post hoc explanations, and evaluation methods, underscoring the importance of tailoring explanations to distinct audiences. Similarly, Ali et al. [8] emphasize that AI models, especially those applied in sensitive fields such as healthcare and finance, must provide explanations comprehensible to end users and regulators to meet high standards of trustworthiness.
Herrera [5] proposes a holistic view of XAI maturity encompassing three complementary dimensions: applied explainability, auditability (A), and governance (G). While this study acknowledges these dimensions, it specifically targets the practical dimension of “applied explainability,” aiming to operationalize explainability through a structured maturity model. Such an operational approach complements broader theoretical frameworks by providing practical tools for systematic evaluation and improvement.
Despite significant research aimed at enhancing explainability in AI—particularly within machine learning (ML) and deep learning (DL)—no standardized or comprehensive maturity framework currently exists for systematically evaluating explainability practices. Current approaches tend to address isolated issues such as transparency or ethics independently, failing to integrate these into a cohesive and operational framework. This fragmentation hinders organizations’ ability to effectively adopt explainable AI practices, limiting their capacity to address the demands of users, regulators, and society.
Traditional maturity models such as capability maturity model integration (CMMI) [9, 10], which primarily focus on software quality and process improvement, serve as foundational inspirations but are inadequate for addressing the specific interpretability requirements unique to AI systems [11, 12]. Explainability represents a nonfunctional [13] attribute fundamentally distinct from traditional quality metrics such as accuracy or robustness [13]. Previous research by Chazette and Schneider [13], Deters et al. [7], Chmielowski et al. [14], and Rodriguez-Cardenas [15] further underscores that achieving interpretability requires specialized frameworks designed explicitly for AI.
To bridge this gap, we propose the Maturity Model for eXplainable Artificial Intelligence-Analysis and Evaluation (MM4XAI-AE), a maturity model specifically tailored to evaluate practical explainability in artificial intelligence-based applications. The MM4XAI-AE integrates two fundamental components: an analysis model and an evaluation model. These models work in tandem to characterize and assess the progressive adoption of explainability practices in AI applications. It defines four progressive maturity levels: operational, justified, formalized, and managed, each associated with a set of clearly defined indicators. These indicators are structured across three key dimensions: technical foundations, structured design, and human-centered explainability, enabling a systematic understanding of how explainability practices evolve from basic technical implementation to advanced interpretation strategies oriented to end users.
This four-level model strategically avoids complexity and redundancy, offering clear, operational stages that range from basic implementation to the advanced integration of explainability practices. These stages are aligned with ethical standards and societal expectations [16, 17], enabling practical adoption across diverse contexts. Each maturity level in the MM4XAI-AE model corresponds to a set of clearly defined indicators that reflect progressive capabilities in transparency, interpretability, and explainability, facilitating systematic evaluation throughout the AI lifecycle. By focusing on practical applicability and structured progression, the model supports both technical development and alignment with growing regulatory and social demands for trustworthy AI.
This study contributes to the XAI domain in three primary ways. First, it introduces MM4XAI-AE, a maturity model focused on the analysis and evaluation of practical explainability practices, explicitly addressing the limitations of existing software-focused maturity frameworks. The model is conceptually aligned with the practicality (P), A, and G dimensions outlined in the PAG-XAI framework [5], providing an operational approach to those principles. Second, the model is empirically validated using thirteen AI-driven applications from diverse domains, demonstrating its P and versatility in real-world contexts. Finally, we offer a structured methodology and clear indicators for systematically evaluating and enhancing explainability practices, providing valuable guidance to researchers, practitioners, and policymakers alike.
The remainder of the paper is organized as follows: Section 2 reviews related literature on explainability in AI and maturity models. Section 3 describes the materials, methods, and foundational concepts underpinning the MM4XAI-AE, including the analysis and evaluation components and their associated indicators, as well as a comparative analysis with the PAG-XAI framework. Section 4 presents empirical validation results from thirteen case studies. Section 5 discusses these findings, analyzing trends and implications, and Section 6 concludes with key contributions and outlines future research directions, emphasizing the model’s potential to foster explainable and trustworthy AI adoption.
2. Related Work
XAI has become a critical area of research due to its importance in fostering trust and transparency in AI systems. Over the past years, various techniques have been proposed to generate interpretable outputs and enhance human understanding of complex models. Tools such as Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) have been widely used across domains such as healthcare and finance. LIME explains individual predictions by locally approximating the model with interpretable representations, while SHAP applies game theory to assign feature importance, ensuring local accuracy and consistency [18–21]. Despite their success, LIME has been criticized for instability in local explanations, and SHAP, although more stable, is computationally intensive. These limitations highlight the need for more robust and standardized evaluation methods in XAI [22].
Complementing these methods, gradient-based techniques like Grad-CAM and Grad-CAM++ have proven essential in visualizing relevant areas in image classification tasks by DL models. These approaches generate class activation maps to show which image regions most influenced the model’s decisions. Their interpretability at the visual level makes them highly valuable in domains like medical imaging, providing intuitive and accessible insights into complex model behavior [18].
Researchers have also focused on establishing metrics and methods to evaluate the effectiveness of XAI techniques. In 2018, studies explored tools such as goodness checklists, satisfaction scales, and fidelity measures [23, 24]. However, most of these metrics are context specific and lack generalizability, making it difficult to standardize explainability evaluations across domains. Subsequent work emphasized the need for user-centered and culturally aware evaluation methods, such as those proposed by Díaz-Rodríguez et al. in cultural heritage applications [25].
From 2020 to 2021, the focus shifted toward identifying the barriers to XAI implementation in real-world contexts. Studies pointed to the lack of transparency as a major challenge in fields such as healthcare, autonomous systems, and human–computer interaction [26–28]. Cabitza et al. proposed metrics such as the degree of correspondence and weighted agreement for medical AI, but these too remained domain specific [29]. Batarseh et al. provided a systematic review of assurance methods, identifying gaps in achieving reliable AI systems, while Vilone et al. and Alangari et al. highlighted the need for consensus on validation processes and standardized interpretability metrics [4, 30, 31].
In response to these gaps, recent works have proposed layered evaluation frameworks to assess fidelity, clarity, and the stability of explanations. Mirzaei et al., for example, introduced a three-layer approach based on performance loss when removing key features, offering a structured method to compare XAI outputs [32]. This indicates a growing interest in designing scalable, rigorous evaluation protocols.
Throughout 2022, XAI continued to expand across multiple sectors. Applications included early detection of drug resistance, decision support in medical diagnostics, and modeling environmental factors like coastal water quality [33–37]. In Industry 4.0, XAI was integrated into intelligent diagnostics and monitoring, yet challenges in ensuring transparency in high-risk industrial applications persisted [38, 39]. Despite progress, a unified, domain-independent metric framework remains elusive.
By 2023, the connection between explainability and the trustworthiness of AI systems had been further reinforced. Díaz-Rodríguez et al. emphasized explainability as essential to fulfilling legal, ethical, and technical requirements, proposing seven key pillars for trustworthy AI [40]. Recent taxonomies have helped classify XAI techniques into ex-ante and post hoc categories, aligning tools with different audiences and phases of the model lifecycle [32, 41]. This classification facilitates the selection of appropriate techniques for regulators, developers, and end users alike.
In parallel, maturity models have emerged as effective tools for managing process quality and continuous improvement, particularly in the software industry. Originally applied to guide software process evolution [9], they have been adapted between 2020 and 2022 for the adoption of AI and ML in business and industrial contexts [9, 12, 17]. Rama et al., for example, presented a maturity model for managing the ML lifecycle, highlighting the absence of structured pathways for AI integration in enterprises [19].
In the context of XAI, maturity models have been proposed to align ethical principles with technical implementation. Vakkuri et al. advocated for models that ensure responsible AI adoption [42], and Pumplun et al. emphasized maturity-based approaches for adopting ML in healthcare settings [43]. These studies signal an ongoing interest in structuring the development and deployment of explainable systems.
Recent work has also applied maturity models to Industry 4.0 environments. Abadía et al. (2024) examined maturity in cyber-physical production systems (CPPSs), focusing on structured transitions from legacy technologies to autonomous systems [44]. Their approach classifies capabilities into maturity levels, enabling industrial adaptation to digital technologies. Other studies stress the need to evolve maturity models for digital transformation, addressing issues such as resistance to change and operational integration [45–47].
In [5], the PAG-XAI index is proposed, which introduces three core dimensions for explainability maturity: P, A, and G. This paper analyzes the connection between the MM4XAI-AE and the PAG-XAI index. The model proposed in this work builds on these efforts, offering a novel maturity model focused specifically on explainability in AI systems. While existing models have focused on adoption or ethical alignment, this model aims to assess the degree to which AI systems offer understandable, interpretable, and transparent decisions in practical business contexts. It provides a structured tool for organizations to improve trust in AI systems, align processes with explainability goals, and guide implementation through clearly defined maturity levels.
3. Materials and Methods
This section introduces the conceptual and methodological foundations of MM4XAI-AE for assessing explainability in artificial intelligence-based applications. Grounded in established XAI principles and aligned with the phases of the AI lifecycle, MM4XAI-AE is structured into four progressive maturity levels: operational, justified, formalized, and managed. This section is organized into three main components. First, it presents the design of the analysis model, which constitutes the backbone of the maturity structure. This model is grounded in key concepts such as interpretability, transparency, explainability, and the AI lifecycle, providing the theoretical and practical basis for defining maturity indicators. These concepts are synthesized into a coherent model structure that supports the definition and classification of indicators across the four maturity levels. The names and descriptions of each level, along with their respective indicators, are introduced in this subsection. Second, the section describes the design of the evaluation model, which builds upon the analysis model by introducing a formal method for assessing the maturity level of explainability in AI applications. A scoring function is developed to operationalize the positioning of evaluated indicators within the defined maturity levels, enabling systematic and replicable classification. Finally, the section outlines the conceptual alignment between MM4XAI-AE and the PAG-XAI dimensions (P, A, and G), showing how the maturity levels and indicators reflect and operationalize those strategic objectives. This alignment reinforces the theoretical soundness of the model while enhancing its relevance for practical implementation.
3.1. Conceptual Design of the MM4XAI—Analysis Model
In recent years, XAI has gained significant importance, especially as AI applications permeate critical sectors. Explainability is essential for enabling users to understand, trust, and effectively manage AI systems. This research incorporates key definitions and dimensions of explainability, grounding the maturity model on established concepts such as interpretability and explainability. By aligning the model with the phases of the AI lifecycle, from data preparation to deployment and performance management, the model provides a structured foundation for assessing the maturity of explainability practices. Furthermore, essential tools and methods, including model-agnostic techniques such as SHAP and LIME, and gradient-based methods for DL interpretability, are integrated to support the model’s indicators. To establish a robust foundation for MM4XAI-AE, we define essential concepts—including explainability, transparency, and the AI lifecycle—that guide the model’s structure and indicators.
XAI refers to a system’s ability to provide reasons and details that clearly and comprehensibly explain its functioning to a human audience. This becomes especially important in critical AI contexts where ML and DL models, often operating as “black boxes,” pose serious challenges to interpretation. According to Barredo Arrieta et al. [3], explainability enables adequate understanding, trust, and management of AI systems and is the key for facilitating transparency and ethical adoption in high-impact applications such as healthcare, finance, and security [3].
The design of the analysis model in MM4XAI-AE is strongly grounded in the AI lifecycle, which structures the maturity levels along six essential phases: problem definition and solution design, data preparation, model development and training, model evaluation and validation, deployment, and performance management with continuous improvement. These phases were defined by the U.S. General Services Administration (GSA) in 2024. Each of them requires specific considerations regarding explainability and maturity. These considerations allow for the articulation of relevant indicators across all levels of maturity, ensuring coverage of the full development and operational cycle of AI systems.
Transparency and explainability are fundamental to the design of the analysis model and are defined in terms of three key levels: simulatability, decomposability, and algorithmic transparency [3]. These levels provide methodological support for distinguishing the depth of understanding that a user can obtain from a model and are directly associated with the progression of indicators within the maturity levels. In addition, explainability serves multiple and context-dependent purposes, including increasing user trust and adoption, promoting the transferability of models across contexts, and supporting accountability and ethical decision-making in AI [3].
To construct the maturity structure, we adapted the foundational approach of CMMI [10] to the specific context of explainability in AI applications. Drawing from CMMI’s logic of gradual improvement, we propose a model with four levels: operational, justified, formalized, and managed. These levels were chosen for their balance between clarity and scalability, avoiding unnecessary complexity while providing meaningful differentiation between stages. Each level is associated with a set of indicators that reflect increasing capacities in terms of transparency, interpretability, and applicability of XAI techniques. The alignment with the AI lifecycle ensures that these explainability practices are coherently integrated at each development stage, allowing organizations to assess and improve their maturity levels consistently.
- •
Level 1 (operational): pertains to AI applications developed using both free and proprietary platforms that provide visual, end-to-end (e2e) support for data processing tasks in AI: Waikato Environment for Knowledge Analysis (Weka) [48], Orange Data Mining (Orange) [49], RapidMiner Studio [50], Konstanz Information Miner (KNime) [51], Java Hep Framework (JHep Work) [52] or guide codes available in GitHub repositories or platforms such as Knowledge Discovery and Data Mining Cup (KDD Cup), Large Scale Visual Recognition Challenge (LSVRC), and Kaggle and Open Machine Learning (OpenML) [53–55]. This level was chosen because many users and developers when utilizing these platforms function as “end users” of AI models, possessing limited understanding of the internal processes and explainability aspects. These platforms simplify complex tasks such as model selection and performance metrics, concealing the model’s inner workings and often making the processes opaque to the user. The decision to include e2e platforms at this level responds to the fact that, despite their accessibility and popularity, these tools can restrict transparency and interpretability essential elements for achieving true AI explainability. At this level, the selected indicators—such as correct model identification, use of e2e platforms, and choice of performance metrics—reflect an initial maturity stage, where developers rely on predesigned tools without a conscious focus on explainability. The consulted literature [48–55] supports the classification of these indicators, highlighting both the utility of these tools for beginners and their limitations for advancing toward a deeper understanding of AI. Table 2 shows the proposed indicators for maturity level 1.
- •
Level 2 (justified): represents a stage where developers begin to adopt best practices for implementing ML or DL processes, addressing each phase of the AI lifecycle in a structured manner. At this level, users move beyond black-box tools, developing an understanding of the various components of the development process, from problem identification to model deployment. The indicators in this level—such as justification of the ML or DL task, handling of missing and outlier data, feature engineering, and model selection were selected to reflect a more advanced knowledge and an awareness of the importance of each stage in AI model development. These indicators are grounded in the literature [56, 57], which supports a more justified and methodical approach to AI implementation to enhance the system’s quality and understanding. This level incorporates practices that increase transparency and interpretability in models, moving away from the opacity typical of more simplified platforms. Table 3 shows the proposed indicators for maturity level 2.
- •
Level 3 (formalized): addresses AI applications that implement ML or DL models and incorporate interpretability techniques to systematically understand the model’s behavior and decisions. At this level, developers are focused on implementing a functional model and committed to ensuring that the model’s decisions can be interpreted and explained clearly at a local or global level. The selected indicators at this level, such as the purpose of interpretation (to create white-box models, explain black-box models, or improve model fairness) and the choice of specific interpretability techniques for different data types, are supported by studies in the literature [58–60]. These studies emphasize the importance of interpretability in applications where transparency and model fairness are essential for acceptance and adoption. Incorporating these indicators allows the model to evaluate the maturity of applications in terms of their capability to generate comprehensible interpretations, providing a solid foundation for informed decision-making and ethical risk management. Table 4 shows the proposed indicators for maturity level 3.
- •
Level 4 (managed): represents the highest level of maturity in the model, focusing on AI applications that not only implement interpretability techniques but also integrate advanced explainability frameworks (XAI) such as IBM AI Explainability 360 (AIX360), H2O.ai (H2O), TensorFlow Explainability (Tf-explain), Skater Library (Skater), and Captum for PyTorch (CaptumAI) [23, 61]. At this level, the aim is for AI applications to achieve high standards of transparency, informativeness and fairness [59], with alignment to ethical and social considerations. To ensure these criteria are met, we propose a unified framework that leverages the strengths of these tools. For example, AIX360 can be used for fairness assessment, while CaptumAI offers robust interpretability techniques for DL models, and H2O provides model-agnostic explanations for structured data applications. By unifying these tools into a cohesive framework, the model can more consistently assess applications against the maturity criteria outlined at this level.
Level | Description |
---|---|
1 | Operational |
2 | Justified |
3 | Formalized |
4 | Managed |
Furthermore, we establish specific guidelines to aid in evaluating compliance with each indicator, addressing challenges in measuring the impact of explainability frameworks. For example, to meet the criterion of “bias identification,” a model could leverage SHAP values within AIX360 to systematically detect biases in feature importance. At the same time, fairness and ethical impacts can be evaluated through AIX360’s fairness metrics, ensuring each application meets these high-level indicators comprehensively. Table 5 shows the proposed indicators for maturity level 4.
Indicator | Description |
---|---|
I1 | ML or DL model correctly identified for the application’s required task |
I2 | End-to-end platform selected for implementing ML or DL processes |
I3 | Validated third-party code selected from repositories such as GitHub or Kaggle |
I4 | At least three performance metrics selected for model evaluation |
Indicator | Description |
---|---|
I1 a I4 | Indicators of the operational level |
I5 | Justified description demonstrating the type of ML or DL task to be performed. |
I6 | Justified description demonstrating the relevance of the research leading to the implementation of the ML or DL application. |
I7 | Techniques applied for managing missing data. |
I8 | Techniques applied for managing outliers in the dataset. |
I9 | Categorical variables encoded (dumification) based on the ML algorithm used. |
I10 | Feature selection and feature engineering techniques applied. |
I11 | Class-balancing techniques implemented in the dataset. |
I12 | Process identified for selecting and partitioning the dataset into training, validation, and testing stages. |
I13 | Structured process identified for implementing ML or DL models, transitioning from “white box” to “black box” models. |
I14 | Hyperparameter tuning process applied to the selected ML or DL model(s). |
I15 | Analytical description of values obtained from performance metrics of the ML model. |
I16 | Solution deployed using services with user-focused proxies. |
Indicator | Description |
---|---|
I1 a I16 | Justified level indicators |
I17 | Purpose of model interpretation defined: creating white-box (intrinsic) models, explaining black-box (post hoc) models, improving model fairness, or testing prediction sensitivity. |
I18 | Type of interpretation determined: local (specific records) or global (model generality-based). |
I19 | Interpretability method selected according to the dataset type: text, images, graphs, or tabular data [48]. |
Indicator | Description |
---|---|
I1 a I19 | Formalized level indicators |
I20 | Biases in the dataset identified focusing on social and ethical impacts on customers and society based on the application’s decisions. SHAP values within AIX360 leveraged for systematic bias detection. |
I21 | Ethical responsibility (“ethical debt”) of the application assessed, determining whether the model requires retraining or data restructuring to mitigate potential biases. |
I22 | Comprehensive XAI framework implemented, integrating tools such as AIX360, H2O, Tf-explain, Skater, or CaptumAI to enhance clarity for diverse audiences, including clients and society, regarding the application’s decision-making process. |
I23 | Proxy mechanism established to bridge the application’s decisions with societal impact, ensuring explanations are accessible and intuitive for nonexpert users, facilitating quicker adoption and trust. |
3.2. Conceptual Design of the MM4XAI—Evaluation Model
The evaluation model, as a core component of MM4XAI-AE, is designed to transform qualitative explainability indicators into a quantitative and replicable scoring system. This model enables the positioning of AI-based applications within one of the four maturity levels previously defined in the analysis model—operational, justified, formalized, and managed—based on the extent to which they fulfill the proposed indicators. To achieve this, we introduce the Explainability Maturity Score (EMS), a composite index that consolidates the contribution of each indicator through a weighted aggregation mechanism. The EMS serves as the output of the evaluation model, providing a standardized measure that facilitates benchmarking and comparison across AI systems. In the following section, we define the scoring ranges, weighting factors, and mathematical formulation used to calculate the EMS, ensuring that indicators at higher levels of maturity contribute more significantly to the overall assessment. This approach reflects the importance of advanced explainability practices in complex AI applications and supports the strategic objectives of MM4XAI-AE.
To operationalize the EMS, we defined a scoring range from 0 to 100 that is divided into four segments, each corresponding to a maturity level. Recognizing that not all indicators contribute equally to explainability, we applied a differentiated weighting scheme that assigns greater importance to those indicators located in higher maturity levels. This ensures that more advanced practices have a stronger impact on the final score and better reflect the complexity and rigor required in critical AI applications. Table 6 summarizes the score intervals associated with each maturity level.
Model maturity levels (MML) | ||||
---|---|---|---|---|
Evaluation score | 1 | 2 | 3 | 4 |
[0, 10) | [10, 30) | [30, 60) | [60, 100] |
- •
Level 1 (operational): 10% (4 indicators), weight factor θ1 = 10/4 = 2.5
- •
Level 2 (justified): 20% (12 indicators), weight factor θ2 = 20/12 = 1.67
- •
Level 3 (formalized): 30% (3 indicators), weight factor θ3 = 30/3 = 10.0
- •
Level 4 (managed): 40% (4 indicators), weight factor θ4 = 40/4 = 10.0
In addition, we have developed data capture tools to implement this evaluative model in AI applications. These tools consolidate the proposed indicators and apply the EMS to assess explainability and maturity levels with consistency and precision.
3.3. Conceptual Alignment Between MM4XAI-AE and PAG-XAI Index
The MM4XAI-AE model aligns conceptually and operatively with the PAG-XAI index [5], which proposes three strategic dimensions for evaluating the maturity of XAI: P, A, and G. P refers to the usefulness and applicability of explainability in real-world contexts. A involves the ability to systematically assess, validate, and trace AI decision-making processes. G encompasses the implementation of ethical, transparent, and policy-based standards that ensure responsible AI practices.
Each maturity level defined in MM4XAI-AE progressively reflects and incorporates the core elements of the PAG-XAI dimensions. At the operational level, explainability is minimally developed. Practical implementation is limited to the use of end-to-end tools without explicit attention to explainability, A practices are virtually absent, and G considerations are not yet introduced. This level corresponds to an initial and underdeveloped state across all three PAG-XAI dimensions.
As systems progress to the justified level, early-stage alignment with PAG-XAI begins to emerge. Developers demonstrate basic practical understanding and begin to use explainability tools and metrics consciously. A remains limited but includes initial steps toward documenting processes and recognizing the need for transparency. G enters the picture through early awareness of ethical, legal, or regulatory concerns, though without formal mechanisms. At this stage, the MM4XAI-AE model begins to intersect meaningfully with PAG-XAI principles, particularly in fostering awareness and initiating systematic practices.
The formalized level shows stronger conceptual convergence with the PAG-XAI framework. Here, practical methods such as SHAP and LIME are applied systematically to generate explainability outputs tailored to user and domain needs. Interpretability methods are documented and integrated, allowing for structured auditing of model behavior. From a governance standpoint, organizations begin to establish internal standards, ethical guidelines, and preliminary compliance protocols. This level marks a critical stage in aligning explainability with A and G dimensions in a more deliberate and structured manner.
The highest level, managed, fully reflects the ideals of PAG-XAI. P is demonstrated through user-centered explainability mechanisms that are embedded into decision-making workflows. A is robust and continuous, involving advanced validation practices, bias detection, and accountability mechanisms. G becomes comprehensive, with organizations implementing ethical principles, external compliance standards, and mechanisms for social accountability. At this level, the maturity of explainability practices is not only technically advanced but also aligned with broader organizational responsibilities and societal expectations.
In this way, MM4XAI-AE not only complements the PAG-XAI framework but also serves as a practical instrument to realize its strategic vision. While PAG-XAI offers a high-level conceptual guide to achieving responsible AI, MM4XAI-AE provides a tangible model with concrete indicators and assessment mechanisms that translate those ideals into measurable actions. The EMS generated by MM4XAI-AE allows for the operational evaluation of an application’s maturity level in explainability, enabling a quantifiable and replicable assessment aligned with PAG-XAI’s principles.
The relationship between the two models is, therefore, both conceptual and instrumental. MM4XAI-AE offers a roadmap with practical steps toward achieving the ideals articulated by PAG-XAI. Conversely, PAG-XAI provides the normative and strategic framing that underscores why progressing through these maturity levels is necessary for trustworthy, auditable, and ethical AI. In the future, these models could be integrated further by combining the EMS indicators with PAG-XAI metrics to create a holistic maturity evaluation system, offering organizations both strategic direction and operational guidance for advancing explainability in AI.
4. Results
We validated the complete MM4XAI-AE model—including both the analysis and evaluation components—by applying it to thirteen AI-based applications that use ML algorithms to process structured datasets. These datasets are publicly available in the “Machine Learning Repository” of UC Irvine, and the corresponding scientific papers are accessible through various open-access journals. Table 7 consolidates the dataset names, corresponding scientific articles, task types (classification or regression), data characteristics (types of attributes, number of instances, and number of features), and year of publication.
Dataset | Article title | Type of task in ML | Attribute types | # Instances | # Attributes | Year of publication |
---|---|---|---|---|---|---|
Antifouling performance index | Establishment of an antifouling performance index derived from the assessment of biofouling on typical marine sensor materials [62] | Classification | Integer: Pixels in images | Not reported in the document | Features such as pixel intensity, edge filters, and texture | 2023 |
DNA barcode database | Artificial intelligence in timber forensics employing DNA barcode database [63] | Classification | Real | Not reported in the document | Not reported in the document | 2023 |
Algerian forest fires dataset | Predicting forest fire in Algeria using data-mining techniques: case study of the decision tree algorithm [64] | Classification, regression | Real | 244 | 12 | 2019 |
Alcohol QCM Sensor dataset | Classification of alcohols obtained by QCM sensors with different characteristics using ABC-based neural network [65] | Classification, regression, clustering | Real | 125 | 8 | 2019 |
AI4I 2020 predictive maintenance dataset | Explainable artificial intelligence for predictive maintenance applications [66] | Classification, regression, explainability | Real | 10,000 | 14 | 2020 |
Amphibians dataset | Predicting presence of amphibian species using features obtained from GIS and satellite images [67] | Classification | Integer, real | 189 | 23 | 2020 |
Intrusion detection evaluation dataset (CIC-IDS2017) | False positive identification in intrusion detection using XAI [68] | Classification | Categorical, numerical | 2,830,743 | 11 | 2017 |
Alzheimer’s Disease Neuroimaging Initiative (ADNI) | A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer’s disease [69] | Classification | Numeric, categorical, textual, and images | 1048 | 11 | 2004 |
Retinal optical coherence tomography angiography (OCTA) | Early detection of dementia through retinal imaging and trustworthy AI [70] | Classification | Images | 1192 | Process the images with a CNN | 2024 |
Demographic health survey (DHS), focusing on reproductive-aged women from six East African countries | Explainable artificial intelligence models for predicting pregnancy termination among reproductive-aged women in six East African countries: machine learning approach [71] | Classification | Categorical, numerical | 338,904 | 18 | 2024 |
B&VIIT Eye Center and Kim’s Eye Hospital | Code-Free Machine Learning Approach for EVO-ICL Vault Prediction: A Retrospective Two-Center Study [72] | Classification | Numeric | 640 | 18 | 2024 |
Pulmonary tuberculosis (TB) detection in chest radiographs (CXR) | A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography [73] | Classification | Image | 6881 | Process the images with a CNN | 2024 |
Images of herbal plants for classification purposes | Teachable machine: optimization of herbal plant image classification based on epoch value, batch size, and learning rate [74] | Classification | Image | 10,000 | Process the images with a CNN | 2024 |
To apply the model, we developed a structured data collection instrument aligned with all the indicators defined in the analysis model. This instrument includes guiding questions designed to evaluate the presence or absence of each indicator. Since direct interaction with the application developers was not feasible, we relied on a careful reading of the published papers, operating under the assumption that the development processes are sufficiently documented in the article texts.
The application of the MM4XAI-AE was carried out by four experts with extensive experience in AI, both as developers and researchers. Each expert independently reviewed the assigned articles using the instrument, following a guided scoring process based on the maturity indicators. The evaluation simulated an expert interview scenario, where the evaluator interprets the documentation as if interacting with the developers responsible for the AI implementation. For each article, the experts scored the indicators that were met, and the EMS was calculated using the weighted formula defined in equation (1).
This procedure allowed us to determine the maturity level of each AI-based application and to verify that the evaluated studies were distributed across all four maturity levels defined in the MM4XAI-AE. The section concludes with a detailed summary of the indicators identified in each of the thirteen evaluated papers, allowing for a transparent and structured view of how each AI-based application aligns with the maturity levels defined in MM4XAI-AE. Table 8 compiles the results of the expert evaluations and shows the score assigned to each paper, highlighting the diversity in explainability practices across the selected applications. The indicators used and the scoring methodology provide a replicable and systematic way to assess explainability in real-world AI systems.
Application references | I1 | I2 | I3 | I4 | I5 | I6 | I7 | I8 | I9 | I10 | I11 | I12 | I13 | I14 | I15 | I16 | I17 | I18 | I19 | I20 | I21 | I22 | I23 | Total | MML |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[62] | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10.84 | 2 |
[63] | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10.84 | 2 |
[64] | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15.85 | 2 |
[65] | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15.85 | 2 |
[66] | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 72.53 | 4 |
[67] | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28.37 | 2 |
[68] | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 80.04 | 4 |
[69] | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 80.04 | 4 |
[70] | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 49.19 | 3 |
[71] | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 68.36 | 4 |
[72] | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 46.69 | 3 |
[73] | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9.17 | 1 |
[74] | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9.17 | 1 |
The AI applications analyzed span a wide range of domains and levels of complexity. Table 7 presents a summary of the datasets used, emphasizing the diversity of tasks (classification and regression), attribute types (numeric, categorical, and image), and publication years, which demonstrates the contemporary relevance of the selected studies. It is important to note that all papers describe systems that implement AI algorithms throughout the lifecycle, including model training, evaluation, and, in some cases, deployment and optimization, ensuring a comprehensive view of their explainability maturity.
Several datasets, such as the Antifouling Performance Index and the DNA Barcode Database, are associated with image and textural data and involve classification tasks. However, the limited reporting of attribute or instance counts in some articles restricted the full analysis of dataset complexity. In contrast, other datasets such as CIC-IDS2017 and ADNI demonstrate higher complexity due to their large number of features and tasks involving both classification and regression, often integrating interpretability strategies.
Advanced applications—particularly those using convolutional neural networks (CNNs) for image-based diagnosis, such as retinal OCTA and tuberculosis detection from CXR scans—reflect the integration of post hoc explainability techniques. These examples underscore the growing importance of explainability in domains where model decisions have high-stakes consequences.
The heterogeneity of the evaluated AI-based applications highlights the adaptability of MM4XAI-AE across different contexts. Its application in domains such as healthcare, cybersecurity, environmental science, and computer vision validates its robustness and relevance. The results demonstrate that the model is capable of capturing a wide range of maturity levels in explainability practices, regardless of the domain of application. Furthermore, this diversity supports the model’s potential for broader adoption and reinforces its value as a diagnostic and planning tool for organizations seeking to enhance the explainability of their AI systems.
Table 8 shows that the selected investigations cover the four levels proposed in the maturity model. The article “Establishment of an Antifouling Performance Index Derived from the Assessment of Biofouling on Typical Marine Sensor Materials,” was classified at maturity level 2 (“Justified”). This classification is due to correctly identifying the supervised learning model required to classify biofouling organisms (I1) and using technological tools such as Weka Segmentation for process implementation (I2). In addition, it relies on validated code and standard techniques, such as image analysis libraries (I3). However, although the study justifies the classification task and its relevance for optimizing marine materials (I5 and I6), it does not address more advanced techniques, such as handling missing data, feature selection, or hyperparameter tuning, which limits its formalization. This combination of elements explains its placement at this maturity level, demonstrating a conscious development but lacking a fully structured methodological framework.
The article “Predicting Forest Fire in Algeria Using Data Mining Techniques: Case Study of the Decision Tree Algorithm” fulfilled several Level 2 indicators due to its methodological approach and results. Firstly, I1 was met by correctly identifying the ML model, the J48 decision tree, as the most suitable for predicting forest fires based on the available meteorological data and the specific requirements of the case study. Furthermore, I2 was evidenced by the model’s implementation using the Weka tool, an integrated platform for ML processes, while I4 was achieved by evaluating the model with key metrics such as accuracy, recall, and F-measure, demonstrating robust quantitative analysis. The article also addressed I5 and I6 by justifying the model selection and emphasizing the relevance of the forest fire prediction problem, a critical issue in the context of Algeria due to its significant environmental and economic impacts. In addition, I12 and I13 were fulfilled by explicitly identifying and describing the data partitioning process into training and testing stages and progressing toward a “black box” model by developing a decision tree-based approach that optimizes performance while maintaining simplicity for future hardware implementations. Finally, I15 was achieved by providing a detailed analysis of the evaluated metric values, showing an accuracy of 82.92% and a recall of 92%, aligning with the selected model’s expectations.
The article “Classification of Alcohols Obtained by QCM Sensors with Different Characteristics Using ABC-Based Neural Network” fulfilled several Level 2 indicators due to its methodological approach and the implementation of advanced ML techniques. The fulfillment of I1 was evident through the correct selection of an artificial neural network based on the artificial bee colony (ABC) optimization algorithm as the model for the required classification task. The use of specific platforms and tools for implementing these processes, as outlined in the methodology, supports the fulfillment of I3. In addition, multiple performance metrics, such as mean squared error (MSE), were analyzed, satisfying I4. The article also addresses I5 and I6 by providing a detailed justification of the relevance of alcohol classification using QCM sensor data and its potential impact on industrial and hygiene applications. The data partitioning process for training and testing was clearly described, aligning with indicator I12. Hyperparameter tuning, including the number of neurons in hidden layers and training cycles, was also thoroughly explored, fulfilling I14. Finally, the results are presented through a comprehensive analysis of the performance metrics achieved, demonstrating a notable improvement in the model’s accuracy and effectiveness, reflecting the fulfillment of I15.
The article “Explainable Artificial Intelligence for Predictive Maintenance Applications” achieved maturity level 4 due to its comprehensive approach and advanced use of XAI techniques. First, the article fulfilled indicator I1 by correctly identifying ensemble-based decision trees and interpretable decision tree models to address the predictive maintenance task. Indicator I3 was also met through validated code and implemented algorithms, as detailed in the methodology. At least three metrics, including precision, recall, and confusion matrices, were employed to evaluate model performance, achieving indicator I4. The paper provides a detailed justification of the type of ML task to be performed (I5) and its relevance to industrial applications, as illustrated by the discussion of predictive maintenance’s impact on production processes (I6). Techniques for feature engineering and selecting relevant attributes were implemented, aligning with indicators I9 and I10.
Furthermore, class balancing techniques were applied to address the inherent imbalance in the dataset (I11), and the data partitioning process for training, validation, and testing stages was clearly defined (I12). The article demonstrated compliance with I13 and I14 by exploring structured approaches to interpretable decision trees, including hyperparameter tunings such as node count and selected feature sets. A detailed analysis of performance metrics was conducted, fulfilling I15. At the interpretability level, the purpose of explainable models to enhance trust and adoption was clearly defined, addressing indicators I17, I18, and I19. Finally, explanatory interfaces such as normalized feature deviations and decision trees provided an accessible evaluation for nontechnical audiences, effectively implementing XAI frameworks (I22). In addition, a mechanism was established to link model decisions to broader societal impacts, fostering trust and technological adoption (I23).
The article “Predicting Presence of Amphibian Species Using Features Obtained from GIS and Satellite Images” met Level 2 indicators due to its rigorous methodology and analytical results. First, I1 was fulfilled by correctly identifying models such as Gradient Boosted Trees (GBTs) for predicting amphibian presence based on relevant environmental data. Comprehensive platforms like RapidMiner, integrated with Weka and H2O plugins, were used to implement ML processes, meeting I2. In addition, metrics such as area under the curve (AUC) and balanced accuracy were selected to evaluate model performance, aligning with I4. The article also detailed the species classification problem and its ecological relevance, justifying I5 and I6. Indicators I7 and I8 were addressed through techniques for managing missing data and identifying outliers in the environmental datasets. Moreover, categorical variable encoding and feature engineering techniques were implemented to optimize model accuracy, fulfilling I9 and I10. The research team employed advanced techniques to balance uneven classes in the datasets, meeting I11, and provided a thorough description of the data partitioning process for training, validation, and testing stages, satisfying I12. The design and tuning of specific hyperparameters for the selected models demonstrated a meticulous approach, fulfilling I13 and I14. Finally, the results were presented through a detailed analysis of key metrics, fulfilling I15 and demonstrated that the proposed approach is suitable for practical applications in environmental planning and impact assessment.
The article titled “False Positive Identification in Intrusion Detection Using XAI” achieved maturity level 4 due to its comprehensive approach to identifying and mitigating false positives in anomaly-based intrusion detection systems (IDS) through XAI techniques. It fulfilled I1 by selecting a neural network model for intrusion detection and optimizing it using attributes derived from XAI tools, such as SHAP and the adversarial approach. Indicators I2 and I3 were met using advanced platforms and validated datasets, including LYCOS Intrusion Detection System 2017 (LYCOS-IDS2017). Model performance was evaluated using precision, false positive rate, and area under the curve (AUC), satisfying I4. The article justified the application of ML techniques to address cybersecurity challenges (I5) and highlighted the relevance of the proposed model by demonstrating its capacity to reduce false positives, thereby improving intrusion detection efficiency (I6). Missing data were managed, and outliers were addressed during preprocessing, fulfilling I7 and I8. Categorical variables were encoded to ensure proper data representation (I9).
Furthermore, advanced feature selection and transformation techniques were applied (I10), and methods were used to balance classes within the datasets (I11). The study included a structured data partitioning process into training, validation, and testing stages (I12) and implemented techniques that integrated both “white box” and “black box” models (I13). Hyperparameter tuning was performed to maximize model performance (I14), and a detailed analysis of the obtained metrics was conducted (I15). At maturity level 3, the article satisfied I17 by identifying the purpose of model interpretation and fulfilled I18 and I19 by selecting methods such as SHAP and the Adversarial Approach to explain decisions based on XAI vectors. Finally, at level 4, the study addressed I20 and I21 by identifying biases in the data and assessing the implications of the model’s decisions. This result was complemented by implementing a comprehensive XAI framework that included tools like SHAP and Principal Component Analysis (PCA), promoting clear and accessible explanations for non-technical users, thereby fostering trust and system adoption (I22 and I23).
The article “A Multilayer Multimodal Detection and Prediction Model Based on Explainable Artificial Intelligence for Alzheimer’s Disease” achieved maturity level 4 by addressing early diagnosis and progression prediction of Alzheimer’s disease through an XAI approach. The fulfillment of indicators begins with the precise identification of the ML model (I1), employing Random Forest (RF) as the primary classifier, and the selection of advanced platforms for implementing multimodal processes (I2). The model incorporated validated code and detailed metrics, such as precision, F1-score, and AUC, to evaluate performance (I3, I4). The study justified the relevance and nature of the Alzheimer’s prediction and classification problem, emphasizing its clinical impact and the benefits of early detection (I5, I6). Robust data management techniques, such as feature selection and class balancing strategies, were also addressed (I7-I11), ensuring rigorous data partitioning for training, validation, and testing stages (I12). The structured transition toward explainable models was achieved using SHAP and fuzzy rule-based systems to interpret decisions (I13 and I17). Advanced techniques for hyperparameter tuning, performance analysis through metrics, and deployment of detailed explanations for each model decision were implemented, with natural language representations accessible to medical practitioners (I14–I16, I18-I19). Finally, the model analyzed data biases and provided explanations consistent with medical standards, ensuring ethical responsibility and fostering trust in its clinical use (I20 and I21). This study exemplifies how AI can effectively balance precision and explainability to address complex medical challenges.
The article “Early Detection of Dementia Through Retinal Imaging and Trustworthy AI” reached Level 3 maturity in the technological maturity model due to its robust integration of advanced DL techniques and interpretability analysis. Indicator I1 was fulfilled by selecting the Eye-AD model, an architecture based on graph and CNNs. Indicator I3 was satisfied with using libraries such as PyTorch for model development, highlighting the implementation of validated techniques. Regarding indicator I4, the article employed precision, F1-score, and AUC metrics to evaluate model performance across different stages. At the “Conscious” level, the article justifies the nature and relevance of the problem addressed (I5 and I6), emphasizing the importance of early detection of Alzheimer’s disease and mild cognitive impairment using optical coherence tomography angiography (OCTA) images. Indicator I10 was addressed using advanced feature selection and extraction techniques from retinal images.
In addition, the article describes a structured process for partitioning datasets into training, validation, and testing stages (I12) and presents a progressive design for analyzing intra- and inter-instance relationships within the data, fulfilling indicator I13. Hyperparameter tuning (I14) was detailed, optimizing components such as the multilevel graph neural network. Furthermore, an exhaustive analysis of the values obtained from the selected metrics was conducted, meeting indicator I15. The model’s implementation in a user environment was notable for providing precise and reliable explanations of model decisions, achieving indicators I16, I17, and I18. Finally, indicator I19 was addressed by adapting interpretive methods based on heatmaps and statistical analysis, enabling an understanding of the relevance of different image regions in the diagnostic process. These elements position the article as an example of technological maturity in the application of AI for early disease diagnosis, with significant advances in interpretability techniques and biomedical analysis.
The article titled “Explainable Artificial Intelligence Models for Predicting Pregnancy Termination Among Reproductive-Aged Women in Six East African Countries” achieved Level 4 technological maturity due to its rigorous approach to the predictive and explainable analysis of pregnancy termination among women of reproductive age. First, ML models such as Random Forest (RF) and Extreme Gradient Boosting (XGB) were correctly identified to address the task, fulfilling I1. Additionally, at least three metrics, including precision, F1-score, and AUC, were selected to evaluate model performance, meeting I4. The article provided well-founded descriptions of the task and the relevance of the model, emphasizing how the results can influence reproductive health and public policies, fulfilling I5 and I6. During preprocessing, techniques such as imputation of missing data and class balancing using Synthetic Minority Over-sampling TEchnique (SMOTE) were applied, satisfying I7 and I11. Variable encoding and transformations were performed, addressing I9, and advanced feature selection techniques, such as Mutual Information and Step-Backward Feature Selection, were implemented, fulfilling I10. The process included stratified data partitioning into training and testing sets, meeting I12. At the formalized level, the RF model was chosen for its ability to handle high-dimensional data and robustness against overfitting, and it was interpreted using XAI tools such as SHAP, LIME, and Explain Like I’m Five (Eli5), fulfilling I17, I18, and I19. These tools identified key predictors, including wealth index, educational level, and access to potable water. The model also assessed biases in the data and applied interpretability techniques to enhance explainability satisfying I20 and I22. Overall, the comprehensive and explainable approach positions this work as a benchmark in applying AI to address public health challenges and intervention planning in East Africa.
The article “Code-Free Machine Learning Approach for EVO-ICL Vault Prediction: A Retrospective Two-Center Study” achieved Level 3 maturity in the technological maturity framework, excelling in postoperative vault prediction using ML tools without requiring coding. The fulfillment of indicator I1 was demonstrated by correctly identifying learning models such as Random Forest, Gradient Boosting, and Adaptive Boosting for predicting vault values in patients undergoing Implantable Collamer Lens (ICL) surgery. For indicator I4, robust metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Area Under the Curve (AUC) were employed to evaluate model performance in both internal and external validations. The study adequately justified the problem’s relevance (I5, I6), emphasizing the importance of optimizing implanted lens size to reduce postoperative complications. Handling missing data and normalizing key variables met the requirements of indicator I7, while the dataset partitioning into training and validation stages, performed through 10-fold cross-validation, satisfied indicator I12.
Furthermore, hyperparameter tuning using grid search was implemented, fulfilling indicator I14. The article also satisfied I15 by providing a detailed metrics analysis, demonstrating that Random Forest and Gradient Boosting models outperformed traditional approaches such as linear regression. Regarding interpretability, indicators I17, I18, and I19 were achieved through tools such as t-distributed Stochastic Neighbor Embedding (t-SNE) visualizations and decision tree-based classification criteria, which enabled the interpretation of key factors, including lens size and anterior chamber volume. Finally, the article evaluated inter-center measurement biases and addressed the need for personalized models, ensuring greater accuracy across diverse clinical contexts thereby reinforcing the study’s applicability to medical practice.
The article “A Deep Learning-Based Algorithm for Pulmonary Tuberculosis Detection in Chest Radiography” achieved Level 1 technological maturity in the maturity model due to its fulfillment of several key indicators. The deep neural network model MobileNet was correctly selected as the foundation for training tuberculosis (TB) detection algorithm in chest radiographs, meeting indicator I1. The use of Google Teachable Machine, an accessible platform that does not require extensive coding, enabled the implementation of the DL process, aligning with indicator I2. Furthermore, the article used validated and open-source code derived from datasets such as Kaggle and other recognized sources, satisfying indicator I3. At least three metrics were selected to evaluate the model’s performance: precision, sensitivity, and AUC, fulfilling indicator I4. These metrics demonstrated that the algorithm achieved AUC values of 0.951 and 0.975 on validation sets, highlighting its accuracy in detecting TB at levels comparable to clinical experts. This work stands as an example of the initial level of technological maturity in the application of AI for medical diagnostics, providing a solid foundation for future research and more advanced developments.
The article fulfilled indicators I1, I2, I4, and I6, placing it at Level 1 due to its focus on correctly identifying the DL (ML/DL) models required for the application, utilizing the Teachable Machine platform as an end-to-end environment to implement ML/DL processes, selecting at least three metrics to evaluate model performance, and presenting a justified description of the research’s relevance. The study employed a CNN architecture in combination with Teachable Machine to classify images of medicinal plants, testing parameters such as batch size, learning rate, and the number of epochs to optimize performance. It also detailed how these decisions were critical in achieving an accuracy of 98%–100%, highlighting the model’s utility in web applications for plant identification and justifying its implementation by improving accessibility to knowledge about medicinal plants. This work represents an initial step in applying ML techniques but with a limited scope for higher maturity levels.
The diverse applications and methodologies analyzed across the selected articles underscore the flexibility and adaptability of the technological maturity model in evaluating AI-based solutions in various domains. These studies highlight the potential and limitations of current AI implementations by systematically addressing different maturity levels and employing robust ML and explainability techniques. The findings reveal progression from foundational applications leveraging simple platforms to sophisticated frameworks integrating advanced models, interpretability tools, and ethical considerations. This broad spectrum of use cases validates the maturity model’s relevance and offers insights into the necessary steps for advancing AI-driven applications toward higher maturity levels. Such understanding is critical for promoting the development of reliable, scalable, and impactful AI solutions across scientific, medical, environmental, and societal contexts.
It is important to note that although the need for greater rigor in the application of the maturity model is acknowledged, such as the inclusion of multiple evaluators to corroborate the scores assigned to each article, the purpose of this initial approach was to verify that by applying the maturity model to different articles, the various maturity levels would be identified. This finding suggests that the proposed model can effectively measure the maturity levels across the established indicators.
5. Discussion
The findings of this study underscore the need for a dedicated maturity model to assess explainability in AI applications. The MM4XAI-AE, comprising both the analysis and evaluation components, was applied to thirteen AI-based developments using structured datasets and allowed for the identification of maturity levels based on specific explainability indicators and the computation of the EMS. This application demonstrated how the model effectively positions AI applications across the four defined maturity levels, revealing a wide variability in the adoption of interpretability and explainability practices. Importantly, the evaluation process was carried out by four experts in AI, who followed a structured guide to assess each article and score the presence of the indicators. The results reflect that, although efforts exist to promote explainability in AI, significant challenges remain in standardizing and operationalizing these practices across domains.
Of the thirteen evaluated studies, six achieved higher maturity levels (formalized and managed), while seven remained at the lower levels (operational and justified). Applications with lower maturity tended to lack formal strategies for interpretability and transparency, often relying on black-box models without documented explainability processes. Although technically functional, such systems face adoption barriers in high-stakes domains due to the absence of clear, traceable, and ethical decision-making mechanisms. In contrast, intermediate and advanced-level applications incorporated tools such as SHAP, LIME, and CaptumAI to provide transparency. However, these tools are predominantly technical and geared toward developers, limiting their accessibility to broader audiences such as domain experts, nontechnical users, and regulators. This reinforces the need for explainability mechanisms that are not only technically sound but also user centered and aligned with decision-making needs.
Temporal trends observed in the EMS values suggest some fluctuation in explainability maturity over the years. In 2019, the average maturity level was 2.00, which rose to 2.67 in 2020 and peaked at 4.00 in 2021. However, a decline was observed in the subsequent years, with scores of 2.67 in 2023 and 2.40 in 2024. While this indicates some progress, it also reflects inconsistency in the integration of explainability practices—possibly influenced by sectoral interests, evolving technologies, and varying awareness. Although this trend cannot be generalized to the entire field of AI, the study still provides valuable insight into how explainability adoption evolves over time in scientific research.
In critical sectors such as healthcare and security, the variability in explainability maturity is particularly concerning. The healthcare-related papers showed notable dispersion—ranging from Level 1 (operational), as in the case of the tuberculosis detection system (2024), to Level 4 (managed), such as the pregnancy termination prediction model (2024). This variability underscores the urgency of implementing clear explainability standards in fields where decisions directly impact human wellbeing and rights.
From a practical standpoint, the MM4XAI-AE offers tangible value for both academia and industry. For researchers, it provides a structured methodology for evaluating explainability maturity, promoting comparability and reproducibility. For practitioners and organizations, the model serves as a diagnostic tool that helps assess current explainability practices and define pathways for improvement. Its application can inform strategic planning for AI adoption in line with ethical and transparent standards, ensuring that outputs are interpretable and usable for diverse stakeholders. As a recommendation, organizations should prioritize the integration of interpretability methods in AI solutions to ensure that models are transparent, trustworthy, and responsive to societal needs.
Nevertheless, this study has limitations. The evaluation was based solely on the content documented in published articles, without direct access to the developers or source code of the AI systems. This limitation means that some practices might have been implemented but not reported, potentially affecting the scoring. Future work should include interviews or structured surveys with development teams to validate and complement the documentation-based evaluations. In addition, applying the model in diverse regions or regulatory environments may help uncover contextual variables that affect explainability adoption.
Another limitation lies in the size and scope of the sample. Although the current set of thirteen AI applications enabled a robust initial validation of the model, it is not representative of all AI use cases. Future research should broaden the application of MM4XAI-AE to include a larger, more diverse pool of domains—including industrial, commercial, and governmental implementations. This would improve the model’s generalizability and allow for a more comprehensive understanding of how explainability evolves across sectors.
Moreover, future studies may refine the maturity model by incorporating additional indicators. These could include dimensions such as the perceived usefulness of explanations by end users, the influence of explainability on decision-making processes, and adherence to domain-specific ethical and legal standards. Further iterations of the model could also explore how explainability interacts with other critical aspects such as algorithmic fairness, bias mitigation, and social responsibility.
This study represents a significant advancement in the assessment of explainability maturity in AI systems, offering a practical and replicable methodology through the MM4XAI-AE. The findings affirm the necessity of strengthening explainability practices and provide a roadmap for ongoing improvements in research and implementation. Future work should aim to complete the maturity framework by developing the third component—the improvement model—which will complement the analysis and evaluation models already established. This will ensure the model’s long-term relevance and utility across a wide spectrum of applications and regulatory contexts, guiding the evolution of more explainable, ethical, and human-centered AI.
6. Conclusions and Future Work
This research proposes a structured maturity model for explainable artificial intelligence named MM4XAI-AE, designed to assess the degree of explainability in AI-based applications through a two-part structure comprising an analysis model and an evaluation model. The MM4XAI-AE introduces four progressive maturity levels—operational, justified, formalized, and managed—and evaluates explainability using 23 indicators distributed across three key dimensions: technical foundations, structured design, and human-centered explainability. Each indicator contributes to the EMS, a weighted index ranging from 0 to 100 that prioritizes higher maturity levels to reflect their greater impact on achieving trustworthy AI.
The model was empirically validated by applying it to thirteen AI applications sourced from open-access scientific publications using structured datasets. Four AI experts applied the MM4XAI-AE following a guided evaluation protocol and scored the indicators in each study, enabling a robust classification of the applications across the maturity levels. This process revealed a wide distribution across the four levels, demonstrating the model’s capability to diagnose strengths, gaps, and areas of improvement in explainability practices.
The broader impact of this research lies in its potential to guide ethical development and inform regulatory and organizational strategies surrounding AI adoption. By providing a standardized and practical framework to evaluate explainability, the MM4XAI-AE supports transparency, fairness, and trust in AI systems—particularly relevant in high-stakes contexts such as healthcare, security, and public policy. Its conceptual alignment with the PAG-XAI framework (P, A, and G) further reinforces its strategic relevance and utility.
- 1.
Revising current indicators and integrating additional measures to more effectively capture explainability nuances.
- 2.
Engaging domain experts to evaluate and validate the model’s relevance, reliability, and scoring mechanisms.
- 3.
Extending the model’s application to broader ML and DL scenarios, testing its generalizability and robustness.
- 4.
Developing a complementary improvement model to guide organizations toward achieving higher maturity levels and addressing identified gaps.
- 5.
Establishing an evaluative approach to track AI explainability improvements over time.
- 6.
Enhancing future iterations by explicitly integrating PAG-XAI dimensions (P, A, and G) to advance the practical use of XAI maturity approaches, reinforcing A, G, and broader organizational adoption of explainable AI practices.
In an era where AI systems increasingly influence decisions that impact lives, rights, and public trust, the demand for transparency, fairness, and accountability has never been more urgent. MM4XAI-AE represents a critical step forward by bridging theoretical reflection with operational practice, offering a scalable and domain-agnostic pathway to evaluate and elevate explainability maturity in AI applications. Aligned with the PAG-XAI framework, the model embeds P, A, and G as core principles of responsible AI. By providing measurable indicators and actionable insights, MM4XAI-AE empowers organizations to build AI systems that are not only technically robust but also ethically grounded and socially responsive.
As AI continues to permeate high-stakes domains, there is a pressing need for systematic and context-aware maturity assessments that move beyond mere compliance and reflect the evolving expectations of stakeholders. We call for a broader adoption of XAI maturity analysis across sectors, encouraging researchers, practitioners, and policymakers to collaborate in developing shared benchmarks, validating frameworks, and fostering continuous improvement. Explainability should not be treated as a static feature but as a dynamic, evolving capability—central to building AI systems that are trusted, transparent, and aligned with the values of the societies they serve.
Conflicts of Interest
The authors declare no conflicts of interest.
Funding
The authors express their gratitude to the University of Cauca and Comfacauca University Corporation-Unicomfacauca for their support during the development of this work. This research received support from the “National Doctoral Call for University Professors—Cohort 2” of Colombia’s Ministry of Science, Technology, and Innovation (MinCiencias) and from the University of Cauca. Francisco Herrera was supported by the TSI-100927-2023-1 Project, funded by the Recovery, Transformation, and Resilience Plan from the European Union Next Generation through the Ministry for Digital Transformation and the Civil Service and the project PID2023-150070NB-I00 financed by MCIN/AEI/10.13039/501100011033.
Open Research
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.