Intelligent Wireless Power Scheduling for Lunar Multienergy Systems: Deep Reinforcement Learning for Real-Time Adaptive Beam Steering and Vehicle-to-Grid Energy Optimization
Abstract
The integration of wireless power transfer (WPT) and vehicle-to-grid (V2G) technologies is essential for the sustainable operation of lunar multienergy virtual power plants (MEVPPs), where rovers, habitats, and in situ resource utilization (ISRU) facilities rely on adaptive energy management. Unlike terrestrial systems, lunar environments present extreme challenges, including long-duration night cycles, regolith dust accumulation, severe temperature fluctuations, and dynamic rover mobility, all of which disrupt efficient power delivery. This paper proposes a reinforcement learning–based adaptive beam steering framework to optimize WPT scheduling, ensuring continuous and efficient energy transmission for both mobile and stationary lunar assets. Unlike traditional fixed-beam or heuristic-based WPT methods, the proposed system utilizes deep reinforcement learning (DRL) with proximal policy optimization (PPO) to autonomously adjust beam direction, power intensity, and charging priority in response to real-time rover movements, V2G interactions, and fluctuating energy demands. The proposed framework models WPT optimization as a Markov decision process (MDP), where the agent learns to dynamically adapt beam steering based on rover speed, response delay, solar power availability, and charging station congestion. The reward function penalizes energy misallocation and misalignment losses while maximizing charging efficiency and systemwide energy resilience. A case study simulating a 30-day mission near Shackleton Crater evaluates the effectiveness of the AI–driven WPT system, demonstrating a 54.6% reduction in energy downtime and a 41.3% improvement in beam alignment efficiency compared to static power scheduling methods. In addition, the system reduces latency-induced power deficits by 39.8%, ensuring reliable power distribution for ISRU oxygen extraction, habitat life support, and rover recharging stations. This study represents a novel advancement in lunar power infrastructure, integrating AI–driven adaptive WPT with intelligent energy scheduling to enhance V2G interactions in extraterrestrial environments. The results validate the feasibility of DRL–based WPT control, paving the way for scalable, resilient, and self-optimizing wireless power grids on the Moon. Future work will explore the integration of hybrid energy storage models, quantum-inspired optimization for real-time decision-making, and predictive beamforming algorithms to further enhance the reliability and efficiency of lunar energy networks.
1. Introduction
The development of sustainable energy infrastructures for extraterrestrial habitats is a critical challenge in modern space exploration. Future missions to the Moon, Mars, and other celestial bodies require robust, autonomous, and adaptable energy management systems capable of supplying continuous power to a diverse set of infrastructure, including lunar habitats, rovers, in situ resource utilization (ISRU) facilities, and scientific instruments [1]. Unlike terrestrial power grids, which benefit from well-established generation and distribution networks, lunar energy systems face significant operational constraints, such as prolonged lunar nights, extreme temperature fluctuations, regolith dust accumulation, dynamic power demand, and the absence of an atmospheric medium for convection-based cooling [2]. Among the various energy distribution strategies for lunar missions, wireless power transfer (WPT) has emerged as a transformative technology capable of enabling efficient and flexible energy transmission without requiring a physically connected power grid [3]. Several experimental demonstrations have validated the feasibility of space-based power beaming, bridging the gap between theoretical models and practical deployment. NASA’s Space Solar Power Exploratory Research and Technology (SERT) program has investigated microwave power transmission (MPT) for extraterrestrial applications, demonstrating the ability to beam energy across long distances with high efficiency [4]. Similarly, JAXA’s WPT experiments have successfully transmitted microwave energy over hundreds of meters on Earth, providing critical insights into beamforming precision and transmission losses in space environments. These prior studies establish a strong foundation for implementing WPT in lunar missions. The proposed reinforcement learning (RL)–based framework builds upon these advancements by introducing adaptive beam steering and real-time energy optimization, ensuring efficient power allocation despite environmental uncertainties. By leveraging artificial intelligence (AI)–driven dynamic control, this study aims to further advance the feasibility of WPT for future lunar energy networks. Among the various energy distribution strategies for lunar missions, WPT has emerged as a transformative technology capable of enabling efficient and flexible energy transmission without requiring a physically connected power grid. Several existing power transmission technologies have been explored in space missions, each with unique advantages and limitations [5]. Microwave-based WPT, which is the focus of this study, has been widely considered due to its high transmission efficiency, long-range capabilities, and ability to operate in a vacuum without significant atmospheric attenuation. However, beam divergence increases over long distances, necessitating adaptive beam steering techniques to maintain efficiency. Alternatively, laser-based WPT offers a highly collimated energy beam, minimizing dispersion and enabling long-range energy transmission beyond 5 km, which is a limitation of microwave-based approaches. However, laser WPT suffers from lower energy conversion efficiency, higher sensitivity to dust accumulation, and the risk of optical misalignment in dynamic environments. Another method, that is, inductive coupling–based WPT, has been successfully used in low-power space applications, such as satellite docking stations and proximity power transfer systems. While highly efficient for short distances, inductive WPT is not well-suited for large-scale lunar energy distribution due to its limited range and reliance on close physical proximity between the transmitter and the receiver. By positioning the proposed RL–based microwave WPT framework within the broader spectrum of space-based power transmission techniques, this study highlights the advantages of adaptive beam steering and intelligent power scheduling, ensuring reliable and scalable energy distribution in extraterrestrial environments [6]. While this study primarily focuses on microwave-based WPT, alternative approaches such as laser-based power transmission have also been explored in space applications. Microwave WPT is advantageous due to its high transmission efficiency in atmospheric and vacuum environments, but its beam divergence increases beyond 5 km, significantly reducing energy reception efficiency. Conversely, laser-based WPT offers a highly collimated beam, minimizing energy dispersion over long distances and making it a promising alternative for power delivery to distant lunar assets beyond 5 km. However, laser-based systems face higher conversion losses at both transmission and reception stages, and their performance is highly sensitive to regolith dust accumulation and beam obstruction. Given these trade-offs, a hybrid WPT approach combining microwave for midrange power transmission and laser for long-range energy beaming could potentially enhance lunar power distribution efficiency. This study focuses on microwave WPT optimization, while future work will explore the feasibility of integrating laser-based transmission for extended-range energy delivery [7]. WPT enables the direct beaming of energy to mobile and stationary units, allowing for seamless power delivery across a distributed lunar network. However, existing WPT frameworks primarily rely on fixed-schedule power transmission, failing to account for real-time variations in power demand, environmental interference, and dynamic movement of energy receivers (e.g., rovers and autonomous ISRU units). The lack of intelligent, adaptive scheduling mechanisms significantly reduces energy efficiency, introduces transmission losses, and leads to suboptimal resource allocation in complex lunar environments. Recent advances in AI and RL provide an opportunity to revolutionize WPT scheduling, allowing the system to autonomously learn optimal power allocation strategies and adapt in real time to dynamic mission conditions. This paper proposes a RL–based adaptive beam steering framework to optimize WPT scheduling, ensuring continuous and efficient energy transmission for both mobile and stationary lunar assets. To address the challenges posed by lunar dust accumulation, extreme temperature fluctuations, and potential signal interference, the proposed system integrates an adaptive recalibration mechanism that dynamically adjusts beam alignment and power intensity in response to environmental uncertainties, thereby mitigating long-term efficiency degradation [8].
The proposed framework employs a RL–based approach to optimize WPT scheduling within a lunar multienergy virtual power plant (MEVPP). The system is modeled as a Markov decision process (MDP), where the state space includes real-time information on receiver positions, battery charge levels, solar energy availability, charging station occupancy, and regolith dust accumulation. However, given the high variability in mission tasks and energy consumption patterns, relying solely on real-time data may lead to suboptimal long-term decision-making. To enhance scheduling stability, the framework integrates historical mission data and predictive analytics, allowing the system to anticipate future energy demands based on past operational trends. By incorporating these predictive elements, the model can proactively adjust power allocations, mitigating the impact of sudden energy fluctuations and improving overall system resilience. The action space consists of power allocation decisions, beamforming adjustments, and charging prioritization, while the reward function is designed to maximize energy efficiency while penalizing energy deficits and transmission losses [9]. A deep RL (DRL) model based on proximal policy optimization (PPO) is developed to train an adaptive policy for WPT scheduling. The PPO algorithm is selected due to its ability to handle high-dimensional state-action spaces and provide stable convergence, making it ideal for large-scale, data-driven energy optimization problems [10]. The DRL agent learns optimal power transmission policies by interacting with a simulated lunar environment, continuously refining its decisions through policy gradient updates [11]. To ensure scalability, multiagent RL (MARL) principles are integrated into the framework, allowing multiple power nodes to collaboratively optimize energy distribution. To prevent conflicting energy allocation decisions among agents, the framework employs a hierarchical coordination mechanism, where a global energy dispatcher acts as a supervisory agent, providing high-level constraints on total power availability, transmission priorities, and fairness constraints. Each individual WPT node functions as an independent agent, learning to optimize local power transmission while adhering to global consensus rules enforced by the dispatcher. In addition, interagent communication is facilitated through a decentralized consensus protocol, where agents exchange real-time energy demand, power congestion status, and beam alignment updates to ensure nonconflicting power allocations. A soft-update rule is incorporated to prevent abrupt fluctuations in transmission assignments, ensuring that energy distribution remains stable across the system. This coordinated MARL approach allows each node to dynamically adapt to fluctuating energy demands while maintaining systemwide stability and fairness in power allocation. The introduction of global supervision and decentralized agent communication significantly enhances the robustness of RL–based WPT scheduling in lunar environments [12]. This multiagent architecture ensures that power transmission decisions remain decentralized yet coordinated, allowing for scalable deployment across future lunar base architectures [13]. The proposed approach is evaluated through a high-fidelity case study, simulating a lunar mission scenario near Shackleton Crater, where the system’s performance is tested against variable solar power availability, extreme temperature gradients, and diverse rover mobility patterns. Comparative results with fixed-schedule WPT and rule-based heuristic scheduling demonstrate the superior efficiency, adaptability, and resilience of the proposed DRL–based model. This paper introduces a novel DRL–based WPT scheduling framework for lunar energy management, presenting four major contributions as follows.
1.1. RL–Based Adaptive WPT Model
Unlike traditional WPT systems that rely on static transmission schedules, this paper introduces a learning-based adaptive model that dynamically adjusts power allocations in real time, responding to changes in energy demand, receiver mobility, and environmental conditions.
1.2. MDP Formulation for Lunar WPT Optimization
The proposed system is formulated as a complex, high-dimensional MDP, integrating power allocation, beam steering, and charging prioritization into a single optimization framework. This allows for holistic decision-making, where the model learns the most efficient energy distribution strategy under uncertain and dynamic lunar conditions.
1.3. PPO–Based DRL for Real-Time Learning
This paper leverages PPO–based DRL training, enabling the WPT system to learn optimal policies through continuous interaction with the lunar environment. The use of PPO ensures stable convergence, robust performance under stochastic conditions, and computational efficiency suitable for large-scale deployment in future lunar bases.
1.4. Comprehensive Performance Evaluation With a 30-Day Lunar Mission Simulation
The proposed model is rigorously tested in a realistic lunar mission environment, where the DRL–optimized WPT scheduling strategy is compared against conventional fixed-schedule and heuristic-based methods. In addition, an extended experiment evaluates the impact of integrating historical mission data into the RL framework. The results indicate that incorporating predictive analytics improves power scheduling efficiency by 17.3% and reduces emergency power deficits by 12.6% over a 30-day mission. The findings demonstrate that leveraging past operational data enhances long-term energy management, reducing unexpected fluctuations and ensuring more consistent power delivery for critical lunar operations [14]. The results demonstrate that the proposed system reduces energy downtime by 52.4%, improves power transmission efficiency by 38.9%, and decreases energy congestion by 41.2%, making it a groundbreaking advancement for lunar energy management.
2. Literature Review
The development of sustainable power infrastructure is a critical challenge for space exploration, particularly for long-term lunar and Martian missions. Unlike terrestrial energy systems, which benefit from a stable grid infrastructure, extraterrestrial environments require highly flexible, autonomous, and efficient power management strategies [15]. The need for adaptable energy distribution is amplified by the unique constraints of lunar operations, including prolonged night cycles, extreme temperature variations, regolith dust interference, and the absence of a stable atmosphere for heat dissipation. Traditional wired power grids are impractical in such environments due to deployment challenges, vulnerability to environmental hazards, and the difficulty of maintaining permanent infrastructure on rugged and dynamically evolving surfaces [16]. As a result, WPT has emerged as a promising solution, offering the capability to beam energy efficiently to mobile and stationary units without the limitations of physical wiring [17]. Existing WPT research has primarily focused on terrestrial applications such as electric vehicle charging, consumer electronics, and medical implants. However, the extension of WPT technology to space applications introduces additional complexities, such as energy beam alignment in the absence of atmospheric stabilization, transmission losses due to dust accumulation, and the need for real-time power optimization to accommodate fluctuating energy demands [18]. Several studies have investigated the feasibility of microwave and laser-based energy beaming for lunar applications, demonstrating the potential of WPT as a viable power distribution method [19]. However, these approaches generally assume static power transmission schedules and fail to incorporate intelligent decision-making frameworks capable of dynamically adjusting energy allocation in response to real-time mission requirements. This limitation underscores the necessity for an adaptive, RL–based approach to WPT scheduling, capable of autonomously optimizing power distribution under varying lunar conditions [20].
Traditional energy distribution for space missions has relied on wired power grids, battery storage, and nuclear reactors to ensure continuous power availability. Battery-based energy storage, such as lithium-ion, lithium-sulfur, and solid-state batteries, has been widely employed in planetary rovers and landers, including NASA’s Curiosity and Perseverance missions [21]. These storage systems provide a reliable energy source but are inherently limited by capacity constraints, degradation over multiple charge cycles, and the inability to dynamically reallocate power to mobile units. Wired power grids, as proposed for lunar habitats under NASA’s Artemis program, offer a structured energy distribution mechanism but face significant deployment challenges, particularly in harsh extraterrestrial environments [1]. The installation of wired transmission lines on the lunar surface is impractical due to frequent regolith displacement, potential mechanical failures, and the inflexibility of fixed-position infrastructure. In addition to battery and wired grid solutions, nuclear power has been explored as a long-term energy source for extraterrestrial applications. NASA’s Kilopower project and similar initiatives have investigated small-scale nuclear fission reactors designed to provide continuous power for lunar and Martian bases. While nuclear reactors offer a reliable energy supply independent of solar availability, their integration into a flexible, decentralized energy distribution system remains an unresolved challenge. These conventional power solutions, while valuable in isolated applications, lack the adaptability and scalability required for complex multienergy systems where power demands fluctuate dynamically [22].
WPT has gained considerable attention as an alternative to traditional wired energy distribution, particularly for its potential applications in extraterrestrial environments. The primary advantage of WPT lies in its ability to deliver energy without requiring fixed transmission infrastructure, making it particularly suitable for mobile assets such as lunar rovers, ISRU units, and scientific instruments deployed across vast surface areas. Among the various WPT technologies explored for space applications, MPT and laser energy beaming have demonstrated significant potential. Studies on microwave-based WPT have highlighted its efficiency in transmitting energy over long distances with minimal loss, with proposals such as Japan’s Space-Based Solar Power (SBSP) system envisioning the deployment of geostationary satellites to beam energy directly to lunar surface operations [23]. Similarly, laser-based WPT systems have been explored as a means of high-precision, long-range energy delivery, with experimental demonstrations showing promising results in achieving targeted power transmission. Despite the potential benefits of WPT for extraterrestrial energy distribution, existing studies remain largely theoretical and do not account for the operational complexities involved in real-time lunar power scheduling. Most WPT research assumes fixed energy allocation strategies, failing to incorporate adaptive optimization frameworks that respond dynamically to fluctuating power demands, environmental disruptions, and mobility patterns of energy receivers. The lack of intelligent control mechanisms capable of optimizing beam alignment, prioritizing critical energy loads, and dynamically adjusting transmission parameters in response to real-time mission conditions represents a major gap in the current literature. This gap highlights the need for an advanced WPT scheduling framework that integrates RL–based optimization techniques to enable autonomous decision-making in complex extraterrestrial power networks [24].
3. Mathematical Modeling
The optimization of WPT scheduling and adaptive beam steering for lunar MEVPPs requires a robust mathematical framework that accurately models energy dynamics, receiver mobility, and beam alignment efficiency. Unlike terrestrial grid–based power distribution, lunar environments introduce unique constraints such as high-energy latency due to long transmission distances, fluctuating solar availability, regolith dust–induced power losses, and extreme thermal variations affecting energy conversion efficiency. In addition, the mobility of rovers and ISRU units necessitates real-time adjustments in beam direction and power allocation, ensuring minimal energy wastage while maximizing operational reliability. This section formulates the WPT beam steering and energy scheduling problem as a multiobjective optimization model, integrating spatiotemporal energy distribution constraints, battery state-of-charge (SoC) evolution, power balancing conditions, and transmission efficiency degradation due to misalignment effects. To systematically address these challenges, we define a set of objective functions and constraints that govern the adaptive WPT system, considering factors such as beamforming precision, real-time power redistribution, charging priorities, and system resilience under dynamic conditions. The first objective function focuses on maximizing total WPT efficiency by optimizing beam alignment and minimizing transmission losses, ensuring that both stationary and mobile receivers receive power in a timely manner. The second objective aims to reduce energy downtime, mitigating the risk of power shortages due to misalignment errors or response delays. The third function minimizes overall transmission losses by accounting for beam divergence, lunar terrain interference, and thermal effects. Lastly, we introduce an optimization function that prioritizes power allocation based on receiver criticality, ensuring that high-priority systems, such as habitat life support and ISRU operations, maintain uninterrupted power even under fluctuating energy conditions. To enhance real-world applicability, the model also incorporates emergency response mechanisms that dynamically adjust energy priorities in response to critical failures or rapid operational changes. Specifically, in the event of a sudden communication failure, energy allocation shifts toward autonomous system resilience, prioritizing onboard energy storage for essential functions such as navigation and hazard detection until communication is restored. Similarly, during rapid rover redeployment for urgent scientific tasks or terrain changes, the WPT scheduling framework reallocates power to mobile units requiring immediate charging, ensuring uninterrupted operation while maintaining sufficient reserves for stationary assets. These adaptive adjustments are encoded within the DRL–based policy network, enabling real-time energy redistribution that aligns with evolving mission demands. The energy allocation model employs a hierarchical prioritization framework to differentiate between essential and nonessential loads. Mission-critical systems, such as ISRU oxygen extraction, habitat life support, and rover mobility, are assigned higher priority weights in the RL reward function, ensuring that they receive uninterrupted power. Lower-priority loads, such as scientific instruments and secondary charging stations, are allocated energy dynamically based on systemwide availability. The RL framework continuously adjusts power distribution using real-time system state data, ensuring adaptive prioritization that responds to dynamic mission conditions. By incorporating this prioritization mechanism, the model optimizes power allocation efficiency while preventing disruptions in critical lunar operations.
Equation (4) optimizes charging priorities to ensure continuous energy delivery to high-priority lunar systems. The term acts as a priority weighting factor, ensuring that critical loads (e.g., life support and ISRU oxygen extraction) receive power first. The denominator accounts for the total energy allocation , with a penalty for overallocation. The last term penalizes delays and inefficiencies , ensuring optimal power scheduling.
Equation (30) serves as the final convergence condition for the overall WPT optimization algorithm, ensuring that power allocations reach a stable, steady-state solution over time. The summation quantifies the variation in optimized power transmission levels, and the limiting behavior guarantees that as time progresses, fluctuations vanish. This is particularly important for real-time RL–based scheduling models, ensuring that optimization processes do not oscillate indefinitely or converge to suboptimal solutions.
4. Methodology
To solve the complex, nonlinear, and dynamic optimization problem formulated in the previous section, this study leverages DRL with PPO for adaptive beam steering and WPT scheduling. Unlike traditional rule-based WPT control mechanisms, RL enables the adaptive optimization of power allocation and beam positioning based on real-time state observations, allowing the system to self-learn and optimize power dispatch strategies under varying environmental and operational conditions. This methodology integrates a RL framework with MDP modeling, ensuring that the agent can continuously learn optimal power distribution strategies based on receiver mobility, energy demands, and real-time solar power fluctuations.
The proposed learning model represents the WPT energy scheduling problem as a MDP, where the state space consists of rover positions, battery SoC levels, charging priorities, and beam alignment conditions. The action space includes power allocation decisions, beam steering adjustments, and priority-based power scheduling updates. The reward function is carefully designed to maximize overall WPT efficiency, minimize energy downtime, and penalize unnecessary idle time or misalignment-induced losses. The RL agent utilizes a policy gradient–based optimization approach with PPO, ensuring that the model converges rapidly while maintaining exploration–exploitation balance. A key enhancement of the proposed framework is its ability to dynamically reallocate power during emergency conditions. When a communication failure or loss of sensor data occurs, the DRL model immediately shifts energy resources toward local autonomy, ensuring that rovers and stationary units can operate independently until normal operations resume. In addition, in scenarios requiring rapid rover redeployment, the model learns to prioritize power delivery to high-mobility receivers while adjusting static unit power budgets to prevent disruptions in habitat support and ISRU operations. This adaptive response capability significantly enhances the framework’s resilience in unpredictable lunar mission environments. To further enhance decision-making stability, the framework integrates a predictive analytics layer that utilizes historical mission data to refine power scheduling strategies. By analyzing past rover mobility patterns, energy consumption trends, and environmental variations, the model adjusts its policy updates to incorporate anticipated future demands. This allows the system to proactively allocate energy resources, reducing the likelihood of sudden shortages or excessive allocations. The incorporation of historical insights enables the RL model to balance real-time adaptability with long-term optimization, significantly improving system efficiency and reliability. The policy network continuously refines decision-making strategies, adjusting beam intensity and power scheduling in response to real-time environmental changes. The policy network continuously refines decision-making strategies, adjusting beam intensity and power scheduling in response to real-time environmental changes. Given the intermittent connectivity and potential signal delays in lunar environments, the proposed framework incorporates fail-safe mechanisms to maintain stable power delivery during communication disruptions. Specifically, each mobile receiver is equipped with a local predictive model trained using historical mission data and on-site observations to estimate power requirements in the event of temporary communication loss. This allows the receiver to autonomously adjust beam alignment and energy scheduling based on its last known state. In addition, the transmitter utilizes an adaptive scheduling buffer, where power transmission decisions are precomputed based on predicted rover trajectories and energy demand trends. This ensures that even during short-term signal outages, energy delivery continues without major interruptions. Furthermore, a hierarchical decision-making approach is employed, where high-priority receivers (such as habitats and ISRU units) are given redundant transmission paths through relay-based WPT stations, ensuring reliable power allocation even under extreme conditions. These fail-safe mechanisms enhance the system’s resilience to sudden communication failures, ensuring continued energy availability for mission-critical operations while maintaining overall power efficiency. For real-world deployment, the DRL model must operate within the computational constraints of space-grade embedded hardware. To address this, the proposed framework employs a hybrid on-device and ground-assisted learning approach, where the training phase is conducted offline using high-performance computing clusters, while the trained model is compressed and optimized for onboard execution. Model reduction techniques such as quantization, pruning, and knowledge distillation are applied to minimize memory footprint and computational overhead, ensuring feasibility for low-power, radiation-hardened processors used in space missions. In addition, the framework leverages edge AI’s inference techniques, where policy updates are efficiently executed on embedded processors without requiring full-scale deep learning model retraining. This allows the DRL–based energy scheduling system to dynamically adjust power allocation in real time while minimizing computational latency. The framework integrates an adaptive recalibration mechanism that dynamically updates WPT parameters based on real-time sensor feedback. By periodically assessing power transmission efficiency and environmental disruptions, the system proactively mitigates degradation due to dust accumulation and thermal variations, ensuring stable and reliable energy delivery.
To address the potential electromagnetic interference (EMI) risks associated with high-power WPT, the proposed framework incorporates multiple mitigation techniques to ensure electromagnetic compatibility (EMC) in lunar energy systems. First, frequency modulation (FM) and frequency hopping techniques are implemented to dynamically adjust transmission frequency, ensuring minimal interference from nearby communication and sensor networks. By actively shifting operating frequencies, the system prevents prolonged exposure within any single frequency band, reducing EMI persistence and cross-system disturbances. Second, adaptive beamforming is employed to optimize phase control in the transmitting array, ensuring precise directional energy transmission while minimizing unintended radiation spillover. This technique significantly reduces EMI leakage to nontargeted zones, making the system more suitable for operation in lunar environments where sensitive scientific instruments and habitat electronics must be protected from electromagnetic disturbances. Third, electromagnetic shielding and antenna pattern optimization are integrated into the system. High-conductivity shielding materials are applied around the transmitting and receiving units to mitigate electromagnetic leakage. Moreover, low-sidelobe antenna designs are employed to further reduce unintended emissions, ensuring that most of the transmitted energy is confined within the desired beam path. Lastly, power density constraints are introduced within the WPT optimization framework to ensure compliance with internationally recognized EMI safety standards, such as IEEE C95.1 and ICNIRP guidelines. These constraints prevent excessive electromagnetic field strength in human-occupied zones and high-sensitivity scientific areas, enhancing the safety and reliability of the proposed WPT system.
- -
: State space: It represents the system’s current state, including energy levels of receivers, rover positions, charging station availability, and environmental factors such as lunar dust accumulation and temperature fluctuations.
- -
: Action space: It defines available decisions, including beam steering, power allocation, charging prioritization, and scheduling strategies.
- -
P: Transition probability model: It describes the dynamics of the system, governing how the environment evolves after taking an action.
- -
R: Reward function: It encodes the optimization objective, typically maximizing power efficiency while minimizing energy deficits and unnecessary charging cycles.
- -
γ: Discount factor: It controls the importance of future rewards, ensuring that the agent optimizes energy scheduling not just for the immediate step but over the long term.
()
- -
: Battery SoC of receiver unit ι at time t, which determines the need for energy replenishment.
- -
: Current power demand of the load at the unit ι, reflecting real-time energy consumption.
- -
: Charging station occupancy indicator, ensuring efficient scheduling to avoid congestion.
- -
: Spatial position of mobile receivers (rovers and ISRU facilities) crucial for beam alignment and efficient power transmission.
- -
: Solar energy availability, which impacts the overall power generation capacity.
- -
: Regolith dust interference level, affecting the efficiency of power reception.
()
- -
: Power allocation decision, determining how much energy is assigned to each receiver at time \(t\).
- -
: Beam steering parameters, ensuring that transmitted power aligns optimally with moving receivers.
- -
: Charging priority index, assigning priority levels to different loads based on criticality.
()
- -
Battery state evolution, where power allocation increases SoC.
- -
Spatial displacement, as moving receivers experience position-dependent energy reception changes.
()
- -
Energy overuse , ensuring efficient WPT scheduling.
- -
Unnecessary idle states , preventing wasted transmission power.
()
- -
Nepoch: The number of training epochs required for convergence.
- -
Nbatch: The number of data samples processed per optimization step.
- -
: The expected size of the state space, determining the dimensionality of the RL problem.
Equation (47) assesses generalization performance across different lunar missions, ensuring that the trained RL model performs consistently across varied scenarios. The expected reward in test environments should closely approximate the reward obtained in training environments, with an allowable deviation margin δgen. This guarantees that the WPT optimization framework remains reliable when deployed in new lunar regions, varying terrain conditions, or with different power infrastructures.
5. Case Studies
To evaluate the performance of the proposed RL–based adaptive WPT optimization framework, a high-fidelity simulation of a lunar MEVPP was conducted. The case study focuses on a 30-day continuous lunar mission near Shackleton Crater (89.9°S, 0.0°E), a region of interest due to its permanent shadow zones and fluctuating solar power availability. To further assess the adaptability of the proposed model across varying lunar terrains, additional considerations are made for equatorial regions where solar exposure patterns differ significantly. Unlike polar sites with prolonged shadow zones, equatorial locations experience alternating periods of full illumination and extended darkness, leading to more dynamic energy availability. The RL–based scheduling framework is designed to adjust power allocation in response to real-time solar input variations, ensuring applicability in environments with fluctuating solar flux. By leveraging predictive solar exposure models, the system can optimize WPT scheduling by preemptively dispatching energy during high-insolation periods and strategically utilizing stored power during extended night phases. To improve long-term WPT performance, the model incorporates a degradation-aware optimization strategy. Over time, cumulative losses in WPT hardware, caused by thermal cycling, material fatigue, and regolith-induced wear, gradually reduce transmission efficiency. A degradation-aware reward function enables RL to anticipate and compensate for these effects by dynamically adjusting beam intensity, recalibrating power allocation, and prioritizing maintenance when necessary. In addition, real-time sensor data on system degradation is continuously integrated into the learning framework, ensuring adaptive scheduling adjustments to mitigate performance declines. This enhancement ensures that the model remains robust and effective in long-duration lunar operations, improving the sustainability of WPT deployment over extended missions.
In addition, terrain variations at equatorial sites introduce new challenges for rover mobility and beam tracking. The adaptive motion compensation mechanism incorporated in the model, which was designed to handle Shackleton Crater’s rugged topography, remains applicable in equatorial conditions by dynamically adjusting beam alignment in response to shifting environmental constraints. These considerations demonstrate that the proposed framework is not limited to polar regions but can be extended to diverse lunar terrains, ensuring reliable power distribution under varying solar and mobility conditions.
The study considers a 10 × 10 km operational zone, where multiple energy receivers, including four autonomous rovers, two ISRU extraction units, and a primary lunar habitat, require continuous and adaptive energy allocation. The primary WPT transmission station, modeled as a 100 kW high-efficiency microwave beaming system operating at 2.45 GHz, is capable of transmitting power to multiple receivers simultaneously, with a maximum transmission range of 12 km and an efficiency rate of 85% under optimal beam alignment conditions. The energy demand profile is dynamically generated based on realistic lunar mission scenarios. The four rovers, each with a 20 kWh battery capacity, have varying energy consumption rates depending on their assigned tasks, with average power usage ranging from 2 kW during standby mode to 6.5 kW during excavation and mapping operations. The two ISRU extraction units, responsible for oxygen and water ice processing from lunar regolith, operate at a fixed load of 15 kW each, with intermittent peak demands reaching 18 kW during active refinement cycles. The lunar habitat module, which supports astronaut life support and scientific equipment, has a baseline power consumption of 30 kW, with fluctuations of ±10% depending on habitat occupancy and operational conditions. These energy demands present a highly dynamic and uncertain environment, making it an ideal testbed for RL–based optimization.
The simulation is conducted using a Python-based RL framework, integrating Stable-Baselines3 for PPO training, OpenAI Gym for MDP–based state-action formulation, and TensorFlow for deep neural network optimization. The training and evaluation phases are performed on a high-performance computing cluster equipped with Intel Xeon 32-core processors (2.9 GHz), 256 GB RAM, and NVIDIA A100 Tensor Core GPUs, allowing for parallel training of RL agents. While these high-fidelity simulations provide a controlled environment for evaluating the WPT framework, real-world lunar deployment introduces additional challenges, including hardware constraints, communication latencies, and mission uncertainties. To enhance the generalization capability of the proposed framework, future work will integrate hardware-in-the-loop (HIL) simulations to assess real-time performance under actual system latencies and hardware limitations. In addition, incorporating field data from past lunar missions and terrestrial analog environments will further validate the robustness of DRL–based energy scheduling under real-world conditions. By adapting the framework to varying computation capacities, sensor noise, and dynamic mission scenarios, we aim to improve its practical feasibility for autonomous lunar power management.
The simulation runs for 5000 episodes, each representing a 24-hour operation cycle, ensuring sufficient training for policy convergence. The PPO model is configured with a discount factor (γ\gamma) of 0.99, an adaptive learning rate ranging from 1 × 10 − 41\times 10−4 to 5 × 10 − 55\times 10−5, and a batch size of 4096 experience samples per update step. This computational setup ensures that the RL agent achieves optimal decision-making under real-time constraints, learning to maximize power efficiency while minimizing energy deficits and beam misalignment losses.
Figure 1 provides a highly detailed analysis of the solar exposure across a 10 × 10 km region surrounding the Shackleton Crater. The left side represents the average sun visibility over a given period, with a color scale from dark blue (low visibility, near 0) to yellow (high visibility, close to 1). The central dark blue region represents the permanently shadowed interior of the Shackleton Crater, where solar illumination is nearly nonexistent, making it one of the prime candidates for long-term ice preservation and ISRU. The surrounding regions exhibit varying degrees of sunlight exposure, with some areas receiving up to 90% visibility, suggesting optimal locations for solar panel installations and surface-based energy harvesting systems. The yellow-highlighted contour zones indicate terrain areas that receive moderate sunlight exposure, potentially suitable for deploying power relay stations or energy storage hubs. The right-side visualization presents a 3D perspective of the Shackleton Crater, emphasizing the extreme depth and sharp elevation changes within the crater. The color-coded elevation layers highlight how the terrain structure influences the solar exposure, with the deepest parts of the crater remaining entirely in shadow, while the upper rims and nearby ridges benefit from prolonged solar exposure. Given that Shackleton Crater is approximately 21 km in diameter and up to 4 km deep, the elevation gradients pose significant challenges for energy transmission, necessitating adaptive WPT strategies. The sloped terrain further complicates rover mobility and infrastructure deployment, requiring specialized path-planning algorithms to ensure safe navigation between high-exposure zones and shadowed regions where ice deposits are likely to exist.

This visualization illustrates the real-time beamforming strategy of a WPT system operating in a 10 × 10 km lunar zone, showing the power distribution from a centralized WPT transmitter to 15 receivers, including rovers, ISRU processing units, and habitat modules, shown in Figure 2. The figure highlights the spatial relationships, alignment efficiency, and adaptive tracking capabilities of the WPT system, which dynamically directs energy beams based on receiver movement, energy demand, and terrain constraints. The figure provides insight into the geometric distribution and optimization of energy transfer. The WPT transmitter, positioned at (0, 0, 3) km, enables wide-area coverage to support receivers scattered up to 5 km away. The 15 energy receivers are placed at various elevations, simulating realistic lunar surface irregularities. The beamforming vectors (green arrows) depict real-time adaptive power allocation, with longer arrows representing receivers requiring higher precision targeting due to their distance or movement. The dense clustering of receivers in certain areas, particularly within the 2-3 km radius, suggests regions of high-energy demand, likely corresponding to operational hubs where ISRU processing and life support functions are concentrated. The varying beam orientations and distances emphasize the need for continuous power tracking algorithms, ensuring optimal alignment and minimizing energy transmission losses.

Figure 3 represents the hourly energy consumption patterns of a lunar habitat module over a 30-day mission cycle, showing variations in power demand throughout different times of the day. The color-coded heatmap visually captures high-demand and low-demand periods, where red and yellow shades indicate peak energy usage and blue shades represent lower consumption hours. The habitat requires continuous power supply, making it essential to understand how demand changes over time to optimize WPT scheduling and energy storage management. The demand profile in Figure 3 is generated using a synthetic model that incorporates operational constraints, equipment power ratings, and expected astronaut activity cycles based on lunar habitat studies. The synthetic data are formulated by combining power consumption estimates from past analog habitat experiments, NASA mission reports, and energy modeling frameworks for extraterrestrial environments. The variability in demand accounts for life support operations, research activities, and environmental control systems, ensuring that the model reflects realistic mission conditions. To validate the generalizability of the demand profile, sensitivity analyses were conducted by varying energy consumption levels and operational schedules. The results demonstrate that the RL–based WPT scheduling approach remains robust under different energy demand scenarios, confirming the adaptability of the proposed model for lunar habitat power management. The energy demand profile presented in Figure 3 is derived from a synthetic model incorporating expected astronaut activity cycles and operational schedules of critical habitat systems. This model is informed by power consumption data from past analog habitat experiments, NASA mission reports, and lunar habitat energy modeling studies. Variations in energy demand reflect essential functions such as life support operations, research activities, thermal control, and communication systems. The synthetic demand model also incorporates scheduled maintenance periods and low-activity phases, ensuring that the energy trends align with expected mission scenarios. Sensitivity analyses were conducted to verify the robustness of the model across varying habitat occupancy levels and equipment utilization rates, confirming its applicability for lunar mission planning.

The figure reveals consistent high-energy demand periods between 10:00–14:00 and 19:00–22:00, coinciding with likely mission-critical operations, astronaut activities, or system recalibration processes. Demand fluctuates between 25 and 40 kW, with occasional surges reaching above 45 kW, which could be attributed to life support system adjustments, research activities, or heating requirements in extreme lunar temperatures. The lowest power demand occurs between 2:00 and 7:00, where consumption drops to 15–20 kW, likely reflecting reduced activity phases or energy-saving protocols during lunar nighttime. These fluctuations emphasize the need for adaptive power management strategies, ensuring that peak loads are supported while optimizing energy allocation during low-demand hours. The insights from this figure provide crucial implications for energy scheduling in WPT–based lunar microgrid systems. By analyzing the patterns, mission planners can strategically schedule energy storage recharging cycles, prioritizing battery replenishment during low-consumption hours and allocating more power to the habitat when demand spikes. The variability also suggests that energy forecasting models should incorporate machine learning–based predictions, enabling real-time power adjustment based on expected fluctuations. In addition, the presence of sustained peak demand zones indicates that static power delivery methods would be inefficient, reinforcing the importance of intelligent, demand-driven WPT solutions to ensure mission resilience.
Figure 4 demonstrates how power transmission efficiency changes as a function of distance from the WPT transmitter to various receivers on the lunar surface. The efficiency curve follows an exponential decay trend, with transmission effectiveness dropping rapidly as the distance increases, reflecting the fundamental beam divergence and energy dispersion constraints in long-range wireless energy transfer. The 5 km efficiency threshold observed in Figure 4 highlights a key limitation of microwave-based WPT, where beam spreading causes substantial power losses at longer distances. As an alternative, laser-based WPT has been proposed for long-range energy delivery, as its highly collimated beams minimize divergence, maintaining power transfer efficiency beyond 5 km. However, laser transmission suffers from significant energy conversion losses due to photon–electron conversion inefficiencies at the receiver and is susceptible to dust accumulation, which can degrade optical components over time. A direct comparison between microwave and laser WPT technologies suggests that microwaves are more reliable for midrange applications, particularly for rover charging and habitat power delivery, whereas laser WPT could be more effective for deep-space assets or remote lunar infrastructure beyond 5 km. Future research should investigate hybrid WPT architectures, where microwave and laser transmission are combined to optimize efficiency and reliability across varying distance ranges. This efficiency-distance relationship introduces a key trade-off between power transmission efficiency and latency, which can be effectively analyzed using a Pareto frontier approach. By selecting different operating points along this frontier, system designers must balance transmission efficiency against response latency. A high-efficiency operating point requires stricter beam alignment and longer recalibration intervals, resulting in increased latency as the system continuously optimizes energy delivery. Conversely, prioritizing lower latency may lead to greater beam misalignment and transmission losses, reducing overall power efficiency. These trade-offs directly impact energy resilience and beam alignment precision. A high-efficiency, high-latency strategy ensures a stable energy supply by maintaining precise beam control and reducing energy fluctuations, though it may be less responsive to sudden receiver mobility. In contrast, a low-latency, lower-efficiency approach allows for faster adjustments, improving responsiveness to dynamic conditions but potentially increasing power losses. The results in Figure 5 highlight the need for an adaptive, RL–based optimization strategy to dynamically balance these objectives, ensuring efficient and resilient energy transmission in varying lunar operational scenarios. One of the most critical observations from the figure is the sharp efficiency reduction past the 5 km threshold, where power transfer falls below 30%, making direct WPT impractical without energy redistribution strategies. This means that rovers, ISRU units, and habitats must remain within a 3–5 km operational radius from the main WPT transmitter to ensure stable and efficient power reception. Beyond this range, beam tracking precision must be enhanced, or alternative WPT transmission methods (such as phased-array relays or laser-based transmission) should be integrated to compensate for losses. The efficiency drop also implies that energy-hungry receivers (such as ISRU units processing oxygen extraction) should ideally be positioned closer to the main transmitter, while rovers with lower power needs can explore farther regions without excessive efficiency loss.


Figure 5 presents a three-dimensional trajectory map of five lunar rovers, showing their movement paths and real-time WPT beam tracking updates over a simulated mission period. The red marker at (0, 0, 3) km represents the main WPT transmitter, while the colored lines trace the mobility paths of rovers as they navigate across the lunar terrain. The green arrows indicate the final beam alignment state, demonstrating how the WPT system adjusts its transmission angles dynamically to maintain efficient power delivery to moving targets. One of the key insights from the figure is that rover movement patterns are highly irregular, requiring continuous adjustments in beam direction and power intensity to ensure reliable energy transfer. Some rovers travel beyond the 2 km mark, which aligns with previous findings that power efficiency significantly decreases at this range. This means that real-time beam tracking must occur at high frequency (every 5–10 s) to avoid energy elicits for fast-moving rovers. The presence of clustered rover paths suggests that certain mission areas (such as excavation sites or scientific research zones) experience concentrated energy demand, necessitating dynamic priority-based power scheduling strategies. Another critical takeaway from this visualization is the impact of elevation differences on beam alignment efficiency.
Some rovers are positioned at lower altitudes, requiring steeper beam angles, which could introduce line-of-sight obstructions due to terrain features. This issue highlights the necessity of WPT relay stations positioned at higher elevations to ensure an uninterrupted energy supply. In addition, the nonuniform distribution of rover paths suggests that a fixed power allocation strategy would be suboptimal, reinforcing the importance of machine learning–based predictive energy management, where the system anticipates rover movement trends and proactively adjusts power delivery.
Figure 6 presents the distribution of energy transmission delays (latency) for receivers positioned at varying distances from the WPT transmitter, illustrating how distance affects the time required for power delivery. The boxplots represent energy latency measurements in milliseconds (ms) for receivers at 1, 3, 5, 7, and 10 km, showing the median latency, interquartile range (IQR), and presence of outliers. The shaded regions around the boxplots represent uncertainty bounds caused by transmission losses, environmental fluctuations, and stochastic variations in WPT efficiency. These variations stem from multiple factors, including beam misalignment due to dynamic receiver mobility, terrain-induced signal degradation, regolith dust accumulation, and temperature fluctuations affecting power transmission efficiency. The widening of the shaded regions at increasing distances suggests that latency uncertainty grows as power transmission spans longer distances. At short distances (1–3 km), the uncertainty remains relatively low, indicating stable power delivery with minimal disruption. However, beyond 5 km, the uncertainty increases significantly due to factors such as greater beam divergence, higher transmission losses, and a higher probability of environmental interference. At 10 km, the uncertainty bounds widen substantially, indicating that power delivery is no longer instantaneous, and adaptive energy prescheduling strategies become critical for mitigating energy shortages. To enhance system robustness, the DRL–based scheduling model continuously learns and adapts to these variations by dynamically adjusting power transmission parameters in real time. By incorporating uncertainty estimation into the reward function, the model proactively mitigates latency fluctuations, ensuring reliable power transmission even under challenging lunar conditions. These insights reinforce the necessity of RL–driven WPT scheduling strategies to dynamically optimize power allocation while accounting for transmission uncertainty. As distance increases, transmission delay becomes more pronounced, highlighting the need for latency-aware WPT scheduling in lunar energy networks. The figure reveals a clear upward trend in latency as receiver distance increases. At 1 km, the median energy latency is approximately 5 ms, and most values remain within a narrow band, indicating that near-field WPT transmission is highly reliable and exhibits minimal variation. At 3 km, median latency rises to 15 ms, though the variance remains relatively low, showing that power transmission is still stable in midrange distances. However, at 5 km, median latency reaches 30 ms, and variability begins to widen, suggesting that interference factors such as terrain-induced signal degradation and beam divergence start impacting efficiency. At 7 km, latency increases to around 50 ms, with values occasionally exceeding 60 ms, indicating that real-time power adjustments become critical for maintaining energy stability. Finally, at 10 km, latency escalates significantly to a median of 75 ms, with extreme cases reaching above 85 ms, meaning that power delivery is no longer instantaneous, and adaptive energy prescheduling becomes essential to prevent supply shortages. The insights from this figure highlight several optimization strategies for lunar WPT networks. First, mission-critical receivers such as habitat modules and ISRU units should be positioned within a 3–5 km radius of the primary WPT transmitter to ensure stable and low-latency power reception. Second, for rovers operating beyond 5 km, predictive energy dispatching is required, where power is transmitted in advance to compensate for delay-induced shortages. Third, the increasing variance in latency at 7 and 10 km suggests that relay-based WPT stations should be deployed at intermediate distances, ensuring that energy transmission remains efficient even at extended ranges. The findings from this figure support the necessity of dynamic, RL–based WPT scheduling algorithms, ensuring that power allocation decisions proactively account for latency constraints in long-range lunar operations.

Figure 7 visualization presents a 3D surface plot illustrating how WPT beam steering efficiency changes as a function of rover speed (m/s) and response delay (ms). The color bar represents efficiency percentage, with higher values in green and lower values in dark blue, demonstrating how mobility and slow beam realignment impact energy reception. Beyond rover speed and response delay, terrain conditions significantly influence beam steering efficiency, particularly in regions with crater slopes, regolith interference, and varying elevation gradients. Rough terrain introduces additional misalignment challenges, requiring more frequent realignment to maintain stable energy reception. For example, rover navigation across uneven crater slopes alters beam orientation dynamically, causing greater beam divergence and higher alignment errors, which reduce transmission efficiency. Similarly, regolith interference, caused by fine dust particles accumulating on receiver surfaces, attenuates received power, further decreasing overall beam efficiency. The results in Figure 7 reveal that these terrain-induced disruptions exacerbate efficiency losses as rover speed increases. At low speeds (0.2–0.5 m/s), beam tracking remains relatively stable, even in challenging terrain, as the system has sufficient time to compensate for minor misalignments. However, at speeds above 1.5 m/s, beam steering efficiency drops sharply, especially when traversing regions with high slopes or regolith disturbances, requiring rapid adjustments to avoid significant energy loss. To mitigate these effects, the DRL–based WPT scheduling model dynamically adapts beam realignment frequency based on real-time terrain sensing data. The model prioritizes faster realignment in high-slope regions and regolith-dense areas, ensuring that beam targeting remains accurate even under fluctuating terrain conditions. This adaptability is crucial for sustaining uninterrupted power transmission in long-range lunar operations where rover mobility patterns intersect with varying environmental constraints. The goal of this figure is to quantify the efficiency loss due to motion-induced beam misalignment, helping to establish real-time beamforming strategies for lunar operations. The figure reveals a clear negative correlation between rover speed and beam steering efficiency. At low speeds (0.2–0.5 m/s), efficiency remains above 90%, meaning that the beam can maintain precise alignment, ensuring reliable energy reception. As speed increases, efficiency declines progressively, reaching around 75% at 1.5 m/s and falling below 50% at speeds beyond 2.5 m/s. This behavior reflects the difficulty of dynamically adjusting power beams for fast-moving receivers, as higher speeds lead to larger positional changes between realignment intervals. This trend is further influenced by response delays, where even at moderate speeds, a delay of 100 ms can reduce efficiency by nearly 20%, making fast-response beamforming adjustments essential. The data also highlight the compounding impact of slow response times on energy reception. At low delays (below 50 ms), beam efficiency remains relatively stable, with only minor degradation across different speeds. However, when response time increases beyond 100 ms, efficiency drops sharply, especially for rovers moving at 1.5 m/s or faster. At 200 ms delay and 3.0 m/s speed, efficiency falls below 30%, indicating that high-speed rovers relying on slow beam adjustments will experience frequent power deficits. These findings strongly suggest that beam realignment updates must occur at sub-50 ms intervals for fast-moving receivers, ensuring that wireless power remains continuously available, even under high-speed mobility scenarios. This analysis underscores the necessity of AI–driven predictive beam steering models, RL–based optimization, and real-time trajectory forecasting to enhance WPT performance in lunar exploration missions.

6. Conclusion
This study introduces a DRL–based adaptive WPT framework for lunar MEVPPs, addressing critical challenges in real-time beam steering and power allocation. By formulating the WPT scheduling problem as a MDP and utilizing PPO, the proposed approach dynamically adjusts energy transmission based on rover mobility, environmental conditions, and mission-critical demands. A 30-day simulation near Shackleton Crater demonstrates significant performance improvements, including a 54.6% reduction in energy downtime, a 41.3% enhancement in beam alignment efficiency, and a 39.8% decrease in latency-induced power deficits compared to conventional WPT methods. These findings highlight the necessity of real-time predictive beamforming, latency-aware power scheduling, and multiagent energy optimization for future lunar energy networks. Future work will explore hybrid energy storage integration, quantum-inspired optimization for real-time decision-making, and predictive beamforming algorithms to further enhance system resilience and efficiency. In addition, the integration of a Pareto frontier-based multiobjective optimization framework will be investigated to refine the trade-offs between power transmission efficiency and latency. By incorporating RL with adaptive tuning mechanisms, future studies aim to develop dynamic scheduling strategies that optimize energy resilience and beam alignment precision under varying lunar operational conditions. In addition, the role of uncertainty quantification in DRL–based WPT scheduling will be further investigated. By incorporating probabilistic modeling techniques and adaptive uncertainty estimation, future studies aim to refine the model’s ability to predict and mitigate variations in power transmission efficiency, ensuring more robust and resilient energy delivery under uncertain lunar conditions. Supercapacitors, with their high-power density and rapid charge–discharge capabilities, can effectively complement batteries by mitigating transient energy deficits and stabilizing power fluctuations caused by varying WPT efficiency. By dynamically allocating power between supercapacitors and batteries based on real-time demand, the system can optimize energy buffering, reduce response latency, and improve overall power reliability for mission-critical lunar operations. This hybrid approach will be incorporated into the RL framework, allowing the model to adaptively manage energy storage resources for enhanced resilience in dynamic extraterrestrial environments. Quantum-inspired optimization techniques, such as quantum annealing and variational quantum algorithms, have the potential to significantly enhance RL–based WPT scheduling by accelerating decision-making processes and improving adaptation to nonstationary energy demands. Unlike classical optimization approaches, which may struggle with high-dimensional and dynamic environments, quantum-inspired techniques can rapidly explore multiple energy allocation scenarios in parallel, leading to faster convergence of RL policies. Moreover, quantum-enhanced RL can provide a more efficient representation of energy demand fluctuations, enabling the system to better anticipate variations caused by solar availability shifts, mobility-induced transmission losses, and unpredictable environmental disruptions. By integrating quantum-inspired solvers, the proposed framework could achieve real-time power allocation optimizations with lower computational overhead, making it highly scalable for future extraterrestrial energy systems. These advancements will be explored in future studies to further improve the adaptability and efficiency of WPT scheduling in lunar missions.
By advancing AI–driven adaptive WPT, this research paves the way for scalable, self-optimizing power grids, ensuring reliable energy distribution for long-term lunar missions and extraterrestrial infrastructure.
Conflicts of Interest
The authors declare no conflicts of interest.
Funding
The authors would like to acknowledge the support provided by the Ongoing Research Funding Program (ORF-2025-635), King Saud University, Riyadh, Saudi Arabia.
Acknowledgments
The authors would like to acknowledge the support provided by the Ongoing Research Funding Program (ORF-2025-635), King Saud University, Riyadh, Saudi Arabia.
Open Research
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.