A machine learning potential construction based on radial distribution function sampling
Abstract
Sampling reference data is crucial in machine learning potential (MLP) construction. Inadequate coverage of local configurations in reference data may lead to unphysical behaviors in MLP-based molecular dynamics (MLP-MD) simulations. To address this problem, this study proposes a new on-the-fly reference data sampling method called radial distribution function (RDF)-based data sampling for MLP construction. This method detects and extracts anomalous structures from the trajectories of MLP-MD simulations by focusing on the shapes of RDFs. The detected structures are added to the reference data to improve the accuracy of the MLP. This method allows us to realize a reasonable MLP construction for liquid water with minimal additional data. We prepare data from an H2O molecular cluster system and verify whether the constructed MLPs are practical for bulk water systems. MLP-MD simulations without RDF-based data sampling show unphysical behaviors, such as atomic collisions. In contrast, after applying this method, we obtain MLP-MD trajectories with features, such as RDF shapes and angle distributions, that are comparable to those of ab initio MD simulations. Our simulation results demonstrate that the RDF-based data sampling approach is useful for constructing MLPs that are robust to extrapolations from molecular cluster systems to bulk systems without any specialized know-how.
1 INTRODUCTION
Molecular dynamics (MD) simulation is a computational method for investigating the dynamic behaviors and interactions of atoms or molecules. It is applied in various fields, including life, material, astrochemical, and geological sciences.1-7 In classical MD (CMD) simulations, energies and forces are obtained from an empirical force field. Various force fields have been developed8-14 and applied in MD simulations. However, nearly all empirical force fields have been constructed by optimizing many parameters to reproduce the experimental results. Therefore, in conditions far from the environments that the empirical force fields can describe, CMD simulations often fail to reproduce the correct physical and chemical properties.15, 16 In addition, because the empirical force fields cannot incorporate quantum mechanical effects, CMD simulations cannot describe chemical reactions involving bond formation or breaking.17
Ab initio MD (AIMD) simulations contain few empirical parameters and provide more realistic atomic descriptions than CMD simulations. The systems evolve in time by obtaining energies and forces by performing on-the-fly single-point ab initio calculations based on the quantum mechanical equation. Therefore, AIMD has been performed to reveal the microscale properties of a wide variety of materials, including water, silicate and lithium-ion batteries.18-20 However, because the computational cost is high due to solving the quantum mechanical equations, AIMD simulations are complex to perform over large space and time scales.
Machine learning potential (MLP) has recently been developed for large space- and time-scale AIMD simulations.21, 22 MLP is a machine-learning model that trains a potential energy surface (PES) in a system from reference data that contains information on atomic coordinates, energies, and forces, in which the local atomic environments of reference structures are converted to descriptors as inputs for neural networks. Neural network models for MLPs, such as the Behler–Parrinello high dimensional neural network23 and deep potentials,24 have realized invariance for the translation, rotation, and permutation of atoms, as well as the size scalability of MLPs.21, 22, 25-27 The accuracy of the energy and force values predicted by an adequately constructed MLP is equivalent to the ab initio computational level of the reference data. Furthermore, once the MLPs are constructed, the energies can be obtained by simply substituting the coordinates as in classical force fields. Therefore, MD simulations can be performed for large-scale systems at the ab initio computational level using MLPs.
In MLP construction, reference data is usually prepared from first-principles or quantum chemical (QC) calculations in periodic boundary and isolated cluster models. The calculations in the periodic boundary models are typically used for bulk systems, such as liquids, crystals, and amorphous materials,25, 28-31 whereas the calculations in the isolated cluster models are used for smaller systems, such as small molecular clusters or chemical reactions.32-35 Recently, attempts to calculate large systems using MLPs constructed from QC calculations in the isolated cluster models have become increasingly popular. Two primary approaches have been developed. One approach involves reducing the computational cost of QC calculations for the calculation of the entire system. Liu et al. developed an electrostatically embedded generalized molecular fractionation method, which is fragment-based QC calculations, for large–size ion–water systems.36, 37 Other fragment-based or divide-and-conquer methods for QC calculations are also useful for preparing extensive system data while keeping the computational cost low.38 Another approach is constructing MLPs with small molecular cluster data and applying them to large systems. The only successful extrapolation of water clusters to bulk water was performed by Zaverkin et al. using their original Gaussian moment neural network model.39-41 Therefore, optimal MLP construction methods for large systems based on QC calculations are still in the exploratory phase.
The sampling of reference data is an essential issue in MLP construction. Reference data is usually sampled using MD or Monte Carlo simulations.42, 43 Enhanced-sampling methods are also utilized to sample rare events.44-47 However, there are some cases where the reference data does not adequately cover the entire space of the local configurations, which could lead to unphysical behaviors in MLP-based MD (MLP-MD) simulations. Several data sampling methods have been proposed to improve the original dataset. For example, Nagai et al. proposed a self-learning hybrid Monte Carlo approach, which is a type of hybrid Monte Carlo simulation for materials science, combined with an MLP.48 Zhang et al. proposed a procedure for MLP construction through exploration, labeling, and training.49, 50 Thus, efficient data sampling methods are required to compensate for the lack of reference data and improve the quality and accuracy of MLPs.
In this study, we propose an on-the-fly reference data sampling method using a radial distribution function (RDF) called RDF-based data sampling. We focused on the RDF shape obtained by MLP-MD simulations and demonstrated that it is possible to modify the anomalous RDF with a few additional data points. In RDF-based data sampling, the RDF of the MLP-MD trajectory was compared with a reference RDF to verify whether it was an anomalous structure. The detected anomalous structures were then added to the reference data as a kind of counterexample to facilitate the MLPs in generating an accurate PES. In addition, we applied this sampling method to a liquid–water system.
Here, we constructed MLPs using the reference data of H2O molecular clusters obtained by QC calculations. Then, we verified whether the MLPs could extrapolate bulk water systems across the size-scale gap between the isolated and periodic boundary systems. The MLPs constructed without our method exhibited unphysical behaviors in MLP-MD simulations, such as atomic collisions. In contrast, our method provided a significant improvement in MLP accuracy. We also confirmed that MLP is robust to scaling up size.
2 METHODOLOGY
2.1 Radial distribution function-based data sampling
Radial distribution function-based data sampling was performed to detect and extract anomalous structures from MLP-MD trajectories. Then, we retrained the reference data using additional anomalous structures to improve the MLP accuracy. Figure 1 shows the simple computational flow of RDF-based data sampling. In this section, we briefly describe this approach. The detailed computational flow is provided in Supplementary Material.

First, we prepared an initial reference dataset consisting of a set of coordinates and corresponding energies and forces. An MLP was constructed by training a neural network model with reference data. Then, the MLP was used to perform short MLP-MD simulations, and the RDFs for each trajectory were computed. Note that this RDF is the distribution of the density of a particle–particle distance in a system at a given time, which does not include the time average. In this study, we calculated the RDFs of oxygen–oxygen (OO), oxygen–hydrogen (OH), and hydrogen–hydrogen (HH) atoms in water.
Next, we detected anomalous structures using RDFs. We focused on two regions (A and B) in the RDFs (Figure 2). Region A corresponds to interatomic distances up to the front of the first peak position in the correct RDF. Therefore, the RDF values in region A must be zero because the structures have interatomic distances that are too short to be observed in the proper structures. If we detect anomalous structures with non-zero RDF values in region A of the MLP-MD simulations, we can assume that the MLP has not correctly learned these structures. Then, anomalous structures were added to the reference dataset as new data. The lower limit of region A was set to approximately the diatomic intramolecular distance. Interatomic distances shorter than those in region A were too short for convergence in the QC calculations. In addition, because structures with anomalies in this region have high energies, sampling those structures is not expected to improve MLP accuracy. Thus, we excluded this region from the detection of anomalous structures. The exact criteria are provided in Supplementary Material.

Any RDFs can function as ref-RDFs whenever the RDFs have features of the target system. In this study, we chose the TIP4P force field for the ref-RDFs generation, which is different force field used to generate training data structures. Ref-RDFs were prepared from CMD simulations with the Nose–Hoover thermostat51, 52 for temperature control in the NVT ensemble using the LAMMPS software.53 Note that one can use other ref-RDF, such as those obtained from experiments, other computational methods, and so on. Because the TIP4P model54 is a rigid-body model, the intramolecular distances of OH and HH are constant, and therefore, sharp peaks of the RDFs appear. These peaks are presumed to cause the large and values in region B owing to their large RDF values. To address this issue, we excluded these regions from the comparison. The series of processes from the MLP construction to anomalous structure sampling were repeated three times to improve the MLP. More detailed information on the boundary for regions A and B and the sampling thresholds for are shown in Supplementary Material.
2.2 Reference dataset
We prepared an initial reference dataset that included the geometries, energies, and forces of water cluster systems with 1, 10, 27, 64, and 100 H2O molecules. A set of structures in the monomeric H2O molecular system was constructed with two hydrogen atoms and one oxygen atom in a random arrangement. The structures of the other cluster systems were extracted from CMD simulations. The simulation cells had the same number of H2O molecules as molecular clusters. After CMD simulations, we extracted the water cluster structures in the unit cell of each trajectory. Then, we calculated the energies and forces of the cluster structures using QC calculations and added them to the reference dataset. This two-step dataset construction strategy was proposed as an efficient method for initial data sampling.42 The CMD simulations were performed using the TIP3P force field under the NVT ensemble. The densities of water were 0.94, 1.00, and 1.13 g/cm3 at temperatures of 300 and 600 K, resulting in a total of six combinations of density and temperature conditions. The number of reference structures obtained is listed in Table 1; they were separated into training and validation data.
#H2O | 1 | 10 | 27 | 64 | 100 |
---|---|---|---|---|---|
Total reference data | 99 | 1200 | 3600 | 12,000 | 23,100 |
Training dataa | 66 | 960 | 2880 | 10,200 | 18,600 |
Validation datab | 33 | 240 | 720 | 1800 | 4500 |
- a 60%–85% of total data.
- b Remainder of total data.
We construct the initial MLPs using the initial reference datasets. Subsequently, we updated only 100 H2O molecular cluster data, while RDF-based data sampling cycles described in Section 2.1 and the other cluster size dataset remained unchanged.
2.3 Machine learning potential construction
This study used Deep Potential – Smooth Edition (DeepPot-SE)24 of the DeepMD-kit package55 as a neural network model to construct MLPs. In previous studies, DeepPot-SE was used for MLP construction in bulk water.30, 37, 42, 56 The sizes of the embedding and fitting networks of DeepPot-SE were set to (25, 50, 100) and (320, 160, 32, 16), respectively, and the learning steps were set to 1 × 106 epoch per learning. In addition, five MLPs were created in parallel during one cycle to achieve a greater efficiency. Detailed training conditions are provided in Supplementary Material.
2.4 Machine learning potential-based molecular dynamics simulation
We performed MLP-MD simulations using the LAMMPS software53 with the DeepMD plugin.55 Short MLP-MD simulations in RDF-based data sampling cycles were performed over a period of 2 ps, extracting trajectory snapshots every 0.2 ps under the NVT ensemble using the Nose–Hoover thermostat with three densities (0.94, 1.00, and 1.13 g/cm3) at 300 K.
2.5 Quantum chemical calculation
Quantum chemical calculations were performed to prepare reference data using Gaussian 16 software.57 To efficiently prepare the reference dataset, we employed the PM6 semi-empirical molecular orbital method58 to calculate the energies and forces of the water molecular clusters.
3 RESULTS AND DISCUSSION
3.1 Conventional construction of machine learning potential
First, we construct an MLP without using the RDF-based data sampling method. Figure 3 shows parity plots of the energies and forces of the PM6 and MLP for reference data of 100 H2O clusters. The energy and force data distributions are spread over a range of −253 to −237 eV and −4 to 4 eV/Å, respectively. The energy plots in Figure 3A are distributed with a slight downward bulge, and the obtained MLP tends to underestimate the energy. The RMSEs of the energies and forces for the validation data were 0.927 meV/atom and 38.4 meV/Å, respectively, which were smaller than the values of 1 meV/atom and 50 meV/Å that were recommended by Wen et al.59

We performed an MLP-MD simulation for 50 ps in a unit cell containing 100 H2O. A snapshot of the simulation (Figure 4A) shows the aggregation of oxygen and hydrogen atoms and unnatural chain structures. Figure 4B–D show the RDFs of OO, OH, and HH, respectively. These RDFs differ significantly from the RDFs of bulk water. These results suggest a case in which MLP-MD simulations show unphysical behaviors, even if the prediction accuracy of the MLPs seems satisfactory. This study prepared reference data by combining CMD simulations under different conditions with QC calculations. Nonetheless, the reference data does not cover a sufficient structural space. Therefore, appropriate reference data construction is necessary to perform MLP-MD simulations comparable to AIMD simulations.

3.2 Machine learning potential with radial distribution function-based data sampling
We applied our RDF-based data sampling method to improve the accuracy of MLP and MLP-MD simulations. After three sampling cycles, 231 additional structures were obtained. This number is sufficiently small compared to the original 100 H2O cluster structures (23,100). In addition, the number of detected anomalous structures for each cycle was 149, 51, and 31, which is more than half of which were sampled in the first cycle. In the second and third cycles, anomalies were detected only in region B of the RDF, suggesting that the detected anomalous structures did not include those with extremely short interatomic distances. Even the detected anomalous structures slightly exceeded the thresholds of . These threshold exceedances are likely caused by the inconsistent RDFs of PM6 and TIP4P. As a result, we obtained at least the minimum structures required to improve the MLP accuracy.
Figure 5 shows the parity plots for the energies and forces of PM6 and improved MLP in the system with 100 H2O clusters. The top end of the energy plots expanded from −240 (Figure 3A) to −180 eV (Figure 5A) through sampling. The force range also expanded from an initial range of −4 to 4 eV/Å (Figure 3B) to a final range of −30 to 30 eV/Å (Figure 5B). Additional structures caused these distribution spreads. In particular, the energy distribution indicated that structures could be obtained during the early stages of structural collapse. Strong repulsive interaction data was added to the reference data. The strong repulsion did not seem to have been estimated accurately by the MLP before applying this sampling method (Figure 4). Some water cluster structures added to the reference data are shown in Figure S3 (Supplementary Material). The RMSEs of the energy and force were 1.01 meV/atom and 43.9 meV/Å, respectively. The additional data caused this slight increase in the RMSEs.

We performed an MLP-MD simulation of the bulk water system to verify the appearance of any unphysical behaviors. The MLP-MD simulation was performed for 50 ps in the unit cell with 100 H2O at a density of 1.00 g/cm3 and 300 K. For comparison, we also conducted a semi-empirical MD simulation at the PM6 level using the CP2K software.60 Detailed information on the semi-empirical MD simulation is described in Supplementary Material.
Compared with the RDFs computed by the MLP trained without RDF-based data sampling (Figure 4), we determined that the MLP improved by applying RDF-based data sampling and showed stable behaviors that reflected the characteristics of water. Figure 6 shows the RDFs of OO, OH, and HH derived from the MLP-MD and semi-empirical MD simulations. The RDF behaviors, such as the standing positions and heights of the peaks and valleys in the three RDFs, are consistent with the results of the MLP-MD (blue lines) and semi-empirical MD simulations (red lines). The shapes of the obtained RDFs are also consistent with those of previous PM6 studies.61, 62

Furthermore, we computed the normalized frequency of the triplet OOO angles that consist of an oxygen atom and two of the four nearest-neighbor oxygen atoms (Figure 7A). The experimental results show that the triplet OOO angles are widely distributed at approximately 100°.63, 64 The results of the MLP-MD and semi-empirical MD simulations shown in Figure 7B show the same features, such as two peaks at approximately 60° and 100° and a decay of the distribution tail.

We also computed mean square displacement (MSD) and self-diffusion coefficient to verify the dynamical reproductivity. We performed 50 ps MLP-MD and semi-empirical MD simulations at 300 K. The MSD and the estimated self-diffusion coefficient were computed using MD Analysis library.65, 66 Figure 8 shows the calculated MSDs as a function of time. The self-diffusion coefficients for MLP-MD and semi-empirical MD simulations were 1.43 and 1.41 Å2/ps, respectively. These results show that the MLP-MD simulations are good reproductivity of the structural and dynamical characteristics.

To verify the amount of data required to construct an appropriate MLP, we also constructed an MLP trained with half the amount of the original reference data. We computed the RMSEs of the energy and force of validation data and the RDFs. The results are shown in Supplementary Material. Although the accuracy and reproducibility of the RDFs were slightly lower than those of the original MLP, they were still sufficiently practical, suggesting that our method is effective even when the amount of data is reduced. The tuning of hyperparameters, such as the number of neurons and epoch number, will help further improve the MLP accuracy.
We conclude that the quality of MLPs cannot be accurately measured only by energy and force predictions in some cases, and attention should be paid to the structures that appear in the MLP-MD simulations. Our method overcomes this problem by focusing on anomalous RDF shapes. Although the initial reference data did not cover sufficient structural space, the MLPs improved to the point where they could reproduce semi-empirical MD simulations.
3.3 Verification of cluster size extrapolation
In Sections 3.1 and 3.2, we showed that even when the prediction accuracy of the reference data is sufficient, MLP-MD causes unphysical behaviors owing to the incomplete dataset, and these phenomena are solved by RDF-based data sampling. In this section, we investigate the cluster size scalability of the MLP. Here, we tested whether the MLP could predict the energies and forces of molecular clusters with sizes not included in the reference dataset. We prepared the test data for 50 and 200 H2O clusters as cases with clusters smaller and larger than 100 H2O clusters, respectively. A total of 1500 test data points for both 50 and 200 H2O clusters were newly prepared using CMD simulations to generate cluster structures and QC calculations at the PM6 level to obtain the energies and forces. Figure 9 shows the parity plots of the energies and forces of PM6 and MLP in systems with 50 and 200 H2O clusters. The RMSEs of the energy and force of the 50 H2O clusters were 0.872 meV/atom and 36.5 meV/Å, respectively, and those of the 200 H2O clusters were 0.733 meV/atom and 44.0 meV/Å. All these values were lower than the recommended thresholds, ensuring cluster size scalability for our MLP. These results also indicated that as the cluster size increased, the RMSE of the energy per atom decreased. Therefore, we confirmed that our MLP exhibits size scalability for both bulk and cluster systems. In this work we applied our method only for pure water. When adding more other molecules in aqueous solution or changing the computational method of the training data, MLPs can be easily reconstructed by using transfer learning67 or delta learning methods.68

4 CONCLUSION
The preparation of sufficient reference data is essential for generating MLPs. Insufficient data causes unphysical behavior in MLP-MD simulations. In this study, we proposed a simple on-the-fly data sampling method, namely RDF-based data sampling, to improve the accuracy of the MLP. This method was applied to liquid–water systems. RDF-based data sampling detected anomalous structures made from MLP-MD simulations, focusing on the anomalous shapes of RDFs, and added those structures to the reference dataset. The MLP-MD simulation without RDF-based data sampling showed unphysical behaviors, such as aggregation of atoms or an unnatural chain structure, despite the good RMSE values of the predictions for the validation data. However, by applying RDF-based data sampling, the prediction accuracy of the MLP improved significantly with few additional data. Our final MLP-MD simulations produced appropriate RDFs, triplet OOO angle distributions, MSD, and self-diffusion coefficient, which were comparable with the results of semi-empirical MD simulations. These results indicate that RDF-based data sampling helps construct an MLP that is robust to extrapolation from molecular clusters to bulk systems. This simple sampling method can be applied to molecular aggregation systems other than water by appropriately changing specific parameters. Further improvements in MLP accuracy are expected when RDF-based data sampling is combined with different sampling methods or machine learning algorithms.
ACKNOWLEDGMENTS
This study was supported by Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research [Grant Number JP24K17108, JP24KJ0486]; a Grant-in-Aid for Transformative Research Areas “Materials Science of Meso-Hierarchy” [Grant Number JP23H04879]; the Japan Science and Technology Agency, which is the establishment of university fellowships for the creation of science technology innovation [Grant Number JPMJFS2106]; Institute for Quantum Chemical Exploration, 2024 Research Grants-in-Aid; and the Multidisciplinary Cooperative Research Program at the Center for Computational Sciences, University of Tsukuba. Some of the computations were performed using computer facilities at the Research Institute for Information Technology, Kyushu University, for General Projects on the supercomputer “Flow” at the Information Technology Center, and Nagoya University.
CONFLICT OF INTEREST STATEMENT
The authors declare no competing financial interests.
Open Research
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in MLP-for-Water at https://github.com/Natsuww/MLP-for-Water.