ORIGINAL ARTICLE

Open Access

Characterizing viral samples using machine learning for Raman and absorption spectroscopy

Miad Boodaghidizaji

orcid.org/0000-0001-8231-0309

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Contribution: Conceptualization (equal), Data curation (lead), Formal analysis (equal), Methodology (lead), Software (lead), Visualization (lead), Writing - original draft (lead), Writing - review & editing (lead)

Search for more papers by this author

Shreya Milind Athalye,

Shreya Milind Athalye

Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, Indiana, USA

Contribution: Conceptualization (equal), Data curation (lead), Formal analysis (equal), Visualization (lead), Writing - original draft (lead), Writing - review & editing (supporting)

Search for more papers by this author

Sukirt Thakur,

Sukirt Thakur

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Contribution: Conceptualization (equal), Formal analysis (equal), Investigation (equal), Methodology (lead), Software (supporting), Writing - original draft (supporting)

Search for more papers by this author

Ehsan Esmaili,

Ehsan Esmaili

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Contribution: Conceptualization (equal), Data curation (lead), Formal analysis (equal), Investigation (equal), Writing - original draft (supporting)

Search for more papers by this author

Mohit S. Verma,

Corresponding Author

Mohit S. Verma

[email protected]

orcid.org/0000-0002-6374-3333

Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, Indiana, USA

Weldon School of Biomedical Engineering, Purdue University, West Lafayette, Indiana, USA

Birck Nanotechnology Center, Purdue University, West Lafayette, Indiana, USA

Correspondence Mohit S. Verma, Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, IN 47907, USA.

Email: [email protected]

Arezoo M. Ardekani, School of Mechanical Engineering, Purdue University, 585 Purdue Mall, West Lafayette, IN 47907, USA.

Email: [email protected]

Contribution: Conceptualization (equal), Funding acquisition (supporting), Project administration (lead), Supervision (lead), Writing - review & editing (lead)

Search for more papers by this author

Arezoo M. Ardekani,

Corresponding Author

Arezoo M. Ardekani

[email protected]

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Correspondence Mohit S. Verma, Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, IN 47907, USA.

Email: [email protected]

Arezoo M. Ardekani, School of Mechanical Engineering, Purdue University, 585 Purdue Mall, West Lafayette, IN 47907, USA.

Email: [email protected]

Contribution: Conceptualization (equal), Funding acquisition (lead), Project administration (lead), Supervision (lead), Writing - review & editing (lead)

Search for more papers by this author

Miad Boodaghidizaji,

Miad Boodaghidizaji

orcid.org/0000-0001-8231-0309

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Search for more papers by this author

Shreya Milind Athalye,

Shreya Milind Athalye

Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, Indiana, USA

Contribution: Conceptualization (equal), Data curation (lead), Formal analysis (equal), Visualization (lead), Writing - original draft (lead), Writing - review & editing (supporting)

Search for more papers by this author

Sukirt Thakur,

Sukirt Thakur

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Contribution: Conceptualization (equal), Formal analysis (equal), Investigation (equal), Methodology (lead), Software (supporting), Writing - original draft (supporting)

Search for more papers by this author

Ehsan Esmaili,

Ehsan Esmaili

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Contribution: Conceptualization (equal), Data curation (lead), Formal analysis (equal), Investigation (equal), Writing - original draft (supporting)

Search for more papers by this author

Mohit S. Verma,

Corresponding Author

Mohit S. Verma

[email protected]

orcid.org/0000-0002-6374-3333

Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, Indiana, USA

Weldon School of Biomedical Engineering, Purdue University, West Lafayette, Indiana, USA

Birck Nanotechnology Center, Purdue University, West Lafayette, Indiana, USA

Correspondence Mohit S. Verma, Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, IN 47907, USA.

Email: [email protected]

Arezoo M. Ardekani, School of Mechanical Engineering, Purdue University, 585 Purdue Mall, West Lafayette, IN 47907, USA.

Email: [email protected]

Contribution: Conceptualization (equal), Funding acquisition (supporting), Project administration (lead), Supervision (lead), Writing - review & editing (lead)

Search for more papers by this author

Arezoo M. Ardekani,

Corresponding Author

Arezoo M. Ardekani

[email protected]

School of Mechanical Engineering, Purdue University, West Lafayette, Indiana, USA

Correspondence Mohit S. Verma, Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, IN 47907, USA.

Email: [email protected]

Arezoo M. Ardekani, School of Mechanical Engineering, Purdue University, 585 Purdue Mall, West Lafayette, IN 47907, USA.

Email: [email protected]

Contribution: Conceptualization (equal), Funding acquisition (lead), Project administration (lead), Supervision (lead), Writing - review & editing (lead)

Search for more papers by this author

https://doi.org/10.1002/mbo3.1336

Citations: 2

Share a link

Email
Wechat
Bluesky

Graphical Abstract

We applied machine learning techniques to Raman and absorption spectra to determine the concentration of samples containing viral particles (measles, mumps, rubella, and varicella-zoster viruses). We proposed two different networks to jointly use the Raman and absorption spectra, where our results demonstrated that concatenating the Raman and absorption data increases the prediction accuracy compared to using either Raman or absorption spectrum alone. Ultimately we were able to make predictions with accuracies as high as 95%.

Abstract

Machine learning methods can be used as robust techniques to provide invaluable information for analyzing biological samples in pharmaceutical industries, such as predicting the concentration of viral particles of interest in biological samples. Here, we utilized both convolutional neural networks (CNNs) and random forests (RFs) to predict the concentration of the samples containing measles, mumps, rubella, and varicella-zoster viruses (ProQuad®) based on Raman and absorption spectroscopy. We prepared Raman and absorption spectra data sets with known concentration values, then used the Raman and absorption signals individually and together to train RFs and CNNs. We demonstrated that both RFs and CNNs can make predictions with R² values as high as 95%. We proposed two different networks to jointly use the Raman and absorption spectra, where our results demonstrated that concatenating the Raman and absorption data increases the prediction accuracy compared to using either Raman or absorption spectrum alone. Additionally, we further verified the advantage of using joint Raman-absorption with principal component analysis. Furthermore, our method can be extended to characterize properties other than concentration, such as the type of viral particles.

1 INTRODUCTION

The recent outbreak of COVID-19 proved the importance of robust antiviral medications to stop the spread of pandemic viral infections. Antiviral drugs and vaccines are the two major solutions to keep viral infections at bay. A recent study suggested that, for example, the COVID-19 vaccine saved approximately 20 million human lives in 1 year (Watson et al., 2022). Measles, mumps, rubella, and varicella (MMRV) are common viral childhood diseases that can have serious complications. Developing efficient methods to mass produce MMRV paves the way for limiting the spread of the MMRV globally. The vaccine development flourished in the early 20th century, and Maurice Hilleman at Merck & Co Inc., a pioneer in the development of vaccinations, developed Rubeovax™ in 1968, the first commercial live vaccine for measles (Tulchinsky, 2018). Vaccine development and production have been continuously improving in upstream and downstream processing (Blue et al., 2015). Vaccine production involves challenging processes such as viral vector development, effective purification, polishing steps, and formulation with stable storage conditions. These processes require comprehensive and continuous quality management to maintain the product's efficacy and ensure public safety. With the advancement in viral vector-driven gene therapies and vaccine production, there is a growing interest in improving the continuous production of virus-like particle (VLP)-based vaccines (Gutierrez-Granados et al., 2018). The development of continuous manufacturing processes in the vaccine industry demands rapid, robust, and continuous analytical methods (Process analytical technology [PAT] tools) to understand real-time manufacturing processes (Maruthamuthu, Rudge, et al., 2020).

Noninvasive in-line sensors such as Raman probes (Raman spectroscopy) hold great potential due to their higher sensitivity to read the molecular fingerprints of chemical and biological molecules, species, or products (Butler et al., 2016; Rolinger et al., 2020). Raman spectra possess clear spectral features that can be easily assigned to different chemical compounds. Additionally, minimal sample preparation is sufficient for making accurate quantitative predictions using Raman spectra (Pian et al., 2022). In other words, Raman spectroscopy provides invaluable information for various analyte molecules even in ultra-low concentrations (Panneerselvam et al., 2022). Similarly, absorption spectroscopy is a robust technique that, owing to its high sensitivity and large signal-to-noise ratio, (Torrisi et al., 2020) has the potential to be implemented as a great tool to make predictions. Generally, both Raman and absorption spectra have been widely used for particle detection and identification (Barnes et al., 2006; Nitkowski et al., 2008; Pallaoro et al., 2015; Probst et al., 2021) and quantitative analysis (Bao et al., 2018; Storey & Helmy, 2019; Strachan et al., 2007).

Recently machine learning (ML) has become popular for making predictions based on spectroscopy data. Both supervised and unsupervised ML techniques have been applied to Raman signals to make predictions (Ralbovsky & Lednev, 2020). Particularly, Raman spectroscopy has been utilized for cancer predictions (Ralbovsky & Lednev, 2020). For instance, techniques, such as principal component analysis or artificial neural networks have been utilized for detecting cervical cancer (Daniel et al., 2018). Furthermore, Raman signals have been utilized for classification problems, such as classifying bacteria (Khan et al., 2018; Koya et al., 2018; Maruthamuthu, Raffiee, et al., 2020; Maruthamuthu, Rudge, et al., 2020) viral, (Ditta et al., 2019; Tong et al., 2019) and fungal infections (Dzurendová et al., 2021; Guo et al., 2021). Additionally, Raman spectroscopy has been applied for regression purposes, such as predicting the concentration of the markers of interest, such as sensing the pH and Lactate in body fluids (Olaetxea et al., 2020). Absorption spectroscopy also has been utilized for classification purposes, such as the characterization of proteins (Zhang et al., 2021) classification of wines, (Philippidis et al., 2020) and quantifying the concentration of organic acids (Wolf et al., 2013). Furthermore, the joint Raman and absorption spectra have been applied to predict the values of concentrations (Isaev et al., 2020).

Previous studies, in particular, have confirmed the capability of ML techniques in making quantitative predictions based on Raman or absorption signals. However, a comparison of these signals and their strength in making accurate ML-based predictions for viral samples, such as MMRV has not been studied before. Here, we aim to create methods based on Raman and absorption spectroscopy that enables monitoring of the concentration of the viral particles in well plates. Additionally, it is not known whether using Raman and absorption spectra simultaneously can boost the prediction accuracy compared to using only Raman or absorption spectra separately. In our previous study, we demonstrated that deep learning enables the efficient detection of bacteria, fungi, and mammalian cells in static dried-down conditions (Maruthamuthu, Raffiee, et al., 2020). Following our previous study, we intend to build convolutional neural networks (CNNs) and random forests (RFs) models that accept the Raman or absorption spectra or their combination as the input and predict the concentration of samples containing MMRV.

2 MATERIALS AND METHODS

2.1 Data acquisition

All these samples prepared in this study are based on the ProQuad®, which is a sterile, lyophilized, preservative-free, live virus vaccine that contains measles, mumps, rubella, and varicella-zoster viruses (Kuter et al., 2006). We procured ProQuad® (manufactured by Merck & Co Inc.,) from the Purdue College of pharmacy and stored it at −20°C. We prepared the linear dilutions of the ProQuad® vaccine with a step size of 4% and an initial concentration of 7.20E + 05 plaque-forming units/ml (PFU/ml) (Lyophilized ProQuad® + 10 µl Diluent). Throughout this article, we refer to the number of infective particles within the sample (PFU) as particles. All the Raman spectra of the ProQuad® dilutions were collected with the Renishaw in Via^TMQontorconfocal Raman microscope (Renishaw plc) (RENISHAW). We used a 785-nm excitation laser with 100% (300 mW) power and 10 s acquisition time (1 accumulation). The spectral resolution of the spectra was 1 cm⁻¹, and the spectrum ranged from 101 to 3200 cm⁻¹ corresponding to 3194 Raman shifts. The samples were focused with an X5 objective of a microscope (LeicaDM2700M), and three replicate Raman spectra were collected for each dilution. The sample volume used for the measurement was 100 µl, and the substrate used for the measurements was a 96-well plate (Corning^TM3635 UV-Transparent Microplates). The experiment was repeated once. The raw Raman spectral data was collected using WiRE 5.5 software. Furthermore, we collected the absorption spectrum for ProQuad® dilutions using the BMG LABTECH, Inc microplate reader (CLARIOstar Plus, SN: 430-2173). The spectrum range was 220 to 1000 nm with a spectral resolution of 1 nm wavelength corresponding to 781 wavelengths. The sample volume used for the measurement was 100 µl, and the substrate used for the measurements was a 96-well plate (Corning^TM3635 UV-Transparent Microplates). We collected three spectral scans for each dilution. The experiment was repeated once. In total, the data set includes Raman and absorption spectra for 25 different concentration values with 3 to 6 replicates for each value, making a total of 116 samples, where 20% of this data is used for testing by 5-fold cross-validation as described in Section 2.2.

2.2 Machine learning modeling

We adopt two widely used ML techniques to relate the Raman and absorption spectra to the concentration values: the RF and the CNN techniques. Before training, to ensure the reproducibility of the results, all the models are initialized by setting the seed number to zero. To assess the accuracy of predictions, we use the values of the coefficient of determination (R² scores). Further, to train the models, the 5-fold cross-validation technique is used both for the CNNs and RFs. In this method, the whole data is split into five sections, where the model is trained five times, and each time four sections are used as the training data set and one section as the testing data set. The 5-fold cross-validation model ensures that all the data points fall into the testing data set at least once, preventing biased predictions. The Sklearn (Pedregosa et al., 2011) and Pytorch (Paszke et al., 2019) modules in Python are used for modeling the RFs and CNNs, respectively.

CNN is a supervised machine learning technique that, in our case, takes one-dimensional signals as the input and identifies the important parts of the signal, which paves the way for automatic learning of various features and hidden aspects in the signal that are important for the regression. In other words, CNN can capture the spatial and temporal dependencies in the Raman or absorption spectrum. The general architectures of the deep learning models used in this study are similar, that is, a feed-forward single CNN consisting of four convolutional layers followed by four fully connected layers when either Raman or absorption spectrum is used as the input, as shown in Figure 1a. However, when it comes to using both the Raman and absorption spectra as the input, we use two different designs. In one design, we concatenate the Raman and absorption signals and feed them into a single CNN, as shown in Figure 1a. In another design, a double CNN is created for feeding the inputs, as demonstrated in Figure 1b. In the double CNN, the Raman and absorption spectrum are first fed into two separate networks with four convolutional layers and then two fully connected layers. Eventually, the outputs of each network are concatenated and fed into a network with two fully connected layers. In all models, the architecture used for convolutional layers is based on residual mapping following the deep residual learning method (He et al., 2016). The presence of residual blocks with shortcut connections between inputs and outputs boosts the training stability and paves the way for having deeper layers (He et al., 2016).

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Schematic view of the neural network structure when (a) Raman, absorption, or concatenated Raman-absorption spectrum is used as the input (b) both Raman and absorption spectra are used as separate inputs. The number of layers shown here is for illustration purpose and does not reflect the actual values.

Furthermore, the kernel size used for all the convolutional layers is three with zero paddings and strides of one. Additionally, all the networks are trained for 6000 epochs (iterations), where a further increase in the epochs does not significantly boost the prediction accuracy. We use the mean squared loss function as the criterion for training with the back-propagation techniques, where we adopt the stochastic gradient descent with momentum and adaptive learning rate, Adam, (Kingma & Ba, 2014) where the weight decay and learning rate are set to 0.1 and 10⁻⁸, respectively. Batch normalization and ReLU activation functions are applied consecutively at the end of each convolutional layer, and the ReLU function is applied at the end of each fully connected layer. After passing the last ReLU function, the data is mapped into one neuron as the output. The number of channels and neurons are hyperparameters that can be tuned for further accuracy. In the current study, we found that a maximum of 10 channels and 4000 neurons leads to sufficient accuracy while at the same time avoiding over-fitting.

RF regression is a supervised machine-learning technique that utilizes the ensemble average of multiple decision trees to make final predictions (Grömping, 2009). Each one of the trees makes its prediction of the concentration. As shown in Figure 2, the Raman, absorption, or their concatenated spectrum is used as the input with the concentration as the output. RF is a powerful regression technique that runs efficiently on larger data sets. RFs are generally suitable for making predictions in the training range. Additionally, we use the bootstrapping technique, where we select multiple training samples from the original training sample, and these different samples are used for training each one of these decision trees. Bootstrapping reduces over-fitting chances and stabilizes the network. The squared error criterion in scikit-learn (Pedregosa et al., 2011) is used to measure the quality of splitting for 100 trees.

3 RESULTS AND DISCUSSIONS

We use CNN and RF as two powerful ML techniques, with different levels of preprocessing to identify the optimum predictions. Here, we discuss how the algorithms work with the test data generated using 5-fold cross-validation, where each fold can contain points both inside and outside of the training ranges. For CNN, we discuss whether a single or double CNN works better when both Raman and absorption spectra are used as the input.

In this study, CNN models are composed of multiple convolutional layers with a kernel size of three, where, in each layer, by convolving around the signal, hidden features and patterns are learned. To expedite the learning process and improve the model performance, it is beneficial to preprocess the data before training the models. Thus, we apply baseline corrections and normalize the data using the standard normal variate method, that is, subtracting each spectrum by its mean value and dividing by the standard deviation described by Romero-Torres et al. (2006) Additionally, normalizing the Raman spectrum makes intensities of the Raman and absorption spectrum to be approximately in the same order for further comparison. No baseline correction or normalization is required for the absorption spectrum since the difference between maximum and minimum values is relatively low. Additionally, normalizing the absorption data led to no significant boost in prediction accuracy. Finally, the Raman and absorption signals are smoothened using the Savitzky-Golay (SG) filter (González-Viveros et al., 2021; Romero-Torres et al., 2006). Figure 3 demonstrates the Raman and absorption spectra before and after preprocessing for two different concentrations. In addition to normalization and applying filters, some studies trim the Raman spectrum to obtain the spectral range of interest (Pian et al., 2022). In the current study, we did not observe any significant gain in the prediction accuracy when the Raman or absorption spectrum is trimmed, as we have shown, for example, for the RF method in the Appendix. Additionally, we analyzed how the predictions change with the subtraction of the control spectrum of solvent as described in the Appendix, where we noticed a reduction in the accuracy with the subtraction of the control spectrum. Therefore, we excluded the subtraction of the control spectrum step from preprocessing steps.

The R² coefficients for the values of the 5-fold predictions for both RF and CNN are listed in Table 1. The average R² score for all the predictions is above 90%. However, the prediction accuracy is higher when the concatenated Raman-absorption spectrum is used for RF and CNN compared to the predictions based on either Raman or absorption spectrum. Furthermore, the prediction accuracy is slightly higher for RF compared to CNN in the hyperparameter space we studied. However, both RF and CNN lead to predictions with R² values as high as 98% when the joint Raman-absorption data is used. Additionally, we note that the single CNN demonstrates higher prediction accuracy compared to the double CNN, which might be attributed to the low predictability of the absorption spectrum compared to the Raman spectrum, as the prediction accuracy is higher when only Raman is used compared to when only absorption data is used. Furthermore, we have made a comparison between RF and support vector machine (SVM) methods in Appendix, where we note that RF predictions are slightly more accurate than SVM. Additionally, we visually demonstrate how the predictions of CNN and RF vary for the testing data set in one of the folds in the 5-fold data set. As demonstrated in Figure 4, prediction values based on the Raman spectrum are more in line with the actual values as opposed to the absorption spectrum, where the average R² coefficient is lower. This difference can be attributed to the larger size of the Raman signal and, therefore, larger regions of dissimilarity corresponding to different concentrations, which make Raman spectra more distinguishable from each other. Further, the use of joint Raman-absorption spectra boosts the prediction accuracy compared to the case when only Raman spectra are used.

Table 1. The R² values of 5-fold cross-validation for the prediction of concentration for given Raman, absorption, and Raman-absorption concatenated spectra

Fold	Absorption	Raman	Raman-absorption	Concatenated R-A	Absorption	Raman	Raman-absorption
1	0.975	0.976	0.981	0.992	0.977	0.974	0.988
2	0.948	0.973	0.980	0.974	0.974	0.983	0.995
3	0.942	0.980	0.977	0.992	0.964	0.979	0.991
4	0.835	0.964	0.931	0.946	0.881	0.953	0.981
5	0.952	0.984	0.980	0.991	0.989	0.982	0.990
Ave	0.930	0.975	0.969	0.979	0.955	0.974	0.989

The differences between the prediction accuracy of the Raman and the absorption spectra can further be understood through the PCA. We use PCA to reduce the dimensionality of the Raman, absorption, and concatenated Raman-absorption spectra to 4, where the original size of the Raman and absorption spectra are 3194 and 781. Figure 5 demonstrates how principal coordinate (PC) values differ at different concentration values. The distinction between PCA points at different concentrations is more evident for the Raman-absorption spectrum as compared to the Raman or absorption spectrum. Additionally, we notice that for most cases, not only the prediction accuracy does not increase by conducting PCA, but also for the Raman and Raman-absorption data, the average R² values slightly decrease when we compare Figures 4 and 5d. Therefore, for the current data set, dimensionality reduction does not improve the prediction accuracy.

4 CONCLUSION

In the current study, the possibility of using absorption, Raman, and joint Raman-absorption spectrum to determine the concentration of the samples containing viral particles was investigated. RF and CNN, as two different machine learning algorithms, were utilized for making predictions, and the prediction accuracy was monitored using 5-fold cross-validation. We demonstrated that with sufficient preprocessing, both the Raman and absorption spectra could be used to create a surrogate to predict the values of concentration. In most cases, the Raman spectrum leads to more accurate predictions compared to the absorption spectrum. Moreover, concatenating Raman and absorption spectra improves the prediction accuracy both for RF and CNN. Furthermore, PCA analysis sheds light on the advantage of joint spectra over single usage of Raman or absorption spectrum as the points corresponding to different concentrations are further separated. We have demonstrated that the joint utilization of the Raman and absorption spectra paves the way for the real-time measurements of the concentration of the viral particles in well plates, which can be extended to different static and dynamics settings, such as microfluidic devices with different flow conditions.

The key limitations of this study can be listed as follows. (a) the predictions, in general, work well when the unknown concentration values lie in the range of training data sets. Given that here we focused on relatively large concentration values (>4 × 105/ml), the predictions for the low concentration values («4 × 105/ml) are not reliable. (b) The predictions are valid only for ProQuad® samples. Further training data points corresponding to different types of viral particles are required to extend the applicability of the current method.

In future studies, we intend to extend the range of predictions and develop a graphical user interface, which accepts the raw Raman and absorption data as the input and predicts the values of concentrations for different ML methods. Indeed, the current study can serve as a basic block for developing completely automated software that can capture the values of concentration for different types of viral particles using different machine learning algorithms. Furthermore, we aim to extend the predictions to include Raman spectroscopy in microfluidics under different flow conditions.

AUTHOR CONTRIBUTIONS

Miad Boodaghidizaji: Conceptualization (equal); data curation (lead); formal analysis (equal); methodology (lead); software (lead); visualization (lead); writing – original draft (lead); writing – review & editing (lead). Shreya Milind Athalye: Conceptualization (equal); data curation (lead); formal analysis (equal); visualization (lead); writing – original draft (lead); writing – review & editing (supporting). Sukirt Thakur: Conceptualization (equal); formal analysis (equal); investigation (equal); methodology (lead); software (supporting); writing – original draft (supporting). Ehsan Esmaili: Conceptualization (equal); data curation (lead); formal analysis (equal); investigation (equal); writing – original draft (supporting). Mohit S Verma: Conceptualization (equal); funding acquisition (supporting); project administration (lead); supervision (lead); writing – review & editing (lead). Arezoo M Ardekani: Conceptualization (equal); funding acquisition (lead); project administration (lead); supervision (lead); writing – review & editing (lead).

ACKNOWLEDGMENTS

This work was performed under a Project Award Agreement from the National Institute for Innovation in Manufacturing Biopharmaceuticals (NIIMBL) and financial assistance award 70NANB21H085 from the U.S. Department of Commerce, National Institute of Standards and Technology. Miad Boodaghidizaji and Shreya Milind Athalye contributed equally to this work.

CONFLICT OF INTEREST

None declared.

ETHICS STATEMENT

None required.

APPENDIX

Several ML algorithms can be used for making predictions using Raman and absorption spectra. Here, we present how the predictions might be different if one chooses a different ML algorithm, such as the support vector machine (SVM). Table A1 demonstrates the R² coefficients for the values of the 5-fold predictions for the SVM method. The results are very similar to the values listed for the RF method for the Raman and the concatenated Raman-absorption spectrum. However, for the absorption spectrum, we note that the RF predictions are more accurate than SVM.

Table A1. The R² values of 5-fold cross-validation for the prediction of concentration for the Raman, absorption, and concatenated Raman-absorption spectrum using the SVM method

SVM
Fold	Absorption	Raman	Raman-absorption
1	0.664	0.980	0.991
2	0.698	0.984	0.990
3	0.777	0.981	0.989
4	0.601	0.968	0.980
5	0.621	0.984	0.993
Ave	0.672	0.979	0.988

Note: Furthermore, we demonstrate how the prediction R² values change if the input dimension is reduced to four using PCA for the RF and SVM methods. As shown in Table A2, we note that dimensionality reduction, in this case, significantly decreases the R² values, particularly for the SVM method.
Abbreviations: PCA, principal component analysis; RF, random forest; SVM, support vector machine.

Table A2. The R² values of 5-fold cross-validation for the prediction of concentration for the Raman, absorption, and concatenated Raman-absorption spectrum using PCA

Fold	Absorption	Raman	Raman-absorption	Absorption	Raman	Raman-Absorption
	RF			SVM
1	0.958	0.926	0.933	0.664	0.731	0.701
2	0.972	0.935	0.960	0.698	0.746	0.790
3	0.949	0.951	0.940	0.777	0.623	0.578
4	0.873	0.914	0.973	0.601	0.758	0.793
5	0.973	0.957	0.973	0.621	0.695	0.538
Ave	0.945	0.937	0.956	0.672	0.710	0.680

Note: The background noise can affect the Raman and absorption spectra, particularly at low Raman shifts and wavelengths. As a result, in this section, we remove the initial parts of the Raman (Raman shift <300 cm⁻¹) and absorption spectrum (λ < 250 nm). As shown in Table A3, we note that the prediction accuracies do not change significantly with trimming. Therefore, we used the entire spectra for prediction. Indeed, one of the advantages of using machine learning techniques is that these techniques automatically detect which part of the signal is important. Figure A1 demonstrates the values of importance for the Raman and absorption spectra before and after trimming. The important values are obtained automatically from the Sklearn importance attribute for the RF method (Pedregosa et al., 2011). As evident, we do not notice any significant shift in the important regions of the signals.
Abbreviations: PCA, principal component analysis; RF, random forest; SVM, support vector machine.

Table A3. The R² values of 5-fold cross-validation for the prediction of concentration for the trimmed Raman, absorption, and concatenated Raman-absorption spectrum using the RF method

RF
Fold	Absorption	Raman	Raman-absorption
1	0.958	0.971	0.987
2	0.972	0.982	0.991
3	0.949	0.967	0.975
4	0.873	0.961	0.955
5	0.973	0.980	0.988
Ave	0.945	0.972	0.979

Note: In this study, we did not subtract the control spectrum of the solvent (sterile water) from the Raman and absorption spectrum to minimize the amount of preprocessing. Here, we demonstrate how the prediction accuracies change if we subtract the control data from all the spectrums. Figure A2 demonstrates the comparison of the Raman and absorption spectrum for samples that contain viral particles. As evident, the presence of viral particles induces noticeable changes at most Raman shifts. Further, the absorption signal at all wavelengths is different when viral particles are introduced. Additionally, we presented the spectrum with the water data subtracted. Table A4 demonstrates R² values for predictions of the RF method using the spectrum with water data subtracted. We note that the R² values decrease with the subtraction of water data compared to the values presented in Table A3. Therefore, we excluded the subtraction of the water spectrum step in the preprocessing.
Abbreviation: RF, random forest.

Table A4. The R² values of 5-fold cross-validation for the prediction of concentration for the trimmed Raman, absorption, and concatenated Raman-absorption spectrum using the RF method with control data being subtracted

RF
Fold	Absorption	Raman	Raman-absorption
1	0.975	0.920	0.981
2	0.982	0.983	0.991
3	0.971	0.931	0.948
4	0.882	0.866	0.917
5	0.989	0.947	0.974
Ave	0.959	0.972	0.962

Abbreviation: RF, random forest.

Open Research

DATA AVAILABILITY STATEMENT

The data sets generated and/or analyzed during the current study are available in the Mendeley data repository at https://doi.org/10.17632/44sgp2jvj5.1.

REFERENCES

Bao, W.-J., Li, J., Li, J., Zhang, Q.-W., Liu, Y., Shi, C.-F., & Xia, X.-H. (2018). Au/znse-based surface enhanced infrared absorption spectroscopy as a universal platform for bioanalysis. Analytical Chemistry, 90(6), 3842–3848.
10.1021/acs.analchem.7b04505
CAS PubMed Web of Science® Google Scholar
Barnes, S. E., Cygan, Z. T., Yates, J. K., Beers, K. L., & Amis, E. J. (2006). Raman spectroscopic monitoring of droplet polymerization in a microfluidic device. The Analyst, 131(9), 1027–1033.
10.1039/b603693g
CAS PubMed Web of Science® Google Scholar
Blue, J. T., Sinacola, J. R., & Bhambhani, A. (2015). Process scale-up and optimization of lyophilized vaccine products. In D. Varshney, & M. Singh (Eds.), Lyophilized Biologics and Vaccines (pp. 179–210). Springer. https://doi.org/10.1007/978-1-4939-2383-0_9
10.1007/978-1-4939-2383-0_9
Google Scholar
Butler, H. J., Ashton, L., Bird, B., Cinque, G., Curtis, K., Dorney, J., Esmonde-White, K., Fullwood, N. J., Gardner, B., Martin-Hirsch, P. L., Walsh, M. J., McAinsh, M. R., Stone, N., & Martin, F. L. (2016). Using Raman spectroscopy to characterize biological materials. Nature Protocols, 11(4), 664–687.
10.1038/nprot.2016.036
CAS PubMed Web of Science® Google Scholar
Daniel, A., Prakasarao, A., & Ganesan, S. (2018). Near-infrared Raman spectroscopy for estimating biochemical changes associated with different pathological conditions of cervix. Spectrochimica Acta, Part A: Molecular and Biomolecular Spectroscopy, 190, 409–416.
10.1016/j.saa.2017.09.014
CAS PubMed Web of Science® Google Scholar
Ditta, A., Nawaz, H., Mahmood, T., Majeed, M. I., Tahir, M., Rashid, N., Muddassar, M., Al-Saadi, A. A., & Byrne, H. J. (2019). Principal components analysis of Raman spectral data for screening of hepatitis c infection. Spectrochimica Acta, Part A: Molecular and Biomolecular Spectroscopy, 221, 117173.
10.1016/j.saa.2019.117173
CAS PubMed Web of Science® Google Scholar
Dzurendová, S., Shapaval, V., Tafintseva, V., Kohler, A., Byrtusová, D., Szotkowski, M., Márová, I., & Zimmermann, B. (2021). Assessment of biotechnologically important filamentous fungal biomass by fourier transform Raman spectroscopy. International Journal of Molecular Sciences, 22(13), 6710.
10.3390/ijms22136710
CAS PubMed Web of Science® Google Scholar
González-Viveros, N., Gómez-Gil, P., Castro-Ramos, J., & Cerecedo-Núñez, H. H. (2021). On the estimation of sugars concentrations using Raman spectroscopy and artificial neural networks. Food Chemistry, 352, 129375.
10.1016/j.foodchem.2021.129375
CAS PubMed Web of Science® Google Scholar
Grömping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63(4), 308–319.
10.1198/tast.2009.08199
Web of Science® Google Scholar
Guo, Z., Wang, M., Barimah, A. O., Chen, Q., Li, H., Shi, J., El-Seedi, H. R., & Zou, X. (2021). Label-free surface enhanced Raman scattering spectroscopy for discrimination and detection of dominant apple spoilage fungus. International Journal of Food Microbiology, 338, 108990.
10.1016/j.ijfoodmicro.2020.108990
CAS PubMed Web of Science® Google Scholar
Guti'errez-Granados, S., G'odia, F., & Cervera, L. (2018). Continuous manufacturing of viral particles. Current Opinion in Chemical Engineering, 22, 107–114.
10.1016/j.coche.2018.09.009
Web of Science® Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
10.1109/CVPR.2016.90
Google Scholar
Isaev, I., Trifonov, N., Sarmanova, O., Burikov, S., Dolenko, T., Laptinskiy, K., & Dolenko, S (2020). Joint application of raman and optical absorption spectroscopy to determine concentrations of heavy metal ions in water using artificial neural networks. Proc. SPIE 11458, Saratov FallMeeting 2019: Laser physics, photonic technologies, and molecular modeling, 114580R (Vol. 11458). International Society for Optics and Photonics. https://doi.org/10.1117/12.2564398
10.1117/12.2564398
Google Scholar
Khan, S., Ullah, R., Shahzad, S., Anbreen, N., Bilal, M., & Khan, A. (2018). Analysis of tuberculosis disease through Raman spectroscopy and machine learning. Photodiagnosis and Photodynamic Therapy, 24, 286–291.
10.1016/j.pdpdt.2018.10.014
PubMed Web of Science® Google Scholar
Kingma, D. P., & Ba, J (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Google Scholar
Koya, S. K., Brusatori, M., Martin, J. V., Yurgelevic, S., Huang, C., Liberati, D. M., Auner, G. W., & Diebel, L. N. (2018). Rapid detection of clostridium difficile toxins in serum by Raman spectroscopy. Journal of Surgical Research, 232, 195–201.
10.1016/j.jss.2018.06.031
CAS PubMed Web of Science® Google Scholar
Kuter, B. J., Hoffman Brown, M. L., Hartzel, J., Williams, W. R., Eves, K. A., Black, S., Shinefield, H., Reisinger, K. S., Marchant, C. D., Sullivan, B. J., Thear, M., Klopfer, S., Xu, J., Gress, J. O., Schödel, F., & ProQuad®, S. (2006). Safety and immunogenicity of a combination: Measles, mumps, rubella and varicella vaccine (proquad®). Human Vaccines, 2(5), 205–214.
10.4161/hv.2.5.3246
CAS PubMed Google Scholar
Maruthamuthu, M. K., Raffiee, A. H., De Oliveira, D. M., Ardekani, A. M., & Verma, M. S. (2020). Raman spectra-based deep learning: A tool to identify microbial contamination. MicrobiologyOpen, 9(11), e1122.
10.1002/mbo3.1122
CAS PubMed Web of Science® Google Scholar
Maruthamuthu, M. K., Rudge, S. R., Ardekani, A. M., Ladisch, M. R., & Verma, M. S. (2020). Process analytical technologies and data analytics for the manufacture of monoclonal antibodies. Trends in Biotechnology, 38(10), 1169–1186.
10.1016/j.tibtech.2020.07.004
CAS PubMed Web of Science® Google Scholar
Nitkowski, A., Chen, L., & Lipson, M. (2008). Cavity-enhanced on-chip absorption spectroscopy using microring resonators. Optics Express, 16(16), 11930–11936.
10.1364/OE.16.011930
CAS PubMed Web of Science® Google Scholar
Olaetxea, I., Valero, A., Lopez, E., Lafuente, H., Izeta, A., Jaunarena, I., & Seifert, A. (2020). Machine learning-assisted Raman spectroscopy for ph and lactate sensing in body fluids. Analytical Chemistry, 92(no. 20), 13888–13895.
10.1021/acs.analchem.0c02625
CAS PubMed Web of Science® Google Scholar
Pallaoro, A., Hoonejani, M. R., Braun, G. B., Meinhart, C. D., & Moskovits, M. (2015). Rapid identification by surface-enhanced Raman spectroscopy of cancer cells at low concentrations flowing in a microfluidic channel. ACS Nano, 9(4), 4328–4336.
10.1021/acsnano.5b00750
CAS PubMed Web of Science® Google Scholar
Panneerselvam, R., Sadat, H., Höhn, E.-M., Das, A., Noothalapati, H., & Belder, D. (2022). Microfluidics and surface-enhanced Raman spectroscopy, a win–win combination? Lab on a Chip, 22(4), 665–682.
10.1039/D1LC01097B
CAS PubMed Web of Science® Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (Vol. 32). https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., & Dubourg, V. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825–2830.
Web of Science® Google Scholar
Philippidis, A., Poulakis, E., Kontzedaki, R., Orfanakis, E., Symianaki, A., Zoumi, A., & Velegrakis, M. (2020). Application of ultraviolet-visible absorption spectroscopy with machine learning techniques for the classi-fication of cretan wines. Foods, 10(1), 9.
10.3390/foods10010009
PubMed Web of Science® Google Scholar
Pian, F., Wang, Q., Wang, M., Shan, P., Li, Z., & Ma, Z. (2022). A shallow convolutional neural network with elastic nets for blood glucose quantitative analysis using Raman spectroscopy. Spectrochimica Acta, Part A: Molecular and Biomolecular Spectroscopy, 264, 120229.
10.1016/j.saa.2021.120229
CAS PubMed Web of Science® Google Scholar
Probst, J., Borca, C. N., Newton, M. A., van Bokhoven, J., Huthwelker, T., Stavrakis, S., & deMello, A. (2021). In situ x-ray absorption spectroscopy and droplet-based microfluidics: An analysis of calcium carbonate precipitationIn situ x-ray absorption spectroscopy and droplet-based microfluidics: An analysis of calcium carbonate precipitation. ACS Measurement Science Au, 1(1), 27–34.
10.1021/acsmeasuresciau.1c00005
CAS Google Scholar
Ralbovsky, N. M., & Lednev, I. K. (2020). Towards development of a novel universal medical diagnostic method: Raman spectroscopy and machine learning. Chemical Society Reviews, 49(20), 7428–7453.
10.1039/D0CS01019G
CAS PubMed Web of Science® Google Scholar
RENISHAW. https://www.renishaw.com/en/raman-software-9450
Google Scholar
Rolinger, L., Rüdt, M., & Hubbuch, J. (2020). A critical review of recent trends, and a future perspective of optical spectroscopy as pat in biopharmaceutical downstream processing. Analytical and Bioanalytical Chemistry, 412(9), 2047–2064.
10.1007/s00216-020-02407-z
CAS PubMed Web of Science® Google Scholar
Romero-Torres, S., Pérez-Ramos, J. D., Morris, K. R., & Grant, E. R. (2006). Raman spectroscopy for tablet coat-ing thickness quantification and coating characterization in the presence of strong fluorescent interference. Journal of Pharmaceutical and Biomedical Analysis, 41(3), 811–819.
10.1016/j.jpba.2006.01.033
CAS PubMed Web of Science® Google Scholar
Storey, E. E., & Helmy, A. S. (2019). Optimized preprocessing and machine learning for quantitative Raman spectroscopy in biology. Journal of Raman Spectroscopy, 50(7), 5608.
10.1002/jrs.5608
Web of Science® Google Scholar
Strachan, C. J., Rades, T., Gordon, K. C., & Rantanen, J. (2007). Raman spectroscopy for quantitative analysis of pharmaceutical solids. Journal of Pharmacy and Pharmacology, 59(2), 179–192.
10.1211/jpp.59.2.0005
CAS PubMed Web of Science® Google Scholar
Tong, D., Chen, C., Zhang, J., Lv, G., Zheng, X., Zhang, Z., & Lv, X. (2019). Application of Raman spectroscopy in the detection of hepatitis b virus infection. Photodiagnosis and Photodynamic Therapy, 28, 248–252.
10.1016/j.pdpdt.2019.08.006
CAS PubMed Web of Science® Google Scholar
Torrisi, S. B., Carbone, M. R., Rohr, B. A., Montoya, J. H., Ha, Y., Yano, J., Suram, S. K., & Hung, L. (2020). Random forest machine learning models for interpretable x-ray absorption near-edge structure spectrum-property relationships. npj Computational Materials, 6(1), 109.
10.1038/s41524-020-00376-6
Web of Science® Google Scholar
Tulchinsky, T. H. (2018). Maurice hilleman: Creator of vaccines that changed the world. Case Studies in Public Health, 443–470. https:doi.org/10.1016/B978-0-12-804571-8.00003-2
10.1016/B978-0-12-804571-8.00003-2
Google Scholar
Watson, O. J., Barnsley, G., Toor, J., Hogan, A. B., Winskill, P., & Ghani, A. C. (2022). Global impact of the first year of covid-19 vaccination: A mathematical modelling study. The Lancet Infectious Diseases, 22, 1293–1302.
10.1016/S1473-3099(22)00320-6
CAS PubMed Web of Science® Google Scholar
Wolf, C., Gaida, D., Stuhlsatz, A., Ludwig, T., McLoone, S., & Bongards, M. (2013). Predicting organic acid con-centration from uv/vis spectrometry measurements—A comparison of machine learning techniques. Transactions of the Institute of Measurement and Control, 35(1), 5–15.
10.1177/0142331211403797
Web of Science® Google Scholar
Zhang, J., Ye, S., Zhong, K., Zhang, Y., Chong, Y., Zhao, L., Zhou, H., Guo, S., Zhang, G., Jiang, B., Mukamel, S., & Jiang, J. (2021). A machine-learning protocol for ultraviolet protein-backbone absorption spectroscopy under environmental fluctuations. The Journal of Physical Chemistry B, 125(23), 6171–6178.
10.1021/acs.jpcb.1c03296
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume11, Issue6

December 2022

e1336

This article also appears in:

Methods

Characterizing viral samples using machine learning for Raman and absorption spectroscopy

Graphical Abstract

Abstract

1 INTRODUCTION

2 MATERIALS AND METHODS

2.1 Data acquisition

2.2 Machine learning modeling

3 RESULTS AND DISCUSSIONS

4 CONCLUSION

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

CONFLICT OF INTEREST

ETHICS STATEMENT

APPENDIX

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Characterizing viral samples using machine learning for Raman and absorption spectroscopy

Graphical Abstract

Abstract

1 INTRODUCTION

2 MATERIALS AND METHODS

2.1 Data acquisition

2.2 Machine learning modeling

3 RESULTS AND DISCUSSIONS

4 CONCLUSION

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

CONFLICT OF INTEREST

ETHICS STATEMENT

APPENDIX

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

Figures

References

Related

Information