Volume 3, Issue 4 e1164

RESEARCH ARTICLE

Full Access

Object classification in analytical chemistry via data-driven discovery of partial differential equations

Joshua Lee Padgett,

Corresponding Author

Joshua Lee Padgett

[email protected]

orcid.org/0000-0001-9369-351X

Department of Mathematical Sciences, University of Arkansas, Fayetteville, Arkansas, USA

Center for Astrophysics, Space Physics, and Engineering Research, Baylor University, Waco, Texas, USA

Correspondence Joshua Lee Padgett, Department of Mathematical Sciences, University of Arkansas, 850 W Dickson St #309 Fayetteville, AR, USA.

Email: [email protected]

Search for more papers by this author

Yusup Geldiyev,

Yusup Geldiyev

Department of Mathematics and Statistics, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Sakshi Gautam,

Sakshi Gautam

Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Wenjing Peng,

Wenjing Peng

Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Yehia Mechref,

Yehia Mechref

Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Akif Ibraguimov,

Akif Ibraguimov

Department of Mathematics and Statistics, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Joshua Lee Padgett,

Corresponding Author

Joshua Lee Padgett

[email protected]

orcid.org/0000-0001-9369-351X

Department of Mathematical Sciences, University of Arkansas, Fayetteville, Arkansas, USA

Center for Astrophysics, Space Physics, and Engineering Research, Baylor University, Waco, Texas, USA

Correspondence Joshua Lee Padgett, Department of Mathematical Sciences, University of Arkansas, 850 W Dickson St #309 Fayetteville, AR, USA.

Email: [email protected]

Search for more papers by this author

Yusup Geldiyev,

Yusup Geldiyev

Department of Mathematics and Statistics, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Sakshi Gautam,

Sakshi Gautam

Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Wenjing Peng,

Wenjing Peng

Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Yehia Mechref,

Yehia Mechref

Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

Akif Ibraguimov,

Akif Ibraguimov

Department of Mathematics and Statistics, Texas Tech University, Lubbock, Texas, USA

Search for more papers by this author

First published: 04 April 2021

https://doi.org/10.1002/cmm4.1164

Citations: 1

Funding information: National Institutes of Health, 1R01GM112490-04, 1R01GM130091-01, and 1U01CA225753; National Science Foundation, NSF 1903450

Share a link

Email
Wechat
Bluesky

Abstract

Glycans are one of the most widely investigated biomolecules, due to their roles in numerous vital biological processes. However, few system-independent, LC-MS/MS (liquid chromatography tandem mass spectrometry) based studies have been developed with this particular goal. Standard approaches generally rely on normalized retention times as well as m/z-mass to charge ratios of ion values. Due to these limitations, there is need for quantitative characterization methods which can be used independently of m/z values, thus utilizing only normalized retention times. As such, the primary goal of this article is to construct an LC-MS/MS based classification of the glycans derived from standard glycoproteins and human blood serum using a glucose unit index as the reference frame in the space of compound parameters. For the reference frame, we develop a closed-form analytic formula via the Green's function of a relevant convection-diffusion-absorption equation used to model composite material transport. The aforementioned equation is derived from an Einstein–Brownian motion paradigm, which provides a physical interpretation of the time-dependence at the point of observation for molecular transport in the experiment. The necessary coefficients are determined via a data-driven learning procedure. The methodology is presented in an abstractly and validated via comparison with experimental mass spectrometer data.

1 INTRODUCTION

The biological significance of glycans is evident from the numerous studies demonstrating their roles in living systems. These molecules alone, as well as in conjunction with other biomolecules, participate in important biological functions. For example, glycosylation is one of the major post-translational modifications^{1, 2} and is known to mediate a broad range of biological processes such as cell recognition,³ cell signaling,^{3, 4} immune response,⁵ and protein stability.⁶ Furthermore, aberrations in glycosylation patterns are found to be related to various diseases, including cancers.^{3, 7-11}

Glycans display high structural complexity owing to the presence of diverse monosaccharide composition, different linkages, and various branching options.¹² Tandem mass spectrometry (MS) has emerged as an effective technique for glycan structural studies.¹ This technique in conjunction with liquid chromatography (LC) provides a powerful tool for studying molecular structures.^{1, 13-16} Additionally, various derivatization techniques are employed to pursue sensitive and efficient investigation of the glycans. Some of the derivatization reagents include 2-aminobenzamide, procainamide, aminoxyTMT, RapiFluor-MS (RFMS) labeling, and iodomethane permethylation.¹⁷ Our particular method of choice for the current study is permethylation as it delivers several advantages over other derivatization techniques.^13-16 In this technique, methyl groups replace the existing hydrogens, oxygen, and nitrogen atoms in a glycan structure. Permethylated glycans have increased hydrophobicity, which makes them ideal for reverse phase separation. It also prevents fucose (sugar) migration^{18, 19} and sialic acid loss.²⁰ Also, due to increased positive ion efficiency of glycans, ionization efficiency is improved, thereby enhancing the sensitivity.²¹

Despite the availability of sensitive structural investigation techniques, inter-instrument, as well as inter-laboratory variations, in the data acquisition complicates the identification and characterization of glycans. This motivates the need for the development of universally applicable instrument, as well as laboratory, independent classification techniques. The glucose unit index (GUI) is one such method, as it relies only on the relative retention time of sample molecules with respect to the glucose units. Ashwood et al. recently reported the retention time normalization of native glycans based on GUI as well as m/z values.²²

In this study, we develop a method which only utilizes GUI for characterizing permethylated glycans, independent of m/z values of the permethylated glycan structures. Dextrin, which is utilized as a reference frame in this study, consists of a mixture of oligosaccharides of D-glucose units which form linear chains consisting of either $α$ -(1 → 4) or $α$ -(1 → 6) glycosidic bonds. The retention times of these glucose units are used to calculate the normalized retention times of the reduced and permethylated N-glycans derived from samples. The use of Dextrin as an internal standard allows for the elimination of inter-injection variations and improves the accuracy of the measurements. This approach then allows for the development of a mathematical model employing the LC-MS/MS-based data for efficient identification of permethylated N-glycans.

In particular, we employ a so-called data-driven methodology for constructing an associated partial differential equation (PDE) which allows for a straightforward classification procedure. The use of data-driven PDEs and other data-driven methodologies have recently garnered much attention in the literature due to their ability to efficiently learn in relation to dynamical systems and physical processes (see, for instance, References 23-37 and the references therein). The coupling of such approaches with deep learning methods has allowed for the efficient and accurate handling of situations involving quite large data sets. Herein, we consider a modified approach which avoids standard regression methods in favor of a more mathematically informed process for determining the coefficients necessary for object classification. By modifying an approach employed by Einstein (e.g., Reference 38) we are able to deduce more clearly the form of the undetermined PDE—making our approach closer to a supervised learning method (with some distinct differences). Moreover, we demonstrate that performing a single learning procedure on a particular data set will allow for highly accurate classification of unknown data sets of a particular type. This, of course, motivates a wide array of novel questions related to learning procedures in both mathematics and the physical sciences.

This article is organized as follows. Section 2 provides a heuristic description of the algorithm in order to provide the reader clarity regarding our goal and general methodologies. In Section 3, we use a modification of the original Einstein argument for the classical development of standard Brownian motions (see Reference 38, for example) to derive the primary equation of interest. This section also includes a more generalized procedure for the derivation which reduces the necessary assumptions on the system of interest. Section 4 builds on the material in Section 3, allowing for the construction of a closed-form solution to our theoretical model—thus, bypassing the need for numerical approximations and complicated learning procedures. We then use this model in Section 5 to classify unknown samples via our proposed algorithm. This section also clearly outlines the parameters for the physical experiments carried out to produce the data set for classification. Finally, Section 6 provides some concluding remarks and also alludes to possible future endeavors related to the current work.

2 OUTLINE OF PROPOSED ALGORITHM

We briefly outline the ideas behind the newly proposed algorithm, below. Note that the algorithm will be more completely and rigorously described in Main Algorithm (see Section 5). For clarity, we allude to specific aspects of the experiment of interest, whose specific protocol is outlined in Section 5.2.

Proposed Algorithm. Let A denote the sample object of interest (e.g., a standard N-glycan—cf. Section 5.2).

(i)
Inject into the sample, A, simple chemicals which will serve as “markers” in the classification process. For our purposes, we use glucose molecules of different types and denote these types by M_i, i ∈ N∗ = {1, 2, … , n∗} (where $n^{*} \in ℕ$ ). Each molecule M_i, i ∈ N∗, has a linear structure and (possibly) different lengths.
(ii)
We then “slowly” transport sample A through a short (approximately 10 cm) porous tube. We assume that the transport is one-dimensional and let this transport coincide with the positive x-axis (e.g., as in Figure 1). (Note that the speed of transport was selected through auxiliary experiments in order to maximize the high resolution of the signal-to-noise ratio, prior to our classification procedure.)
(iii)
At the point of observation, which we denote by x = L (L < 10 cm), we record all signals which are obtained from the mass spectrometer (with a particular emphasis on the observed peaks in the signals).
(iv)
In this situation it is assumed that for all i, j ∈ N∗, such that i ≠ j, it holds that M_i and M_j do not mix (i.e., do not undergo chemical bonding). We identify each M_i, i ∈ N∗, via the peaks in the signals of sample A obtained from spectrometer. These M_i, i ∈ N∗, will serve as our “marker” molecules.

Note that the family of “marker” molecules M_i, i ∈ N∗, will be a subset of all molecules in any other sample of interest: M_j, j ∈ N = {1, 2, … , n} (where $n^{*} \leq n \in ℕ$ ). In other words:
$The set {(M_{j})}_{j \in N^{*}} is a subset of {(M_{j})}_{j \in N} .$ ()
(v)
We then classify the peaks that are located between those of the “marker” molecules in the sample A via a so-called classification index. In the current article, this index will depend only on two parameters—which in turn depend only on the retrieval time (the time at which the signal has the largest peak).
(vi)
Using the information obtained by completing (i)–(vii), we may then classify other samples of interest. That is, for all other samples which have been injected with the same “markers,” we extract data regarding the M_i, i ∈ N∗, in these samples by matching the spectrometer signal peaks which are closest to the “marker” molecules in sample A.
(vii)
We can then classify the remaining signal peaks in these samples through associated so-called data-driven PDEs. This is accomplished by constructing appropriate diffusion and absorption coefficients, which will allow us to distinguish differences between the new samples and the original sample A (see the Main Algorithm). (Note that we use the terms diffusiont and absorption to mirror the description given in the thought experiment employing compound transport based on Einstein's paradigm of Brownian motion with absorption and drift—see Section 3 for more details.)

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Schematic diagram of the experiment, with a signal peak and retention time, $T_{\max}$ . Note that (A) is a rough depiction of a generic sample passing through a tube and past a mass spectrometer sensor. Possible raw data obtained from this sensor is provided in (B). The value of $T_{\max}$ is determined by direct observation of the data obtained from the sensor

The generic description provided by (i)–(vii) in Proposed Algorithm are meant to provide a rough blueprint of the method we employ, herein. However, there are numerous mathematical details needed before we can rigorously formulate the final algorithm. A schematic depiction of Proposed Algorithm (and Main Algorithm) is provided in Figure 1, below.

As noted in (iv) of Proposed Algorithm, our method is based on the important assumption that for all i, j ∈ N∗, such that i ≠ j, the molecules M_i and M_j are not mixing throughout the experiment. This is formalized in the following assumption.

Assumption 1.For each i, j ∈ N∗ (cf. (i) of Proposed Algorithm), such that i ≠ j, it is assumed that M_i and M_j are not mixing. That is, they do not interact to create novel chemical compounds.

Assumption 1 is a vital assumption in our proposed method. However, it is worth noting that Assumption 1 may not be valid in all physical experiments of interest. For our particular situation, empirical evidence (obtained from experiments employing the guidelines outlined in Section 5.2) suggests that Assumption 1 does in fact hold. The process of molecular mixing requires further generalizations of Einstein's paradigm for Brownian motion and will be a focus of forthcoming research. More details on this process are provided in Section 3.

3 EINSTEIN'S PARADIGM FOR MOLECULAR TRANSPORT IN A TUBE

In this section we will reformulate Einstein's model for Brownian motion to the case where glucose molecules are being transported in the tube filled with porous material. Einstein derived his seminal mathematical framework based upon his visual observations of the random jumps of pollen grains of the plant Clarkia pulchella suspended in water (see References 38 and 39). He then provided a mathematical framework to describe the observed phenomena, which resulted in his model of classical Brownian motion. A less understood fact is that Einstein's approach can also be applied to many generic processes arising in physics, chemistry, and engineering. This is the key observation employed to derive our novel classification algorithm.

3.1 Assumptions for Einstein's paradigm

In order to employ Einstein's approach in novel situations, one must first understand the key principles which underpinned his work. These principles can be formulated into three main axioms, which we formulate into additional assumptions.

Assumption 2.For each molecule of interest, say M, we assume that there exists a time interval $τ$ , which is small compared to the observable time intervals but large enough that the motions performed by M during two consecutive time intervals of length $τ$ can be considered as mutually independent events.

Assumption 3.Let $ℳ (τ)$ to be the set of all possible lengths of non-colliding jumps associated to M in the time interval $τ$ (cf. Assumption 2). We will say that $Δ$ is a possible length of such “free jumps” if $Δ \in ℳ (τ)$ .

Assumption 4.All molecular interactions during such time intervals $τ$ (cf. Assumption 2) are restricted to absorption of the surrounding media—which includes other molecules, the porous media, and all possible boundaries.

Remark 1.In general, the time interval of free jumps, $τ$ (cf. Assumption 2), the expected value of the length of the free jumps associated to $τ$ , which we denote by $Δ_{e} \in ℳ (τ)$ , and the frequency of the free jumps of length $Δ \in ℳ (τ)$ , which we denote by $φ (Δ) \in [0, \infty)$ , are key parameters in our approach and may depend on underlying properties of the involved molecules (such as rates of change) and the surrounding environment (such as the components of the mixtures and the associated porous media) (see, for example, Reference 39 and the references therein).

We will use Remark 1 for the interpretation of the experimental data, herein.

3.2 Derivation of associated equation with drift and absorption

For the development of our PDE model, we employ the notation used in the original work of Einstein.³⁸ Denote by

n \in ℕ

the total number of particles of a molecule M, presented in a unit volume, and let

Δ \in ℳ (τ)

(cf. Assumption 3). Let

d n \in ℕ

denote the number of particles located in an arbitrary interval of length

d Δ

experiencing a displacement of magnitude

Δ

in the time interval

τ

(cf. Assumption 2). This value can be expressed via the following equation

d n = n φ (Δ) d Δ,

()

where

φ : ℝ \to [0, 1]

is the frequency (or probability density) of free jumps with length

Δ .

Note that in his original work, Einstein assumed that φ was an even function which differs from zero only for very small values of

Δ

(e.g., [38, Page 13]).

Remark 2.It is worth noting that Einstein's assumptions on φ are natural assumptions in the particular case involving Brownian motion. However, it is important to observe that this assumption may not be reasonable for all physical phenomena.

Based on Remark 2, it should be clear that our intentions are to consider a more general case than that of Einstein's original work. Therefore, we must clearly indicate what assumptions we will employ (as an arbitrary probability density will be too general to allow for any meaningful data-driven learning). To this end, we assume that there exists

σ \in [0, \infty)

such that the function φ satisfies

\int_{- \infty}^{\infty} φ (Δ) d Δ = 1,

()

which we refer to as the whole universe axiom,

\int_{- \infty}^{\infty} Δ φ (Δ) d Δ = Δ_{e}

()

(cf. Remark 1), which we refer to as the expected length of the free jumps, and

\int_{- \infty}^{\infty} {(Δ - Δ_{e})}^{2} φ (Δ) d Δ = σ^{2}

()

(cf. Remark 1), which we refer to as the standard variance of the free jump lengths.

Next, let $f : ℝ \times [0, \infty) \to [0, \infty)$ be the function which represent the number of particles per unit volume present at position $x \in ℝ$ at time t ∈ [0, ∞). With this Equations (3)–(5), we may now formulate a crucial axiomatic conservation law.

Axiom 1.There exists a continuous function $F : ℝ \to ℝ$ such that the number of particles found at time $t + τ$ (cf. Assumption 2) between two planes perpendicular to the x-axis with abscissas x and $x + δ$ (where $δ \in ℝ$ ) is given by

\int_{x}^{x + δ} f (y, t + τ) d y = \int_{x}^{x + δ} \int_{- \infty}^{\infty} f (y + Δ, t) φ (Δ) d Δ d y + \int_{x}^{x + δ} \int_{t}^{t + τ} F (f (y, s)) d s d y .

()

The second term in the right-hand-side of Equation (6) is a result of possible bonding and/or absorption with other molecules in the sample or with the porous media within the tube. In general, for Equation (6) to be well-defined, all one needs is that F is finite and measurable on the codomain of the function f. However, throughout this article, we will assume that there exists $ω \in ℝ$ such that for all $x \in ℝ$ , t ∈ [0, ∞) this term is well-approximated by the linear function $ω \cdot f (x, t)$ . The conservation law given by Equation (6) is depicted schematically in Figure 2, below.

Remark 3.In Equation (7), we assume that φ has compact support. This assumption is physical in nature, as the ensuing theoretical work follows in a similar fashion if φ does not have compact support.

Note that for all

x \in ℝ

it holds that

\int_{- \infty}^{\infty} f (x + Δ, t) φ (Δ) d Δ = \int_{- \infty}^{\infty} [f (x + Δ, t) - f (x + Δ_{e}, t) + f (x + Δ_{e}, t)] φ (Δ) d Δ

()

(cf. Remark 1). Next, observe that the multi-dimensional Carathéodory Theorem (e.g., Bartle et al. [40, Theorem 6.1.5]) ensures that there exist measurable functions

ψ^{x}, ψ^{t} : ℝ \times [0, \infty) \to ℝ

such that for all

x \in ℝ

, t ∈ [0, ∞) it holds that

f (x, t + τ) - f (x + Δ_{e}, t) = τ [ψ^{t} (x, t + τ)] + Δ_{e} [ψ^{x} (x + Δ_{e}, t)],

()

where for all

x \in ℝ

, t ∈ [0, ∞) it holds that

ψ^{t} (x, t + τ) \approx \frac{\partial f (x, t)}{\partial t} and ψ^{x} (x + Δ_{e}, t) \approx \frac{\partial f (x, t)}{\partial x} .

()

Furthermore, applying Taylor's theorem to

f (x + Δ, t)

, with respect to x, centered at the point

x + Δ_{e}

, yields (under appropriate smoothness assumptions) that for all

x \in ℝ

, t ∈ [0, ∞) it holds that

\begin{array}{ll} f (x + Δ, t) & = f (x + Δ_{e}, t) + \sum_{k = 1}^{\infty} \frac{{(Δ - Δ_{e})}^{k}}{k!} \frac{\partial^{k} f}{\partial x^{k}} (x + Δ_{e}, t) \\ = f (x + Δ_{e}, t) + (Δ - Δ_{e}) \frac{\partial f}{\partial x} (x + Δ_{e}, t) + \frac{{(Δ - Δ_{e})}^{2}}{2!} \frac{\partial^{2} f}{\partial x^{2}} (x + Δ_{e}, t) \\ + \frac{{(Δ - Δ_{e})}^{3}}{3!} \frac{\partial^{3} f}{\partial x^{3}} (x + Δ_{e}, t) + 𝒪 (| Δ - Δ_{e} |^{4}) . \end{array}

()

Next, we assume that

| Δ - Δ_{e} | ≪ 1

and that for all

k \in ℕ

it holds that

f (x + Δ_{e}, t) \approx f (x, t) and \frac{\partial^{k} f}{\partial x^{k}} (x + Δ_{e}, t) \approx \frac{\partial^{k} f}{\partial x^{k}} (x, t) .

()

Combining this with Equations (3)–(5), (7), and (10) (after disregarding the higher-order terms in Equation (10)—justified by the assumption that

| Δ - Δ_{e} | ≪ 1

), and straightforward calculus demonstrates that for all

x \in ℝ

, t ∈ [0, ∞) it holds that the function f satisfies

τ \frac{\partial f}{\partial t} (x, t) + Δ_{e} \frac{\partial f}{\partial x} (x, t) = \frac{σ^{2}}{2} \frac{\partial^{2} f}{\partial x^{2}} (x, t) + τ ω f (x, t) .

()

For convenience, we will rewrite Equation (12) in the form:

\frac{\partial f}{\partial t} (x, t) - D \frac{\partial^{2} f}{\partial x^{2}} (x, t) + γ \frac{\partial f}{\partial x} (x, t) - ω f (x, t) = 0,

()

where for all

x \in ℝ

, t ∈ [0, ∞),

τ \in ℳ (τ)

(cf. Assumption 2) the diffusion, drift (convection), and absorption terms are given by

\frac{σ^{2}}{2 τ} = D, \frac{Δ_{e}}{τ} = γ, and \frac{1}{τ} \int_{t}^{t + τ} F (f (x, s)) d s \approx ω f (x, t),

()

respectively. The differential equation given in Equation (13) is the well-known convection-diffusion-absorption equation which arises in the study of numerous physically relevant phenomena. It is worth noting that Equation (13) immediately follows from Einstein's original equation for Brownian motion (e.g., [38, Equation (10)]) if we set

Δ_{e} = 0

(no drift) and

ω = 0

(no absorption).

We conclude this subsection with some final remarks. In the arguments above, we have assumed sufficient smoothness conditions in order to allow for all claims to hold globally (i.e., for all t ∈ [0, ∞)). However, this can easily be circumvented by constructing Equation (13) locally (e.g., for some c ∈ (0, ∞) with t ∈ [0, c)) and then “gluing” the results together. This approach is avoided, herein, for simplicity and ease of exposition. Furthermore, the assumption that $| Δ - Δ_{e} | ≪ 1$ is not an inherently restrictive assumption. This assumption loosely can be thought of as removing the possibility of so-called long-range interactions within the experiment. One can allow for such interactions, but the resulting PDE may involve non-local operators such as the fractional Laplacian (see, for instance, References 41-44 and the references therein). Finally, we note that the assumptions in Equation (11) are purely for convenience, as we can obtain a result similar to Equation (13) through a simple re-scaling of the function f.

Remark 4.For the remainder of our study, we will assume that $Δ$ , φ, and $τ$ (cf. Assumption 2) are independent of the function f. In general, these parameters can upon both the spatial and temporal variables, x and t, respectively, as well as the underlying porous media. Moreover, in more general situations, these parameters can further depend on the dependent variables' derivatives.

Remark 5.Herein, we assume that $τ$ (cf. Assumption 2) is the same for each of the types of molecules of interest. That is, for each M_i, i ∈ {1, 2, … , n}, we assume that the associated $τ_{i} = τ$ . One can always find such a $τ$ by simply choosing the minimum of the set of associated $τ_{i}$ , i ∈ {1, 2, … , n}.

3.3 A remark on an alternative derivation

We conclude Section 3 with an alternative derivation of Equation (13). This alternative derivation is not simply an exercise in pure mathematics, but rather, it provides justification that the proposed algorithm may be applied to a wider class of problems than originally expected. First, this alternative derivation helps to resolve a slight inconsistency in the derivation presented in Section 3.2. That is, in Section 3.2 we simultaneously applied the Carathéodory theorem and Taylor's theorem to Equation (6). There is nothing inherently wrong with this approach—it simply seems strange. The approach presented below rectifies this concern. Next, this alternative derivation reduces the smoothness assumptions one imposes on the function f. In practice, it is often assumed that a function of interest is smooth enough to manipulate via expansions, but this assumption excludes numerous physically relevant situations. As such, we present a method for reducing the regularity assumptions needed to obtain Equation (13), which, in turn, increases the applicability of the proposed method.

To that end, we will derive Equation (13) using only the Carathéodory theorem. Throughout this argument, we assume that for all

x \in ℝ

, t ∈ [0, ∞) it holds that

\frac{\partial^{2} f}{\partial x^{2}} (x, t)

is continuous and bounded (otherwise the strong form of the PDE Equation (13) is not well-defined). This implies that there exists

α \in (0, 2]

such that for all

h, x \in ℝ

, t ∈ [0, ∞) satisfying

| h | ≪ 1

it holds that

\begin{align} \frac{\partial^{2} f}{\partial x^{2}} (x, t) & = \frac{f (x + h, t) - 2 f (x, t) + f (x - h, t)}{h^{2}} + 𝒪 (| h |^{α}) \\ \approx \frac{f (x + h, t) - f (x, t) + f (x - h, t) - f (x, t)}{h^{2}} . \end{align}

()

Next, the Carathéodory theorem and Equation (15) ensure the existence of

ψ_{1} : ℝ \times [0, \infty) \to ℝ

ψ_{i, h} : ℝ \times [0, \infty) \to ℝ

, i ∈ {1, 2},

h \in ℝ

, such that for all

h, x \in ℝ

, t ∈ [0, ∞) with

| h | ≪ 1

it holds that

f (x + h, t) - f (x, t) = h [ψ_{1, h} (x, t)] and f (x, t) - f (x - h, t) = h [ψ_{1, - h} (x, t)]

()

and

ψ_{1, h} (x, t) - ψ_{1} (x, t) = h [ψ_{2, h} (x, t)] and ψ_{1} (x, t) - ψ_{1, - h} (x, t) = h [ψ_{2, - h} (x, t)] .

()

This and the Carathéodory theorem further ensure that there exists

ψ_{2} : ℝ \times [0, \infty) \to ℝ

such that it holds for all

x \in ℝ

, t ∈ [0, ∞) that

\frac{\partial f}{\partial x} (x, t) = ψ_{1} (x, t) = \lim_{h \to 0} ψ_{1, h} (x, t) = \lim_{h \to 0} ψ_{1, - h} (x, t)

()

and

\frac{\partial^{2} f}{\partial x^{2}} (x, t) = \frac{\partial ψ_{1}}{\partial x} (x, t) = ψ_{2} (x, t) = \lim_{h \to 0} (ψ_{2, h} (x, t) + ψ_{2, - h} (x, t)) .

()

This proves that for all

h, x \in ℝ

, t ∈ [0, ∞) such that

| h | ≪ 1

it holds that

ψ_{2, h} (x, t) \approx \frac{1}{2} \frac{\partial^{2} f}{\partial x^{2}} (x, t) and ψ_{2, - h} (x, t) \approx \frac{1}{2} \frac{\partial^{2} f}{\partial x^{2}} (x, t)

()

We now recapitulate the above arguments as an extension of the classical Carathéodory theorem (compare with, e.g., [40, Theorem 6.1.5]).

Theorem 1.Let $c \in ℝ$ , t ∈ [0, ∞) and let $f : ℝ \times [0, \infty) \to ℝ$ be a function. Then f is twice differentiable with respect to the variable x at the point (c, t) if and only if there exist $ψ_{i} : ℝ \times [0, \infty) \to ℝ$ , i ∈ {1, 2}, such that it holds that

f (x, t) - f (c, t) = ψ_{1} (x, t) (x - c) + ψ_{2} (x, t) {(x - c)}^{2} .

()

Moreover, in this case it holds that

\lim_{x \to c} ψ_{1} (x, t) = \frac{\partial f}{\partial x} (c, t) and \lim_{x \to c} ψ_{2} (x, t) = \frac{1}{2} \frac{\partial^{2} f}{\partial x^{2}} (c, t) .

()

Proof of Theorem 1.We omit the details of this proof for brevity. However, the result follows from arguments analogous to those used to generate Equations (16)–(20). The proof of Theorem 1 is thus completed.

The alternative derivation of Equation (13) follows directly from Theorem 1 (combined with the previous results outlined in Section 3.2). It is worth noting that the alternative derivation presented above only requires the function f to be twice differentiable with respect to x and once differentiable with respect to t. In fact, Theorem 1 can be generalized to situations where f possesses even less regularity. However, we leave these considerations for future endeavors.

4 CLOSED-FORM SOLUTIONS TO THE EINSTEIN EQUATION IN APPLICATION TO GUI CLASSIFICATION VIA RETRIEVAL TIME

As should be clear from Section 3, we intend to use the Einstein model of random jumps to interpret the results of experimental observations. This approach should present a stark contrast to the traditional approach of employing Fick's law, flux conservation laws, and the thermodynamical law that density is proportional to the mass concentration function (see, for example, the classical work by Reference 45). The arguments in this section motivate the novel approach with the particular case of GUI classification via retrieval time in mind.

4.1 Development of closed-form solutions to the Einstein equation

To accomplish our task, we assume that the number of molecules which form a compound of interest is proportional to the molecule's mass and that this molecular mass can be adequately represented by a scalar-valued function, say

C : ℝ \times [0, \infty) \to ℝ

. In our particular case, we are interested in the setting where the domain of the process can be modeled via a one-dimensional tube. Moreover, we assume that there are

n \in ℕ

molecules of interest. For each i ∈ {1, 2, … , n} we assume that the associated molecular weight functions,

C_{i} : ℝ \times [0, \infty) \to ℝ

, satisfy Equation (13) (with f(x, t) ← C_i(x, t) for each i ∈ {1, 2, … , n} in the notation of Equation 13). Next, we assume that for all i ∈ {1, 2, … , n} it holds that D_i,

ω_{i}

, and

γ_{i}

are constant. Furthermore, for each i ∈ {1, 2, … , n} we have the associated Green's function which satisfies for all

x \in ℝ

, t ∈ (0, ∞) that

C_{i} (x, t) = (4 π D_{i})^{- \frac{1}{2}} \exp (- \frac{γ_{i}}{2 D_{i}} x + (ω_{i} - \frac{γ_{i}^{2}}{4 D_{i}}) t - \frac{1}{2} \ln (t) - \frac{| x |^{2}}{4 D_{i} t}) .

()

This and straightforward calculus ensure that for all i ∈ {1, 2, … , n},

x \in ℝ

, t ∈ (0, ∞) it holds that

(\frac{\partial}{\partial t} - D_{i} \frac{\partial^{2}}{\partial x^{2}} + γ_{i} \frac{\partial}{\partial x} - ω_{i}) C_{i} (x, t) = 0 .

()

Note that we can interpret the C_i(x, t), i ∈ {1, 2, … , n}, as analytic representations of the concentration function of the molecules which are injected into a sample of N-glycans (which are subsequently transported through the tube for classification via a mass spectrometer).

Next assume that L ∈ (0, ∞) is the point of observation (signal recording), the tube has infinite length (that is, we assume the domain of the problem to be

ℝ

), and that for all i ∈ {1, 2, … , n} the initial concentration is modeled by

δ_{0} (x)

(the standard delta function). This and Equation (23) demonstrate that for all i ∈ {1, 2, … , n}, t ∈ [0, ∞) it holds that

C_{i} (L, t) = \{\begin{cases} (4 π D_{i})^{- \frac{1}{2}} \exp (- \frac{γ_{i}}{2 D_{i}} L + (ω_{i} - \frac{γ_{i}^{2}}{4 D_{i}}) t - \frac{1}{2} \ln (t) - \frac{L^{2}}{4 D_{i} t}) & : t > 0 \\ 0 & : t = 0 \end{cases} .

()

4.2 Employing closed-form solutions to the Einstein equation to classify GUI via retrieval time

Recall that we have experimental data from trials employing pure samples of each GUI. Thus, we will use this data to determine explicitly the D_i, i ∈ {1, 2, … , n}, coefficients for all components of the N-glycans and the GUIs. In order to find the extreme values at the point of observation, we differentiate Equation (25) to obtain for each i ∈ {1, 2, … , n}, t ∈ (0, ∞) that

\frac{\frac{\partial C_{i}}{\partial t} (L, t)}{C_{i} (L, t)} = ω_{i} - \frac{γ_{i}^{2}}{4 D_{i}} + \frac{L^{2}}{4 D_{i} t^{2}} - \frac{1}{2 t} and \frac{\frac{\partial^{2} C_{i}}{\partial t^{2}} (L, t)}{C_{i} (L, t)} = - \frac{L^{2}}{2 D_{i} t^{3}} + \frac{1}{2 t^{2}} .

()

Combining this with the assumption that

Δ_{e} \approx 0

(no drift—cf. Remark 1) and the assumption that for all i ∈ {1, 2, … , n} it holds that

ω_{i} = 0

(no absorption) yields that for all i ∈ {1, 2, … , n} it holds that

D_{i} = \frac{1}{2} (\frac{L^{2}}{t_{\max}^{i}}),

()

where for each i ∈ {1, 2, … , n} it holds that

t_{\max}^{i} \in (0, \infty)

is the maximal critical point from Equation (26). Combining Equation (27) with experimental data will allow for the determination of the D_i, i ∈ {1, 2, … , n}, by letting the retrieval time for each pure sample correspond to the

t_{\max}^{i}

, i ∈ {1, 2, … , n}.

Note that Figure 3 presents a graph of the actual spectrometer signals, with associated peaks, which correspond to the molecules of GUI passing through the receiver of the mass spectrometer (obtained via the methods outlined in Section 5.2). As we can see, there are eight distinct GUIs and they each have different retrieval times. Figure 4(A) presents the calculated diffusion coefficients D_i, i ∈ {1, 2, … , n}, for each retention time. Figure 4(A,B) together demonstrate the expected inverse proportional relationship between retention time and diffusivity.

The intuition provided by the Einstein paradigm tells us that the molecule's retrieval time should depend on the molecule's time of free travel and that the value should increase proportionally with the molecules mass. This observation is clearly supported by our data (see Figure 4), as we have the mass of the GUI molecules increases with their index value. Furthermore, our post-processing—which can be performed using our analytical formula for pure samples—confirms this intuitive observation and Einstein's theory that the diffusion coefficients are inversely proportional to $τ$ (cf. Assumption 2).

Indeed, consider two GUIs, say, GUI₁ and GUI₂. These two GUIs correspond to the molecules M₁ and M₂, respectively. Since these are distinct types of glucose molecules, it is well-known that the molecules will not develop novel chemical bonds throughout the experiment. Therefore, their “free jump” lengths are mutually independent. These “free jumps” of molecule M₁ corresponds to

τ_{1}

and the “free jumps” of molecule M₂ corresponds to

τ_{2}

(cf. Assumption 2). If we let C₁(x, t) and C₂(x, t) denote the concentrations at position x and time t of the molecules M₁ and M₂, respectively, then both of these functions will satisfy Equation (13) (each with their appropriate associated parameters). This intuition combined with Einstein's paradigm and the assumptions outlined above then implies for all i ∈ {1, 2, … , n} it holds that

D_{i} = \frac{1}{2 τ_{i}} \int_{- \infty}^{\infty} {(Δ - Δ_{e})}^{2} φ (Δ) d Δ and γ_{i} = \frac{1}{τ_{i}} \int_{- \infty}^{\infty} Δ φ (Δ) d Δ,

()

where

φ (Δ)

is frequency at which free jumps of the length

Δ

occur. Note that in this formulation we have assumed that the C_i, i ∈ {1, 2, … , n}, are associated to each molecule M_i, i ∈ {1, 2, … , n}, and no novel molecular bonds are formed. Therefore, the

Δ_{i}

, i ∈ {1, 2, … , n}, are the lengths of the associated “free jumps” of the entire compound of molecules of type M_i, i ∈ {1, 2, … , n}. We also assumed that the expected length of free jumps were the same for all i ∈ {1, 2, … , n}; that is, we assumed that

Δ_{i, e} = Δ_{e}

, i ∈ {1, 2, … , n} (cf. Remark 1). Moreover, the molecules are arranged by their molecular weight as their lengths are proportional to i ∈ {1, 2, … , n}. Clearly, a higher molecular mass is associated to smaller “free jump” lengths. Mathematically, this means for all i, j ∈ {1, 2, … , n} with i < j it holds that

τ_{i} < τ_{j}

Note that the experimental observations presented in Figure 4 support precisely this claim. Further observe that the velocity of the filtration (drift) is so small that diffusion is the dominant transport property. Indeed, if drift had a larger influence than diffusion, then the decrease of the retrial time with respect to the GUI mass would be linear. Since Figure 4(B) clearly indicates an inversely propositional relationship, we conclude that diffusion is dominant. This is an important observation as drift mainly depends upon the boundary conditions of a problem whereas diffusion depends only on object versus tube structure properties. This crucial observation is what allows for the proposed classification method to work so well. Finally, we mention that in this preliminary result we have also ignored the associated absorption rates. However, classification methods for complex serum samples will consider these effects.

Remark 6.From the Einstein paradigm it follows that the calculated diffusion coefficients are inversely proportional to $τ$ (the “free jump times”—cf. Equation 2) of each molecule. Equation (27) provides an explicit relationship between retrieval time and the “free jumps” for molecule M_i, i ∈ {1, 2, … , n}: smaller retrieval times correspond to larger $τ$ .

Note that further refinement of the classification criteria may come from information which is hidden in the dynamics of the intensity of the signal prior to the retrieval time. In our future research, we intend to generalize our procedure by further incorporating the area under the graph of the signal, prior to retrieval time, for each molecule. Such considerations result in a need for better understanding the following functional

I (t) = \frac{d}{d t} \ln (\int_{0}^{t} C (L, τ) d τ) = \frac{C (L, t)}{\int_{0}^{t} C (L, τ) d τ} .

()

From this it is clear that if T_ret is the retrieval time, then Equation (29) evaluates to (T_ret)⁻¹, if we employ a (very) rough numerical approximation of the integral which employs one rectangle of height H = T_ret. This of course agrees with Equation (27). In this regard, one can see that Equation (29) is a true generalization of Equation (27).

5 SERUM CLASSIFICATION USING GUIS AS MARKERS

We will use the general ideas of Remark 6 as a basis for the development of our classification algorithm using GUIs as markers. The basic idea of the algorithm (which was outlined in the Proposed Algorithm) consists of determining the diffusivity coefficients of “marker” molecules in a base sample of interest (cf. Equation 13). The remaining molecules are classified by grouping them based on diffusivity coefficient ranges and computing the associated absorption coefficients, which serve as a correction term of sorts (cf. Equation 13). We assume throughout that drift coefficients are negligible (cf. Equation 13). This assumption is justified due to the intended use of post-processing in the actual physical experiment.

5.1 Serum classification algorithm

The proposed algorithm for the classification of a given unknown sample using two parameters, diffusivity and absorption, consists of five main steps. Note that throughout this section the primary focus in the classification of N-glycans.

Main Algorithm. Let

N \in ℕ

and consider the set of samples

𝒜 = {A_{0}, A_{1}, \dots, A_{N}}

Step 1
Select a base sample, which without loss of generality we assume to be A₀. Let $n_{0} \in ℕ$ represent the number of molecules in A₀. Inject the “marker molecules,” or GUIs—which we designate as GUI₁, GUI₂, … , GUI₈, into the sample A₀. Without loss of generality, assume that the GUIs are indexed with respect to increasing mass. Collect all retention times $T_{i}^{0}$ , i ∈ {1, 2, … , n₀}, from the mass spectrometer. Identify (manually) which retention times are associated to the GUIs.
Step 2
Using the retention times from Step 1, calculate the diffusion coefficients for each GUI, which we designate as $D_{i}^{0}$ , i ∈ {1, 2, … , 8}, via Equation (27). Set the associated absorption coefficients, which we designate as $ω_{i}^{0}$ , i ∈ {1, 2, … , 8}, to zero. This completes the baseline classification procedure (if desired, one can classify the remaining objects in A₀).
Step 3
Take a new (unknown) sample, which without loss of generality we assume to be A₁, and again inject the GUI molecules from Step 1. Let $n_{1} \in ℕ$ represent the number of molecules in A₁ (after injection with GUI molecules). Pass A₁ through the mass spectrometer and again collect all retention times, $T_{i}^{1}$ , i ∈ {1, 2, … , n₁}. Using Equation (27) compute all associated diffusivity coefficients, $D_{i}^{1}$ , i ∈ {1, 2, … , n₁}.
Step 4
Using the results from Step 1 find the coefficients i₁, i₂, … , i_n ∈ {1, 2, … , n₁} which for each j ∈ {1, 2, … , 8} satisfy that
${1, 2, \dots, n_{1}} ∋ i_{j} = \min {k \in {1, 2, \dots, n_{1}} ∖ {i_{1}, i_{2}, \dots, i_{j - 1}} : | D_{i_{j}}^{1} - D_{j}^{0} |} .$ ()
Note that Equation (30) is well-defined for all j ∈ {1, 2, … , 8} by construction. The molecules associated with the indices i₁, i₂, … , i₈ are the GUI markers GUI₁, GUI₂, … , GUI₈. Let $𝔻^{1}, 𝕋^{1} \in ℝ^{8}$ satisfy that
$𝔻^{1} = {D_{i_{1}}^{1}, D_{i_{2}}^{1}, \dots, D_{i_{8}}^{1}} and 𝕋^{1} = {T_{i_{1}}^{1}, T_{i_{2}}^{1}, \dots, T_{i_{8}}^{1}} .$ ()
Finally, let $ω_{i_{j}}^{1}$ , j ∈ {1, 2, … , 8}, satisfy for all j ∈ {1, 2, … , 8} that $ω_{i_{j}}^{1} = 0$ .
Step 5
We now classify the remaining molecules from A₁ using Equation (31) to “classify” all remaining objects. Note that it is the case that $\max_{i \in {1, 2, \dots, n_{1}}} T_{i}^{1} \in 𝕋^{1}$ . For all k ∈ {1, 2, … , n₁} $∖$ {i₁, i₂, … , i₈} we compute the associated $ω_{k}^{1}$ as
$ω_{i}^{1} = \{\begin{cases} (T_{k}^{1})^{- 1} [1 - (2 T_{k}^{1} D_{i_{1}}^{1})^{- 1}] & : 0 < T_{k}^{1} < T_{i_{1}}^{1} \\ (T_{k}^{1})^{- 1} [1 - (2 T_{k}^{1} D_{i_{j} + 1}^{1})^{- 1}] & : T_{i_{j}}^{1} < T_{k}^{1} < T_{i_{j + 1}}^{1}, j \in {1, 2, \dots, 7} \end{cases}$ ()
(cf. Equation 26).
Step 6
Repeat Step 3, Step 4, and Step 5 for the remaining $A_{i} \in 𝒜$ , i ∈ {2, 3, … , N}.

It is worth noting that only the diffusivity coefficients calculated from Step 1 and Step 2 need to be stored for future use. This data set serves as the baseline learning procedure for the data-driven classification algorithm. Thus, while the initial manual classification can be tedious, it results in an algorithm which can be used to classify a large number of other unknown samples (of appropriate type).

Remark 7.Main Algorithm can be significantly improved if we consider the classification index to be a time series instead. In this case, we may consider Equation (29) as the basis for classification. When considering a serum containing only the “marker” molecules, it follows that t is the retrieval time in Equation (29). An analogous (but improved) algorithm can then be obtained by approximating the integral implicitly (for numerical stability) and employing a Newton–Raphson-type method to solve for the desired parameters. As noted earlier, this will be a focus of forthcoming work.

5.2 Description of the experimental set-up and materials used

In this section, we briefly outline the materials and experimental protocols followed in order to obtain our experimental data, which is used for comparison.

5.2.1 Material

Standard glycoproteins, fetuin and ribonuclease B (RNase B), and pooled human blood serum (HBS) were purchased from Sigma Aldrich (St. Louis, MO). Formic acid (FA), borane-ammonia, dimethyl sulfoxide (DMSO), iodomethane and, sodium hydroxide beads were also obtained from the same vendor. HPLC grade water was obtained from Avantor Performance Materials (Center Valley, PA). HPLC grade acetonitrile (ACN), methanol, and ethyl alcohol were supplied by Fisher Scientific (Fair Lawn, NJ). PNGase F enzyme and 10XG7 buffer (0.5M phosphate buffer saline) were purchased from New England Biolabs.

5.2.2 Sample preparation

Model Glycoproteins and Dextrin: 20 $μ$ g each of fetuin and RNase B were mixed with G7 buffer to get a final concentration of 20 mM for the buffer. The samples were then denatured at 90°C for 30 minutes. Samples were then cooled at room temperature and treated with 1.0 $μ$ l of PNGase F, followed by incubation at 37°C for 18 hours. PNGase F digestion was followed by precipitation of de-N-glycosylated proteins with 90% ethanol at −20°C. Reduction of reducing ends of the purified glycans was done by addition of 10 $μ$ of borane-ammonia complex (10 $μ$ g/ $μ$ L) and incubating it at 60°C water bath for one hour. Methanol was later used to remove borane in the form of borate from reduced glycan samples. The methanol washing step was repeated three times to ensure the complete removal of borate from the samples. Reduction was then followed by permethylation of the samples, using a previously reported method.¹⁶ For this purpose, reduced and dried glycan samples were resuspended in 1.2 $μ$ L and 30 $μ$ L of DMSO. Later, 20 $μ$ L of iodomethane was added to the samples and they were loaded on DMSO soaked sodium hydroxide beads packed in spin columns. The spin columns were washed with 200 $μ$ L of DMSO, using a centrifuge at 1800 rpm for two minutes, prior to the loading of samples.

Once loaded, the samples were incubated at room temperature for 2 minutes. After 25 minutes, an additional 20 $μ$ L of iodomethane was added and the samples were again incubated at the room temperature for 15 minutes. Permethylated glycans were then collected by centrifugation at 1800 rpm for two minutes. For complete elution of permethylated glycans, 30 $μ$ L of ACN was added to the spin columns and again the eluants were collected by centrifugation. Permethylated glycans were further dried and resuspended in 20% ACN and 0.1% FA. Each of the samples were run in triplicates and 1 $μ$ g of the samples were injected for each run. Dextrin standard was mixed with the samples prior to reduction and, therefore, was reduced and permethylated with each sample. 1 $μ g$ of sample was spiked with 100 ng of dextrin.

5.2.3 Human blood serum

10 $μ$ L of human blood serum was mixed with 90 $μ$ L of G7 buffer to get a final concentration of 20 mM for the buffer. Proteins from the samples were denatured in 90°C water bath for 30 minutes. After cooling at room temperature, 1.2 $μ$ L of PNGase F was added to the samples. They were then incubated at 37°C for 18 hours. After the completion of the incubation, proteins were precipitated at −20°C for one hour. Reduction and permethylation were then performed as described previously for model glycoproteins. Resuspension was again done in 20% ACN and 0.1% FA. 1 $μ$ L of the serum samples were then injected for each of the triplicate runs.

5.2.4 Liquid chromatography conditions

Chromatography was performed on UltiMate 3000 Nano UHPLC system using C18 column. Optimum temperature for the oven was kept at 55°C. A solution of 98% water, 0.2% ACN, and 0.1% FA was utilized as mobile phase A while, 100% ACN and 0.1% FA was mobile phase B. Initially, the gradient was set at 20% mobile phase B. It was then increased to 42% in 11 minutes. After 48 minutes, it was increased to 55% and then changed to 90% at 49 minutes. It remained at 90% for 54 minutes of total sample run and plummeted to 20% again for equilibration of the column for the final 6 minutes.

5.2.5 Mass spectrometry conditions

LTQ Orbitrap Velos (Thermo Scientific) was used to analyze the samples. The mass spectrometer was set to the positive ion mode with an ESI voltage of 1.6 kV. Full MS was performed at 100,000 resolution with 200–2000 m/z scan range. MS2 was acquired with collision induced dissociation (CID) and higher energy collision dissociation (HCD) with normalized dissociation energies of 30% and 45%, respectively. Activation Q (one of the parameters used in MS methods—the value controls the radio frequency applied to control fragmentation of ions during analysis) was 0.25. Injection time was 10 ms. Repeat count of dynamic exclusion and repeat duration were 2 and 30 s, respectively. The exclusion duration was 60 s. The four most intense ions were selected from the full MS for further CID and HCD based dissociation by applying data-dependent acquisition mode. The precursor ion selection window was 1.50. The MS2 intensity threshold was 5000 counts. Singly charged ions were excluded for MS2.

5.2.6 Data analysis

The extracted ion chromatograms (EIC) of full MS data were used to determine the glycan composition as well as retention times of reduced and permethylated glycans derived from model glycoproteins, and human blood serum, with a mass tolerance of 10 ppm. Retention times of reduced and permethylated glucose units were also determined using the EIC.

5.3 Data classification of an actual experiment using the proposed algorithm

We now demonstrate Main Algorithm and its efficacy through an experimental example. We will use data obtained via the methods and procedures outlined in Section 5.2. To obtain the data in Figures 5-7 below we implemented our algorithm for a particular set of experiments (obtained via the methods outlined in Section 5.2). In the first table (left table) in Figure 5, the GUIs were known (green rows). Then this data is then used according to Main Algorithm to identify the eight GUI markers in five other (unknown) experiments via the same instrumentation. We then later used precise verification methods to determine that Main Algorithm can distinguish GUIs with an error of no more than two percent.

As noted above, the left table in Figure 5 consists of the data obtained from mass spectrometer readings for the sample A₀ (cf. Main Algorithm) and the calculated absorption coefficients for that sample. The first column of the table contains the retention time data. The second column contains the corresponding calculated diffusion coefficients. The last column contains the calculated absorption coefficients. The green rows in each table represent the data corresponding to the injected GUI molecules. As mentioned above, for the sample A₀, manual intervention to the mass spectrometer data is mandatory for finding the GUI retention times. See Section 5.2 for more details.

The second table in Figure 5 contains the data that was obtained from the mass spectrometer after the experiment for sample A₁ (cf. Main Algorithm) and implementation of Main Algorithm for that sample. The first column of the table contains the retention times for the sample. The second column contains the corresponding calculated diffusion coefficients. The third column contains the coefficients of the GUI molecules from sample A₀ (used for classification). The last column contains the calculated absorption coefficients. The tables in Figures 6 and 7 may be interpreted in a similar manner as the second table of Figure 5.

Remark 8.As is well understood, manual classification of molecular structures via the mass spectrometer is a time consuming and difficult task. However, Main Algorithm is demonstrated to be able to successfully identify GUI molecules in all experimental samples and also classify the remaining molecular structures using only retention times (after the initial classification of the “marker” molecules.

6 CONCLUSIONS AND FUTURE ENDEAVORS

In this article, we developed a novel mathematical method which allows for an efficient classification of experimental samples using a GUI as the reference frame. These interpretations were associated directly to the experimental spectrometer data in particular examples. In order to develop the novel method, we presented a data-driven partial differential equation model based on modified Einstein paradigm arguments. This extends Einstein's original study of Brownian motion to the situation of a more general conservation law. Once this data-driven model is obtained, we develop a closed-form solution of the model in order to avoid numerical approximations. A simple learning procedure is performed in order to determine the solutions coefficients on an initial sample. These coefficients are then used in additional learning procedures for computing the coefficients associated with unknown samples.

In order to further justify our method, we provide physical interpretations of the model coefficients. These coefficients are shown to be related to the experimental retrieval times, as well. This serves as the basis for the novel algorithm used for data classification (cf. Main Algorithm). Moreover, the proposed algorithm is successfully implemented and its efficacy is demonstrated via examples. Through the consideration of six independent samples we demonstrate that our method successfully classifies the unknown samples with errors which do not exceed two percent. Moreover, our novel method can be shown to be ten percent more accurate than the traditional method of using retrieval time only for classification.

It is important to mention that the retrieval time and shape of the peak in the spectrometer data depend on three main parameters: drift (velocity), variance (diffusion), and absorption. In this article we demonstrated that if the drift is fixed, due to reprocessing of the data, then the diffusion and absorption coefficients can accurately classify molecules in an unknown sample. Our future endeavors will focus on including all three characteristics for the N-glycan classification. This inclusion will make use of iterative deep learning algorithms of Kolmogorov-type (see, for instance, References 46-48).

Biographies

Joshua Lee Padgett is an Assistant Professor in the Department of Mathematical Sciences at the University of Arkansas. Prior to joining the University of Arkansas, Josh was a postdoc in the Department of Mathematics and Statistics at Texas Tech University. Josh earned his Ph.D. in mathematics at Baylor University under the supervision of Qin Sheng. Josh received his B.S. in Mathematics from Gardner-Webb University where his undergraduate research focused on the metabolic features of tumor cells and the occurrence of the so-called Warburg effect. While there, Josh was also a member of the Track and Field team (competing in the javelin and hammer throw). Josh is also an affiliated faculty member at the Center for Astrophysics, Space Physics, and Engineering Research and is a honorary adjunct faculty at Texas Tech University.
Josh Padgett's research lies at the intersection of pure, applied, and computational mathematics. Current research interests include applied mathematics, numerical analysis, geometric and Lie group integration methods, mathematics of deep learning, operator splitting methods, algebraic structures of numerical methods, fractional differential equations, stochastic differential equations, and the use of spectral and operator theory in theoretical and experimental physics.
Yusup Geldiyev has completed B.S. and M.S. degree in mathematics from TTU. Currently he is a first year phd student at UTD. He has participated in poster competition and given numerous talks in seminars. In short, his goal is to become a professional mathematician and enjoy the process along the way.
Sakshi Gautam is a PhD candidate in the Department of Chemistry & Biochemistry at Texas Tech University, working in the research group of Dr. Yehia Mechref. Her research focuses on the development of LC-MS based analytical methods for efficient separation and identification of isomeric glycans and glycopeptides from biological samples.
Wenjing Peng Dr. Wenjing Peng is currently working as a Research Assistant Professor in Chemistry and Biochemistry Department, Texas Tech University. Before working in Texas Tech University, Dr. Peng completed his undergraduate program and master's degree in Sichuan University, and Ph.D.'s degree in Chengdu Biological Institute, Chinese Academy of Sciences, China. Dr. Peng is focusing on artificial neural network based process optimization and LC-MS based glycomic and glycoproteomic studies for biomarker discovery of cancer diagnosis and progression. He also brings the expertise in design of experiments, microbiology, molecular biology, separation and purification techniques during more than seven years' experience as a Senior Researcher in the biological medicine developing group of Di'ao Pharmaceutical Company. His current projects include breast cancer brain metastasis biomarker discovery, multiple isotopic labeling technology development and anti-cancer drug development using liquid-chromatography-tandem mass spectrometry by integrated transcriptomics, glycomics, proteomics and glycoproteomics.
Research Interests: Glycomics, Proteomics, Glycoproteomics, Transcriptomics, Isotopic labeling, Separation, Purification, High Performance Liquid Chromatography (HPLC), Mass Spectrometry (MS), Design of Experiments (DOE), Biomarker Discovery, Microbiology, Fermentation.
Yehia Mechref Prof. Yehia Mechref is a Paul W. Horn Distinguished Professor in the department of Chemistry and Biochemistry at Texas Tech University. He is also the Chairman of the department of Chemistry and Biochemistry and the Director of the Center for Biotechnology and Genomics. He received a B.Sc. in chemistry from the American University of Beirut (Beirut, Lebanon) and a Ph.D. with an honorable mention from Oklahoma State University (Stillwater, OK, USA). Dr. Mechref is a world-renowned scientist.
He is a nationally and internationally recognized scientific leader in the development of sensitive biomolecular mass spectrometry methods enabling qualitative and quantitative assessments of the roles of proteins, glycoproteins, and glycans in biological systems. He has developed a fast, highly sensitive chromatographic method that efficiently separates and distinguishes among gangliosides in blood samples from patients with different diseases of the esophagus, including adenocarcinoma, high-grade dysplasia, and Barrett's Esophagus. His method for analyzing N-linked glycans, using multiple-reaction monitoring (MRM) on a triple quadrupole mass spectrometer, permits quantitative comparison of the expression of N-glycans in brain-targeting breast carcinoma cells and metastatic breast cancer cells. The permethylated N-glycan MRM method he developed yields rapid, reliable identification and relative abundances of glycans from varied, complex biological samples. He has further refined MRM to minimize the tissue needed and optimize the process, making it possible for the first time to profile glycans on the surfaces of tissues and cells, rather than whole-tissue samples, thus conserving limited tissue samples for other studies.
He pioneered a liquid chromatography-mass spectrometer (LC-MS) system to analyze glycomic samples in conventional proteomic laboratories, eliminating the need for more specialized facilities and making glycomic analysis more accessible to the broader scientific community. Dr. Mechref has, as well, completed a comprehensive characterization of bovine collagen glycosylation that may be foundational in developing the biochemical modifications necessary for its use in treating human diseases.
Over the past eight years, Dr. Mechref's research program has received over $9M in funding from the NIH, CPRIT, and CH Foundation, and he currently has five active NIH grants (3R01s, U01 and S10). Dr. Mechref is a prolific writer who has published 29 review articles, 14 book chapters, and 185 peer-reviewed research papers. He has guest-edited 7 journal special issues and he is a member of the editorial boards of numerous journals.
Currently, Dr. Mechref's Google Scholar h-index is 60, with 10,936 citations. He has 11 US patents. He has organized and co-organized numerous symposia and conferences, and is a standing member of the NIH EBIT study section. Dr. Mechref is the recipient of the 2019 TTU President's Academic Achievement Award, the 2019 TTU Nancy J. Bell Faculty Excellence in Mentoring Award, the 2016 Barnie E. Rushing, Jr. Faculty Distinguished Research Award, and the 2015 Barnie E. Rushing, Jr. Faculty Outstanding Research Award.
Akif Ibragimov is working as Assistant Professor in the Department of Mathematics, Atma Ram Sanatan Dharam College, University of Delhi. He has completed his Ph.D. in 2009 in Mathematics (celestial mechanics and space dynamics). His observations and comments in this field reflect his interests in nature, society, and nonlinear dynamics. He has more than 30 research articles in his name in reputed national as well as international journals in the field of celestial mechanics. He is also author of books related to his field. He is an individual member of International Astronomical Union (IAU obtained his Master's degree in Mathematics in 1975 from the Azerbaijan Oil Academy. He also interned at the Department of Computational Mathematics and Cybernetics (CMC) of Lomonosov Moscow State University (MSU) in 1974 and 1975 and one year later, in June 1976 Akif defended his PhD thesis at the CMC Department of MSU. During the next 10 years his research was focused on the Qualitative Theory of Partial Differential Equations and in 1985 he defended his dissertation at the Steklov Mathematical Institute of the Russian Academy of Sciences in Moscow, and obtained the degree Doctor of Science in Mathematics. Since then his research interests diversified; he worked at several academics and industrial positions in Russia, Azerbaijan and USA. From 2000-2004, he was a Visiting Professor at Texas A&M University, and Research Scientist at Knowledge Based Systems in College Station. Akif joined Texas Tech in September 2004. His research interests cover broad areas of applied mathematics including fluid flow in porous media, mathematical modeling in bio-medicine, fluid structure interaction, and image processing.
Research Interests: Applied Mathematics, Analysis, Industrial Mathematics, Mathematical Biology, Mathematical Physics, Ordinary Differential Equations, and Partial Differential Equations.

REFERENCES

1Dong X, Huang Y, Cho BG, et al. Advances in mass spectrometry-based glycomics. Electrophoresis. 2018; 39(24): 3063-3081.
10.1002/elps.201800273
CAS PubMed Web of Science® Google Scholar
2 Zhu R, Zacharias L, Wooding KM, Peng W, Mechref Y. Glycoprotein enrichment analytical techniques: advantages and disadvantages. Methods in Enzymology. Vol 585. Elsevier; 2017: 397-429.
Google Scholar
3Ohtsubo K, Marth JD. Glycosylation in cellular mechanisms of health and disease. Cell. 2006; 126(5): 855-867.
10.1016/j.cell.2006.08.019
CAS PubMed Web of Science® Google Scholar
4de Vreede G, Morrison HA, Houser AM, et al. A drosophila tumor suppressor gene prevents tonic TNF signaling through receptor N-glycosylation. Developmental Cell. 2018; 45(5): 595-605.
10.1016/j.devcel.2018.05.012
PubMed Web of Science® Google Scholar
5Rudd PM, Wormald MR, Stanfield RL, et al. Roles for glycosylation of cell surface receptors involved in cellular immune recognition. J Molecul Biol. 1999; 293(2): 351-366.
10.1006/jmbi.1999.3104
CAS PubMed Web of Science® Google Scholar
6Solá RJ, Griebenow K. Effects of glycosylation on the stability of protein pharmaceuticals. J Pharmaceut Sci. 2009; 98(4): 1223-1245.
10.1002/jps.21504
CAS PubMed Web of Science® Google Scholar
7Callewaert N, Schollen E, Vanhecke A, Jaeken J, Matthijs G, Contreras R. Increased fucosylation and reduced branching of serum glycoprotein N-glycans in all known subtypes of congenital disorder of glycosylation I. Glycobiology. 2003; 13(5): 367-375.
10.1093/glycob/cwg040
CAS PubMed Web of Science® Google Scholar
8Cho BG, Veillon L, Mechref Y. N-glycan profile of cerebrospinal fluids from Alzheimer's disease patients using liquid chromatography with mass spectrometry. J Proteome Res. 2019; 18(10): 3770-3779.
10.1021/acs.jproteome.9b00504
CAS PubMed Web of Science® Google Scholar
9Gudelj I, Lauc G. Protein N-glycosylation in cardiovascular diseases and related risk factors. Curr Cardiovasc Risk Rep. 2018; 12(6): 16.
10.1007/s12170-018-0579-4
Web of Science® Google Scholar
10Zhang Y, Yu X, Ichikawa M, et al. Autosomal recessive phosphoglucomutase 3 (PGM3) mutations link glycosylation defects to atopy, immune deficiency, autoimmunity, and neurocognitive impairment. J Allergy Clin Immunol. 2014; 133(5): 1400-1409.
10.1016/j.jaci.2014.02.013
CAS PubMed Web of Science® Google Scholar
11Adamczyk B, Tharmalingam T, Rudd PM. Glycans as cancer biomarkers. Biochim Biophys Acta (BBA)-Gen Subj 1820. 2012; 9: 1347-1353.
10.1016/j.bbagen.2011.12.001
Google Scholar
12 Stanley P, Taniguchi N, Aebi M. N-glycans. Essentials of Glycobiology [Internet]. 3rd ed. Cold Spring Harbor Laboratory Press; 2017.
Google Scholar
13Zhou S, Dong X, Veillon L, Huang Y, Mechref Y. LC-MS/MS analysis of permethylated N-glycans facilitating isomeric characterization. Anal Bioanal Chem. 2017; 409(2): 453-466.
10.1007/s00216-016-9996-8
CAS PubMed Web of Science® Google Scholar
14Hu Y, Shihab T, Zhou S, Wooding K, Mechref Y. LC-MS/MS of permethylated N-glycans derived from model and human blood serum glycoproteins. Electrophoresis. 2016; 37(11): 1498-1505.
10.1002/elps.201500560
CAS PubMed Web of Science® Google Scholar
15Zhou S, Huang Y, Dong X, et al. Isomeric separation of permethylated glycans by porous graphitic carbon (PGC)-LC-MS/MS at high temperatures. Anal Chem. 2017; 89(12): 6590-6597.
10.1021/acs.analchem.7b00747
CAS PubMed Web of Science® Google Scholar
16Zhou S, Wooding KM, Mechref Y. Analysis of permethylated glycan by liquid chromatography (LC) and mass spectrometry (MS). High-Throughput Glycomics and Glycoproteomics. New York, NY: Springer; 2017: 83-96.
10.1007/978-1-4939-6493-2_7
Google Scholar
17Zhou S, Veillon L, Dong X, Huang Y, Mechref Y. Direct comparison of derivatization strategies for LC-MS/MS analysis of N-glycans. Analyst. 2017; 142(23): 4446-4455.
10.1039/C7AN01262D
CAS PubMed Web of Science® Google Scholar
18Harvey DJ, Mattu TS, Wormald MR, Royle L, Dwek RA, Rudd PM. “Internal residue loss”: rearrangements occurring during the fragmentation of carbohydrates derivatized at the reducing terminus. Anal Chem. 2002; 74(4): 734-740.
10.1021/ac0109321
CAS PubMed Web of Science® Google Scholar
19Wuhrer M, Koeleman CA, Hokke CH, Deelder AM. Mass spectrometry of proton adducts of fucosylated N-glycans: fucose transfer between antennae gives rise to misleading fragments. Rapid Commun Mass Spectrom. 2006; 20(11): 1747-1754.
10.1002/rcm.2509
CAS PubMed Web of Science® Google Scholar
20Varki A. Sialic acids in human health and disease. Trends Mol Med. 2008; 14(8): 351-360.
10.1016/j.molmed.2008.06.002
CAS PubMed Web of Science® Google Scholar
21Kang P, Mechref Y, Novotny MV. High-throughput solid-phase permethylation of glycans prior to mass spectrometry. Rapid Commun Mass Spectrom. 2008; 22(5): 721-734.
10.1002/rcm.3395
CAS PubMed Web of Science® Google Scholar
22Ashwood C, Pratt B, MacLean BX, Gundry RL, Packer NH. Standardization of PGC-LC-MS-based glycomics for sample specific glycotyping. Analyst. 2019; 144(11): 3601-3612.
10.1039/C9AN00486F
CAS PubMed Web of Science® Google Scholar
23Bar-Sinai Y, Hoyer S, Hickey J, Brenner MP. Learning data-driven discretizations for partial differential equations. Proc Nat Acad Sci. 2019; 116(31): 15344-15349.
10.1073/pnas.1814058116
CAS PubMed Web of Science® Google Scholar
24Narasingam A, Kwon JS-I. Data-driven identification of interpretable reduced-order models using sparse regression. Comput Chem Eng. 2018; 119: 101-111.
10.1016/j.compchemeng.2018.08.010
CAS Web of Science® Google Scholar
25Flandrin P, Goncalves P. Empirical mode decompositions as data-driven wavelet-like expansions. Int J Wavelets Multi Inf Process. 2004; 2(04): 477-496.
10.1142/S0219691304000561
Google Scholar
26Li J, Sun G, Zhao G, Li-Wei HL Robust low-rank discovery of data-driven partial differential equations. Paper presented at: Proceedings of the AAAI Conference on Artificial Intelligence; vol. 34, 2020:767-774.
Google Scholar
27Brunton SL, Proctor JL, Kutz JN. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc Nat Acad Sci. 2016; 113(15): 3932-3937.
10.1073/pnas.1517384113
CAS PubMed Web of Science® Google Scholar
28Rudy SH, Brunton SL, Proctor JL, Kutz JN. Data-driven discovery of partial differential equations. Sci Adv. 2017; 3(4):e1602614.
10.1126/sciadv.1602614
PubMed Web of Science® Google Scholar
29Xu H, Chang H, Zhang D. Dl-pde: deep-learning based data-driven discovery of partial differential equations from discrete and noisy data; 2019. arXiv preprint arXiv:1908.04463.
Google Scholar
30Raissi M, Perdikaris P, Karniadakis GE. Physics informed deep learning (part I): data-driven solutions of nonlinear partial differential equations; 2017. arXiv preprint arXiv:1711.10561.
Google Scholar
31Xiong B, Fu H, Xu F, Jin Y. Data-driven discovery of partial differential equations for multiple-physics electromagnetic problem; 2019. arXiv preprint arXiv:1910.13531.
Google Scholar
32Raissi M, Karniadakis GE. Hidden physics models: machine learning of nonlinear partial differential equations. J Comput Phys. 2018; 357: 125-141.
10.1016/j.jcp.2017.11.039
Web of Science® Google Scholar
33Berg J, Nyström K. Data-driven discovery of PDEs in complex datasets. J Comput Phys. 2019; 384: 239-252.
10.1016/j.jcp.2019.01.036
Web of Science® Google Scholar
34Schaeffer H. Learning partial differential equations via data discovery and sparse optimization. Proc Royal Soc A Math Phys Eng Sci. 2017; 473(2197): 20160446.
10.1098/rspa.2016.0446
PubMed Web of Science® Google Scholar
35Rudy S, Alla A, Brunton SL, Kutz JN. Data-driven identification of parametric partial differential equations. SIAM J Appl Dyn Syst. 2019; 18(2): 643-660.
10.1137/18M1191944
Web of Science® Google Scholar
36Raissi M, Perdikaris P, Karniadakis GE. Machine learning of linear differential equations using Gaussian processes. J Comput Phys. 2017; 348: 683-693.
10.1016/j.jcp.2017.07.050
Web of Science® Google Scholar
37Long Z, Lu Y, Ma X, Dong B. Pde-net: learning PDEs from data. Paper presented at: Proceedings of the International Conference on Machine Learning; 2018:3208-3216.
Google Scholar
38Einstein A. On the theory of the Brownian movement. Ann Phys. 1906; 19(4): 371-381.
10.1002/andp.19063240208
CAS Google Scholar
39Christov IC, Ibraguimov A, Islam R. Long-time asymptotics of non-degenerate non-linear diffusion equations; 2020. arXiv preprint arXiv:2002.07937.
Google Scholar
40Bartle RG, Sherbert DR. Introduction to Real Analysis. Vol 2. New York, NY: Wiley; 2000.
Google Scholar
41 Padgett JL. Analysis of an approximation to a fractional extension problem. BIT Numer Math. 2020; 60: 715-739.
10.1007/s10543-019-00787-y
Web of Science® Google Scholar
42Padgett JL. The quenching of solutions to time–space fractional Kawarada problems. Comput Math Appl. 2018; 76(7): 1583-1592.
10.1016/j.camwa.2018.07.009
Web of Science® Google Scholar
43Padgett JL, Kostadinova E, Liaw CD, Busse K, Matthews L, Hyde T. Anomalous diffusion in one-dimensional disordered systems: a discrete fractional Laplacian method. J Phys A Math Theoret. 2020; 53(13): 135205.
10.1088/1751-8121/ab7499
CAS Web of Science® Google Scholar
44Kostadinova E, Padgett JL, Liaw CD, Matthews LS, Hyde TW. Anomalous diffusion in semi-crystalline polymer structures; 2020. arXiv preprint arXiv:2006.01068.
Google Scholar
45Landau L, Lifshitz E. Fluid Mechanics Pergamon. Vol 61. New York, NY: Springer; 1959.
Google Scholar
46 Kolmogorov AN. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk. Vol 114. Russian Academy of Sciences; 1957: 953-956.
Google Scholar
47Tikhomirov VM. Kolmogorov's work on ϵ-entropy of functional classes and the superposition of functions. RuMaS. 1963; 18(5): R04.
Google Scholar
48 Montanelli H, Yang H. Error bounds for deep ReLU networks using the Kolmogorov–Arnold superposition theorem. Neural Netw. 2020.
10.1016/j.neunet.2019.12.013
PubMed Web of Science® Google Scholar

Citing Literature

All articles

Object classification in analytical chemistry via data-driven discovery of partial differential equations

Abstract

1 INTRODUCTION

2 OUTLINE OF PROPOSED ALGORITHM

3 EINSTEIN'S PARADIGM FOR MOLECULAR TRANSPORT IN A TUBE

3.1 Assumptions for Einstein's paradigm

3.2 Derivation of associated equation with drift and absorption

3.3 A remark on an alternative derivation

4 CLOSED-FORM SOLUTIONS TO THE EINSTEIN EQUATION IN APPLICATION TO GUI CLASSIFICATION VIA RETRIEVAL TIME

4.1 Development of closed-form solutions to the Einstein equation

4.2 Employing closed-form solutions to the Einstein equation to classify GUI via retrieval time

5 SERUM CLASSIFICATION USING GUIS AS MARKERS

5.1 Serum classification algorithm

5.2 Description of the experimental set-up and materials used

5.2.1 Material

5.2.2 Sample preparation

5.2.3 Human blood serum

5.2.4 Liquid chromatography conditions

5.2.5 Mass spectrometry conditions

5.2.6 Data analysis

5.3 Data classification of an actual experiment using the proposed algorithm

6 CONCLUSIONS AND FUTURE ENDEAVORS

Biographies

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Object classification in analytical chemistry via data-driven discovery of partial differential equations

Abstract

1 INTRODUCTION

2 OUTLINE OF PROPOSED ALGORITHM

3 EINSTEIN'S PARADIGM FOR MOLECULAR TRANSPORT IN A TUBE

3.1 Assumptions for Einstein's paradigm

3.2 Derivation of associated equation with drift and absorption

3.3 A remark on an alternative derivation

4 CLOSED-FORM SOLUTIONS TO THE EINSTEIN EQUATION IN APPLICATION TO GUI CLASSIFICATION VIA RETRIEVAL TIME

4.1 Development of closed-form solutions to the Einstein equation

4.2 Employing closed-form solutions to the Einstein equation to classify GUI via retrieval time

5 SERUM CLASSIFICATION USING GUIS AS MARKERS

5.1 Serum classification algorithm

5.2 Description of the experimental set-up and materials used

5.2.1 Material

5.2.2 Sample preparation

5.2.3 Human blood serum

5.2.4 Liquid chromatography conditions

5.2.5 Mass spectrometry conditions

5.2.6 Data analysis

5.3 Data classification of an actual experiment using the proposed algorithm

6 CONCLUSIONS AND FUTURE ENDEAVORS

Biographies

REFERENCES

Citing Literature

Figures

References

Related

Information