Stability Analysis of Learning Algorithms for Ontology Similarity Computation
Abstract
Ontology, as a useful tool, is widely applied in lots of areas such as social science, computer science, and medical science. Ontology concept similarity calculation is the key part of the algorithms in these applications. A recent approach is to make use of similarity between vertices on ontology graphs. It is, instead of pairwise computations, based on a function that maps the vertex set of an ontology graph to real numbers. In order to obtain this, the ranking learning problem plays an important and essential role, especially k-partite ranking algorithm, which is suitable for solving some ontology problems. A ranking function is usually used to map the vertices of an ontology graph to numbers and assign ranks of the vertices through their scores. Through studying a training sample, such a function can be learned. It contains a subset of vertices of the ontology graph. A good ranking function means small ranking mistakes and good stability. For ranking algorithms, which are in a well-stable state, we study generalization bounds via some concepts of algorithmic stability. We also find that kernel-based ranking algorithms stated as regularization schemes in reproducing kernel Hilbert spaces satisfy stability conditions and have great generalization abilities.
1. Introduction and Motivations
The study of ontology deals with questions concerning what entities exist and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences. The developed tools have been widely applied in medicine, biology, and social science. In computer science, ontology is defined as a model for sharing formal concepts and has been applied in intelligent information integration, cooperative information systems, information retrieval, electronic commerce, and knowledge management. After a decade’s development, ontology technology has matured as an effective model of hierarchical structure and semantics for concepts, supported by systematic and comprehensive engineering theory, representation, and construction tools.
Ontology similarity computation is an essential part in practical applications. In information retrieval, it has been used to compute semantic similarity and search for concepts. We take a graph-theory approach and represent an ontology by a weighted graph G = (V, E, w). In this setting, V = {v1, …, vn} is the (finite) set of vertices corresponding to concepts or objects of the ontology, E ⊂ V × V is a set of edges, and w : E → ℝ+ is a weight function. For two vertices vi and vj representing two concepts, the weight w(vi, vj) measures their similarity in the ontology.
Example 1. In some applications of ontology similarity computation, the weight function w takes values on [0,1]. Then the case w(vi, vj) = 1 means that vi and vj represent the same concept while w(vi, vj) = 0 means that these two concepts have no similarity. In information retrieval, with a threshold parameter 0 < ϵ < 1, when one tries to find related information of the concept vi, all concepts vj satisfying w(vi, vj) > ϵ are returned, which means that vj and vi have a high similarity.
Traditional methods for ontology similarity computation are based on pairwise similarity calculation. Their computational complexity is high, and they required selection of many parameters, which are not so intuitive. In this paper, we use a learning theory approach. The idea is to learn a scoring function f : V → ℝ and then to determine the similarity between vertices (concepts) vi and vj by their value difference |f(vi) − f(vj)|: the smaller the difference the higher the similarity. Formally, if and only if . Such an inspiring approach was introduced from the viewpoint of ranking in [1] where a ranking algorithm is used for learning from samples a scoring function f with small ranking error. The method was employed in the ontology setting in [2] which demonstrates accuracy and efficiency. Another possible way to learn such a function f is by a graph Laplacian and taking an eigenvector associated with its second smallest eigenvalue. See [3–6] for details. This method requires a positive definiteness condition for a similarity matrix which is hard to check in our setting. Also, when the size of the graph is large, the computational complexity is high.
In this paper, we explore the learning theory approach for ontology similarity computations in a setting when the ontology graph is a tree. It is a connected graph without cycle. Thus, there is a unique path between any two vertices. The tree structure gives restrictions on similarity of vertices (concepts). For example, we assign a top vertex vtop and let it be the root, then denote k the degree (the number of edges that link to a vertex) of the top vertex. Let NG(vtop) = {v1, v2, …, vk} be the neighbor set of vtop. If there is a path from one vertex to vtop through vi, then it belongs to branch i. Thus, we have k branches in the tree and any two vertices belonging to different branches have no edge between them. The concepts in the same branch of the tree should have higher similarity, compared with concepts in different branches. This observation motivates us to apply the k-partite ranking algorithm [7] in which the k parts correspond to the k classes of vertices of k rates. The rate values of all classes are decided by experts. Intuitively, a vertex of a higher rate b is ranked higher than any vertex of rate a if 1 ≤ a < b ≤ k. Thus, the k-partite ranking algorithm is reasonable to learn a similarity function for an ontology graph with a tree structure.
The main contribution of this paper is to state some ontology computations as a k-partite ranking problem and to conduct stability analysis of the algorithms with mild conditions, which leads to useful error bounds for ontology applications.
The organization of the rest part of this paper is as follows. The setting and main results are given in the next section. The generalization bounds for learning algorithms will be shown in Section 4. The stability and generalization bounds for the learning algorithms stated as regularization schemes in reproducing kernel Hilbert spaces will be discussed in Section 5.
2. Formal Setting and Main Results
Now, we state our learning algorithm for ontology similarity computation.
Let V = {v1, …, vn} be the finite set of vertices of an ontology graph. It is divided into k disjoint subsets corresponding to k rates. Let 𝒟 be a probability measure on V.
The performance of a ranking function f : V → ℝ can be measured by the following concept.
Definition 2. A ranking loss function is a function l : ℝV × V × V → ℝ+ ∪ {0} that assigns, for f : V → ℝ and v, v′ ∈ V, a nonnegative real number l(f, v, v′) interpreted as the loss of f in its relative ranking of v and v′. The expected k-partite ranking error on the ontology graph for a ranking function f : V → ℝ associated with the ranking loss function l is defined as
Example 3. One commonly used ranking loss function is the hinge ranking loss defined as
Learning algorithms are implemented with a sample of size M, called a preference graph, which is assumed here to be independently drawn according to 𝒟. It can also be divided into k parts {𝒯1, …, 𝒯k}, where 𝒯a = {ti : ti ∈ Va} consists of those sampling points of rate a.
In [8], Agarwal and Niyogi have studied the algorithmic stability in a general setting, where the training examples take labels y ∈ [0, M1] for some M1 > 0. A goal ranking function ranks future instances with larger labels higher than those with smaller labels. Here, our setting is more specific. The learner is given a preference graph 𝒯 consisting of k disjoint parts corresponding to the k classes of vertices. Every part has a rate value. The target ranking function ranks future instances in higher-rate parts higher than those in lower-rate parts.
One point we need to emphasize that we abuse terminology for the sake of better readability. If the ranking function f does not associate with RKHS (for instance, in Lemma 9, Theorems 10 and 12), then the second term in the right-hand side of (5) vanishes.
Our error analysis provides a learning rate of algorithm (5) when the ranking loss is σ-admissible.
Definition 4. Let l be a ranking loss, σ > 0, and ℱ a class of real-valued functions on V. We say that l is σ-admissible with respect to ℱ if for any f1, f2 ∈ ℱ and v, v′ ∈ V,
Let us state the estimate of learning rates which will be proved in Section 5.
Theorem 5. Let ℋK be a RKHS such that K(v, v) ≤ κ2 < ∞ for all v ∈ V. Let l be a ranking loss, σ-admissible with respect to ℋK, and bounded by some B > 0, such that l(f, v, v′) is convex with respect to f. Let be a fixed function in ℋK satisfying
Form Theorem 5, we see that if λ → 0 and (e.g., λ = M−1/4), then erl(f𝒯) converges with confidence to . The quantity is well understood in the literature related to learning theory (e.g., [11–14]).
3. Stability Analysis
An algorithm is stable if any change of a single point in a training set yields only a small change in the output. It is natural to consider that a good ranking algorithm is one with good stability; that is, a mild change of samples does not necessarily lead to too much change in the ranking function. Some analysis of the stability of ranking algorithms is given in [1, 8, 15].
Let na = |𝒯a| and be the iath element in 𝒯a for 1 ≤ a ≤ k, 1 ≤ ia ≤ na. Let be the sequence obtained by replacing in 𝒯 by a new sampling point ta of rate a. We define some notions of stability for k-partite ranking algorithms.
Definition 6 (uniform loss stability for a k-partite ranking algorithm on ontology graph). Let A be a k-partite ranking algorithm for ontology whose output on a preference graph 𝒯 = (𝒯1, …, 𝒯k) is denoted by f𝒯. Let l be a ranking loss function and αa : ℕk → ℝ for 1 ≤ a ≤ k. We say that A has uniform loss stability (α1, …, αk) with respect to l if for all 1 ≤ a ≤ k, na ∈ ℕ, and ia ∈ {1, …, na}, we have for all ta ∈ Va, v, v′ ∈ V but belong to different rate,
Definition 7 (uniform score stability for a k-partite ranking algorithm on ontology graph). Let A be a k-partite ranking algorithm for ontology whose output on a preference graph 𝒯 = (𝒯1, …, 𝒯k) is denoted by f𝒯. Let l be a ranking loss function and μa : ℕk → ℝ for 1 ≤ a ≤ k. We say that A has uniform score stability (μ1, …, μk) with respect to l if for all 1 ≤ a ≤ k, na ∈ ℕ, , ia ∈ {1, …, na}, and ta ∈ Va, and for all v ∈ V,
The main tool used here is McDiarmid’s inequality, which bounds the deviation of any function of a sample on which a single change in the sample has limited effect.
Theorem 8 (see [16].)Let X1, …, XN be independent random variables, each taking values in a set A. Let ϕ : AN → ℝ such that for each i ∈ {1, …, N}, there exists a constant ci > 0 such that
In what follows, denotes a training sample set obtained by replacing in 𝒯 by by tk for , and ia ∈ {1, …, na}. Also, αa(n1, …, nk) and μa(n1, …, nk) are simply denoted by αa and μa, respectively. We only consider the case of sample replacements with the same rate: for some ontology graphs, the graph structure is fixed; hence the members of vertices and edges in each branch are fixed.
4. Generalization Bounds for Stable k-Partite Ranking Algorithms on Ontology Graph
From this section, our analysis for stability of k-partite ranking algorithms is stated on an ontology graph and our organization follows [8]. In this section, generalization bounds for ranking algorithms that exhibit good stability properties will be derived. Our tricks are based on those of [17]. We start with the following technical lemma.
Lemma 9. Let A be a symmetric k-partite ranking algorithm for ontology whose output on a preference graph 𝒯 = (𝒯1, …, 𝒯k) is denoted by f𝒯, and let l be a ranking loss function. Then, for all , ia ∈ {1, …, na}, and ta ∈ Va, tb ∈ Vb, one has
Proof. We have
We are now ready to give our main result of this section, which bounds the expected l-error of a ranking function learned by a k-partite ranking algorithm with good uniform loss stability in terms of its empirical l-error on the training sample. The proof follows [18].
Theorem 10. Let A be a symmetric k-partite ranking algorithm for ontology whose output on a preference graph 𝒯 = (𝒯1, …, 𝒯k) is denoted by f𝒯, and let l be ranking loss function such that 0 ≤ l(f, v, v′) ≤ B for all f : V → ℝ and v, v′ ∈ 𝒯. Let αa : ℕk → ℝ for 1 ≤ a ≤ k such that A has uniform loss stability (α1, …, αk) with respect to l. Let αmax = max {α1, …, αk}. Then, for any 0 < δ < 1, with confidence at least 1 − δ, one has
Proof. Let ϕ : VM → ℝ be defined by
Thus, for any ɛ > 0,
For any γ > 0, and any k-partite ranking algorithm with good uniform loss stability with respect to lγ, Theorem 10 can be applied to bound the expected ranking error of a learned ranking function in terms of its empirical lγ-error on the training sample. The following lemma shows that, for every γ > 0, a ranking algorithm with good uniform score stability also has good uniform loss stability with respect to lγ. Using the techniques of Lemma 2 in [17], and taking τ(vi, vj) in Example 3 as 1, the following lemma can be obtained immediately.
Lemma 11. Let A be a k-partite ranking algorithm for ontology whose output on a preference graph 𝒯 = (𝒯1, …, 𝒯k) is denoted by f𝒯. Let μa : ℕk → ℝ for 1 ≤ a ≤ k such that A has uniform score stability (μ1, …, μk). Then, for every γ > 0, A has uniform loss stability with respect to the γ ranking loss lγ, where for all n1, …, nk ∈ ℕ,
Combining Theorem 10 and Lemma 11, we get the following result which bounds the expected ranking error of a learned ranking function in terms of its empirical lγ-error for any ranking algorithm with good uniform score stability.
Theorem 12. Let A be a k-partite ranking algorithm for ontology whose output on a preference graph 𝒯 = (𝒯1, …, 𝒯k) is denoted by f𝒯. Let μa : ℕk → ℝ for 1 ≤ a ≤ k such that A has uniform score stability (μ1, …, μk), and γ > 0. Denote μmax = max {μ1, …, μk}. If l is a ranking loss satisfying 0 ≤ l(f, v, v′) ≤ B for all f : V → ℝ and v, v′ ∈ 𝒯, then for any 0 < δ < 1, with probability of at least 1 − δ,
5. Stable Ranking Algorithms
In this section, we will demonstrate stability of some ranking algorithms in which a ranking function is selected by minimizing a regularized objective function. A general result for regularization-based k-partite ranking algorithms will be derived in Section 5.1. In Section 5.2, this result is used to illustrate stability of kernel-based k-partite ranking algorithms that perform regularization in a reproducing kernel Hilbert space. These stability results are also used to achieve consistency theorem for kernel-based k-partite ranking algorithms in Section 5.3.
5.1. General Regularizers
Lemma 13. Let l be a ranking loss such that l(f, v, v′) is convex in f. Let ℱ be a convex set of real-valued functions on V, and let σ > 0 such that l is σ-admissible with respect to ℱ. Let λ > 0, and let N : ℱ → ℝ+ ∪ {0} be a functional defined on ℱ such that for preference graph 𝒯 = (𝒯1, …, 𝒯k), the regularized empirical l-error has a minimum (not necessarily unique) in ℱ. Let A be a k-partite ranking algorithm for ontology defined by (29). Let ta ∈ Va, , and ia ∈ {1, …, na}. For brevity, denote
Proof. Recall that a convex function g satisfies
As we will see below, the above result can be used to establish stability of some regularization-based ranking algorithms.
5.2. Regularization in Reproducing Kernel Hilbert Spaces
Theorem 14. Let ℱ be an RKHS with kernel K such that for all v ∈ V, K(v, v) ≤ κ2 < ∞. Let l be a ranking loss such that l(f, v, v′) is convex in f and l is σ-admissible with respect to ℱ. Let λ > 0, and let N be given by (42). Let A be the k-partite ranking algorithm for ontology that, given a preference graph 𝒯, outputs a ranking function f𝒯 ∈ ℱ defined by (29). Then, A has uniform score stability (μ1, …, μk) with
Proof. Let va ∈ Va and ia ∈ {1, …, na}.
Applying Lemma 13 with t = 1/2, we get (using the notation in the proof of Lemma 13) that
Theorems 12 and 14 give the following generalization bound for kernel-based ranking algorithms.
Corollary 15. Under the conditions of Theorem 14, one has that for any 0 < δ < 1, with probability of at least 1 − δ over the draw of 𝒯, the expected ranking error of the ranking function f𝒯 learned by the regularized algorithm associated with the l1 ranking loss is bounded by
The result of Corollary 15 shows that a larger regularization parameter λ leads to better stability and, therefore, a tighter confidence interval in the resulting generalization bound.
Under the conditions of the above results, a kernel-based ranking algorithm minimizing a regularized empirical l-error also has good uniform loss stability with respect to l; this follows from the following simple lemma.
Lemma 16. Let ℱ be a class of real-valued functions on V, and let A be a k-partite ranking algorithm for ontology that, given a preference graph 𝒯, outputs a ranking function f𝒯 ∈ ℱ. If A has uniform score stability (μ1, …, μk) and l is a ranking loss that is σ-admissible with respect to ℱ, then A has uniform loss stability (α1, …, αk) with respect to l, where for all m ∈ ℕ,
The proof of this result can follow the proof of Lemma 13 in [8]. Using Theorem 14 and Lemma 16, we can immediately get the following corollary.
Corollary 17. Under the conditions of Theorem 14, A has uniform loss stability (α1, …, αk) with respect to l, where for all na ∈ ℕ,
5.3. Consistency
Lemma 18. Let f : V → ℝ be a fixed ranking function, and let l be a bounded ranking loss function such that 0 ≤ l(f, v, v′) ≤ B for all f : V → ℝ and v, v′ ∈ V. Then, for any 0 < δ < 1, with probability of at least 1 − δ,
Proof. Define ϕ as
We are now in a position to prove our main result (Theorem 5).
Proof of Theorem 5. We use Corollary 17 and apply Theorem 10 with δ/2 to get that with probability of at least 1 − δ/2,
6. Conclusion
The main focus of this paper is on studying the stability and generalization properties of k-partite ranking algorithm used for ontology computation. This algorithm shows good intuition about the vertex in ontology graph mapping to a vertex in a line. The representation of vertices in ontology graph does not take real-valued labels, and the samples are given by preference graph (pairwise vertices in different ranking rates). This setting is suitable for ontology. We have derived generalization bounds for k-partite ranking algorithms in this setting using the notion of algorithmic stability. It is also shown that k-partite ranking algorithms revealing good stability properties have good generalization properties. Our results are applied to obtain generalization bounds for kernel-based k-partite ranking algorithms that perform regularization in a reproducing kernel Hilbert space.
Acknowledgments
The authors thank the reviewers for their constructive comments and detailed suggestions for improving the quality of this paper. This work was supported in part by the Key Laboratory of Educational Informatization for Nationalities, Ministry of Education, the National Natural Science Foundation of China (60903131) and Key Science and Technology Research Project of Education Ministry (210210).