Volume 2025, Issue 1 8871109
Research Article
Open Access

Deleting Values May Either Increase or Decrease Variance

David L. Farnsworth

Corresponding Author

David L. Farnsworth

School of Mathematics and Statistics , Rochester Institute of Technology , Rochester , New York, 14623 , USA , rit.edu

Search for more papers by this author
First published: 21 May 2025
Academic Editor: Shikha Binwal

Abstract

The impact upon variance when a value is deleted is addressed. It is shown that the cutoff for the deleted value yielding an increase or a decrease in variance is approximately one standard deviation from the mean for a univariate random variable with equally distributed probability on a finite set of elements and for a univariate set of observations. The influence of truncation of the domain for such a discrete random variable and for observations is considered.

1. Introduction

The mean and variance of the discrete random variable X that is uniformly distributed over its domain {x1, x2, …, xn} are
(1)
(2)
respectively. The n in parentheses following μ and σ2 indicates the size of the population. Deleting xn from the domain gives a new probability mass function over the remaining elements of the domain with mean and variance
(3)
(4)
respectively. For convenience, the values in the domain have been numbered so that the deleted or additional value is the nth one. The n − 1 in parentheses following μ and σ2 is reserved for this case where xn is deleted. When a value is deleted from the domain, the probability associated with that value is equally distributed among the remaining values. In order to have at least two numbers for computing the variance, assume that n ≥ 2.

The goal is to express the means and variances for n − 1 values and for n values in terms of each other and thereby to find a condition under which variance is reduced by the deletion, i.e., σ2(n − 1) < σ2(n). The analogous situation in a set of observations is addressed in Section 3. Truncation is the topic of Section 4.

2. Deleting an Element From the Domain of a Discrete Random Variable

The effects on the mean and the variance of adding or deleting a single value of the domain for a random variable in which each value is equally likely to occur are contained in Lemmas 1 and 2.

Lemma 1. For a univariate discrete random variable with equally distributed probability on the finite set of elements {x1, x2, …, xn},

(5)
(6)
(7)

Proof 1. For (5),

(8)

For (6),

(9)

For (7), using (5) to substitute for μ(n) gives

(10)

Lemma 2. For a univariate discrete random variable with equally distributed probability on the finite set of elements {x1, x2, …, xn},

(11)
(12)

Proof 2. For (11),

(13)

Rearranging the terms and using formulas (3) and (4) give

(14)

For (12), substituting (7) into (11) gives

(15)

Solving (15) for σ2(n − 1) yields

(16)

The locations of a single deleted value, in order for the variance to be increased or decreased, are given in Theorems 1 and 2 for these discrete distributions.

Theorem 1. The variance is made smaller by the deletion of the single value xn, i.e., σ2(n − 1) < σ2(n), if and only if the absolute value of the z-score of the deleted value with respect to the unreduced domain’s values is greater than , i.e.,

(17)

Proof 3. From (12) in Lemma 2, set

(18)

Then

(19)
which produces inequality (17).

In Theorem 2, the cutoff criterion is expressed in terms of the mean and standard deviation of the variable with the domain reduced by one value. The criterion is equivalent to the criterion in Theorem 1.

Theorem 2. The variance is made smaller by the deletion of the single value xn, i.e., σ2(n − 1) < σ2(n), if and only if the absolute value of the z-score of the deleted value with respect to the reduced domain’s values is greater than , i.e.,

(20)

Proof 4. From (11) in Lemma 2, set

(21)
which yields
(22)
and inequality (20).

If the deleted or added value in Theorems 1 and 2 has a z-score that makes (17) and (20) an equality, the variances with or without that value are equal, and it could occur that the ranges of the values could be identical.

Example 1. Consider the equally spaced elements in the domain {1,  2, …,  n}, each with the probability 1/n of occurrence. The mean and variance are

(23)

For xn = n, using the criterion (17) of Theorem 1, obtain
(24)
which becomes 3 > 1 + 1/n. Deleting {n} reduces the variance to (n–2)n/12 which is less than σ2(n) for all n. On the other hand, for {1,  2, …,  20} and xn = 15,
(25)
so that deleting {15} leaves a variable with a larger variance. Another viewpoint is that the impact of adding {20} to the domain {1,  2, …,  19} is to increase variance, but adding {15} to the domain {1,  2, …,  13,  14,  16,  17,  18,  19,  20} would decrease variance. For n = 20, the cutoff values for decreasing or increasing variance are
(26)
i.e., 4.88 and 16.12, so that deleting an element of {1,  2,  3,  4,  17,  18,  19,  20} would decrease variance and deleting an element of {5,  6,   … ,  16} would increase variance.

3. Deleting an Element From a Dataset

For data, the only changes are that the denominators of the variances are reduced by one, n is the sample size, and the notation changes from μ to and from σ2 to s2. The formulas analogous to (5)–(7) in Lemma 1 and (11) and (12) in Lemma 2 are
(27)
(28)
(29)
(30)
(31)
respectively. The proofs follow step-by-step those of Lemmas 1 and 2 with the necessary changes in notation. Also, see [1, 2] for these types of formulas. Equation (31) requires that n ≥ 3.

Cutoff values for the increase or decrease of sample variance are contained in Theorems 3 and 4, whose proofs parallel those of Theorems 1 and 2. Although the denominators of the sample variances contain n − 1 and n − 2, instead of the population variances’ n and n − 1, those differences in the formulas disappear when the algebra is performed. The right-hand sides of the inequalities in Theorems 3 and 4 are the same as those in Theorems 1 and 2.

Theorem 3. For the observations {x1, x2, …, xn}, the sample variance is made smaller by the deletion of the single value xn, i.e., s2(n − 1) < s2(n) if and only if the absolute value of the z-score of the deleted value with respect to the unreduced dataset is greater than , i.e.,

(32)

Theorem 4. For the observations {x1, x2, …, xn}, the sample variance is made smaller by the deletion of the single value xn, i.e., s2(n − 1) < s2(n) if and only if the absolute value of the z-score of the deleted value with respect to the reduced dataset is greater than , i.e.,

(33)

4. Truncation

More than one value might be deleted. Deleting a set of the lowest and/or the highest values from distributions or sets of observations is called truncation.

Example 1 revisited: Consider truncation from the right. In this example, all deleted values are outside the bound in Theorem 1 because each single step of truncation from the right decreases the variance.

Example 2. Consider the elements of a random variable’s domain {1,  2,  3,  4,  5,  117}, each with the probability 1/6 of occurrence, and truncation from the left. As the successively larger truncations are taken, Table 1 shows the somewhat surprising fact that the variance of the variable with the remaining domain is larger, while the range is smaller. Using (17) in Theorem 1, each right-hand value in the truncated values can be checked for increasing the variance, compared to the previous truncation. For example, the absolute value of z-score for {2} in the truncated set {1, 2} is

(34)

Table 1. Sequence of left truncations for Example 2.
Deleted value(s) For the remaining values
Mean Range Variance
None 22 116 1806.67
{1} 26.2 115 2062.16
{1,  2} 32.25 114 2394.69
{1,  2,  3} 42 113 2812.67
{1,  2,  3,  4} 61 112 3136.00

Commonly, the deleted values are represented by the percent of the original values that are omitted. For {1,2} in Example 2, the deletion percent is (2/6)100 = 33.3%.

5. Discussion and Concluding Comments

The right-hand sides of the inequalities in Theorems 14 are approximately equal to one for large values n. For Theorems 1 and 3, for n ≥ 11. The closeness of the right-hand sides to each other is not surprising. Z-scores have the feature that, as the x-value moves away from the mean, the standard deviation in the denominator increases along with the numerator increasing. Thus, for observations, detecting an outlying datum utilizing z-scores can be problematical [3]. Also, there are upper bounds for those possible z-scores for elements of the set. Those bounds depend upon n [4].

There are related ideas. For example, (5), (11), (28), and (30) are shortcuts for computing the mean and variance from the previous values [5, 6]. The related problem of changing the numerical value of one datum that is already in the dataset is explored in [6, 7].

Conflicts of Interest

The author declares no conflicts of interest.

Funding

No funds were received by the author regarding this paper.

Data Availability Statement

No data were used to support this study.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.