The impact upon variance when a value is deleted is addressed. It is shown that the cutoff for the deleted value yielding an increase or a decrease in variance is approximately one standard deviation from the mean for a univariate random variable with equally distributed probability on a finite set of elements and for a univariate set of observations. The influence of truncation of the domain for such a discrete random variable and for observations is considered.

1. Introduction

The mean and variance of the discrete random variable X that is uniformly distributed over its domain {x₁, x₂, …, x_n} are

(1)

(2)

respectively. The n in parentheses following μ and σ² indicates the size of the population. Deleting x_n from the domain gives a new probability mass function over the remaining elements of the domain with mean and variance

(3)

(4)

respectively. For convenience, the values in the domain have been numbered so that the deleted or additional value is the n^th one. The n − 1 in parentheses following μ and σ² is reserved for this case where x_n is deleted. When a value is deleted from the domain, the probability associated with that value is equally distributed among the remaining values. In order to have at least two numbers for computing the variance, assume that n ≥ 2.

The goal is to express the means and variances for n − 1 values and for n values in terms of each other and thereby to find a condition under which variance is reduced by the deletion, i.e., σ²(n − 1) < σ²(n). The analogous situation in a set of observations is addressed in Section 3. Truncation is the topic of Section 4.

2. Deleting an Element From the Domain of a Discrete Random Variable

The effects on the mean and the variance of adding or deleting a single value of the domain for a random variable in which each value is equally likely to occur are contained in Lemmas 1 and 2.

Lemma 1. For a univariate discrete random variable with equally distributed probability on the finite set of elements {x₁, x₂, …, x_n},

(5)

(6)

(7)

Proof 1. For (5),

(8)

For (6),

(9)

For (7), using (5) to substitute for μ(n) gives

(10)

Lemma 2. For a univariate discrete random variable with equally distributed probability on the finite set of elements {x₁, x₂, …, x_n},

(11)

(12)

Proof 2. For (11),

(13)

Rearranging the terms and using formulas (3) and (4) give

(14)

For (12), substituting (7) into (11) gives

(15)

Solving (15) for σ²(n − 1) yields

(16)

The locations of a single deleted value, in order for the variance to be increased or decreased, are given in Theorems 1 and 2 for these discrete distributions.

Theorem 1. The variance is made smaller by the deletion of the single value x_n, i.e., σ²(n − 1) < σ²(n), if and only if the absolute value of the z-score of the deleted value with respect to the unreduced domain’s values is greater than , i.e.,

(17)

Proof 3. From (12) in Lemma 2, set

(18)

Then

(19)

which produces inequality (17).

In Theorem 2, the cutoff criterion is expressed in terms of the mean and standard deviation of the variable with the domain reduced by one value. The criterion is equivalent to the criterion in Theorem 1.

Theorem 2. The variance is made smaller by the deletion of the single value x_n, i.e., σ²(n − 1) < σ²(n), if and only if the absolute value of the z-score of the deleted value with respect to the reduced domain’s values is greater than , i.e.,

(20)

Proof 4. From (11) in Lemma 2, set

(21)

which yields

(22)

and inequality (20).

If the deleted or added value in Theorems 1 and 2 has a z-score that makes (17) and (20) an equality, the variances with or without that value are equal, and it could occur that the ranges of the values could be identical.

Example 1. Consider the equally spaced elements in the domain {1, 2, …, n}, each with the probability 1/n of occurrence. The mean and variance are

(23)

For x_n = n, using the criterion (17) of Theorem 1, obtain

(24)

which becomes 3 > 1 + 1/n. Deleting {n} reduces the variance to (n–2)n/12 which is less than σ²(n) for all n. On the other hand, for {1, 2, …, 20} and x_n = 15,

(25)

so that deleting {15} leaves a variable with a larger variance. Another viewpoint is that the impact of adding {20} to the domain {1, 2, …, 19} is to increase variance, but adding {15} to the domain {1, 2, …, 13, 14, 16, 17, 18, 19, 20} would decrease variance. For n = 20, the cutoff values for decreasing or increasing variance are

(26)

i.e., 4.88 and 16.12, so that deleting an element of {1, 2, 3, 4, 17, 18, 19, 20} would decrease variance and deleting an element of {5, 6, … , 16} would increase variance.

3. Deleting an Element From a Dataset

For data, the only changes are that the denominators of the variances are reduced by one, n is the sample size, and the notation changes from μ to

and from σ² to s². The formulas analogous to (5)–(7) in Lemma 1 and (11) and (12) in Lemma 2 are

(27)

(28)

(29)

(30)

(31)

respectively. The proofs follow step-by-step those of Lemmas 1 and 2 with the necessary changes in notation. Also, see [1, 2] for these types of formulas. Equation (31) requires that n ≥ 3.

Cutoff values for the increase or decrease of sample variance are contained in Theorems 3 and 4, whose proofs parallel those of Theorems 1 and 2. Although the denominators of the sample variances contain n − 1 and n − 2, instead of the population variances’ n and n − 1, those differences in the formulas disappear when the algebra is performed. The right-hand sides of the inequalities in Theorems 3 and 4 are the same as those in Theorems 1 and 2.

Theorem 3. For the observations {x₁, x₂, …, x_n}, the sample variance is made smaller by the deletion of the single value x_n, i.e., s²(n − 1) < s²(n) if and only if the absolute value of the z-score of the deleted value with respect to the unreduced dataset is greater than , i.e.,

(32)

Theorem 4. For the observations {x₁, x₂, …, x_n}, the sample variance is made smaller by the deletion of the single value x_n, i.e., s²(n − 1) < s²(n) if and only if the absolute value of the z-score of the deleted value with respect to the reduced dataset is greater than , i.e.,

(33)

4. Truncation

More than one value might be deleted. Deleting a set of the lowest and/or the highest values from distributions or sets of observations is called truncation.

Example 1 revisited: Consider truncation from the right. In this example, all deleted values are outside the bound in Theorem 1 because each single step of truncation from the right decreases the variance.

Example 2. Consider the elements of a random variable’s domain {1, 2, 3, 4, 5, 117}, each with the probability 1/6 of occurrence, and truncation from the left. As the successively larger truncations are taken, Table 1 shows the somewhat surprising fact that the variance of the variable with the remaining domain is larger, while the range is smaller. Using (17) in Theorem 1, each right-hand value in the truncated values can be checked for increasing the variance, compared to the previous truncation. For example, the absolute value of z-score for {2} in the truncated set {1, 2} is

(34)

Table 1. Sequence of left truncations for Example 2.

Deleted value(s)	For the remaining values
Deleted value(s)	Mean	Range	Variance
None	22	116	1806.67
{1}	26.2	115	2062.16
{1, 2}	32.25	114	2394.69
{1, 2, 3}	42	113	2812.67
{1, 2, 3, 4}	61	112	3136.00

Commonly, the deleted values are represented by the percent of the original values that are omitted. For {1,2} in Example 2, the deletion percent is (2/6)100 = 33.3%.

5. Discussion and Concluding Comments

The right-hand sides of the inequalities in Theorems 1–4 are approximately equal to one for large values n. For Theorems 1 and 3, for n ≥ 11. The closeness of the right-hand sides to each other is not surprising. Z-scores have the feature that, as the x-value moves away from the mean, the standard deviation in the denominator increases along with the numerator increasing. Thus, for observations, detecting an outlying datum utilizing z-scores can be problematical [3]. Also, there are upper bounds for those possible z-scores for elements of the set. Those bounds depend upon n [4].

There are related ideas. For example, (5), (11), (28), and (30) are shortcuts for computing the mean and variance from the previous values [5, 6]. The related problem of changing the numerical value of one datum that is already in the dataset is explored in [6, 7].

Conflicts of Interest

The author declares no conflicts of interest.

Funding

No funds were received by the author regarding this paper.

Open Research

Data Availability Statement

No data were used to support this study.

References

All articles

Deleting Values May Either Increase or Decrease Variance

Abstract

1. Introduction

2. Deleting an Element From the Domain of a Discrete Random Variable

3. Deleting an Element From a Dataset

4. Truncation

5. Discussion and Concluding Comments

Conflicts of Interest

Funding

Open Research

Data Availability Statement

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Deleting Values May Either Increase or Decrease Variance

Abstract

1. Introduction

2. Deleting an Element From the Domain of a Discrete Random Variable

3. Deleting an Element From a Dataset

4. Truncation

5. Discussion and Concluding Comments

Conflicts of Interest

Funding

Open Research

Data Availability Statement

References

References

Related

Information