Deleting Values May Either Increase or Decrease Variance
Abstract
The impact upon variance when a value is deleted is addressed. It is shown that the cutoff for the deleted value yielding an increase or a decrease in variance is approximately one standard deviation from the mean for a univariate random variable with equally distributed probability on a finite set of elements and for a univariate set of observations. The influence of truncation of the domain for such a discrete random variable and for observations is considered.
1. Introduction
The goal is to express the means and variances for n − 1 values and for n values in terms of each other and thereby to find a condition under which variance is reduced by the deletion, i.e., σ2(n − 1) < σ2(n). The analogous situation in a set of observations is addressed in Section 3. Truncation is the topic of Section 4.
2. Deleting an Element From the Domain of a Discrete Random Variable
The effects on the mean and the variance of adding or deleting a single value of the domain for a random variable in which each value is equally likely to occur are contained in Lemmas 1 and 2.
Lemma 1. For a univariate discrete random variable with equally distributed probability on the finite set of elements {x1, x2, …, xn},
Lemma 2. For a univariate discrete random variable with equally distributed probability on the finite set of elements {x1, x2, …, xn},
Proof 2. For (11),
Rearranging the terms and using formulas (3) and (4) give
For (12), substituting (7) into (11) gives
Solving (15) for σ2(n − 1) yields
The locations of a single deleted value, in order for the variance to be increased or decreased, are given in Theorems 1 and 2 for these discrete distributions.
Theorem 1. The variance is made smaller by the deletion of the single value xn, i.e., σ2(n − 1) < σ2(n), if and only if the absolute value of the z-score of the deleted value with respect to the unreduced domain’s values is greater than , i.e.,
In Theorem 2, the cutoff criterion is expressed in terms of the mean and standard deviation of the variable with the domain reduced by one value. The criterion is equivalent to the criterion in Theorem 1.
Theorem 2. The variance is made smaller by the deletion of the single value xn, i.e., σ2(n − 1) < σ2(n), if and only if the absolute value of the z-score of the deleted value with respect to the reduced domain’s values is greater than , i.e.,
If the deleted or added value in Theorems 1 and 2 has a z-score that makes (17) and (20) an equality, the variances with or without that value are equal, and it could occur that the ranges of the values could be identical.
Example 1. Consider the equally spaced elements in the domain {1, 2, …, n}, each with the probability 1/n of occurrence. The mean and variance are
3. Deleting an Element From a Dataset
Cutoff values for the increase or decrease of sample variance are contained in Theorems 3 and 4, whose proofs parallel those of Theorems 1 and 2. Although the denominators of the sample variances contain n − 1 and n − 2, instead of the population variances’ n and n − 1, those differences in the formulas disappear when the algebra is performed. The right-hand sides of the inequalities in Theorems 3 and 4 are the same as those in Theorems 1 and 2.
Theorem 3. For the observations {x1, x2, …, xn}, the sample variance is made smaller by the deletion of the single value xn, i.e., s2(n − 1) < s2(n) if and only if the absolute value of the z-score of the deleted value with respect to the unreduced dataset is greater than , i.e.,
Theorem 4. For the observations {x1, x2, …, xn}, the sample variance is made smaller by the deletion of the single value xn, i.e., s2(n − 1) < s2(n) if and only if the absolute value of the z-score of the deleted value with respect to the reduced dataset is greater than , i.e.,
4. Truncation
More than one value might be deleted. Deleting a set of the lowest and/or the highest values from distributions or sets of observations is called truncation.
Example 1 revisited: Consider truncation from the right. In this example, all deleted values are outside the bound in Theorem 1 because each single step of truncation from the right decreases the variance.
Example 2. Consider the elements of a random variable’s domain {1, 2, 3, 4, 5, 117}, each with the probability 1/6 of occurrence, and truncation from the left. As the successively larger truncations are taken, Table 1 shows the somewhat surprising fact that the variance of the variable with the remaining domain is larger, while the range is smaller. Using (17) in Theorem 1, each right-hand value in the truncated values can be checked for increasing the variance, compared to the previous truncation. For example, the absolute value of z-score for {2} in the truncated set {1, 2} is
Deleted value(s) | For the remaining values | ||
---|---|---|---|
Mean | Range | Variance | |
None | 22 | 116 | 1806.67 |
{1} | 26.2 | 115 | 2062.16 |
{1, 2} | 32.25 | 114 | 2394.69 |
{1, 2, 3} | 42 | 113 | 2812.67 |
{1, 2, 3, 4} | 61 | 112 | 3136.00 |
Commonly, the deleted values are represented by the percent of the original values that are omitted. For {1,2} in Example 2, the deletion percent is (2/6)100 = 33.3%.
5. Discussion and Concluding Comments
The right-hand sides of the inequalities in Theorems 1–4 are approximately equal to one for large values n. For Theorems 1 and 3, for n ≥ 11. The closeness of the right-hand sides to each other is not surprising. Z-scores have the feature that, as the x-value moves away from the mean, the standard deviation in the denominator increases along with the numerator increasing. Thus, for observations, detecting an outlying datum utilizing z-scores can be problematical [3]. Also, there are upper bounds for those possible z-scores for elements of the set. Those bounds depend upon n [4].
There are related ideas. For example, (5), (11), (28), and (30) are shortcuts for computing the mean and variance from the previous values [5, 6]. The related problem of changing the numerical value of one datum that is already in the dataset is explored in [6, 7].
Conflicts of Interest
The author declares no conflicts of interest.
Funding
No funds were received by the author regarding this paper.
Open Research
Data Availability Statement
No data were used to support this study.