1. Introduction
Sample variance calculation is a fundamental task in many data analysis applications. The basic formula for calculating a sample variance is based on the sum of squared differences from mean. Given that a set of data is
x1,
x2, … ,
xn, the sample variance, denoted as
, is calculated as follows:
()
where
is the sum of squared differences from the mean and
is the sample mean. Von Neumann et al. [
1] pointed out that (
1) does not take into account the order of the observations. They proposed to instead use successive differences of data so that the order can be considered. Specifically, they used
()
where the subscript
i refers to temporal order of the data and
di =
xi −
xi−1. Define {
x0,
d1, …,
dn} as the successive differences of the input data. From computational perspective, von Neumann’s formula is also advantageous as it avoids a mean calculation that may introduce rounding errors.
The problem with is that it is not mathematically equivalent to the basic definition. This problem was independently solved by Eilon and Chowdhury [2] and Joarder [3] where weighted successive differences were used to derive a formula that is mathematically equivalent to the basic definition.
Eilon and Chowdhury [
2] considered a job scheduling problem where they wanted to minimize the variance of the job’s waiting time. Let
yi be the waiting time of the
ith job. By definition,
y1 = 0, as the first job does not have waiting time, and
, for
i = 1,2, …,
n, where
n is the number of jobs and
pj is the processing time of job
j. The objective is to minimize the variance of the waiting time, or equivalently
. For this purpose, there is a need to quickly update SS
n when job
i and
j are swapped. Notice that when job
i and
j are swapped, most of the jobs’ waiting time will change accordingly, and
and SS
n have to be recalculated. To avoid recalculating
when updating SS
n, Eilon and Chowdhury derived a formula to calculate SS
n from successive differences. By definition, the successive differences of the waiting time are the processing time; that is,
pi =
yi+1 −
yi,
i = 1, 2, … ,
n − 1. So,
()
where
()
Equation (
3) is not a general formula for calculating SS
n as
y1 is zero. Vani and Raghavachari [
4] gave a more general formula by considering the job’s completion time rather than waiting time. Let
be the completion time of job
i. They rewrote (
3) as follows:
()
where
()
for
i,
j = 1, 2, … ,
n. In an independent work, Joarder [
3] also derived a formula similar to (
5). He then converted its double sum structure into a quadratic form wherein
()
where
is a vector of the successive differences and
is a weight matrix with
cij as defined in (
6).
One problem with the weight formula in (6) is that it is not in a unified format but has to be represented as two formulas. This deficiency prohibits a compact representation that would facilitate further derivations. To solve this problem, we derive a unified weight formula for sample variance calculations from weighted successive differences. Joarder [3] derived an updating formula to calculate a variance from weighted successive differences. But, his formula contains a dynamically increased number of updating items. Using the unified weight formula, we show [5] that we can improve Joarder’s formula by reducing the updating items to a fixed number of only two items.
2. Main Results
Theorem 1. Given that a temporally order of the observations x1, x2, … , xn the sum of squared differences about the mean can be represented as
()
where
, {
di =
xi −
xi−1}, for
x0 = 0,
i = 1, 2, …,
n, and
are the
n ×
n symmetric matrix with
()
Proof. First write
()
for
. Now,
x can be presented as
()
for
()
that is, the row
i column
j element of
Pn×n is
()
for
i,
j = 1, 2, …,
n.
Next, the mean of x can be written as
()
In vector form this is
()
where
()
the row
i column
j element of
Qn×n is
()
Now observe that
()
Thus we need to obtain expressions for calculating
,
,
, and
.
First
()
where
()
Then
()
where
()
That is,
()
Finally,
()
where
()
We now can see that
and, hence,
()
A direct calculation produces as follows:
()
Thus,
and the proof is complete.
3. Numerical Example
This section gives a numerical example to illustrate sample variance calculation using the nonunified weight formula
cij given in (
6) and the unified formula
wij given in (
9). We take a sample data set
x1 = 5,
x2 = 14,
x3 = 9, and
x4 = 6 from Ross [
6, Page 145] where the data are used to illustrate the variance updating process using the one-pass algorithm proposed in van Reeken [
7]. The successive differences for this data set are
()
and the successive differences vector is
. Using the nonunified weight formula, the weight matrix
is constructed using two formulas: one for the lower triangular matrix and one for the strictly upper triangular matrix. For the lower triangular matrix with
i ≥
j,
cij = (
n −
i + 1)(
j − 1). For example,
c11 = (4 − 1 + 1)(1 − 1) = 0,
c21 = (4 − 2 + 1)(1 − 1) = 0,
c31 = (4 − 3 + 1)(1 − 1) = 0, and
c32 = (4 − 3 + 1)(2 − 1) = 2. For the strictly upper triangular matrix with
i <
j, the weight formula is
cij = (
n −
j + 1)(
i − 1). For example,
c12 = (4 − 2 + 1)(1 − 1) = 0,
c13 = (4 − 3 + 1)(1 − 1) = 0, and
c23 = (4 − 3 + 1)(2 − 1) = 2. Combining the lower triangular matrix and the strictly upper triangular matrix we can get
()
The variance is then calculated as
()
Now with our approach, the weight matrix
is constructed using the unified weight formula given in (
9). For example,
w12 = (4 + 1)(1 + 2 − 1) − 1 × 2 − (4/2)(1 + 2 + |1 − 2|) = 0 and
w21 = (4 + 1)(2 + 1 − 1) − 2 × 1 − (4/2)(2 + 1 + |2 − 1|) = 0. Similarly,
w23 and
w32 are calculated as
w23 = (4 + 1)(2 + 3 − 1) − 2 × 3 − (4/2)(2 + 3 + |2 − 3|) = 2 and
w32 = (4 + 1)(3 + 2 − 1) − 3 × 2 − (4/2)(3 + 2 + |3 − 2|) = 2. The other weights are calculated in a similar manner to produce
()
The variance is then calculated as
()
4. Conclusions
Sample variance calculation using weighted successive differences is advantageous from a computational perspective as it avoids a mean calculation which may introduce rounding errors. However, the weight formula that has been proposed in previous research is not in a unified format. Instead, it has to be represented as two formulas. This deficiency prohibits compact representation of further derivations. This paper derives a unified weight formula for calculating a sample variance from weighted successive differences. We have employed this compute formula to improve variance updating formula in Vani and Raghavachari [4] or Joarder [3].
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.