Three methods of temporal data upscaling, which may collectively be called the generalized k-nearest neighbor (GkNN) method, are considered. The accuracy of the GkNN simulation of month by month yield is considered (where the term yield denotes the dependent variable). The notion of an eventually well-distributed time series is introduced and on the basis of this assumption some properties of the average annual yield and its variance for a GkNN simulation are computed. The total yield over a planning period is determined and a general framework for considering the GkNN algorithm based on the notion of stochastically dependent time series is described and it is shown that for a sufficiently large training set the GkNN simulation has the same statistical properties as the training data. An example of the application of the methodology is given in the problem of simulating yield of a rainwater tank given monthly climatic data.

1. Introduction

The k-nearest neighbor method has its origins in the work of Mack [1], Yakowitz and Karlsson [2], and others, e.g., [3, 4]. In this work an estimate for Y_i given an independently and identically distributed (i.i.d.) sequence (X_j, Y_j) of random vectors with X_j ∈ R^p and Y_j ∈ R (where R denotes the set of real numbers) on the basis of {(X_j, Y_j) : j < i} is obtained by taking the average of Y_j over the set {Y_j : j ∈ J}, where J is the set of indices of vectors X_j which form the k nearest neighbors of X_i, in which k > 1.

In later work by Lall and Sharma [5] and Rajagapolan and Lall [6] a related method, also called the k-nearest neighbor method, was used for simulating hydrological stochastic time series (X_i, Y_i). In this method the next value in the simulated time series is chosen randomly according to a probability distribution over the set J of indices j of the k nearest neighbors X_j of X_i in {X_j : j < i}.

More recent work in the area has been carried out by Biau et al. [7], Lee and Ourda [8], and Zhang [9].

In the present paper we derive some general results about the k-nearest neighbor algorithm and related methods which we group together as a general class of methods which we call the generalized k-nearest neighbor method (GkNN method). We do not make the assumption that the time series are i.i.d. [1], null-recurrent Markov [10], or Harris recurrent Markov chains [11]. We introduce the natural notion of a time series being eventually well distributed from which, if satisfied, some properties of the GkNN algorithm can be deduced.

The generalized k nearest neighbor (GkNN) algorithm is described in Section 2. Section 3 investigates the problem of predicting the month by month yield (where we use the term “yield" to denote the value of the dependent variable Y_i) while Section 4 considers the computation of the average annual yield. Section 5 computes the variance of the average annual yield while Section 6 considers the behavior of the total yield. Section 7 describes a general framework for viewing the GkNN algorithm and conditions under which this framework is applicable in practice. The eighth section of this paper presents the particular example of the problem of simulating rainwater tank yield. The paper concludes in Section 9.

2. The Generalized k Nearest Neighbor (GkNN) Method

In the GkNN method we are given a time series {v_t ∈ V : t = 1, …, T} of predictor vectors which may be obtained from, for example, a stochastic simulation of climatic data. Here V denotes the space of predictor vectors. We are also given training data {(w_i, u_i) ∈ V × [0, ∞) : i = 1, …, N}.

We want to assign yields y_t for t = 1, …, T in a meaningful way. We are given a metric μ : V × V → [0, ∞). We are also given a probability distribution {p₁, …, p_N} on {1, …, N}. In the GkNN method the yield time series y_t for t = 1,…, T is computed as follows.

For each t = 1, …, T,

(1)
Compute the metric values μ(v_t, w_i) for i = 1, …, N and sort from lowest to highest. Let π_t be the resulting permutation of {1, …, N}.
(2)
Randomly choose i ∈ {1, …, N} according to the distribution {p₁, …, p_N}. Denote it by i_selected.
(3)
Return .

3. Prediction of the Month by Month Yield by GkNN Simulation

We want to determine by either theoretical calculation or computational experiment how well the GkNN method predicts yields, or at least to find some sense in which it can be said that the GkNN method is predicting yields accurately. Suppose that we have a training set {(w_i, u_i) : i = 1, …, N}. Let {v_t : t = 1, …, T} be a given climatic time series and {z_t : t = 1, …, T} associated (unknown) yields. The GkNN method is a stochastic method for generating a yield time series. Suppose that we run it R times resulting in a yield time series for run r, where r ∈ {1, …, R}.

We will first work out how well the GkNN predicted yield approximates the actual yield for any given month. A measure of the error of the predicted yield compared to the actual yield for month t and run r is the square of the deviation, i.e.

. The expected error for the GkNN computation of the yield for month t is

()

We will show that this expected error exists and is positive. Let 〈y_t〉 denote the expected value of the GkNN prediction of the yield for month t. We will show that 〈y_t〉 exists. Let γ(t, r) denote the index i_selected chosen in Step (2) of the GkNN algorithm for month t and run r. By definition

()

Thus 〈y_t〉 exists. Now

()

Therefore

()

Now the variance Var(y_t) is given by

()

Thus

()

The expected error is the sum of two nonnegative terms. The first term can only be zero if all the points in the neighborhood have associated yields equal to 〈y_t〉 and this is seldom the case. The greater the distribution of yields in the neighborhood the greater the first term will be and hence, the greater E_t will be. Thus the expected error E_t is positive and the error in the prediction of the yield during month t for any given run is likely to be positive.

A measure of the total error of the GkNN prediction of yield over the total simulation period for run r is

()

and its expected value is

()

We may write

()

where

()

and

()

We have

()

Now define π : V × {1, …, k}→{1, …, N} by

()

and let, for v ∈ V, i ∈ {1, …, k}, π_v(i) = π(v, i). Then

()

where E : V → [0, ∞) is defined by

()

E : V → [0, ∞) may be called the base error map. We will show that E is bounded over the predictor vector space as follows:

()

where u_max = max⁡{u_i : i = 1, …, N}.

4. Prediction of the Annual Average Yield by GkNN Simulation

Thus the GkNN method does not make accurate detailed month by month predictions of the yield. We would like to determine some way in which the GkNN method gives useful information about the system behavior. We will show that under certain assumptions the GkNN method gives an accurate prediction of the average annual yield and the accuracy of the prediction increases as the total time period of the simulation increases.

Given a permutation π : {1, …, N}→{1, …, N} let V_π = {v ∈ V : π_v = π}. Let Π denote the set of all permutations of {1, …, N}. Suppose that the simulation is carried out over m years, so T = 12m. The average annual yield for run r is

()

Therefore the average of the average annual yield over R runs is given by

()

as R → ∞. Therefore the expected value of the predicted average annual yield is given by

()

If X is a topological space and x = {x_t : t = 1,2, …} is a time series in X then we will say that x is eventually well distributed if

()

(Borel (X) = {Borel sets in X} denotes the sigma algebra generated by the set of open sets in X [12].) This is a natural property for a time series to have. If x is eventually well distributed define its distribution to be the mapping ν : Borel(X)→[0,1] defined by

()

It is straightforward to show that ν is finitely additive and ν(X) = 1.

If the climatic time series {v_t : t = 1,2, …} is eventually well distributed with distribution ν then the average annual yield converges to a limit as the number of years m in the simulation increases given by

()

5. Variance of the Average Annual Yield Predicted by GkNN Simulation

We will now compute the variance of the average annual yield and show that it tends to zero as the number of years m in the simulation increases. We have

()

We may compute

()

Now for

()

as R → ∞ (assuming that the index selection at Step (3) of the GkNN algorithm at time t is independent of its selection at time s). Therefore

()

Also we compute

()

It follows that

()

Therefore

()

where

()

and so the variance of the predicted annual average annual yield as computed by the GkNN method tends to zero as the total number of years m in the simulation increases. If the time series {v_t : t = 1,2, …} is eventually well distributed with distribution ν then

()

6. Prediction of the Total Yield by GkNN Simulation

Thus the computation of average annual yield using GkNN seems to be well behaved. However it is perhaps of greater interest to consider the total yield at any month starting from the beginning of the simulation period. The total yield Y_tot over a simulation period of m years is given by

()

where Y is the average annual yield. Therefore the variance of the total yield is given by

()

If the time series {v_t : t = 1,2, …} is eventually well distributed with distribution ν then Var(Y_tot) = mf(m), where

()

as m → ∞. This limit will be positive for practical applications. Thus, in this case, the variance of Y_tot becomes unbounded as m → ∞.

7. A General Framework for GkNN

Let {v_t : t = 1,2, …} ⊂ V be a time series which may be a realization of some stochastic process and let Z be a topological space. A stochastic process y = {y_t : t = 1,2, …} ⊂ Z will be said to be stochastically dependent on {v_t : t = 1,2, …} if there exists a continuous kernel K : V × Borel(Z)→[0,1] such that

()

The condition that K is a continuous kernel means that for all v ∈ V the mapping taking Γ ∈ Borel(Z) to K(v, Γ) is a probability measure and for all Γ ∈ Borel(Z) the mapping taking v ∈ V to K(v, Γ) is continuous. Equation (35) means that if

for r = 1,2, … are a collection of runs (replicates) of the stochastic process {y_t : t = 1,2, …} then

()

Consider the GkNN process defined by training data {(w_i, u_i) : i = 1, …, N} ⊂ V × [0, ∞). In this case the space Z is the space [0, ∞). We will show that the process {y₁, y₂, …} is stochastically dependent on the time series {v_t : t = 1,2, …}. In fact we have

()

where for a ∈ Z, δ_a : Borel(Z)→[0, ∞) denotes the Dirac measure concentrated on a defined by

()

It follows that the GkNN process is stochastically dependent on {v_t : t = 1,2, …} with kernel K defined by

()

Now suppose that {v_t : t = 1,2, …} ⊂ V is a time series and {y_t : t = 1,2, …}⊂[0, ∞) is a stochastic process which is stochastically dependent on {v_t : t = 1,2, …} with kernel K where K is defined by a continuous functional kernel ϕ : V × [0, ∞)→[0, ∞), i.e.,

()

Let {z_t : t = 1,2, …} be a realization (replicate) of {y_t : t = 1,2, …} and let K_N be the kernel associated with the GkNN process with training set W_N = {(v_i, z_i) : i = 1, …, N} and probabilities

()

for which k_N → ∞ as N → ∞ but k_N/N → 0 as N → ∞. An example of a sequence k_N satisfying this is

. K_N is given by

()

Therefore for an interval (a, b)

()

Now let ψ : V × [0, ∞]→[0,1] be defined by

()

Let {ρ_t} be defined by ρ_t = ψ(v_t, z_t) for t = 1,2, …. Then {ρ_t} is a uniformly distributed sequence of random numbers and z_t = (ψ(v_t.)) ⁻¹(ρ_t). Thus

()

as N → ∞, assuming that

is small for all i = 1, …, k_N for N large enough (this will follow, if {v_t} is eventually well distributed with positive distribution, given that

is a uniformly distributed sequence).

Thus the GkNN kernel equals the kernel of the dependent process in the sense defined above as long as the training set for the GkNN process is large enough.

8. Example of Temporal Upscaling of (Rainwater) Tank Data

We would like to estimate the month by month yield of a rainwater tank (RWT) given monthly climatic data. This is not straightforward because a monthly time step is too coarse for the RWT simulation model. To obtain reasonably accurate results a daily time step must be used for the RWT simulation [13, 14].

The monthly climatic data arises from the water supply headworks (WSH) model [15] and is usually stochastically generated with a very large time span (e.g., 1,000,000 years). The problem of temporal scaling up would not arise if the climatic data for the WSH model had a daily time step (and also if the RWT simulation algorithm could be executed sufficiently fast).

Temporal downscaling has been used extensively in studying the short term effects of long-term climate models such as models of climate change [16–19]. However in the present paper we are considering the problem of upscaling relatively short records of daily data to generate long term records of monthly data.

Three methods of temporal upscaling are the nearest neighbor (NN) method of Coombes et al. [20], Kuczera’s bootstrap method [21] and the k-nearest neighbor (kNN) method [5, 6, 16].

In each of these methods the RWT month by month yield associated with a WSH climatic time series is estimated using a comparatively short (e.g., 140 years) historical record of daily climatic data. In each case the RWT simulation model or, more generally, the Allotment Water Balance model described in [20] is run on this daily historical record for various RWT parameter settings. In order to do this it is necessary to have a demand model which is either a simulation or, as is unlikely, a historical record. The demand simulation will take into account the climatic variables, in particular, the temperature.

The upscaling methods can be described in terms of the following general format. Each of the upscaling methods aggregates the daily RWT yields and climatic variables obtained from running the RWT simulation on the historical record into monthly time steps. They then generate a list

of records of the form

()

where N is the number of months in the historical record. The month label is a number in {1, …, 12} determined from the month corresponding to the record. For the method described in [20], n = 3 and

climatic_variable_1 = average_temperature,
climatic_variable_2 = number_of_rainfall_days,
climatic_variable_3 = rainfall_depth.

For Kuczera’s bootstrap method and the kNN method as currently implemented n = 1 and climatic_variable_1 = rainfall_depth.

Now for all three upscaling methods we are given a sequence

of monthly records coming from the WSH model where

()

For each i we want to select a RWT yield to associate with

. The NN method does this by finding the record in

which is closest to

as measured by the metric (a variant of the Manhattan metric) given by

()

where d = n + 1 is the record length (e.g. 2) and w₁, …, w_n are weights which were chosen to be 1 in [20]. The NN method is deterministic.

The kNN method is a stochastic method in which the following steps are carried out.

(1)
Evaluate the distance from each record to using the following metric (a variant of the Euclidean metric):
()
where s_p is the standard deviation of .
(2)
Sort the metric values
(3)
Choose the top (closest) k values ,…,
(4)
Assign a probability to each of the k selected values proportional to 1/t for t = 1, …, k
(5)
Randomly select an index t according to the assigned probabilities and return the as the RWT yield corresponding to

The bootstrap method is a stochastic method in which a scatter plot of is created. The domain of the plot is divided up into bands of 50 samples per band. Then, given a WSH climatic record the corresponding RWT yield is obtained by finding the band containing , randomly choosing a sample in that band and then returning its RWT yield value.

The bootstrap method of Kuczera can be modified by taking the set of samples associated with any given rainfall value to be the set of samples whose rainfall values are the 50 closest values to the given rainfall value rather than using predefined bands of 50 rainfall values. It can be argued that the modified bootstrap method is superior to the bootstrap method because the closest values are the most appropriate values to use and, for example, if the given rainfall value falls near the boundary of one of the predefined bands then the predicted yield using the bootstrap method will be biased towards the values near the centre of the band.

The modified bootstrap method, the Coombes method, and the kNN method are all examples of the GkNN method. For the modified bootstrap method the predictor vectors have one component, the rainfall. For the Coombes method the predictor vectors have three components, the average temperature, the number of rainfall days, and the rainfall depth. For the kNN method the predictor vectors have two components, the month label (an integer in {1, …, 12}) and the rainfall depth. The training data is obtained by running the RWT simulation model using a daily time step over a relatively short period of time (e.g., 100 years) and then upscaling to a monthly time step by aggregation. The GkNN metric μ : V × V → [0, ∞) may be the modified Manhattan metric of the Coombes method or the modified Euclidean metric of the kNN method.

For the bootstrap method the probability distribution on the set of nearest neighbors is given by

()

For the kNN method the distribution is given by

()

where

()

9. Conclusion

A generalization of three methods of temporal data upscaling, which we have called the generalized k-nearest neighbor (GkNN) method, has been considered. The accuracy of the GkNN simulation of month by month yield has been considered. The notion of an eventually well distributed time series is introduced and on the basis of this assumption some properties of the average annual yield and its variance for a GkNN simulation are computed. The behavior of the total yield over a planning period has been described. A general framework for considering the GkNN algorithm based on the notion of stochastically dependent time series has been described and it is shown that for a sufficiently large training set the GkNN simulation has the same statistical properties as the training data. An example of the application of the methodology has been given in the problem of simulating the yield of a rainwater tank given monthly climatic data.

Conflicts of Interest

The author declare that they have no conflicts of interest.

Acknowledgments

The work described in this paper was partially funded by the Commonwealth Scientific and Industrial Research Organisation (CSIRO, Australia). Also the author would like to thank Fareed Mirza, Shiroma Maheepala, and Yong Song for very helpful discussions.

Open Research

Data Availability

The work of the paper is a theoretical study. The author did not implement any code or generate any data relating to the work. Therefore no data were used to support this study.

References

1 Mack Y. P., Local properties of k-NN regression estimates, Society for Industrial and Applied Mathematics. Journal on Algebraic and Discrete Methods. (1981) 2, no. 3, 311–323, https://doi.org/10.1137/0602035, MR627598.
10.1137/0602035
Google Scholar
2 Yakowitz S. and Karlsson M., J. B. Macneill and G. J. Umphrey, Nearest neighbour methods for time series, with application to rainfall/runoff prediction, Stochastic Hydrology, 1987, Reidel Publishing Company, 149–160.
10.1007/978-94-009-4792-4_9
Google Scholar
3 Cover T. and Hart P. E., Nearest neighbour pattern classification, IEEE Transactions on Information Theory. (1967) 13, no. 1, https://doi.org/10.1109/tit.1967.1053964, Zbl0154.44505.
10.1109/TIT.1967.1053964
Web of Science® Google Scholar
4 Devroye L., On the almost everywhere convergence of nonparametric regression function estimates, The Annals of Statistics. (1981) 9, no. 6, 1310–1319, MR630113, https://doi.org/10.1214/aos/1176345647, Zbl0477.62025.
10.1214/aos/1176345647
Web of Science® Google Scholar
5 Lall U. and Sharma A., A nearest neighbor bootstrap for resampling hydrologic time series, Water Resources Research. (1996) 32, no. 3, 679–693, https://doi.org/10.1029/95WR02966, 2-s2.0-0029663871.
10.1029/95WR02966
Web of Science® Google Scholar
6 Rajagopalan B. and Lall U., A k-nearest-neighbor simulator for daily precipitation and other weather variables, Water Resources Research. (1999) 35, no. 10, 3089–3101, https://doi.org/10.1029/1999WR900028, 2-s2.0-0032848078.
10.1029/1999WR900028
Web of Science® Google Scholar
7 Biau G., Devroye L., Dujmovic V., and Krzyak A., An affine invariant k-nearest neighbour regression estimate, Journal of Multivariate Analysis. (2012) 112, 24–34, https://doi.org/10.1016/j.jmva.2012.05.020, MR2957283.
10.1016/j.jmva.2012.05.020
Web of Science® Google Scholar
8 Lee T. and Ouarda T. B. M. J., Identification of model order and number of neighbors for k-nearest neighbor resampling, Journal of Hydrology. (2011) 404, no. 3-4, 136–145, https://doi.org/10.1016/j.jhydrol.2011.04.024, 2-s2.0-79959202022.
10.1016/j.jhydrol.2011.04.024
Web of Science® Google Scholar
9 Zhang S., Nearest neighbor selection for iteratively kNN imputation, The Journal of Systems and Software. (2012) 85, no. 11, 2541–2552, https://doi.org/10.1016/j.jss.2012.05.073, 2-s2.0-84865249371.
10.1016/j.jss.2012.05.073
Web of Science® Google Scholar
10 Yakowitz S., Nearest neighbor regression estimation for null-recurrent Markov time series, Stochastic Processes and Their Applications. (1993) 48, no. 2, 311–318, https://doi.org/10.1016/0304-4149(93)90050-E, MR1244548.
10.1016/0304-4149(93)90050-E
Web of Science® Google Scholar
11 Sancetta A., Nearest neighbor conditional estimation for Harris recurrent Markov chains, Journal of Multivariate Analysis. (2009) 100, no. 10, 2224–2236, https://doi.org/10.1016/j.jmva.2009.06.013, MR2560365.
10.1016/j.jmva.2009.06.013
Web of Science® Google Scholar
12 Halmos P. R., Measure Theory, 1974, Springer, New York, NY, USA, MR0033869.
10.1007/978-1-4615-9976-0
Google Scholar
13 Mashford J. and Maheepala S., A general model for the exact computation of yield from a rainwater tank, Applied Mathematical Modelling: Simulation and Computation for Engineering and Environmental Systems. (2015) 39, no. 7, 1929–1940, https://doi.org/10.1016/j.apm.2014.10.004, MR3325588.
10.1016/j.apm.2014.10.004
Google Scholar
14 Mashford J., Maheepala S., Neumann L., and Coultas E., Computation of the expected value and variance of the average annual yield for a stochastic simulation of rainwater tank clusters, Proceedings of the 2011 International Conference on Modeling, Simulation and Visualization Methods, 2011, Las Vegas, Nev, USA, 303–309.
Google Scholar
15 Cui L.-J. and Kuczera G., Optimizing urban water supply headworks using probabilistic search methods, Journal of Water Resources Planning and Management. (2003) 129, no. 5, 380–387, 2-s2.0-0141521666, https://doi.org/10.1061/(ASCE)0733-9496(2003)129:5(380).
10.1061/(ASCE)0733-9496(2003)129:5(380)
Web of Science® Google Scholar
16 Gangopadhyay S., Clark M., and Rajagopalan B., Statistical downscaling using K-nearest neighbors, Water Resources Research. (2005) 41, no. 2, https://doi.org/10.1029/2004WR003444.
10.1029/2004WR003444
PubMed Web of Science® Google Scholar
17 Fowler H. J., Blenkinsop S., and Tebaldi C., Linking climate change modelling to impacts studies: recent advances in downscaling techniques for hydrological modelling, International Journal of Climatology. (2007) 27, no. 12, 1547–1578, https://doi.org/10.1002/joc.1556, 2-s2.0-35348933854.
10.1002/joc.1556
Web of Science® Google Scholar
18 Maraun D., Wetterhall F., Ireson A. M. et al., Precipitation downscaling under climate change: Recent developments to bridge the gap between dynamical models and the end user, Reviews of Geophysics. (2010) 48, no. 3, 1–34, https://doi.org/10.1029/2009RG000314, 2-s2.0-77954700859.
10.1029/2009RG000314
Web of Science® Google Scholar
19 Erhardt R. J., Band L. E., Smith R. L., and Lopes B. J., Statistical downscaling of precipitation on a spatially dependent network using a regional climate model, Stochastic Environmental Research and Risk Assessment. (2015) 29, no. 7, 1835–1849, 2-s2.0-84939265131, https://doi.org/10.1007/s00477-014-0988-y.
10.1007/s00477-014-0988-y
Web of Science® Google Scholar
20 Coombes P. J., Kuczera G., Kalma J. D., and Argue J. R., An evaluation of the benefits of source control measures at the regional scale, Urban Water Journal. (2002) 4, no. 4, 307–320, 2-s2.0-0036409577, https://doi.org/10.1016/S1462-0758(02)00028-6.
10.1016/S1462-0758(02)00028-6
Google Scholar
21 Kuczera G., Urban water supply drought security: a comparative analysis of complimentary centralised and decentralised storage systems, Proceedings of the Water Down Under 2008, 2008, 1532–1543.
Google Scholar

All articles

Stochastic Temporal Data Upscaling Using the Generalized k-Nearest Neighbor Algorithm

Abstract

1. Introduction

2. The Generalized k Nearest Neighbor (GkNN) Method

3. Prediction of the Month by Month Yield by GkNN Simulation

4. Prediction of the Annual Average Yield by GkNN Simulation

5. Variance of the Average Annual Yield Predicted by GkNN Simulation

6. Prediction of the Total Yield by GkNN Simulation

7. A General Framework for GkNN

8. Example of Temporal Upscaling of (Rainwater) Tank Data

9. Conclusion

Conflicts of Interest

Acknowledgments

Open Research

Data Availability

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Stochastic Temporal Data Upscaling Using the Generalized k-Nearest Neighbor Algorithm

Abstract

1. Introduction

2. The Generalized k Nearest Neighbor (GkNN) Method

3. Prediction of the Month by Month Yield by GkNN Simulation

4. Prediction of the Annual Average Yield by GkNN Simulation

5. Variance of the Average Annual Yield Predicted by GkNN Simulation

6. Prediction of the Total Yield by GkNN Simulation

7. A General Framework for GkNN

8. Example of Temporal Upscaling of (Rainwater) Tank Data

9. Conclusion

Conflicts of Interest

Acknowledgments

Open Research

Data Availability

References

References

Related

Information