Stochastic Temporal Data Upscaling Using the Generalized k-Nearest Neighbor Algorithm
Abstract
Three methods of temporal data upscaling, which may collectively be called the generalized k-nearest neighbor (GkNN) method, are considered. The accuracy of the GkNN simulation of month by month yield is considered (where the term yield denotes the dependent variable). The notion of an eventually well-distributed time series is introduced and on the basis of this assumption some properties of the average annual yield and its variance for a GkNN simulation are computed. The total yield over a planning period is determined and a general framework for considering the GkNN algorithm based on the notion of stochastically dependent time series is described and it is shown that for a sufficiently large training set the GkNN simulation has the same statistical properties as the training data. An example of the application of the methodology is given in the problem of simulating yield of a rainwater tank given monthly climatic data.
1. Introduction
The k-nearest neighbor method has its origins in the work of Mack [1], Yakowitz and Karlsson [2], and others, e.g., [3, 4]. In this work an estimate for Yi given an independently and identically distributed (i.i.d.) sequence (Xj, Yj) of random vectors with Xj ∈ Rp and Yj ∈ R (where R denotes the set of real numbers) on the basis of {(Xj, Yj) : j < i} is obtained by taking the average of Yj over the set {Yj : j ∈ J}, where J is the set of indices of vectors Xj which form the k nearest neighbors of Xi, in which k > 1.
In later work by Lall and Sharma [5] and Rajagapolan and Lall [6] a related method, also called the k-nearest neighbor method, was used for simulating hydrological stochastic time series (Xi, Yi). In this method the next value in the simulated time series is chosen randomly according to a probability distribution over the set J of indices j of the k nearest neighbors Xj of Xi in {Xj : j < i}.
More recent work in the area has been carried out by Biau et al. [7], Lee and Ourda [8], and Zhang [9].
In the present paper we derive some general results about the k-nearest neighbor algorithm and related methods which we group together as a general class of methods which we call the generalized k-nearest neighbor method (GkNN method). We do not make the assumption that the time series are i.i.d. [1], null-recurrent Markov [10], or Harris recurrent Markov chains [11]. We introduce the natural notion of a time series being eventually well distributed from which, if satisfied, some properties of the GkNN algorithm can be deduced.
The generalized k nearest neighbor (GkNN) algorithm is described in Section 2. Section 3 investigates the problem of predicting the month by month yield (where we use the term “yield" to denote the value of the dependent variable Yi) while Section 4 considers the computation of the average annual yield. Section 5 computes the variance of the average annual yield while Section 6 considers the behavior of the total yield. Section 7 describes a general framework for viewing the GkNN algorithm and conditions under which this framework is applicable in practice. The eighth section of this paper presents the particular example of the problem of simulating rainwater tank yield. The paper concludes in Section 9.
2. The Generalized k Nearest Neighbor (GkNN) Method
In the GkNN method we are given a time series {vt ∈ V : t = 1, …, T} of predictor vectors which may be obtained from, for example, a stochastic simulation of climatic data. Here V denotes the space of predictor vectors. We are also given training data {(wi, ui) ∈ V × [0, ∞) : i = 1, …, N}.
We want to assign yields yt for t = 1, …, T in a meaningful way. We are given a metric μ : V × V → [0, ∞). We are also given a probability distribution {p1, …, pN} on {1, …, N}. In the GkNN method the yield time series yt for t = 1,…, T is computed as follows.
- (1)
Compute the metric values μ(vt, wi) for i = 1, …, N and sort from lowest to highest. Let πt be the resulting permutation of {1, …, N}.
- (2)
Randomly choose i ∈ {1, …, N} according to the distribution {p1, …, pN}. Denote it by i_selected.
- (3)
Return .
3. Prediction of the Month by Month Yield by GkNN Simulation
We want to determine by either theoretical calculation or computational experiment how well the GkNN method predicts yields, or at least to find some sense in which it can be said that the GkNN method is predicting yields accurately. Suppose that we have a training set {(wi, ui) : i = 1, …, N}. Let {vt : t = 1, …, T} be a given climatic time series and {zt : t = 1, …, T} associated (unknown) yields. The GkNN method is a stochastic method for generating a yield time series. Suppose that we run it R times resulting in a yield time series for run r, where r ∈ {1, …, R}.
The expected error is the sum of two nonnegative terms. The first term can only be zero if all the points in the neighborhood have associated yields equal to 〈yt〉 and this is seldom the case. The greater the distribution of yields in the neighborhood the greater the first term will be and hence, the greater Et will be. Thus the expected error Et is positive and the error in the prediction of the yield during month t for any given run is likely to be positive.
4. Prediction of the Annual Average Yield by GkNN Simulation
Thus the GkNN method does not make accurate detailed month by month predictions of the yield. We would like to determine some way in which the GkNN method gives useful information about the system behavior. We will show that under certain assumptions the GkNN method gives an accurate prediction of the average annual yield and the accuracy of the prediction increases as the total time period of the simulation increases.
5. Variance of the Average Annual Yield Predicted by GkNN Simulation
6. Prediction of the Total Yield by GkNN Simulation
7. A General Framework for GkNN
Thus the GkNN kernel equals the kernel of the dependent process in the sense defined above as long as the training set for the GkNN process is large enough.
8. Example of Temporal Upscaling of (Rainwater) Tank Data
We would like to estimate the month by month yield of a rainwater tank (RWT) given monthly climatic data. This is not straightforward because a monthly time step is too coarse for the RWT simulation model. To obtain reasonably accurate results a daily time step must be used for the RWT simulation [13, 14].
The monthly climatic data arises from the water supply headworks (WSH) model [15] and is usually stochastically generated with a very large time span (e.g., 1,000,000 years). The problem of temporal scaling up would not arise if the climatic data for the WSH model had a daily time step (and also if the RWT simulation algorithm could be executed sufficiently fast).
Temporal downscaling has been used extensively in studying the short term effects of long-term climate models such as models of climate change [16–19]. However in the present paper we are considering the problem of upscaling relatively short records of daily data to generate long term records of monthly data.
Three methods of temporal upscaling are the nearest neighbor (NN) method of Coombes et al. [20], Kuczera’s bootstrap method [21] and the k-nearest neighbor (kNN) method [5, 6, 16].
In each of these methods the RWT month by month yield associated with a WSH climatic time series is estimated using a comparatively short (e.g., 140 years) historical record of daily climatic data. In each case the RWT simulation model or, more generally, the Allotment Water Balance model described in [20] is run on this daily historical record for various RWT parameter settings. In order to do this it is necessary to have a demand model which is either a simulation or, as is unlikely, a historical record. The demand simulation will take into account the climatic variables, in particular, the temperature.
-
climatic_variable_1 = average_temperature,
-
climatic_variable_2 = number_of_rainfall_days,
-
climatic_variable_3 = rainfall_depth.
For Kuczera’s bootstrap method and the kNN method as currently implemented n = 1 and climatic_variable_1 = rainfall_depth.
- (1)
Evaluate the distance from each record to using the following metric (a variant of the Euclidean metric):
() -
where sp is the standard deviation of .
- (2)
Sort the metric values
- (3)
Choose the top (closest) k values ,…,
- (4)
Assign a probability to each of the k selected values proportional to 1/t for t = 1, …, k
- (5)
Randomly select an index t according to the assigned probabilities and return the as the RWT yield corresponding to
The bootstrap method is a stochastic method in which a scatter plot of is created. The domain of the plot is divided up into bands of 50 samples per band. Then, given a WSH climatic record the corresponding RWT yield is obtained by finding the band containing , randomly choosing a sample in that band and then returning its RWT yield value.
The bootstrap method of Kuczera can be modified by taking the set of samples associated with any given rainfall value to be the set of samples whose rainfall values are the 50 closest values to the given rainfall value rather than using predefined bands of 50 rainfall values. It can be argued that the modified bootstrap method is superior to the bootstrap method because the closest values are the most appropriate values to use and, for example, if the given rainfall value falls near the boundary of one of the predefined bands then the predicted yield using the bootstrap method will be biased towards the values near the centre of the band.
The modified bootstrap method, the Coombes method, and the kNN method are all examples of the GkNN method. For the modified bootstrap method the predictor vectors have one component, the rainfall. For the Coombes method the predictor vectors have three components, the average temperature, the number of rainfall days, and the rainfall depth. For the kNN method the predictor vectors have two components, the month label (an integer in {1, …, 12}) and the rainfall depth. The training data is obtained by running the RWT simulation model using a daily time step over a relatively short period of time (e.g., 100 years) and then upscaling to a monthly time step by aggregation. The GkNN metric μ : V × V → [0, ∞) may be the modified Manhattan metric of the Coombes method or the modified Euclidean metric of the kNN method.
9. Conclusion
A generalization of three methods of temporal data upscaling, which we have called the generalized k-nearest neighbor (GkNN) method, has been considered. The accuracy of the GkNN simulation of month by month yield has been considered. The notion of an eventually well distributed time series is introduced and on the basis of this assumption some properties of the average annual yield and its variance for a GkNN simulation are computed. The behavior of the total yield over a planning period has been described. A general framework for considering the GkNN algorithm based on the notion of stochastically dependent time series has been described and it is shown that for a sufficiently large training set the GkNN simulation has the same statistical properties as the training data. An example of the application of the methodology has been given in the problem of simulating the yield of a rainwater tank given monthly climatic data.
Conflicts of Interest
The author declare that they have no conflicts of interest.
Acknowledgments
The work described in this paper was partially funded by the Commonwealth Scientific and Industrial Research Organisation (CSIRO, Australia). Also the author would like to thank Fareed Mirza, Shiroma Maheepala, and Yong Song for very helpful discussions.
Open Research
Data Availability
The work of the paper is a theoretical study. The author did not implement any code or generate any data relating to the work. Therefore no data were used to support this study.