Volume 3, Issue 1 pp. 59-68

RESEARCH ARTICLE

Open Access

Prediction of the water level at the Kien Giang River based on regression techniques

Ta Quang Chieu,

Corresponding Author

Ta Quang Chieu

[email protected]

orcid.org/0009-0001-5079-7762

Faculty of Computer Science and Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Correspondence Ta Quang Chieu, Faculty of Computer Science and Engineering, Dong Da, Thuyloi University, Hanoi, Vietnam.

Email: [email protected]

Search for more papers by this author

Nguyen Thi Phuong Thao,

Nguyen Thi Phuong Thao

Faculty of Computer Science and Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Search for more papers by this author

Dao Thi Hue,

Dao Thi Hue

Faculty of Water Resources Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Search for more papers by this author

Nguyen Thi Thu Huong,

Nguyen Thi Thu Huong

Faculty of Computer Science and Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Search for more papers by this author

Ta Quang Chieu,

Corresponding Author

Ta Quang Chieu

[email protected]

orcid.org/0009-0001-5079-7762

Faculty of Computer Science and Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Correspondence Ta Quang Chieu, Faculty of Computer Science and Engineering, Dong Da, Thuyloi University, Hanoi, Vietnam.

Email: [email protected]

Search for more papers by this author

Nguyen Thi Phuong Thao,

Nguyen Thi Phuong Thao

Faculty of Computer Science and Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Search for more papers by this author

Dao Thi Hue,

Dao Thi Hue

Faculty of Water Resources Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Search for more papers by this author

Nguyen Thi Thu Huong,

Nguyen Thi Thu Huong

Faculty of Computer Science and Engineering, Thuyloi University, Dong Da, Hanoi, Vietnam

Search for more papers by this author

First published: 18 January 2024

https://doi.org/10.1002/rvr2.71

Share a link

Email
Wechat
Bluesky

Abstract

Model accuracy and runtime are two key issues for flood warnings in rivers. Traditional hydrodynamic models, which have a rigorous physical mechanism for flood routine, have been widely adopted for water level prediction in river, lake, and urban areas. However, these models require various types of data, in-depth domain knowledge, experience with modeling, and intensive computational time, which hinders short-term or real-time prediction. In this paper, we propose a new framework based on machine learning methods to alleviate the aforementioned limitation. We develop a wide range of machine learning models such as linear regression (LR), support vector regression (SVR), random forest regression (RFR), multilayer perceptron regression (MLPR), and light gradient boosting machine regression (LGBMR) to predict the hourly water level at Le Thuy and Kien Giang stations of the Kien Giang river based on collected data of 2010, 2012, and 2020. Four evaluation metrics, that is, R², Nash–Sutcliffe efficiency, mean absolute error, and root mean square error, are employed to examine the reliability of the proposed models. The results show that the LR model outperforms the SVR, RFR, MLPR, and LGBMR models.

1 INTRODUCTION

The Kien Giang River, one of two major tributaries within the Nhat Le River system, flows through the Le Thuy and Quang Ninh districts in Quang Binh province (Vietnam) (Figure 1). Spanning approximately 69 km in total length (Ly et al., 2013), this area has been known as a “flood navel” since the formation of Le Thuy and Quang Ninh topography. During the historical flood in October 2020, more than 50,000 houses at the foothills of the Truong Son mountain range were submerged, and thousands of villages and hamlets were isolated. The flood peak at the Le Thuy station reached 4.88 m, exceeding the warning level III and 0.97 m higher than the historical flood peak in 1979.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The Kien Giang river system and location of meteorological and hydrological stations.

Accurately forecasting the river water level is critical for early flood warning and flood disaster mitigation. In general, there are two main approaches to predict the water level. The former relies on physically based models, such as the MIKE HYDRO River, HEC-HMS, SOBEK, and EFDC. Although these models have high accuracy, they typically require a variety of datasets, including topographic, meteorological, and hydrological data, and intensive computational time for model simulation. Therefore, physically-based models are unsuitable for short-term and real-time prediction. Moreover, the development of a physically based model frequently demands in-depth knowledge and expertise in the hydrological field (Atashi et al., 2022).

An alternative approach is a data-driven model that collects and analyzes the statistical relationship between input and output data. This approach can help overcome the limitations mentioned above of the physically based model. The machine learning (ML) model has been used for flood forecasting since the 1990s and is one of the most popular frameworks utilized in the data-driven method. Recent studies suggest that ML can be a powerful tool for flood forecasting because it can be built quickly and effortlessly without understanding the underlying process. In addition, other main advantages of ML models are the shorter computational time, faster calibration and validation, and easier usage compared to the physically based models (Mekanik et al., 2013).

To our knowledge, no previous studies have applied the ML approach to predict river water levels for the Quang Binh province. The goal of our study is to apply regression methods, including linear regression (LR), support vector regression (SVR), random forest regression (RFR), multilayer perceptron regression (MLPR), and light gradient boosting machine regression (LGBMR) to predict water level at the Le Thuy and the Kien Giang stations.

2 METHODOLOGY AND DATA COLLECTION

2.1 Regression methods

Regression is a mathematical method in statistics used to analyze the relationship between a quantity to be forecasted over time and historical data. In this study, five regression techniques of machine learning are applied to generate data-driven models. The primary process when developing these models is called the “learning phase,” where the relationship between the input and output variables of the system is established (Guo et al., 2021):

y=f(x)

()

with the available data:

[({x}_{1},{y}_{1}),({x}_{2},{y}_{2}),\text{\unicode{x02026}}\,({x}_{n},{y}_{n})]={\{{x}_{i},{y}_{i}\}}_{i=1}^{n},

()

where x is the input vector, y is the output vector, n is the number of observations and f is the regression function.

2.1.1 LR

LR is a machine learning algorithm based on supervised learning, which models a target prediction value based on independent variables. Different regression models differ based on the kind of relationship between the dependent and independent variables they are considering and the number of independent variables getting used. The regression's dependent variable can be referred to as an outcome variable, a criterion variable, an endogenous variable, or a regressand. Respectively, the independent variable can be referred to as an exogenous variable, a predictor variable, or a regressor.

In Figure 2a, the input X is the work experience and the output Y is the salary of an individual. In this example, the regression line is the best-fit line for our model. The hypothesis function for LR is as follows:

{\rm{y}}={\theta }_{1}+{\theta }_{2}x,

()

where x is the input training data (univariate—one input variable), and y is the labels to data (supervised learning), θ₁ is the intercept, and θ₂ is the coefficient of input training data x.

Cost function (J): To achieve the best-fit regression line, the model updates θ₁ and θ₂ values after each iteration, therefore minimizing the error differences between predicted value (pred) and true value (y)

\mathrm{minimize}\frac{1}{n}\sum _{i=1}^{n}{({\mathrm{pred}}_{i}-{y}_{i})}^{2}.

()

2.1.2 SVR

The SVR approach proposed by Drucker et al. (1996) was employed herein for nonlinear regression. The regression function of SVR can be expressed as follows (Liong & Sivapragasam, 2002; Yu et al., 2006):

{f}^{\mathrm{SVR}}(x)={w}^{T}.\phi (x)+b,

()

where w is the weight vector, φ is the nonlinear mapping function, and b is the bias term. According to the fundamental concept of structural risk minimization to prevent overfitting, Equation (4) can be further expressed as follows:

\mathop{\min }\limits_{w,b,{\xi }_{i},{\xi }_{i}^{* }}\frac{1}{2}\Vert {w}^{2}\Vert +C\sum _{i=1}^{n}({\xi }_{i}+{\xi }_{i}^{* }),

()

\mathrm{subject}\,\mathrm{to}\,\left\{\begin{array}{c}{y}_{i}-[{w}^{T}.\phi ({x}_{i})+b]\le \varepsilon +{\xi }_{i}^{* }\\ [{w}^{T}.\phi ({x}_{i})+b]-{y}_{i}\le \varepsilon +{\xi }_{i}^{* }\\ {\xi }_{i}\ge 0,{\xi }_{i}^{* }\ge 0,i=1,2,\ldots ,n\end{array}\right.

()

where C denotes the cost parameter or penalty parameter, ξ and ξ∗ are nonnegative slack variables, and e is the parameter of the insensitive loss function. On the basis of Lagrange multipliers, the optimization problem of SVR can be written as a dual pattern (Wu et al., 2008):

{f}^{\mathrm{SVR}}(x)=\sum _{i=1}^{n}\left({\alpha }_{i}-{\alpha }_{i}^{* }\right)K({x}_{i},x)+b,

()

where α and α∗ are Lagrange multipliers and K is the kernel function. In this study, a commonly used radial basis function was employed to estimate the kernel function. Detailed descriptions of the SVR methodology can be found in the literature (Brereton & Lloyd, 2010; Chang & Lin, 2011).

2.1.3 RFR

The RFR approach proposed by Breiman (Breiman, 2001) is a tree-based ensemble ML technique based on the combination of bagging (bootstrap aggregation) and the random subspace method. During training, the binary recursive partitioning of classification and the regression tree are used to build each decision tree. Once a forest of trees has been constructed, predictions from each tree are aggregated as the final result. The advantages of the RFR approach are its simplicity and the low number of tuning hyperparameters. The RFR algorithm, as shown in Figure 2b, is summarized as follows (Choi et al., 2019; Li et al., 2016; Muñoz et al., 2018; Nguyen et al., 2015):

1.
On the basis of the bootstrap method, a subset of samples is randomly produced with replacements from the original data set.
2.
These bootstrap samples are employed to construct regression trees. The optimal split criterion is used to split each node of the regression trees into two descendant nodes. The process on each descendant node is continued recursively until a termination criterion is fulfilled.
3.
Each regression tree provides a predicted result. Once all of the regression trees have reached their maximum size, the final prediction is determined as the average of the results from all of the regression trees:

{f}^{\mathrm{RFR}}(x)=\frac{1}{\mathrm{tr}}\sum _{\mathrm{tr}=1}^{{N}_{\mathrm{tree}}}{\hat{h}}_{\mathrm{tr}}(x),

()

where tr is the number of trees, N_tree is the maximum size of the trees, and

{\hat{h}}_{\mathrm{tr}}

denotes the prediction of each regression tree. Detailed descriptions of RFR have been provided in previous studies (Biau & Scornet, 2015; Boulesteix et al., 2012).

2.1.4 MLPR

MLPR, which belongs to the feed-forward neural network, includes three layers: input, hidden, and output layers (Figure 2c). The neural network in MLPR consists of neurons, biases assigned to neurons, connections among neurons, and weights connecting neurons. Mathematically, the regression function of MLPR can be expressed as follows (Chen et al., 2020; Khan & Coulibaly, 2006):

{f}^{\mathrm{MLPR}}(x)={c}_{r}+\sum _{q}{u}_{qr}{a}_{q}(x),

()

where c_r denotes the bias of the rth output neuron, u_qr is the weight connecting the qth neuron in the hidden layer to the rth neuron in the output layer, and a_q(x) represents the activation function of the hidden neuron, which can be expressed in terms of F:

{a}_{q}(x)=F\left({d}_{q}+\sum _{p}{v}_{pq}.{x}_{p}\right),

()

where d_q is the bias of the qth hidden neuron, x_p is the input variable, and v_pq is the weight connecting the pth neuron in the input layer to the qth neuron in the hidden layer. Several types of activation functions can be employed, including linear, sigmoid, hyperbolic tangent (tanh), and rectified linear unit (ReLU) functions. In the training process of MLPR, the back-propagation algorithm is used for adjusting the weights connecting neurons to minimize errors (Jhong et al., 2018; Rumelhart et al., 1986). Details regarding the theory of MLPR have been provided in previous studies (Govindaraju & Rao, 2000; Hagan et al., 1996; Simon Haykin, 1998).

2.1.5 LGBMR

LGBMR uses four main algorithms to improve computational efficiency and prevent overfitting: gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB), a histogram-based algorithm, and a leaf-wise growth algorithm (He et al., 2022; Ke et al., 2017). As shown in Figure 2d, the leaf-wise growth algorithm allows the identification of the leaf node with the largest split gain while preventing overfitting. In addition, LGBMR adopts the histogram-based decision tree algorithm to divide continuous floating-point features into a variety of intervals to reduce the computational power required for prediction. Moreover, GOSS and EFB are used to reduce the number of samples for accelerating the training process of LGBMR.

The objective function of LGBMR can be written as follows:

Ob{j}^{t}=\sum _{i=1}^{n}l\left({y}_{i},{\hat{y}}_{i}^{t}\right)+\sum _{i=1}^{t}\Omega ({f}_{i}),

()

where l is the loss function, Ω is the regularization term of a decision tree f_i at the t time iteration, y_i is the true (objective) value, and

{\hat{y}}_{i}

is the predicted value. On the basis of the boosting algorithm, Equation (12) can be further expressed as follows:

Ob{j}^{t}=\sum _{i=1}^{n}\left[{g}_{i}{f}_{t}({x}_{i})+\frac{1}{2}{h}_{i}{f}_{t}^{2}({x}_{i})\right]+\sum _{i=1}^{t}\Omega ({f}_{i}),

()

where

{\hat{y}}_{i}^{t-1}

is the predicted value at the t − 1 step model and f_t(xi) denotes the new predicted value at the tth step. To solve the objective function, the Newton method is employed to simplify Equation (13) into the following equation

Ob{j}^{t}=\sum _{i=1}^{n}\left[{g}_{i}{f}_{t}({x}_{i})+\frac{1}{2}{h}_{i}{f}_{t}^{2}({x}_{i})\right]+\sum _{i=1}^{t}\Omega ({f}_{i}),

()

where g_i and h_i are, respectively, the first and second derivatives of the loss function, which can be expressed as follows:

{g}_{i}{=}_{{\hat{y}}_{i}^{(t-1)}}l[{y}_{i},{\hat{y}}^{(t-1)}],{h}_{i}{=}_{{\hat{y}}_{i}^{(t-1)}}^{2}l[{y}_{i},{\hat{y}}^{(t-1)}],

()

Samples in the regression trees are related to leaf nodes. The final value of loss can be determined from the accumulation of the loss values of the leaf nodes. Thus, with the use of I_j to represent the sample of leaf j, Equation (14) can be rewritten as

Ob{j}^{t}=\sum _{j=1}^{T}\left[\left(\sum _{i\in {I}_{j}}{g}_{i}\right){w}_{j}+\frac{1}{2}\left(\sum _{i\in {I}_{j}}{h}_{i}+\lambda \right){w}_{j}^{2}\,\right],

()

where T is the total number of regression trees, w is the weight of the lead node, and λ is the regularization parameter. Thus, the optimal objective function can be solved by minimizing the quadratic function. Detailed descriptions of LGBMR have been provided in previous studies (He et al., 2022; Ke et al., 2017; Tang et al., 2020).

2.2 Evaluation metrics

To quantitatively evaluate the performance of the five models, the following evaluation metrics are employed:

Coefficient of determination (R²)
${R}^{2}={\left[\frac{{\sum }_{i=1}^{n}({H}_{i}^{\mathrm{obs}}-{\bar{H}}^{\mathrm{obs}})({H}_{i}^{\mathrm{pre}}-{\bar{H}}^{\mathrm{pre}})}{\sqrt{{{\sum }_{i=1}^{n}({H}_{i}^{\mathrm{obs}}-{\bar{H}}^{\mathrm{obs}})}^{2}{\sum }_{i=1}^{n}{({H}_{i}^{\mathrm{pre}}-{\bar{H}}^{\mathrm{pre}})}^{2}}}\right]}^{2}.$ ()
Nash – Sutcliffe efficiency (NSE)
$\mathrm{NSE}=1-\frac{{\sum }_{i=1}^{n}{({H}_{i}^{\mathrm{obs}}-{H}_{i}^{\mathrm{pre}})}^{2}}{{\sum }_{i=1}^{n}{({H}_{i}^{\mathrm{obs}}-{\bar{H}}^{\mathrm{obs}})}^{2}}.$ ()
Mean absolute error (MAE)
$\mathrm{MAE}=\frac{{\sum }_{i=1}^{n}|{H}_{i}^{\mathrm{obs}}-{H}_{i}^{\mathrm{pre}}|}{n}.$ ()
Root mean square error (RMSE)

\mathrm{RMSE}=\sqrt{\frac{{\sum }_{i=1}^{n}{({H}_{i}^{\mathrm{obs}}-{H}_{i}^{\mathrm{pre}})}^{2}}{n}},

()

where

{H}_{i}^{\mathrm{obs}}

and

{H}_{i}^{\mathrm{pre}}

are the observed and predicted water levels, respectively;

{\bar{H}}^{\mathrm{obs}}

and

{\bar{H}}^{\mathrm{pre}}

are the mean of observed and predicted water levels, respectively. R² ranges from 0 to 1. An R² value of 1 indicates that the predicted values are equal to the observed. The NSE ranges from

-\infty

to 1. The closer the value of NSE to 1, the better the result of the model is. RMSE and MAE are the model's error metrics with optimal value at zero.

Two additional metrics were considered, namely peak error (PE) and Error of time to peak (Δt). These are two critical metrics in flood warning to further evaluate the model's capacity in forecasting the peak water level in terms of value and time.

\mathrm{PE}={H}_{{p}}^{{pre}}-{H}_{p}^{\mathrm{obs}}{,}

()

\Delta t={T}_{p}^{\mathrm{pre}}-{T}_{p}^{\mathrm{obs}},

()

where

{H}_{p}^{\mathrm{obs}}

and

{H}_{p}^{\mathrm{pre}}

are the peak of observed and predict water level, respectively;

{T}_{p}^{\mathrm{obs}}

and

{T}_{p}^{\mathrm{pre}}

denote the observed and predicted time to peak, respectively. The closer value of PE and Δt are to zero, the higher the performance of the model is.

3 RESULTS AND DISCUSSION

In this study, we predict water levels at the Le Thuy and Kien Giang stations using five regressive machine learning models. We employ six evaluation metrics R², NSE, MAE, RMSE, PR, and $\Delta t$ to assess the models' performance. Accordingly, the ideal input data set parameters and the most effective machine learning strategy for this issue are selected. Regressive machine learning models are installed using the Python 3.7 environment.

3.1 Study area and data

The Nhat Le River basin, which has a total area of 2612 km², includes three main tributaries: Kien Giang, Long Dai, and Nhat Le (Figure 1). Three medium-sized irrigation reservoirs, An Ma, Cam Ly, and Rao Da, have a small flood capacity of 22.1, 6.9, and 11.6 million m³, respectively. Other reservoirs have smaller capacities, so the impact of their operation on downstream water levels is negligible.

In this study, we use hourly rainfall and water level data collected from three stations, namely Kien Giang, Le Thuy, and Dong Hoi. The data set contains the flood season of 2010, as weel as the whole year of 2012 and 2020. We split the data set into two parts: training data from 2010 to 2012 (10,297 samples), and testing data from 2020 data (8785 samples).

3.2 Input data set to predict water level at Le Thuy and Kien Giang stations

We use the data set of hourly water levels and rainfalls recorded at three stations (Kien Giang, Le Thuy, and Dong Hoi) in the years 2010, 2012, and 2020. Through previous research and analysis, it is discovered that the water level at a station is influenced by previous rainfall and water levels at that station and nearby stations. One problem is determining the length of past input data to acquire the highest performance. This study also employs hourly data, allowing us to predict the water levels for 1, 6, and 12 h-ahead. To improve accuracy, we experiment with integrating predicted rainfall data from hydro-meteorological stations for the following 1, 6, and 12 h as additional features. The following sections will present these experiment's findings.

Choosing the number of time lags and time leads

To get the optimal historical data length, we configure the water level forecasting model for Le Thuy and Kien Giang stations as follows:

{\hat{H}}_{t+h}^{LT}=f({R}_{t+h,\text{\unicode{x02026}},t+1,t,t-1,t-2\text{\unicode{x02026}}t-k};{H}_{t,t-1,t-2\text{\unicode{x02026}}t-k})

()

{\hat{H}}_{t+h}^{KG}=f({R}_{t+h,\text{\unicode{x02026}},t+1,t,t-1,t-2\text{\unicode{x02026}}t-k};{H}_{t,t-1,t-2\text{\unicode{x02026}}t-k}).

()

In which ${\hat{H}}_{t+h}^{LT}$ , ${\hat{H}}_{t+h}^{KG}$ are the forecast water levels at Le Thuy and Kien Giang stations, respectively, t is the present time, t + h is the predicted time (h = 1, 6, or 12 h). ${R}_{t+h,\text{\unicode{x02026}},t+1,t,t-1,t-2\text{\unicode{x02026}}t-k}$ are the rainfalls at the three stations at present, in the past and the future (predicted rainfall), $k$ is the number of time lags, ${H}_{t,t-1,t-2\text{\unicode{x02026}}t-k}$ are the water levels at three stations (Le Thuy, Kien Giang, and Dong Hoi) at present and in the past.

With the LR model, the number of time lags is selected between 1 and 10 h (k = 1–10) to predict the water level for the next six time leads (h = 6 h) at the Le Thuy station. The results show that the use of data from 2 to 10 h ago to predict the water level for the next 6 h yields good result with the values of R² and NSE are both above 0.99. Especially, at k = 3, all four metrics demonstrate the best performance, with MAE and RMSE error of 2.35 and 0.25 cm, respectively. In addition, the peak error in the 2020 flood, Ep is 0.7 cm when k = 3. Therefore, we decided to select the 4-day input data set (at time t, t − 1, t − 2, and t − 3) to predict the water level at 1, 6, and 12 h (h = 1, 6, 12) in the following section. (Figure 3).

3.3 Prediction results of water levels in Kien Giang and Le Thuy

In this experiment, the water level at Le Thuy station was predicted using five regression models: LR, RFR, SVR, MLPR, and LGBMR. The scatter plot (Figure 4) compares the models' water level predictions with a 6h time lead. We can observe that the MLPR model underperforms compared to the the LR, RFR, and LGBMR models, while the SVR model performs the worst as the distribution of points furthest from the 1:1 line. When the water level is below warning level III (H = 2.7 m), the RFR and LGBMR models produce results close to the 1:1 line. However, the forecast value of the RFR and LGBMR models is also lower than the observed value as they fail to estimate the flood's peak water level.

Tables 1 and 2 show the details of the assessments of the capacity to replicate water levels at the Le Thuy and Kien Giang stations, respectively. Observed water level and rainfall data in 2010 and 2012 and predicted rainfall are used as training data at three locations: Kien Giang, Le Thuy, and Dong Hoi. The model's output is the water level at Le Thuy and Kien Giang for the following 1, 6, and 12 h. The 2020 data set was used to test the model. The SVR model performs poorly, while the three models LR, RFR, and LGBMR, give good results in terms of NSE, R², MAE, and RMSE value.

Table 1. Evaluation metrics of LR, RFR, SVR, MLPR, and LGBMR models at the Le Thuy station.

Model	Time lags (h)	NSE	R²	MAE (m)	RMSE (m)	PE (m) Observed peak: 4.88 m
LR	1	0.9999	0.9999	0.0045	0.0001	0.0226
	6	0.9960	0.9961	0.0236	0.0024	0.2143
	12	0.9869	0.9876	0.0422	0.0078	0.3933
RFR	1	0.9943	0.9952	0.0103	0.0034	1.0807
	6	0.9858	0.9875	0.0334	0.0085	1.2022
	12	0.9732	0.9733	0.0564	0.0159	1.1532
SVR	1	0.8903	0.8976	0.0774	0.0652	2.4954
	6	0.8261	0.8318	0.0813	0.1034	3.8098
	12	0.7943	0.8031	0.0989	0.1223	3.7720
MLPR	1	0.9938	0.9939	0.0198	0.0037	0.1666
	6	0.9808	0.9832	0.0420	0.0114	0.7417
	12	0.9628	0.9635	0.0581	0.0221	0.6911
LGBMR	1	0.9916	0.9926	0.0136	0.0050	1.1828
	6	0.9848	0.9856	0.0345	0.0090	1.1766
	12	0.9763	0.9764	0.0529	0.0141	1.1107

Note: The significance of bold which is the LR model gives the best result.
Abbreviations: LGBMR, light gradient boosting machine regression; LR, linear regression; MLPR, multilayer perceptron regression; RFR, random forest regression; SVR, support vector regression.

Table 2. Evaluation metrics of LR, RFR, SVR, MLPR, and LGBMR models at the Kien Giang station.

Model	Time lags	NSE	R²	MAE (m)	RMSE (m)	PE (m) Observed peak: 14.66 m
LR	1	0.9983	0.9983	0.0104	0.0011	0.2215
	6	0.9363	0.9385	0.0788	0.0437	1.3489
	12	0.8908	0.8983	0.1143	0.0748	0.2867
RFR	1	0.9932	0.9936	0.0138	0.0047	2.0925
	6	0.9394	0.9407	0.0688	0.0416	2.7378
	12	0.8912	0.8926	0.1064	0.0746	3.6833
SVR	1h	0.8243	0.8397	0.0977	0.1206	7.2003
	6	0.7405	0.7604	0.1189	0.1779	7.2043
	12	0.7008	0.7273	0.1482	0.2051	7.1874
MLPR	1	0.9770	0.9792	0.0498	0.0158	0.6581
	6	0.8336	0.8637	0.0897	0.1141	5.0895
	12	0.6689	0.7865	0.1516	0.2269	0.6804
LGBMR	1	0.9870	0.9877	0.0235	0.0089	2.5578
	6	0.9499	0.9502	0.0688	0.0343	2.0842
	12	0.9305	0.9311	0.0960	0.0476	2.7698

Note: The significance of bold which is the LR model gives the best result.
Abbreviations: LGBMR, light gradient boosting machine regression; LR, linear regression; MLPR, multilayer perceptron regression; RFR, random forest regression; SVR, support vector regression.

Forecasting is crucial for flood disaster mitigation and prevention. One of the requirements to determine a model's validity is accurate reporting of flood peaks and time to peak. With NSE and R² greater than 0.99, RMSE of 0.007 and MAE of 0.11, and especially the error of flood peak less than 8% (equivalent to 39.3 cm in case h = 12 h at Le Thuy station), and 2% (i.e., 2.87 cm in case h = 12 h at Kien Giang station), and ∆t = 1 h. The results demonstrate that the LR model gives the best result. First, the results of the LR model are shown in Tables 1 and 2. Second, the relationship between rainfall and the water level is linear. Third, the water level of the stations (Le Thuy, Kien Giang, Dong Hoi) is the linear influence.

Figures 5 and 6 show the actual water level measured and the predicted water level generated by LR model with time lead 6h during the historic flood from October 6, 2020, to October 21, 2020, at two stations (Le Thuy and Kien Giang). We can observe that the predicted line and measured line are very close to one another. However, the results of the linear regression model are still unreliable, particularly at the flood peak. The method of employing input data, such as the water level in Dong Hoi, may be to blame for this. Due to the high tide impacts on Dong Hoi water level, the water level process curve changes occasionally. The water level in Le Thuy thus tends to fluctuate similarly to the water level in Dong Hoi when training the data, despite the possibility that the water level in Le Thuy is mostly unaffected by high tides. To further improve the model prediction accuracy, more research should take a variety of other variables into account, such as the geography of the location, ground cover, and initial moisture conditions.

4 CONCLUSIONS

In this study, we have developed models to predict the water level at the Le Thuy and Kien Giang stations on the Kien Giang River using LR, RFR, SVR, MLPR, and LGBMR regression techniques. The models are trained and validated using the hourly rainfall and water levels datasets at three stations in 2010, 2012, and 2020. The study's findings indicate that the best water level prediction result at the Le Thuy station at 6hrs ahead when using input data of hourly rainfall and water level in the period [t − 3, t], as well as the predicted rainfall at three hydrometeorological stations in the area. The model can forecast both the trough and peak of the water level. Evaluation metrics R², NSE, MAE, and RMSE show that the application of the data-oriented model is feasible and reliable in predicting water levels in which the LR outperforms the other four regression methods.

In future studies, we will consider the addition of other inputs such as flows, tidal levels, rainfalls at nearby stations, or forecasted rain from the hydro-meteorological center stations (instead of using actual rainfall data measured at timestep t + 1), constructing water level forecasting models for hydrological stations and other “virtual” stations along the Kien Giang and Nhat Le rivers. In addition, other machine learning techniques, such as deep learning algorithms, will also be explored to enhance the accuracy of future water level forecasts.

ACKNOWLEDGMENTS

The authors would like to express their sincere gratitude for the support to complete this research. This research is supported by the Scientific Research and Technology Development Project at ministerial level, namely “Research on Digital Transformation in the flood warning methods for the community: an experimental flood warning system for Nhat Le river basin, Quang Binh province.”

ETHICS STATEMENT

None declared.

Open Research

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the corresponding author upon reasonable request.

REFERENCES

Atashi, V., Gorji, H. T., Shahabi, S. M., Kardan, R., & Lim, Y. H. (2022). Water level forecasting using deep learning time-series analysis: A case study of red river of the north. Water, 14(12), 1971. https://doi.org/10.3390/w14121971
10.3390/w14121971
Web of Science® Google Scholar
Biau, G., & Scornet, E. (2015). A random forest guided tour. https://arxiv.org/abs/1511.05741
Google Scholar
Boulesteix, A.-L., Janitza, S., Kruppa, J., & König, I. R. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493–507. https://doi.org/10.1002/widm.1072
10.1002/widm.1072
Web of Science® Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
10.1023/A:1010933404324
Web of Science® Google Scholar
Brereton, R. G., & Lloyd, G. R. (2010). Support vector machines for classification and regression. The Analyst, 135(2), 230–267. https://doi.org/10.1039/B918972F
10.1039/b918972f
CAS PubMed Web of Science® Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–27. https://doi.org/10.1145/1961189.1961199
10.1145/1961189.1961199
Web of Science® Google Scholar
Chen, C., He, W., Zhou, H., Xue, Y., & Zhu, M. (2020). A comparative study among machine learning and numerical models for simulating groundwater dynamics in the Heihe River Basin, northwestern China. Scientific Reports, 10(1), 3904. https://doi.org/10.1038/s41598-020-60698-9
10.1038/s41598-020-60698-9
CAS PubMed Web of Science® Google Scholar
Choi, C., Kim, J., Han, H., Han, D., & Kim, H. S. (2019). Development of water level prediction models using machine learning in wetlands: A case study of Upo Wetland in South Korea. Water, 12(1), 93. https://doi.org/10.3390/w12010093
10.3390/w12010093
Web of Science® Google Scholar
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1996). Support vector regression machines. In M. C. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9). MIT Press. https://proceedings.neurips.cc/paper_files/paper/1996/file/d38901788c533e8286cb6400b40b386d-Paper.pdf.
Google Scholar
Govindaraju, R. S., & Rao, A. R. (2000). Artificial neural networks in hydrology. I: Preliminary concepts. Journal of Hydrologic Engineering, 5(2), 115–123. https://doi.org/10.1061/(ASCE)1084-0699(2000)5:2(115)
10.1061/(ASCE)1084-0699(2000)5:2(115)
Web of Science® Google Scholar
Guo, W.-D., Chen, W.-B., Yeh, S.-H., Chang, C.-H., & Chen, H. (2021). Prediction of river stage using multistep-ahead machine learning techniques for a Tidal River of Taiwan. Water, 13(7), 920. https://doi.org/10.3390/w13070920
10.3390/w13070920
Web of Science® Google Scholar
Hagan, M. T., Demuth, H. B., & Beale, M. (1996). Neural network design. PWS Publishing Co.
Google Scholar
Haykin, S. (1998). Neural networks: A comprehensive foundation. Prentice Hall.
Google Scholar
He, B., Ye, L., Pei, M., Lu, P., Dai, B., Li, Z., & Wang, K. (2022). A combined model for short-term wind power forecasting based on the analysis of numerical weather prediction data. Energy Reports, 8, 929–939. https://doi.org/10.1016/j.egyr.2021.10.102
10.1016/j.egyr.2021.10.102
Web of Science® Google Scholar
Jhong, Y.-D., Chen, C.-S., Lin, H.-P., & Chen, S.-T. (2018). Physical hybrid neural network model to forecast typhoon floods. Water, 10(5), 632. https://doi.org/10.3390/w10050632
10.3390/w10050632
Web of Science® Google Scholar
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30.
Google Scholar
Khan, M. S., & Coulibaly, P. (2006). Application of support vector machine in lake water level prediction. Journal of Hydrologic Engineering, 11(3), 199–205. https://doi.org/10.1061/(ASCE)1084-0699(2006)11:3(199)
10.1061/(ASCE)1084-0699(2006)11:3(199)
Web of Science® Google Scholar
Li, B., Yang, G., Wan, R., Dai, X., & Zhang, Y. (2016). Comparison of random forests and other statistical methods for the prediction of lake water level: A case study of the Poyang Lake in China. Hydrology Research, 47(S1), 69–83. https://doi.org/10.2166/nh.2016.264
10.2166/nh.2016.264
Web of Science® Google Scholar
Liong, S.-Y., & Sivapragasam, C. (2002). Flood stage forecasting with support vector machines. JAWRA Journal of the American Water Resources Association, 38(1), 173–186. https://doi.org/10.1111/j.1752-1688.2002.tb01544.x
10.1111/j.1752-1688.2002.tb01544.x
Web of Science® Google Scholar
Ly, N. D., Duong, N. H., & Dai, N. (2013). Khí hậu và thủy văn tỉnh Quảng Bình. Nhà Xuất Bản Khoa Học Kỹ Thuật.
Google Scholar
Mekanik, F., Imteaz, M. A., Gato-Trinidad, S., & Elmahdi, A. (2013). Multiple regression and artificial neural network for long-term rainfall forecasting using large scale climate modes. Journal of Hydrology, 503, 11–21. https://doi.org/10.1016/j.jhydrol.2013.08.035
10.1016/j.jhydrol.2013.08.035
Web of Science® Google Scholar
Muñoz, P., Orellana-Alvear, J., Willems, P., & Célleri, R. (2018). Flash-flood forecasting in an Andean Mountain catchment—Development of a step-wise methodology based on the random forest algorithm. Water, 10(11), 1519. https://doi.org/10.3390/w10111519
10.3390/w10111519
Web of Science® Google Scholar
Nguyen, T.-T., Huu, Q. N., & Li, M. J. (2015). Forecasting time series water levels on Mekong river using machine learning models. In 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE) (pp. 292–297). https://doi.org/10.1109/KSE.2015.53
10.1109/KSE.2015.53
Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
10.1038/323533a0
Web of Science® Google Scholar
Tang, M., Zhao, Q., Ding, S. X., Wu, H., Li, L., Long, W., & Huang, B. (2020). An improved light GBM algorithm for online fault detection of wind turbine gearboxes. Energies, 13(4), 807. https://doi.org/10.3390/en13040807
10.3390/en13040807
Web of Science® Google Scholar
Wu, C. L., Chau, K. W., & Li, Y. S. (2008). River stage prediction based on a distributed support vector regression. Journal of Hydrology, 358(1), 96–111. https://doi.org/10.1016/j.jhydrol.2008.05.028
10.1016/j.jhydrol.2008.05.028
Web of Science® Google Scholar
Yu, P.-S., Chen, S.-T., & Chang, I.-F. (2006). Support vector regression for real-time flood stage forecasting. Journal of Hydrology, 328(3), 704–716. https://doi.org/10.1016/j.jhydrol.2006.01.021
10.1016/j.jhydrol.2006.01.021
Web of Science® Google Scholar

Volume3, Issue1

February 2024

Pages 59-68

Prediction of the water level at the Kien Giang River based on regression techniques

Abstract

1 INTRODUCTION