Applying Semi-AutoML for Vessel Traffic Flow Prediction: Case Studies of Two Ports
Abstract
Accurate prediction of vessel traffic flow plays a crucial role in maritime supply chain operations, enabling efficient resource allocation and timely delivery of goods. This study contributes to the field of maritime logistics and transportation management by employing a semiautomated machine learning (semi-AutoML) approach and presenting a comparative analysis to predict vessel traffic flow in two distinct port settings. The proposed approach involves automatically evaluating the performance of a set of preselected models to identify the best-fitting models for the dataset. This is followed by a manual tuning phase to further optimize the performance of the selected models. The proposed methodology is implemented on limited-scale datasets from two separate case studies: Mohammedia port and Los Angeles port, with the latter serving as a benchmark against an existing study. The performance of the models in predicting vessel traffic flow was evaluated using different metrics. The findings indicate improvements in forecast accuracy, with an RMSE of 2.22 for Mohammedia port and 11.65 for Los Angeles port. The results for Los Angeles port showcase a notable improvement of up to 70.95% in RMSE compared to the outcomes of a previous study, emphasizing the superior efficacy of the proposed methodology in predicting vessel traffic flow. The workflow presented in this study was implemented using the PyCaret framework, and the Python code implementation is publicly available on Colab (https://colab.research.google.com/drive/1Wk5Y_1uFSEYJLx49nXPaoPnGUgLGY6cT).
1. Introduction
Ship traffic flow refers to the movement of ships entering and leaving a port over a specific time period. It plays a critical role in assessing port performance and capacity planning. It is also important for reducing maritime accidents and economic losses [1]. Accurate forecasting of vessel traffic volume is critical for port operators to effectively allocate resources and develop informed strategies to optimize the flow of goods and mitigate potential disruptions within the logistics chain [2]. Monitoring and forecasting ship traffic flow are crucial for the sustainable development and competitiveness of a port. It enables stakeholders, including shipping companies, cargo owners, and customs authorities, to plan and optimize their logistics and supply chain activities effectively [3].
Predicting ship traffic flow is a complex task that poses significant challenges due to the multitude of factors that affect maritime traffic patterns. One such factor is weather conditions, which can greatly impact ship movements and cause delays or diversions [4]. Seasonality is another factor that must be taken into consideration, as it can affect the demand for shipping services and the availability of port infrastructure [5]. Additionally, economic trends and geopolitical events can have a major impact on shipping patterns, as they can affect trade flows, shipping costs, and port investments [6]. The availability and quality of data pose considerable challenges in predicting ship traffic flow. Vessel traffic data is frequently constrained by limitations and may be outdated. In the case of Mohammedia port, the dataset is aggregated on a weekly basis, while for the Los Angeles case, the dataset is truncated, resulting in a limited number of data points for the comparative study. These limitations can significantly impede the accuracy of predictions [7]. Addressing these challenges is crucial for improving the accuracy and reliability of ship flow predictions, as this can have significant implications for port operations and overall efficiency.
- 1.
The literature review conducted provides a comprehensive understanding of the importance of accurate ship traffic flow forecasting in enabling efficient resource allocation and timely goods delivery. This understanding sets the foundation for the study’s objectives and relevance.
- 2.
The study introduces a novel methodology for vessel traffic flow prediction by employing a semi-AutoML approach. This approach is applied to two distinct case studies, namely, the Mohammedia port and the Los Angeles port. By conducting the comparative study, the study not only assesses the performance of the proposed approach in different contexts but also provides valuable insights into its applicability and effectiveness.
- 3.
The present study adds to the current body of literature by conducting a comparative analysis to evaluate the performance of the proposed approach in relation to a previous study for the case of Los Angeles port, providing valuable insights and demonstrating improvements in performance.
This paper is structured as follows: Section 2 offers an extensive review of the relevant literature, focusing on two key aspects. Firstly, an examination of maritime logistics and operations planning is conducted to provide a context for the study and highlight the importance of accurate vessel traffic flow prediction in this domain. Secondly, various modeling approaches for vessel traffic flow prediction are discussed, such as statistical models and machine learning models. In Section 3, the theoretical framework and a comprehensive overview of the study design employed in this study are presented. The subsections within this section provide a comprehensive overview of the preselected algorithms utilized, the process of hyperparameter optimization, and the evaluation metrics employed for assessing the performance of the models. Section 4 provides an in-depth analysis of the experiment and the results obtained. The section describes the dataset considered, including a detailed description of the proposed workflow. The results of the proposed semi-AutoML approach used for vessel traffic flow prediction in the considered ports are presented, and the findings are further analyzed. In Section 5, we provide a summary of our findings, discuss the implications of the results, and propose potential avenues for future research.
2. Literature Review
2.1. Maritime Logistics and Operations Planning
Maritime logistics and operations planning involve optimizing the movement of goods and services in the maritime transportation network, which accounts for over 80% of global trade volume [8]. A study conducted by Gülmez et al. [9] is aimed at providing a comprehensive analysis of the literature on maritime logistics. The results suggest a growing focus on enhancing port operations optimization and planning. Academic research in this field is aimed at developing models and algorithms to assist decision-making, driven by the importance of maritime safety and security, advances in machine learning, and the availability of data such as automatic identification system (AIS) [10]. AIS is used in maritime transportation to track the movements of vessels. It includes information about the ship, such as the position, course, speed, and identification. By analyzing AIS data, it is possible to identify patterns in vessel trajectories. Lee et al. [11] conducted an analysis of AIS data obtained from ships that arrived at and departed from Busan New Port, located in South Korea. The study used a machine learning clustering method to derive ship trajectory patterns and identified clusters for port arrival and departure. The findings have implications for the development of novel port maneuvering guidelines for conventional ships as well as autonomous vessels operating within port areas. However, AIS data may not provide sufficient information and may contain errors or omissions due to technical malfunctions or incorrect settings. Marten et al. [12] present a proficient AI solution integrated within a database, utilizing Markov models to forecast future port destinations of vessels using historical AIS data. The proposed method addresses situations where the AIS destination entry of a vessel is ambiguous and demonstrates high accuracy in handling large-scale prediction tasks. To incorporate extensive AIS data into maritime Internet of Things (IoT) applications, fast communication networks and supercomputing capabilities are essential. Liu et al. [13] propose utilizing 6G technology for the enhancement of conventional maritime IoT systems. The study suggests a machine learning framework to improve vessel trajectory records from AIS networks in maritime IoT scenarios. The first step uses a density-based clustering method to recognize outliers, while the second step employs a bidirectional long short-term memory (BLSTM) to recover timestamped points affected by outliers. The proposed framework outperformed other methods in experiments conducted on both simulated and real-world datasets and holds the capability to improve intelligent vessel traffic services in maritime IoT systems enabled by 6G technology. An essential aspect of ensuring the safety and security of maritime transportation involves maintaining a persistent situational awareness of the maritime domain with a robust security and monitoring system. Nevertheless, it can prove to be a challenging task to monitor the movements of vessels, particularly in regions with high levels of congestion. Vanneschi et al. [14] propose a new genetic programming (GP) system to predict the position of vessels in a defined time range. The proposed framework is anticipated to provide support in emergency response operations and improve the efficiency of chasing illegal vessels. The proposed system integrates geometric semantic genetic operators and linear scaling to improve GP performance, which outperforms machine learning methods such as support vector machines (SVMs) and linear regression (LR). Dobrkovic et al. [15] focus on the optimization of barge scheduling and waiting times by introducing a novel approach for predicting the future positions of deep-sea vessels. By utilizing a directed graph methodology tailored to the unique characteristics of maritime routes in the North Sea region, the study extracts relevant waypoints from traffic data to facilitate accurate vessel position predictions. The effectiveness of the proposed method is demonstrated through the evaluation of real-world AIS data. A recent study conducted by Weerasinghe et al. [16] offers a comprehensive review of maritime logistics and operations planning, highlighting the considerable research focus on areas such as scheduling, simulation and automation, container transportation, and dockside operations. The findings of these studies underscore the significant potential of employing data analysis and machine learning techniques to advance maritime domain awareness, inform effective maritime transportation management practices, and enhance operations planning strategies.
2.2. Modeling Approaches for Vessel Traffic Flow Prediction
The prediction of vessel traffic flow is an important matter in the field of maritime transportation, as it has the potential to enhance the efficiency and safety of port operations. Numerous techniques have been suggested to tackle this problem, such as statistical models and machine learning models. In this section, we provide a concise overview of recent studies on these approaches and their usage in both maritime and other related domains.
2.2.1. Statistical Models
Statistical models are widely used to forecast ship traffic flow due to their ability to capture relationships between variables and provide reliable predictions [17]. Several studies have explored the effectiveness of such models for traffic flow prediction. Kumar and Vanajakshi [7] explore the use of a seasonal autoregressive integrated moving average (SARIMA) for short-term prediction of traffic flow with low-scale data. The study used flow data from a three-lane arterial roadway in India and validated the developed model by performing a 24-h forecast to compare the predicted flow with actual flow values. The same approach was used in a study conducted by Multiningsih et al. [18], which focuses on forecasting the volume of ship passenger flow at Semayang Port. Identifying seasonalities is essential to recognize distinct patterns within the port. Ghosh et al. [19] highlight the importance of considering seasonal and periodic patterns within the data. The study compares classical and Bayesian methods for estimating the parameters of the SARIMA method. The Bayesian approach utilizes the Markov chain Monte Carlo method and provides more accurate forecasts for extreme peaks and rapid fluctuations compared to the classical methods. The article by Farhan and Ong [5] examines the significance of accurate seasonal container throughput forecasts for logistics companies, enabling them to generate reliable predictions of container throughput at major international container ports. These forecasts play a crucial role in assisting port operators in making operational decisions and enhancing their overall efficiency. A study by He et al. [20] proposes a Kalman model customized for short-term vessel traffic flow forecasting, which uses a regression model to replace the Kalman filter transfer equation. The model demonstrates better prediction performance than the traditional regression model and highlights the importance of vessel traffic flow forecasting for safety measures in maritime transportation within water regions featuring multiple bridges.
2.2.2. Machine Learning Models
In the context of predicting vessel traffic flow, traditional statistical models may not be sufficient to capture the complex and dynamic dependencies between the various variables involved. This is due to the potential for nonlinearity and dynamism in such relationships. Alternatively, machine learning models can incorporate a broad range of factors that influence vessel traffic, such as weather conditions, port congestion, and vessel size and type. A study by Ogura et al. [4] is aimed at developing a method for predicting vessel arrival time that considers weather conditions for better forecast accuracy. The presented approach in this study utilizes Bayesian learning techniques to estimate vessel routes and voyage speeds by taking into account historical route and operational data in similar future weather conditions. The effectiveness of the method was evaluated using data from vessels engaged in the transportation of goods for the domestic appliance and automotive sectors. The results demonstrate that the proposed approach achieves a mean prediction accuracy of 90%, outperforming other benchmarked methods by a significant margin of 28%.
Recent studies have further advanced the field of vessel traffic flow prediction. Kim and Lee [21] introduce a novel spatiotemporal embedding network (STENet) for the prediction of long- and medium-term traffic in high-risk areas, emphasizing the integration of spatial and temporal information using real AIS sensor data from a Korean port. Experimental results show that STENet exhibits a significantly higher level of performance compared to traditional models. Tang et al. [22] propose a long short-term memory (LSTM) model improved with a random deactivation layer to predict vessel heave motion and alleviate time delay in wave compensation control systems. A mathematical model of vessel motion is employed to solve the heave motion of the vessel, and a wave model is subsequently established. The proposed LSTM model outperforms the backpropagation (BP) for short-term predictions.
Additionally, Lee et al. [23] employ a similar approach with the aim of enhancing the performance of autonomous ships. The study utilizes historical data and applies an LSTM model, incorporating various extracted features from the data. The results of the study demonstrate promising outcomes, indicating a potential application in recognizing maritime traffic conditions for coastal ships navigating complex sea routes. Recent advancements in machine learning for maritime applications are also highlighted in studies such as Man et al. [24], which introduces a spatiotemporal model for vessel traffic prediction. This model integrates graph attention (GAT) networks with BLSTM to capture spatial and temporal dependencies, significantly enhancing prediction accuracy for Yangtze River traffic. Additionally, Jiang et al. [25] present a spatiotemporal multigraph fusion network (STMGF-Net) for vessel trajectory forecasting. STMGF-Net employs multigraph construction and fusion techniques to model intricate vessel interactions, delivering high-precision predictions for intelligent maritime navigation.
Beyond maritime applications, Pan et al. [26] propose a hybrid framework, fundamental diagram (FD)-Markov-LSTM, that merges the FD, Markov chains, and LSTM to estimate and predict traffic states. This hybrid approach combines statistical and deep learning methods, achieving notable improvements in urban traffic flow prediction accuracy. Similarly, Pan et al. [27] develop a hybrid traffic flow model (TFMDL) that integrates deep learning with traffic flow theory. TFMDL combines model-driven and data-driven methods, outperforming traditional approaches in estimation accuracy and data efficiency for highway traffic. Together, these studies underscore the effectiveness of hybrid and spatiotemporal models in advancing the accuracy and robustness of traffic flow prediction across both maritime and urban domains.
The effectiveness of different modeling approaches can vary depending on various factors such as data characteristics, specific port or domain considerations, and desired performance metrics. Therefore, it is important to assess the specific requirements and context of the prediction task in order to choose an appropriate modeling approach. In this regard, the present study proposes a semi-AutoML approach to systematically evaluate the performance of multiple machine learning models. Through experimentation across diverse scenarios, our aim is to identify the most suitable method for predicting vessel traffic flow in the context of the two case studies investigated in this research.
3. Theory and Methodologies
3.1. Study Design
- 1.
The initial stage of the study involves dataset selection. Two case studies are considered: Mohammedia port and Los Angeles port. These datasets serve as the foundation for conducting comparative analyses and investigating vessel traffic flow prediction in different port settings.
- 2.
In the data preparation process, the selected data is examined and processed to handle missing values, outliers, and inconsistencies to ensure data integrity and accuracy. The data is then aggregated to a suitable level, enabling meaningful insights and analysis. In this phase, each dataset is divided into three subsets: training, testing, and validation sets for robust model training, unbiased evaluation, and reliable validation of the predictive models.
- 3.
Following data preparation, a set of algorithms is preselected to obtain an initial evaluation of their performance on the datasets. In this study, a total of 12 models are chosen and evaluated using a semi-AutoML approach with predefined configurations.
- 4.
From the initial evaluation, the top two performing algorithms for each case study are manually trained and fine-tuned to further enhance their performance.
- 5.
To optimize the hyperparameters of the selected models and obtain reliable performance estimates, a combination of random grid search and k-fold cross-validation techniques is employed. This iterative process systematically explores different hyperparameter configurations and evaluates their performance using cross-validation.
- 6.
The performance of all the models is thoroughly evaluated using various metrics to assess their predictive quality. These metrics are considered at each stage of the process, including the semi-AutoML phase for training the models, the hyperparameter tuning phase, and the evaluation on the validation set.
The overall workflow of the study is illustrated in Figure 1, and a comprehensive explanation of each step is outlined in the subsequent sections.

3.2. Preselected Algorithms
In order to identify the optimal approach for forecasting vessel traffic flow for the Mohammedia port use case, we have evaluated a range of potential models. Initially, a set of 12 models from diverse families was chosen and applied to the datasets to assess their performance. Through a comparative analysis, the most suitable models were selected. In this section, a concise overview of the investigated models is presented.
3.2.1. Ensemble Learning Models
Ensemble learning is a technique involving the integration of multiple models to enhance the overall predictive performance. The three models utilized in this study are random forest, extra trees, and adaptive boosting (AdaBoost). Random forest is an ensemble learning technique that combines numerous decision trees, with each tree being trained on a random subset of the available data. The predictions from individual trees are then aggregated to make the final prediction. Extra trees is another ensemble learning approach that constructs multiple decision trees. What sets extra trees apart from random forest is the use of the entire training dataset to train each decision tree and randomly selecting the features to split on at each node, introducing additional randomness into the model. In contrast, AdaBoost is an algorithm that iteratively constructs a strong learner by training weak learners, specifically decision trees, in a sequential manner. During each iteration, it assigns greater importance to instances that are incorrectly classified by the previous learner by assigning them higher weights. The process is repeated until the desired accuracy level is achieved. A comprehensive examination of these models is conducted by Sagi and Rokach [28].
3.2.2. LR Models
The LR model family encompasses algorithms that seek to establish a linear correlation between input variables and a continuous output variable, thus enabling the prediction of numerical values based on given features. Five models within this family were selected in this study: least absolute shrinkage and selection operator (Lasso), least angle regression (LARS), ridge, Lasso least angle regression (LLAR), and elastic net. The Lasso technique stands out for its ability to facilitate feature selection and regularization within LR. By imposing a penalty on the absolute magnitudes of the regression coefficients, Lasso encourages sparse solutions. On the other hand, the LARS algorithm is used for both feature selection and regression. It operates by systematically incorporating the predictors into the model while simultaneously adjusting their coefficients to minimize the residual error. This iterative process allows LARS to identify and include the most relevant predictors in the model, enhancing its predictive capabilities. Ridge regression is a model within the LR family incorporating Tikhonov regularization to counter overfitting concerns. It achieves this by imposing a penalty term that restricts the sum of squared coefficients, leading to the shrinkage of their magnitudes. Other algorithms leverage a combination of multiple algorithms to achieve enhanced results. LLAR is a derivative form of the Lasso algorithm that incorporates the geometric principles of LARS while implementing Lasso regularization. The objective of LLAR is to simultaneously accomplish feature selection and regularization, ensuring a balance between the two aspects. On the other hand, elastic net is a regularization technique that combines the Lasso and ridge methods. It provides a harmonious strategy by imposing both absolute and squared penalties on the regression coefficients. Doreswamy and Vastrad [29] offer an extensive analysis and comprehensive description of each model, presenting a benchmark study conducted on the oxazoline and oxazole derivative descriptor dataset.
3.2.3. Gradient Boosting Models
Four models employed in this study are part of the gradient boosting family. Gradient boosting for regression (GBR) is a comprehensive framework used to create ensemble models through the iterative combination of weak models. This process involves sequentially adding models that aim to rectify the errors made by their predecessors, thereby generating a robust predictive model. Categorical boosting (CatBoost) regressor, on the other hand, is a model specifically designed to effectively handle categorical features. It incorporates ordered boosting and gradient-based tree construction to improve its overall performance. Light gradient boosting machine (LightGBM) is a highly efficient gradient boosting framework recognized for its speed and effectiveness. It employs a histogram-based algorithm for expedited training and supports diverse objectives and evaluation metrics. Lastly, extreme gradient boosting (XGBoost) is a gradient boosting library, known for its exceptional performance and adaptability. It integrates optimization techniques, regularization methods, and tree pruning to enhance accuracy and training speed. A comparative analysis of gradient boosting algorithms is provided by Bentéjac et al. [30].
3.3. Evaluation Metrics
4. Experimental Approach
4.1. Data Description
This study utilizes two datasets to analyze and compare vessel traffic flow in different ports. The primary dataset is collected from Mohammedia port in Morocco, which handles an annual traffic volume of 11 million tons, predominantly focused on petroleum products [2]. The dataset includes weekly records of ships served at the port from 2017 to mid-2022, with a specific inclusion criterion of ships measuring between 100 and 180 m in length.
For comparative analysis, this study uses a benchmark dataset from previous research conducted by Zhang et al. [31], which analyzed vessel traffic flow in the Los Angeles port area using AIS data. The Los Angeles dataset, sourced from the US Department of the Interior’s Bureau of Ocean Energy Management (BOEM) and the National Oceanic and Atmospheric Administration (NOAA), spans from 2010 to 2022 and provides daily traffic records. To align with the benchmark study, this work focuses on data from the month of January between 2010 and 2014, aggregated daily, while excluding records from other months. This approach ensures consistency and enables a direct comparison of results.
Both datasets were preprocessed, and no missing data was found. The primary features used in both cases are the date and the corresponding number of vessels recorded on that date. Figure 2 illustrates the vessel traffic flow for the two case studies. For Mohammedia port, the data is aggregated weekly, reflecting trends over time, while the Los Angeles port data is presented daily for January across a 5-year period, highlighting traffic patterns during this specific timeframe. The figure underscores the relatively small size of both datasets, which poses challenges for accurate ship traffic flow prediction. Nevertheless, these datasets provide a robust foundation for evaluating the proposed methodology in distinct port environments.


4.2. Model Training
The datasets from the Mohammedia port and the Los Angeles port were divided into three distinct sets: the training set, test set, and validation set. The training set was utilized to train the model, enabling the model to acquire knowledge from the provided dataset and identify patterns and correlations. The test set, on the other hand, was employed to evaluate the performance of the models and fine-tune their hyperparameters. It is important to note that the model only occasionally encounters the test set and does not directly learn from it. Consequently, the test set indirectly impacts the model by providing an unbiased assessment of its predictive capabilities. Finally, the validation set was reserved for the application of the trained model on unseen data, serving as a means to assess its performance on real-world scenarios beyond the training and test datasets. Figure 3 illustrates the distribution of the three sets, divided in terms of percentage and the number of data points.


In the context of the Mohammedia port case study, the dataset is divided into an 85% training set and a 15% test set. The goal is to predict vessel traffic flow for a future period of 5 months, equivalent to 22 weeks. In the Los Angeles case study, the training data is split into a 90% training set and a 10% test set. Simultaneously, a validation set consisting of 11 days is utilized, aligning with the duration used in the benchmarked study.
The models developed in this study were implemented using PyCaret, a Python library designed for machine learning tasks. PyCaret served as a tool for automating the machine learning workflow, allowing for the preselection and comparison of models across the two distinct use cases considered in this research. Furthermore, PyCaret was employed for manual model creation and fine-tuning, aiming to optimize the performance of the top-performing models. The experiments were conducted on a computational system equipped with a 12-core Intel i7-9750H processor operating at a base clock speed of 3.3 GHz, 16 GB of RAM, and an Nvidia GeForce GTX 1650 graphics processing unit with 4 GB of VRAM. The results of the comparative analysis conducted during the semi-AutoML phase are depicted in Figures 4 and 5. Figure 4 illustrates the evaluation metrics obtained by each model for the two case studies, providing insights into the forecast accuracy. The presented graph depicts the ranking of models based on their RMSE values. It reveals that AdaBoost exhibited superior performance as the most effective model for the Mohammedia port dataset, while extra trees demonstrated the lowest RMSE value for the Los Angeles dataset. Random forest, on the other hand, secured the second position in both scenarios. Additionally, the training time (TT) in seconds provides insights into the duration required by each model for training purposes. Moreover, Figure 5 visually presents the outcomes of the prediction using the preselected models, providing a graphical representation of their performance.




4.3. Hyperparameter Tuning
This section presents the findings of the hyperparameter tuning phase, which is aimed at enhancing the effectiveness of the chosen models. The top two models from the semi-AutoML training phase were selected for each dataset. The fine-tuning process involved the implementation of a random grid search within a predefined search space, coupled with k-fold cross-validation to assess the models’ performance. The primary objective was to minimize the RMSE metric. To determine the optimal value of k for k-fold cross-validation, an experiment was conducted, and the outcomes are depicted in Figure 6, illustrating the influence of different k-fold values on the RMSE metric for the test sets of the two case studies. In the Mohammedia port dataset, the AdaBoost model exhibited the lowest RMSE when utilizing a k-fold value of 3. Conversely, the random forest model demonstrated superior performance with a k-fold value of 11. In the case of the Los Angeles port dataset, both the extra trees and random forest models showcased optimal performance with a k-fold value of 11. The RMSE values obtained for these models were in close proximity, with the extra trees model achieving an RMSE of 11.65 and the random forest model achieving an RMSE of 11.67. This suggests that an 11-fold cross-validation approach effectively captured the underlying patterns in vessel traffic flow in the Los Angeles port area.


4.4. Results and Discussion
The refined models, obtained by retraining the best two models for each case study using the optimal k-values, were evaluated on both the test set and validation set. The prediction results of vessel traffic flow, reflecting the effectiveness of the hyperparameter tuning process, are depicted in Figure 7. By comparing the performance of the models on the test sets, the impact of the fine-tuning is clearly observed. Additionally, the models were validated using the unseen validation set. Table 1 provides an overview of the parameters of the finalized models after the fine-tuning, highlighting the specific adjustments made during the hyperparameter optimization phase to enhance their performance. In the Mohammedia port case study, AdaBoost exhibited the best results for all metrics in both the semi-AutoML phase and the hyperparameter tuning phase. The performance of the random forest model was slightly lower but comparable to that of AdaBoost. The observed enhancement in outcomes is significant when contrasting the performance of the semi-AutoML phase with that of the hyperparameter tuning phase in the Los Angeles case study. This notable improvement can be attributed to the effect of the chosen k-value. In order to assess the effectiveness of the proposed methodology, the Los Angeles port dataset was compared to three neural network models from a referenced study [31]. These models include the backpropagation neural network (BPNN), the genetic algorithm–optimized backpropagation neural network (GA-BPNN), and the self-adaptive particle swarm optimization neural network (SAPSO-NN). Figure 8 demonstrates that the top two models from the semi-AutoML phase outperformed the BPNN and GA-BPNN approaches by achieving a lower RMSE value. During the hyperparameter tuning phase, the extra trees model demonstrated comparable performance to the SAPSO-BPNN model in terms of the achieved RMSE value. It is worth noting that the extra trees model may have achieved this performance with potentially less required simulation time.


Parameter | Mohammedia port | Los Angeles port | ||
---|---|---|---|---|
AdaBoost | Random forest | Extra trees | Random forest | |
n estimators | 50 | 100 | 100 | 100 |
Learning rate | 1.0 | — | — | — |
Loss | Linear | — | — | — |
Bootstrap | — | True | False | True |
ccp alpha | — | 0.0 | 0.0 | 0.0 |
Criterion | — | Squared error | Squared error | Squared error |
Max depth | — | None | — | None |
Max features | — | 1.0 | — | 1.0 |
Max leaf nodes | — | None | — | None |
Max samples | — | None | — | None |
Min impurity decrease | — | 0.0 | 0.0 | 0.0 |
Min sample leaf | — | 1 | 1 | 1 |
Min sample split | — | 2 | 2 | 2 |
Min weight fraction leaf | — | 0.0 | 0.0 | 0.0 |

5. Conclusions
This research focuses on the utilization of a semi-AutoML methodology to forecast vessel traffic flow and presents a comparative analysis of machine learning algorithms. Two case studies with restricted and limited size datasets were considered: Mohammedia port and Los Angeles port. Twelve models from diverse model families were examined and evaluated through a series of experimental analyses. The findings highlight the effectiveness of ensemble learning models, specifically AdaBoost, extra trees, and random forest, in achieving superior performance in both case studies compared to other benchmarked models. The systematic application of the semi-AutoML approach facilitated the identification of the most suitable models for the given problem, which were further enhanced through hyperparameter tuning. In the case of the Los Angeles port, the comparison with a previous study demonstrated the improved performance of the proposed semi-AutoML approach in terms of the RMSE metric.
However, this study has certain limitations that should be acknowledged. First, the reliance on historical data may restrict the model’s ability to adapt to unexpected or rapidly changing situations, such as sudden shifts in weather conditions, geopolitical events, or port disruptions. While the proposed methodology performs well under stable conditions, its effectiveness in dynamic or unforeseen scenarios remains untested. Second, the limited size of the datasets, particularly for Mohammedia port, may constrain the generalizability of the findings to larger or more complex port environments. Additionally, the study’s focus on aggregated data (weekly for Mohammedia and daily for Los Angeles) may overlook finer grained temporal patterns that could improve prediction accuracy.
- 1.
The integration of real-time data streams, specifically leveraging AIS data, could significantly enhance the responsiveness and adaptability of the predictive models in dynamic vessel traffic flow scenarios. It is worth noting that the current study relied on aggregated data derived from AIS data, which limited the utilization of potentially valuable features.
- 2.
Exploring the incorporation of additional influential factors, such as weather conditions, geopolitical events, and economic trends, in predicting traffic flow for diverse port settings. Comparing the outcomes across different settings would provide insights into the generalizability of the model and highlight the necessity for region-specific adjustments.
We hope the findings of the proposed approach and its applications will shed light on the accuracy and reliability of different forecasting methods for ship flow prediction and provide valuable insights for stakeholders in the maritime industry to make informed decisions and enhance operational efficiency.
Conflicts of Interest
The authors declare no conflicts of interest.
Author Contributions
Abdeltif Boujamza: conceptualization, formal analysis, methodology, and writing—original draft. Mohamed El Hafta: conceptualization and writing—review and editing. Saâd Lissane Elhaq: supervision, resources, and project administration. Ahmed Loukili: supervision and project administration.
Funding
This research did not receive any external funding.
Open Research
Data Availability Statement
The data that support the findings of this study are openly available on Colab at https://colab.research.google.com/drive/1Wk5Y_1uFSEYJLx49nXPaoPnGUgLGY6cT.