Volume 44, Issue 6 pp. 1238-1244
CLINICAL ARTICLE
Open Access

Portable Ultrasound Bladder Volume Measurement Over Entire Volume Range Using a Deep Learning Artificial Intelligence Model in a Selected Cohort: A Proof of Principle Study

Hyun Ju Jeong

Hyun Ju Jeong

Department of Medical Device Development, Seoul National University College of Medicine, Seoul, Republic of Korea

Search for more papers by this author
Aeran Seol

Aeran Seol

Department of Obstetrics and Gynecology, Seoul National University College of Medicine, Seoul National University Hospital, Seoul, Republic of Korea

Search for more papers by this author
Seungjun Lee

Seungjun Lee

Department of Obstetrics and Gynecology, Seoul National University College of Medicine, Seoul National University Hospital, Seoul, Republic of Korea

Department of Obstetrics and Gynecology, Dankook University College of Medicine, Cheonan, Republic of Korea

Search for more papers by this author
Hyunji Lim

Hyunji Lim

Department of Obstetrics and Gynecology, Seoul National University College of Medicine, Seoul National University Hospital, Seoul, Republic of Korea

Search for more papers by this author
Maria Lee

Maria Lee

Department of Obstetrics and Gynecology, Seoul National University College of Medicine, Seoul National University Hospital, Seoul, Republic of Korea

Search for more papers by this author
Seung-June Oh

Corresponding Author

Seung-June Oh

Department of Medical Device Development, Seoul National University College of Medicine, Seoul, Republic of Korea

Department of Urology, Seoul National University College of Medicine, Seoul National University Hospital, Seoul, Republic of Korea

Correspondence: Seung-June Oh ([email protected])

Search for more papers by this author
First published: 19 May 2025
Citations: 1

ABSTRACT

Objective

We aimed to prospectively investigate whether bladder volume measured using deep learning artificial intelligence (AI) algorithms (AI-BV) is more accurate than that measured using conventional methods (C-BV) if using a portable ultrasound bladder scanner (PUBS).

Patients and Methods

Patients who underwent filling cystometry because of lower urinary tract symptoms between January 2021 and July 2022 were enrolled. Every time the bladder was filled serially with normal saline from 0 mL to maximum cystometric capacity in 50 mL increments, C-BV was measured using PUBS. Ultrasound images obtained during this process were manually annotated to define the bladder contour, which was used to build a deep learning AI model. The true bladder volume (T-BV) for each bladder volume range was compared with C-BV and AI-BV for analysis.

Results

We enrolled 250 patients (213 men and 37 women), and a deep learning AI model was established using 1912 bladder images. There was a significant difference between C-BV (205.5 ± 170.8 mL) and T-BV (190.5 ± 165.7 mL) (p = 0.001), but no significant difference between AI-BV (197.0 ± 161.1 mL) and T-BV (190.5 ± 165.7 mL) (p = 0.081). In bladder volume ranges of 101–150, 151–200, and 201–300 mL, there were significant differences in the percentage of volume differences between [C-BV and T-BV] and [AI-BV and T-BV] (p < 0.05), but no significant difference if converted to absolute values (p > 0.05). C-BV (R2 = 0.91, p < 0.001) and AI-BV (R2 = 0.90, p < 0.001) were highly correlated with T-BV. The mean difference between AI-BV and T-BV (6.5 ± 50.4) was significantly smaller than that between C-BV and T-BV (15.0 ± 50.9) (p = 0.001).

Conclusion

Following image pre-processing, deep learning AI-BV more accurately estimated true BV than conventional methods in this selected cohort on internal validation. Determination of the clinical relevance of these findings and performance in external cohorts requires further study.

Trial Registration: The clinical trial was conducted using an approved product for its approved indication, so approval from the Ministry of Food and Drug Safety (MFDS) was not required. Therefore, there is no clinical trial registration number.

1 Introduction

The portable ultrasound bladder scanner (PUBS) is widely used to measure post-void residual volume (PVR) [1, 2]. Previous studies have shown that the measurement accuracy of PUBS is acceptable for clinical practice [3, 4]. However, some reports suggest that the measurements using PUBS tend to overestimate or underestimate the true bladder volume (T-BV), highlighting the need to improve the accuracy of PUBS [5, 6].

Recently, deep-learning-based artificial intelligence (AI) technology has been actively applied in the medical field [7, 8]. Deep learning models are expected to have the potential to improve clinical workflow and reduce operator dependency for clinicians using PUBS, thereby helping them diagnose lower urinary tract dysfunction more accurately. However, research using deep learning AI to enhance the measurement accuracy of PUBS remains scarce [9, 10]. Existing studies [9, 10] have been limited by small sample sizes and the lack of T-BV data obtained via catheterization, raising concerns about the accuracy and reliability of the AI algorithms.

This study aimed to prospectively determine whether bladder volume measured using a deep learning AI algorithm (AI-BV) is more accurate than that measured using conventional methods (C-BV) with PUBS.

2 Materials and Methods

2.1 Patients

We prospectively measured BV in patients who visited the Department of Urology at Seoul National University Hospital for urodynamic study (UDS) owing to voiding symptoms between January 2021 and July 2022. This study was approved by the Institutional Review Board of Seoul National University Hospital (IRB No. 2011-028-1172). The inclusion criteria were adult patients aged ≥ 20 years diagnosed with neurogenic or non-neurogenic bladder conditions. The exclusion criteria were patients with conditions that could affect ultrasound measurements, such as ascites, abdominal wounds, pregnancy, or a history of bladder surgery (cystectomy).

2.2 Study Design

Patients were placed in the supine position, and BV was measured using a PUBS (Biocon-1100, Mcube Technology, Seoul, Korea). The PUBS device used in this study was designed to acquire 12 sectional ultrasound images at 15° intervals by detecting reflected signals from the bladder and urine bladder. It extracted the bladder wall positions from these sectional images and reconstructs a three-dimensional representation of the bladder using a proprietary algorithm to measure the BV. The BV measurements in this study followed the manufacturer's protocol and were carried out by an experienced independent examiner (clinical laboratory technologist).

During the UDS procedure, serial BV measurements were performed with PUBS during filling cystometry. Normal saline was infused into the bladder in increments of 50 mL until the maximum cystometric capacity was reached. The PUBS was used to measure the BV once at each increment. Since T-BV could be influenced by urine production from the kidneys during UDS, we assumed a constant rate of urine production and uniform time intervals for each ultrasound measurement. The T-BV for each measurement was extrapolated from a starting BV of 0 mL to the final BV, which was calculated as the sum of the voided volume during the pressure-flow study and PVR drained via urethral catheterization [1].

2.3 Training Data Set and Test Data Set

For each patient, 12 ultrasound bladder images were acquired during the C-BV measurement process. Images that were unclear, blurry, or obscured by adjacent bowel structures, rendering the bladder wall invisible, were excluded. For the selected bladder images, a urologist (S.J.O.) manually drew the bladder contours, which were subsequently used to develop a deep learning AI algorithm. Among these images, 91% were randomly assigned to the training set, whereas 9% were used as the test set.

2.4 U-Net Deep Learning and Implementation

We applied the U-Net model to train features from input images [11]. After the convolution layers in each encoder block, max pooling layers were used to down-sample the input feature maps, generating low-resolution feature maps. In each decoder block, transpose convolution layers were used to up-sample the input feature maps. The high-resolution feature maps obtained through up-sampling were concatenated with feature maps of the same size from the previous down-sampling process. The model comprehensively learned high- and low-level features to predict the pixel-wise probability of the target class (bladder). We performed three consecutive iterations of down-sampling and up-sampling and compressed the model by setting the number of channels in each layer to 16, 32, 64, and 128.

The input and output image sizes were 64 × 224. Logarithmic scaling, commonly used in image processing, was applied. Additionally, pre-processing steps, such as horizontal flip augmentation, smoothing, sharpening filters, and normalization, were performed on the training images. The model was trained for 200 epochs with a mini-batch size of 48. Each layer was initialized using He normal initialization, and the learning rate was set to 1e-4. Early stopping was adopted with a minimum delta of 0.0009 and patience of 50. The code was developed using TensorFlow and Keras, and the training was conducted on two NVIDIA GTX 3090 GPUs. The development of the deep learning AI algorithm, including pre-processing steps, was performed by AI engineers of the Mcube Technology, Seoul, Korea.

2.5 Statistical Analysis

To evaluate the measurement accuracy of PUBS, we performed analyses using four methods. (1) Paired t-test analyzed whether there were significant differences between C-BV and T-BV and AI-BV and T-BV. (2) Percentage of differences of volume (PDV) for C-BV and AI-BV based on BV range were calculated and compared using paired t-tests and Wilcoxon signed-rank tests. PDV was calculated as [measured BV-T-BV]/T-BV [1]. (3) Linear regression analysis assessed the correlation between C-BV and T-BV and AI-BV and T-BV. (4) Bland–Altman analysis determined if C-BV and AI-BV were well aligned with T-BV. The differences between C-BV and T-BV and AI-BV and T-BV were calculated. The mean of each C-BV and AI-BV in relation to T-BV was obtained. The 95% limits of agreement were calculated as mean ± 1.96 × SD.

Patient demographics were presented as mean ± standard deviation (SD). Statistical significance was set at p < 0.05. All data analyses were performed using SPSS for Windows (Version 27.0, IBM Corp., Armonk, NY, USA). Data analysis and reporting followed the TRIPOD + AI reporting guidance for prediction models that use machine learning [12].

3 Results

This study enrolled 250 patients (213 men; 85.2%, 37 women; 14.8%), with a mean age of 67.7 ± 10.5 years. During the C-BV measurement process, we acquired 2780 ultrasound bladder images. Of these, 684 images were excluded because of being unclear, blurry, or obscured by adjacent bowel structures, which made the bladder wall unidentifiable. The remaining bladder images were used for manual bladder contouring, which served as the basis for developing a deep learning AI algorithm. In total, 1912 images were used for the training set, and 184 images were used for the test set (Figure 1).

Details are in the caption following the image
Box plot compares the percentage of difference of volume (PDV) for C-BV and AI-BV across different bladder volume (BV) ranges. C-BV is BV measured by conventional methods. AI-BV is BV measured by applying a deep learning artificial intelligence algorithm. PDV [C-BV] is the percentage of difference of volume for C-BV. PDV [AI-BV] is the percentage of difference of volume for AI-BV. See Table 1 for p-values indicating differences between PDV [C-BV] and PDV [AI-BV] by BV range.

There was a significant difference between C-BV (205.5 ± 170.8 mL) and T-BV (190.5 ± 165.7 mL) (p = 0.001). However, there was no significant difference between AI-BV (197.0 ± 161.1 mL) and T-BV (190.5 ± 165.7 mL) (p = 0.081). If comparing the PDV [C-BV] and PDV [AI-BV] across BV ranges (Table 1), there were no significant differences between PDV [C-BV] and PDV [AI-BV] in the 0–50 mL and 51–100 mL ranges (p > 0.05). However, significant differences were observed between PDV [C-BV] and PDV [AI-BV] in the 101–150, 151–200, and 201–300 mL ranges (p < 0.05), although comparisons based on absolute value revealed no differences between the groups (p > 0.05). In the 100–300 mL range, PDV was significantly lower in AI-BV (0.16 ± 0.23) than C-BV (−0.17 ± 0.27) (p = 0.001). No significant differences were found between PDV [C-BV] and PDV [AI-BV] in the 301–400 mL and > 400 mL ranges (p > 0.05).

Table 1. Comparison of percentage of difference of bladder volume before and after applying deep learning artificial intelligence.
Bladder volume range (n) PDV [C-BV] PDV [AI-BV] p value# (Paired t-test) p value# (Wilcoxon SR test) p value* (Paired t-test) p value* (Wilcoxon SR test)
0–50 mL (n = 52) −0.04 ± 0.23 0.06 ± 0.30 0.160 0.109 0.115 0.109
51–100 mL (n = 10) −0.45 ± 0.83 0.39 ± 0.56 0.072 0.093 0.091 0.110
101–150 mL (n = 15) −0.27 ± 0.32 0.32 ± 0.22 0.001 0.001 0.578 0.609
151–200 mL (n = 22) −0.22 ± 0.32 0.19 ± 0.28 0.003 0.003 0.039 0.036
201–300 mL (n = 40) −0.11 ± 0.22 0.09 ± 0.18 0.002 0.001 0.006 0.004
301–400 mL (n = 22) 0.02 ± 0.24 −0.06 ± 0.15 0.319 0.465 0.031 0.039
401– (n = 23) −0.01 ± 0.12 −0.07 ± 0.19 0.366 0.605 0.102 0.121
  • Note: PDV [C-BV] is bladder volume (BV) measured by conventional method. PDV [AI-BV] is BV measured by applying deep learning artificial intelligence (AI) algorithm. All variables are presented as mean ± standard deviation. p value# represents a comparison of PDV [C-BV] and PDV [AI-BV] using the paired t-test and Wilcoxon Signed-Rank Test. The p value* represents a comparison of the absolute values of real numbers of PDV [C-BV] and PDV [AI-BV] using the paired t-test and Wilcoxon Signed-Rank Test. Bold values indicate statistically significance.

Linear regression analysis demonstrated a high correlation between C-BV and T-BV (R2 = 0.91, 95% CI; 0.94–1.03, p < 0.001) and AI-BV and T-BV (R2 = 0.90, 95% CI; 0.88–0.97, p < 0.001) (Figure 2). Bland–Altman analysis revealed that the mean difference between C-BV and T-BV was 15.0 ± 50.9 mL, whereas that of AI-BV and T-BV significantly reduced to 6.5 ± 50.4 mL (p = 0.001) (Figure 3). For C-BV, the upper limit of agreement (LoA) was 114.8 mL, and the lower LoA was −84.8 mL. For AI-BV, the upper LoA was 105.4 mL, and the lower LoA was −92.3 mL.

Details are in the caption following the image
Linear regression of (a) C-BV versus T-BV and (b) AI-BV versus T-BV. C-BV is bladder volume (BV) measured by conventional methods. AI-BV is BV measured by applying a deep learning artificial intelligence algorithm. T-BV is true BV measured using urethral catheterization.
Details are in the caption following the image
Bland–Altman plots for (a) C-BV and (b) AI-BV. C-BV is bladder volume (BV) measured by conventional methods. AI-BV is BV measured by applying a deep learning artificial intelligence algorithm. LoA, Limit of agreement.

Among the ultrasound bladder images used in the test set, 152 images for men and 32 images for women were used to perform a subgroup analysis by gender. There was a significant difference between AI-BV (197.4 ± 159.4 mL) and T-BV (189.0 ± 163.6) in men (p = 0.004), but there was no significant difference between AI-BV (195.2 ± 171.5 mL) and T-BV (197.9 ± 177.9 mL) in women (p = 0.87). When each PDV was compared in the detailed BV range, the results in men were similar to those in Table 1. In men, both C-BV (R2 = 0.94, p < 0.001) and AI-BV (R2 = 0.95, p < 0.001) were highly correlated with T-BV. Compared to men, both C-BV (R2 = 0.82, p < 0.001) and AI-BV (R2 = 0.74, p < 0.001) were less highly correlated with T-BV in women. Bland–Altman analysis showed that there was a significant difference between [mean difference between AI-BV and T-BV] and [mean difference between C-BV and T-BV] in men (p < 0.05), but there was no significant difference in women (p > 0.05).

4 Discussion

PVR measurement is a commonly used diagnostic method for assessing lower urinary tract dysfunction in urology [5]. The measured PVR aids in determining treatment in patients with voiding dysfunction [13]. Although urethral catheterization is considered the gold standard for accurately measuring PVR, it is invasive and can cause discomfort to patients. Consequently, non-invasive PUBS are widely used [5, 14].

Knowingly, there have only been two studies that applied deep learning AI technology to improve the accuracy of PUBS measurements. Matsumoto et al. reported that AI implementation enhanced the accuracy of measuring small BV [9]. Their study developed an automatic detection tool using 81 paired images as the training set. If applied to BV of 50 mL and 100 mL, the developed tool achieved a sensitivity and specificity of 88.5% and 100.0%, respectively, for 100 mL, and 94.1% and 100.0%, respectively, for 50 mL. Based on these results, the authors demonstrated that the newly developed tool could reliably and accurately estimate BV of approximately 50 and 100 mL. However, a critical limitation of this study was the absence of T-BV measurements through catheterization; the authors relied solely on actual voided volume for comparison. Since some participants may have had PVR, the accuracy of the study's findings is questionable.

Maher et al. conducted a prospective non-inferiority study involving 58 patients [10]. Three researchers measured BV (L x W x H x 0.52) using bladder images obtained through PUBS, and these volumes were compared with those measured by AI. The study reported a high level of agreement and consistency, with an intraclass correlation coefficient of 0.91. However, similar to Matsumoto et al., catheterization was not used to measure T-BV in this study, leading to reduced accuracy in the comparison results. Additionally, the size of the test data set used for validation was notably small, which undermines the reliability of the AI algorithm's accuracy.

Contrastingly, our study obtained T-BV through catheterization and compared the measured values before and after applying AI with T-BV. Additionally, we secured a sufficient number of bladder images to develop an AI model. Furthermore, our study measured BV across a wide range, from 0 mL to maximum cystometric capacity, and analyzed these measurements by specific volume ranges. This allowed us to evaluate the accuracy of BV measurements in low, moderate, and high-volume ranges.

In this study, the analysis by BV range revealed no significant differences in measurement methods before and after applying AI for BV ≤ 100 mL and ≥ 301 mL. This may be attributed to limitations in obtaining cross-sectional bladder images for small amounts of urine and the variability in bladder shape, making AI application challenging in these cases [15, 16]. Clinically, improving measurement accuracy in the small BV range may have limited significance. Enhancing accuracy in the range of 200–300 mL is more critical for clinical decision-making, as it is essential to determine the necessity of clean intermittent catheterization for patients with moderate PVR [17, 18]. Within the 101–300 mL range, AI-BV showed significantly lower PDV compared to C-BV. However, as there were no differences in the absolute values of PDV between the two methods, only a directional change in PDV could be observed. Specifically, compared to T-BV, C-BV tended to underestimate BV, whereas AI-BV tended to overestimate it. Considering patient safety, we believe that overestimation is preferable to underestimation in using PUBS. If analyzing BV broadly in the 100–300 mL range, the PDV for AI-BV (0.16 ± 0.23) was significantly smaller than that for C-BV (−0.17 ± 0.27). These findings suggest that applying a deep learning AI model improves the accuracy of BV measurements.

The subgroup analysis by gender showed that although there was a significant difference between AI-BV and T-BV in men, the average difference significantly improved after applying AI. In contrast, the difference between T-BV and C-BV did not significantly improve even after applying AI in women. This gender difference is thought to be due to anatomical differences such as the presence of an enlarged prostate in men and the presence of the surrounding genital organs in women. However, in this study, the number of bladder images in the test set for women was insufficient, which limited the accuracy of the results. Additional research using more cases in women is needed in the future.

Our study results show modest differences in BV estimation between AI-based and conventional methods, suggesting that the role of AI technology in BV estimation may be limited in actual clinical settings. This may be due to the fact that PUBS already demonstrates high accuracy, as supported by numerous studies [1, 2, 9, 19, 20], leaving limited room for improvement despite the application of deep learning AI. The role of AI is particularly significant in improving measurement accuracy for patients prone to errors with C-BV measurements. Several studies have reported reduced accuracy of BV measurements with PUBS in patients with adjacent abdominal fluid, uterine myomas, and ovarian cysts [21-24]. Future research focusing on the application of AI technology to ultrasound-based BV measurement in these challenging cases is essential. The strength of this study lies in its prospective design, allowing for comparative analysis of accuracy across various BV ranges. Additionally, by obtaining and using T-BV as a reference for comparison, the study achieved a higher level of reliability and accuracy.

4.1 Limitations of Our Study Include

Although BV was measured using PUBS by an experienced independent examiner, we did not investigate inter-observer variability or intra-observer variability. Second, although we have done our best in the pre-processing of training images, there is still a possibility that unknown errors in this process may make the algorithm in its current form insufficient for clinical implementation. Last, we did not perform the study in an external validation cohort. Thus, the generalizability of the present results is not established.

5 Conclusions

Following image pre-processing, deep learning AI-BV more accurately estimated true BV than conventional methods in this selected cohort on internal validation. Determination of the clinical relevance of these findings and performance in external cohorts requires further study.

Author Contributions

Conceptualization: Seung-June Oh, Maria Lee. Data creation: Hyun Ju Jeong, Seung-June Oh. Formal analysis: Hyun Ju Jeong, Seung-June Oh. Investigation: Hyun Ju Jeong, Seung-June Oh, Aeran Seol, Seungjun Lee, Hyunji Lim, Maria Lee. Methodology: Hyun Ju Jeong, Seung-June Oh, Aeran Seol, Seungjun Lee, Hyunji Lim, Maria Lee. Supervision: Seung-June Oh. Writing ± original draft: Hyun Ju Jeong, Seung-June Oh. Writing ± review and editing: Hyun Ju Jeong, Seung-June Oh, Aeran Seol, Seungjun Lee, Hyunji Lim, Maria Lee.

Acknowledgments

Ms. Yu-Kyung Lee assisted with the clinical research coordination. This study is supported by the Korea Medical Device Development Fund (Project Number: 1711135009).

    Ethics Statement

    This study received approval from the Institutional Review Board of Seoul National University Hospital (IRB No.2011-028-1172) and was conducted in accordance with the Declaration of Helsinki and Ethical Standards set by the Institutional Review Board. All participants provided written informed consent.

    Consent

    The authors have nothing to report.

    Conflicts of Interest

    The authors declare no conflicts of interest.

    Data Availability Statement

    The data and materials are available from the corresponding author upon reasonable request.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.