RESEARCH ARTICLE

Open Access

QuadWindow: A Perspective-Aware Framework for Geometric Window Extraction From Street-View Imagery

Zhuangqun Niu

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Contribution: Data curation, Formal analysis, Methodology, Software, Validation, Writing - original draft, Visualization, Investigation

Search for more papers by this author

Ke Xi,

Ke Xi

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

School of Artificial Intelligence, Jianghan University, Wuhan, China

Contribution: Data curation, Validation, Writing - original draft, Visualization, Investigation

Search for more papers by this author

Yifan Liao,

Yifan Liao

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Contribution: Formal analysis, Validation, Writing - original draft

Search for more papers by this author

Pengjie Tao,

Corresponding Author

Pengjie Tao

[email protected]

orcid.org/0000-0001-5011-9446

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Hubei Luojia Laboratory, Wuhan, China

Correspondence: Pengjie Tao ([email protected])

Tao Ke ([email protected])

Contribution: Conceptualization, Methodology, Resources, Supervision, Writing - review & editing, Funding acquisition

Search for more papers by this author

Tao Ke,

Corresponding Author

Tao Ke

[email protected]

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Hubei Luojia Laboratory, Wuhan, China

Correspondence: Pengjie Tao ([email protected])

Tao Ke ([email protected])

Contribution: Conceptualization, Methodology, Project administration, Resources, Supervision, Writing - review & editing, Funding acquisition

Search for more papers by this author

Zhuangqun Niu,

Zhuangqun Niu

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Contribution: Data curation, Formal analysis, Methodology, Software, Validation, Writing - original draft, Visualization, Investigation

Search for more papers by this author

Ke Xi,

Ke Xi

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

School of Artificial Intelligence, Jianghan University, Wuhan, China

Contribution: Data curation, Validation, Writing - original draft, Visualization, Investigation

Search for more papers by this author

Yifan Liao,

Yifan Liao

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Contribution: Formal analysis, Validation, Writing - original draft

Search for more papers by this author

Pengjie Tao,

Corresponding Author

Pengjie Tao

[email protected]

orcid.org/0000-0001-5011-9446

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Hubei Luojia Laboratory, Wuhan, China

Correspondence: Pengjie Tao ([email protected])

Tao Ke ([email protected])

Contribution: Conceptualization, Methodology, Resources, Supervision, Writing - review & editing, Funding acquisition

Search for more papers by this author

Tao Ke,

Corresponding Author

Tao Ke

[email protected]

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China

Hubei Luojia Laboratory, Wuhan, China

Correspondence: Pengjie Tao ([email protected])

Tao Ke ([email protected])

Contribution: Conceptualization, Methodology, Project administration, Resources, Supervision, Writing - review & editing, Funding acquisition

Search for more papers by this author

First published: 17 July 2025

https://doi.org/10.1002/eng2.70294

Funding: This work was supported by National Key Research and Development Program of China, 2018YFD1100405.

Share a link

Email
Wechat
Bluesky

ABSTRACT

Rapid and reliable assessment of building damage is essential for post-disaster response and recovery. As windows often reflect critical structural changes, their automatic extraction from street-view images provides valuable insights for emergency assessment, urban risk modeling, and disaster database updates. Existing methods struggle to leverage the quadrilateral prior of windows due to two main issues: poor handling of perspective distortion and the lack of robust loss functions when precise vector annotations are unavailable. To overcome these challenges, we introduce QuadWindow, a framework specifically designed to handle perspective distortions through a perspective transformation sub-network that predicts transformations from street-view images to frontal views, significantly simplifying window extraction tasks without manual correction. Additionally, we propose a differentiable rendering loss that directly aligns predicted quadrangles with raster-based ground truth, bypassing the need for explicit corner-point annotations. Experimental results demonstrate that QuadWindow outperforms state-of-the-art methods across five façade datasets, with an average F1-score of 87.6% and Intersection over Union (IoU) of 78.03%, achieving 1.47% and 5.2% improvement, respectively.

1 Introduction

Automatic extraction of windows from building façade images is critical for geospatial applications such as urban energy modeling [1-3], daylight simulation [4-6], disaster assessment [7-9], 3D building modeling [10, 11], and building digital documentation [12]. Previous studies [13-20] employed heuristic methods, relying on hand-crafted pattern templates or grammatical rules to match the repetitive and symmetrical features of building façades. However, the limitations of these methods lie in their limited feature representation and heavy reliance on manual priors, making it difficult for them to adapt to diverse façade styles.

Due to the variations in window shape and the presence of perspective distortion in street-view images, accurately extracting window instances remains a challenging task. This task requires not only careful algorithmic design but also substantial human effort for precise annotation. Accordingly, most studies recast window parsing as quadrangle extraction, the dominant shape observed in urban façades [21, 22]. When applied to rectified façade images, this simplification becomes the extraction of rectangular windows [16, 23]. While this reformulation reduces accuracy to some extent, it is justifiable for two reasons: (1) It simplifies the task assumptions, facilitating the development of high-performance models by reducing the complexity of the output space; (2) In applications like seismic simulation and daylight analysis, the accuracy provided by quadrangles is often sufficient for practical needs.

Building upon these considerations, this study formulates the window extraction task as a quadrangle prediction problem. However, training a high-performance deep learning model under this simplified framework remains challenging, particularly in leveraging the spatial prior knowledge of window arrangements. Such prior is essential for improving network performance [24-26], especially in occlusion scenarios, where it assists in extracting occluded windows. Furthermore, spatial relationships, such as alignment and size consistency of windows in the same row or column, provide structural information that can improve predictions. While previous studies have successfully integrated spatial prior [27, 28], their methods often rely on manually corrected façade images with a frontal view. However, in street-view images, perspective distortion disrupts the spatial arrangement of windows, reducing the reliability of existing methods. Although manual rectification can partially mitigate these distortions, it introduces additional complexity, restricting the practical applicability of these methods in large-scale urban analysis.

Another challenge lies in effectively leveraging the quadrangle assumption to ensure robust and accurate predictions. By enforcing a quadrangle constraint, window extraction becomes significantly simplified and facilitates the generation of more refined window shapes for quadrangle windows. Wang et al. [29] proposed an approach based on a generalized bounding box, while Li et al. [21] introduced a key point prediction method, both yielding promising results. However, these methods encounter a critical issue in establishing loss functions, as they require explicit ground truth annotations for window corner points. This poses a challenge because many annotations in existing datasets are raster masks, and a significant portion of these do not conform to a quadrangle representation. Forcibly converting such annotations into bounding boxes or corner points often leads to substantial accuracy loss, as illustrated in Figure 1. This raises an important question: can we bypass reprocessing arbitrary-shaped window annotations and instead directly establish a loss between vectorized predictions and raster-based ground truth annotations? This would enable the model to predict more regularized window shapes without requiring additional re-annotation efforts.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Vectorization results of window ground truth in the eTRIMS dataset [30] using the Douglas-Peucker algorithm. (a) and (b) illustrate two typical cases of errors introduced by this simplification.

To address these challenges, we propose an improved deep learning framework for quadrilateral-based window extraction from street-view images. To effectively leverage the quadrangle prior for window extraction, similar to methods that treat objects as key points [31], our approach adopts a multi-branch architecture that predicts quadrangles directly, ensuring geometric consistency despite the presence of perspective distortions. First, we integrate a perspective transformation estimation subnetwork that predicts transformations from street-view façade images to frontal views, adaptively eliminating perspective distortion. To supervise the estimation of perspective matrices, we design two unsupervised loss functions that leverage the inherent structural consistency of window layouts, enabling effective supervision without additional annotations. Additionally, we introduce a differentiable rendering loss that directly aligns quadrangle predictions with raster-based ground truth, bypassing the need for manual simplification or re-annotation. Experimental results demonstrate that our method outperforms state-of-the-art methods across five different datasets, with an average F1-score of 87.6% and an Intersection over Union (IoU) of 78.03%, achieving 1.47% and 5.2% improvement, respectively.

The main contribution of this paper is:

We propose a subnetwork that predicts transformations from street-view to frontal images, eliminating the need for manual corrections and ensuring the robust integration of priors through attention mechanisms. Additionally, we introduce two unsupervised loss functions, enabling effective supervision of the transformations without additional annotations.
A differentiable rendering loss is proposed to directly align quadrangle predictions with raster-based ground truth annotations, bypassing the need for manual re-annotation or contour simplification and enabling regularized quadrangle representations.
We develop a multi-branch network for quadrilateral-based window extraction that leverages spatial priors and attention mechanisms. This framework generates geometrically consistent vectorized window instances in an end-to-end manner.

The remainder of this paper is organized as follows: Section 2 reviews related work and outlines our motivations, followed by Section 3 detailing the proposed method. Sections 4 and 5 present experimental results and ablation studies, respectively, while Section 6 concludes the paper.

2 Related Work

2.1 Window Extraction With Architectural Prior

Windows are typically quadrilaterals arranged in a symmetrical pattern, which constitutes one of the most significant forms of prior knowledge for extracting windows from building façades. This prior knowledge imposes constraints on the extraction process and results, aiding in addressing challenges such as occlusion and false extractions. This has been demonstrated in both grammar-based methods [14, 17, 32, 33] and deep learning-based approaches [26, 27, 34-36].

In deep learning, architectural priors have been integrated more deeply through post-processing or attention mechanisms. For example, Ma and Ma [34] proposed a post-optimization method that clusters and votes for window bounding boxes along horizontal and vertical directions, refining candidate boxes with confidence scores below a predefined threshold. Similarly, Liu et al. [27] and Zhang et al. [36] cluster windows in the same row and column, constraining segmentation results using a variance loss function. Meanwhile, Sun et al. [35] and Zhuo et al. [28] employed attention mechanisms to model long-range dependencies, implicitly encoding spatial relationships between windows. Tao et al. [22] demonstrated that sparse attention mechanisms can significantly enhance network representation capabilities when applied to corrected images.

While previous studies have successfully integrated spatial prior, their methods often rely on manually corrected façade images with a frontal view. In contrast, street-view images suffer from perspective distortion, which diminishes the reliability of these approaches. Although manual rectification can mitigate this issue [6, 37], it is not scalable for large datasets, limiting the practicality of such methods in real-world applications. In contrast, our method directly addresses this limitation by learning transformation matrices from raw, distorted images, thereby improving scalability and practicality.

2.2 Vectorized Windows Extraction

To leverage the quadrilateral assumption of windows, traditional grammar-based methods often assume that windows are regular rectangles in rectified images and parse façades using appropriate grammar rules to balance representation accuracy and computational efficiency [13, 16, 26].

In deep learning, researchers have also leveraged the quadrilateral assumption of windows and improved the accuracy of extraction. For instance, DeepFacade [27] and DAN-PSPNet- ${L}_{\mathrm{sym}}$ [36] conduct self-regulating rectangle constraints that penalize windows not being rectangles by comparing the prediction with their rectangle hulls and bounding boxes. Deepwindows [35] implicitly encodes window shape information by embedding the width and height of windows into high-dimensional embeddings and concatenating them with semantic features. Although these methods improve overall segmentation accuracy, they still suffer from over-smoothing effects due to their reliance on raster-based grid representations, which limit boundary sharpness [28, 38]. To achieve sharper window representations, some researchers have adopted more concise vectorized representations. For example, Wang et al. [29] proposed using two bounding boxes to represent the four corners of a window, enabling sharp representations without resolution constraints. Li et al. [21] introduced a bottom-up approach that first detects window corners in the image and then clusters them into distinct window instances based on pairing relationships, achieving more computational efficiency.

Vectorizing windows not only reduces computational costs [11, 29] but also accurately captures the geometric shapes of the most common window types. However, these methods typically require explicit ground truth annotations for window corners, while most existing datasets provide mask annotations rather than corner points. This reliance on explicit corner annotations poses a significant challenge, often requiring manual re-annotation or lossy simplification [11, 21]. Therefore, it is essential to develop a method for extracting vectorized windows using raster-based annotations.

In recent years, vectorized representations have gained significant attention in deep learning, due to their ability to adapt to complex shapes and provide flexible object representations. These methods model objects as sets of key points [39, 40] or contour points [41-43], enabling better handling of irregular and deformable objects. Among these, RepPoints [31] and its extension, Dense RepPoints [40] represent objects using a set of adaptive points that position themselves over the object to circumscribe its spatial extent and capture semantically significant local areas. This approach offers a unified object representation over different levels of granularity. However, the vanilla RepPoints derive bounding boxes by converting the points into rectangular shapes, which is insufficient to fully represent the quadrangular shapes of windows. Conversely, Dense RepPoints requires as many as 729 points to represent the object, making it less efficient for window extraction.

3 Methodology

Following the design of RepPoints [31], the proposed QuadWindow framework uses a set of adaptive points to circumscribe a window's spatial extent. To generate quadrangle predictions, we design two parallel regression branches: one supervised by our proposed differentiable rendering loss to refine four corner positions, and the other retaining the original settings from RepPoints to maintain compatibility with off-the-shelf detectors. Additionally, we integrate the perspective transformation estimation subnetwork into the final feature map generated by the backbone. This subnetwork is supervised using dedicated alignment and perpendicular losses, leveraging the shape priors to eliminate distortions. To refine these corrected feature maps, we employ a lightweight attention mechanism, which captures long-range dependencies and enhances spatial feature representations. The overview of the proposed method is illustrated in Figure 2.

3.1 Perspective Transformation Estimation

To mitigate the discrepancy between the regular arrangement of windows in a frontal view and the perspective distortions commonly observed in façade images, we propose a perspective transformation estimation subnetwork. This module enables the model to dynamically estimate the perspective transformation required to align each image to its frontal view in a data-driven manner. For clarity, we first briefly introduce the definition of perspective transformation. This transformation can be represented by a homography matrix with eight parameters:

\boldsymbol{T}=\left(\begin{array}{lll}{a}_1& {a}_2& {b}_1\\ {}{a}_3& {a}_4& {b}_2\\ {}{c}_1& {c}_2& 1\end{array}\right)

(1)

here,

\boldsymbol{a}=\left[{a}_1,{a}_2;{a}_3,{a}_4\right]

can be considered an affine matrix defining four degrees of freedom: scaling, rotation, stretching, and shearing. The vector

\boldsymbol{b}=\left[{b}_1,{b}_2\right]

accounts for two degrees of freedom for translation. Finally,

\mathbf{c}=\left[{c}_1,{c}_2\right]

is a projection vector that defines two degrees of freedom for nonlinear deformation, describing how the perceived objects change when the viewpoint of the observer changes in the depth dimension.

Due to the inherently nonlinear nature of the perspective matrix, directly regressing its parameters is exceedingly challenging [44-46]. To mitigate this, we disentangle the matrix into several basic transformations and predict the parameters for each transformation. Specifically, we exclude image translation, fix the aspect ratio before and after transformation, and introduce an additional shearing parameter to simplify learning. Translation is removed to maintain alignment of the coordinate origin before and after transformation, while fixing the aspect ratio follows common practice in homography estimation [45] and ensures that the overall shape of the image does not change drastically. In addition, using fewer parameters reduces estimation complexity and improves training stability. It is worth noting that, although this formulation may cause slight variations in image size and aspect ratio compared to a full-parameter transformation, we have empirically verified that it does not compromise the model's ability to recover regular façade structures under moderate perspective distortion, which sufficiently meets the requirements of our task. Consequently, we choose a vector

\boldsymbol{t}=\left({t}_1,{t}_2,{t}_3,{t}_4,{t}_5,{t}_6\right)\in {R}^6

as a surrogate parameterization of the transformation matrix. Based on

\boldsymbol{t}

, we derive the fundamental geometric transformations, including scaling

{\boldsymbol{T}}_s

, shearing

{\boldsymbol{T}}_h

, rotation

{\boldsymbol{T}}_r

, and projection

{\boldsymbol{T}}_p

\begin{array}{cc}{\boldsymbol{T}}_s& =\left(\begin{array}{cc}{t}_1+1& 0\\ {}0& {t}_1+1\end{array}\right)\\ {}{\boldsymbol{T}}_h& =\left(\begin{array}{cc}1& {t}_2\\ {}{t}_3& 1\end{array}\right)\\ {}{\boldsymbol{T}}_r& =\left(\begin{array}{cc}\cos \left({t}_4\right)& -\sin \left({t}_4\right)\\ {}\sin \left({t}_4\right)& \cos \left({t}_4\right)\end{array}\right)\\ {}\boldsymbol{A}& ={\boldsymbol{T}}_r{\boldsymbol{T}}_h{\boldsymbol{T}}_s\\ {}{\boldsymbol{T}}_p& =\left({t}_5\kern0.5em {t}_6\right)\end{array}

(2)

Ultimately, we integrate all the basic transformations to derive the comprehensive transformation matrix

\boldsymbol{H}

\boldsymbol{H}=\left(\begin{array}{ll}\boldsymbol{A}& \mathbf{0}\\ {}{\boldsymbol{T}}_p& 1\end{array}\right)

(3)

However, directly learning the surrogate parameters

\boldsymbol{t}

to construct the full homography matrix

\boldsymbol{H}

is not robust enough, as it can be significantly affected by variations in input size. To mitigate this issue, we do not estimate

\boldsymbol{t}

for

\boldsymbol{H}

directly. Instead,

\boldsymbol{t}

is learned to construct a normalized homography matrix

\tilde{\boldsymbol{H}}

, which is invariant to input size. The original matrix

\boldsymbol{H}

is then recovered through a normalization transformation.

\begin{array}{cc}\boldsymbol{H}& ={\boldsymbol{N}}^{-1}\tilde{\boldsymbol{H}}\boldsymbol{N}\\ {}\boldsymbol{N}& =\left(\begin{array}{ccc}\frac{2}{w-1}& 0& -1\\ {}0& \frac{2}{h-1}& -1\\ {}0& 0& 1\end{array}\right)\end{array}

(4)

where

\tilde{\boldsymbol{H}}

denotes normalized transformation matrix relative to the image sizes,

\boldsymbol{N}

represents normalization matrix,

w

and

h

signify the image width and height, respectively.

Given that perspective distortion is a high-level semantic feature, we incorporate the perspective transformation estimation subnetwork into the final feature map of backbone's outputs. As illustrated in Figure 3, the feature map is first processed through a 3 × 3 convolution and a ReLU activation [47]. Following this, a 7 × 7 RoIAlign [48] layer is employed to resample the feature map to a fixed size. The resampled feature map is passed through a prediction head comprising sequential layers: a 1 × 1 convolution, an average pooling layer, a LeakyReLU [49] activation, and another 1 × 1 convolution. The final output consists of surrogate parameters, which are used to construct the perspective matrix as previously described.

After obtaining the perspective matrix for each image, we resample the backbone's feature maps using this matrix, correcting potential geometric distortions. To enhance these corrected feature maps with window arrangement priors, we employ a simple row- and column-based attention mechanism [50], chosen for its computational efficiency. The enhanced feature maps are then resampled back to their original size using the inverse of the perspective matrix. This perspective matrix-assisted attention mechanism helps us leverage the prior knowledge of frontal views, even when dealing with non-frontal view images. This enables the model to capture long-range contextual information and improve overall performance.

3.2 Alignment and Perpendicular Loss

In the perspective transformation estimation subnetwork, direct annotations for supervised training are unavailable. Considering that during façades are rectified to frontal views, window shapes will progressively become regular, whose edges align with horizontal or vertical axes, and adjacent edges become perpendicular. Based on this shape prior, we propose two types of losses: alignment loss and perpendicular loss, to effectively guide and supervise the perspective transformation estimation subnetwork.

3.2.1 Alignment Loss

For each normalized edge

\boldsymbol{e}

of a quadrangle, where

{\left\Vert \boldsymbol{e}\right\Vert}_2=1

, let

{\boldsymbol{e}}_x

and

{\boldsymbol{e}}_y

represent its components along the

x

and

y

axes, respectively. The function

{\ell}_{\mid }={\boldsymbol{e}}_x\times {\boldsymbol{e}}_y

serves as an effective indicator. When

{\boldsymbol{e}}_x={\boldsymbol{e}}_y

{\ell}_{\mid }

reaches its maximum value. Conversely, when

{\boldsymbol{e}}_x\ne {\boldsymbol{e}}_y

{\ell}_{\mid }

is a monotonic function within the corresponding interval. When

{\ell}_{\mid }

approaches 0, it incrementally guides

\boldsymbol{e}

to align with either the

x

y

axis. Based on this, the alignment loss function is proposed.

{\mathcal{L}}_{align}=\frac{1}{n}\sum \limits_{i=0}^n\sum \limits_{j=0}^4\frac{{\left|{\boldsymbol{p}}_{i,\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4}-{\boldsymbol{p}}_{i,j}\right|}_x\cdotp {\left|{\boldsymbol{p}}_{i,\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4}-{\boldsymbol{p}}_{i,j}\right|}_y}{{\left\Vert {\boldsymbol{p}}_{i,\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4}-{\boldsymbol{p}}_{i,j}\right\Vert}_2^2}

(5)

where

i

represents the

i

-th positive sample,

j

denotes the

j

-th point of the quadrangle, and

\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4

indicates the subsequent point in the counterclockwise direction. The term

\left|{\boldsymbol{p}}_{i,\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4}-{\boldsymbol{p}}_{i,j}\right|

represents the absolute difference in coordinates between two adjacent points, while

{\left|\cdotp \right|}_x

and

{\left|\cdotp \right|}_y

represent the components along the

x

and

y

axes, respectively.

3.2.2 Perpendicular Loss

For two adjacent normalized edges

{\boldsymbol{e}}_1

and

{\boldsymbol{e}}_2

of a quadrangle, where

{\left\Vert {\boldsymbol{e}}_1\right\Vert}_2=1

and

{\left\Vert {\boldsymbol{e}}_2\right\Vert}_2=1

, the measure

{\ell}_{\perp }=\left|{\boldsymbol{e}}_1\cdotp {\boldsymbol{e}}_2\right|

serves as a quantifiable measure of orthogonality. When

{\boldsymbol{e}}_1

is orthogonal to

{\boldsymbol{e}}_2

{\ell}_{\perp }=0

. Conversely, when they are not orthogonal,

{\ell}_{\perp }

takes a non-zero value. As

{\ell}_{\perp }

converges to 0, the function gradually enforces adjacent edges to become perpendicular. Based on this principle, we propose the perpendicular loss function.

\begin{array}{cc}\boldsymbol{e}& ={\left\{{\boldsymbol{p}}_{i,\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4}-{\boldsymbol{p}}_{i,j}\right\}}_{i=1,j=1}^{n,4}\\ {}{\mathcal{L}}_{perpendicular}& =\frac{1}{n}\sum \limits_{i=0}^n\sum \limits_{j=0}^4\exp \left(\frac{{\boldsymbol{e}}_{i,j}\cdotp {\boldsymbol{e}}_{i,\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4}}{{\left\Vert {\boldsymbol{e}}_{i,j}\right\Vert}_2{\left\Vert {\boldsymbol{e}}_{i,\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4}\right\Vert}_2}\right)\end{array}

(6)

where

i

denotes the

i

-th positive sample,

j

represents the

j

-th point or edge of the quadrangle, and

\left(j+1\right)\ \mathit{\operatorname{mod}}\ 4

refers to the subsequent point or edge in the counterclockwise direction. The symbol

\cdotp

signifies the vector inner product and the

{\left\Vert \cdotp \right\Vert}_2

denotes the L2 norm.

The coordinates of the points $\left\{{\boldsymbol{p}}_{i,j}\right\}$ in both loss functions are derived from homography transformation with the perspective matrix. Before applying this transformation, we detach the gradients of these points. This operation prevents gradients from flowing back to the point coordinates, ensuring effective supervision of the perspective matrix.

Clearly, these two proposed loss functions are complementary: the perpendicular loss encourages adjacent edges to be perpendicular but does not guarantee that the quadrangle is ultimately aligned with the coordinate axes after transformation, while the alignment loss enforces edges to be aligned with the axes but does not ensure they are properly aligned with both the x and y axes after transformation. The potential extreme optimization results when using only one of these losses are illustrated in Figure 4.

3.3 Differentiable Rendering Loss for Quadrangles

The proposed differentiable rendering loss is designed to supervise window corner locations using raster mask annotations. Unlike existing methods that rely on explicit window corner annotations, this loss directly aligns quadrangle predictions with raster-based ground truth, eliminating the need for manual re-annotation or contour simplification. Specifically, a closed 2D quadrangle is constructed from the predicted points and rendered into a grid mask. The discrepancy between the rendered mask and the ground truth mask is then quantified using the L2 loss, effectively guiding the model to refine its predictions.

To ensure stable training, several enhancements were introduced. As shown in Figure 5, the network's first four output points are mapped counterclockwise to the window corners, forming a closed quadrangle. This quadrangle is then triangulated, and the resulting vertices $\boldsymbol{v}$ and triangle indices $\boldsymbol{tri}$ are passed to the differentiable rendering pipeline $R$ [51], generating the rendered mask $\hat{\boldsymbol{M}}=R\left(\boldsymbol{v},\boldsymbol{tri}\right)$ . Here, $\boldsymbol{v}$ represents normalized point coordinates, and $\boldsymbol{tri}$ specifies the triangle vertex indices.

At the initial training stage, network weight initialization may cause the normalized predicted points to be random values near

\left(0,0\right)

. These points could form a self-intersecting quadrangle, which would generate incorrect gradients. To address this, the differentiable rendering loss is truncated within a small rectangular range from

\left(-\delta, -\delta \right)

\left(\delta, \delta \right)

. Within this range, the loss is replaced by a bounding box-guided loss that uses the Manhattan distance between the points and the corresponding ground-truth bounding box vertices. Once all points move outside this range, the bounding box-guided loss is truncated to prevent interference with point position regression. The combined loss is referred to as the rendering loss.

{\mathcal{L}}_{rendering}=\frac{1}{n}\sum \limits_{j=1}^n\left[g\left(x,y\right){\left\Vert {\hat{\boldsymbol{M}}}_j-{\boldsymbol{M}}_j\right\Vert}_2^2+\left(1-g\right(x,y\left)\right){\left\Vert {\tilde{\boldsymbol{v}}}_j-\tilde{\boldsymbol{b}}\right\Vert}_1\right]

(7)

where

j

signifies the

j

-th positive sample,

{\hat{\boldsymbol{M}}}_j

represents the rendered mask of the

j

-th positive sample,

{\boldsymbol{M}}_j

corresponds to the ground-truth mask, the notation

{\left\Vert \cdotp \right\Vert}_2^2

denotes the squared L2 norm, while

{\left\Vert \cdotp \right\Vert}_1

refers to the L1 norm,

{\tilde{\boldsymbol{v}}}_j

is normalized predicted point coordinates relative to the ground-truth bounding box, and

\tilde{\boldsymbol{b}}

indicates the normalized ground-truth bounding box coordinates. The truncation function

g\left(x,y\right)

is implemented as a Heaviside step function.

g\left(x,y\right)=\left\{\begin{array}{ll}1,& \mid x\mid >\delta \kern0.5em or \mid y\mid >\delta \\ {}0,& otherwise\end{array}\right.

(8)

where the truncation threshold

\delta

is set to a default value of 0.1.

It is worth noting that while the proposed differentiable rendering loss proves effective and empirically demonstrates stable convergence after initial training epochs, it inevitably introduces additional computational overhead. Nevertheless, this computational cost is negligible when utilizing contemporary deep-learning devices, as modern GPU-based hardware inherently originates from architectures optimized for highly efficient graphics rendering pipelines. Furthermore, recent advancements, such as NVIDIA's differentiable rendering frameworks [51], have enabled seamless integration of high-performance graphics pipelines with CUDA-based computation. This integration achieves smooth interoperability without substantially increasing computational load, while maintaining differentiability throughout the rendering process, thereby mitigating potential concerns regarding the efficiency and practicality of our proposed approach.

4 Experiments and Results

4.1 Datasets

Following a prior study [27], we validated our results on five benchmark datasets: the ECP dataset [16], CMP dataset [25], Graz50 dataset [23], ArtDeco dataset [26], and eTRIMS dataset [30]. These datasets, specifically designed for façade parsing, feature diverse architectural styles collected from cities worldwide, offering a comprehensive evaluation of the proposed method. Our research focused exclusively on window extraction, ignoring other categories in these datasets. Detailed dataset information is shown in Table 1, with annotated examples in Figure 6.

TABLE 1. The comprehensive details of each dataset.

Datasets	Count	Rectified	Category
ECP	104	Yes	window, wall, balcony, door, shop, sky, roof, chimney.
CMP	606	Yes	wall, molding, cornice, pillar, window, door, sill, blind, balcony, shop, deco, background.
Graz50	50	Yes	wall, door, window, sky.
ArtDeco	79	Yes	door, shop, balcony, window, wall, sky, roof.
eTRIMS	60	No	window, wall, door, sky, pavement, vegetation, car, road.

4.2 Comparative Methods and Evaluation Metrics

We compared the performance of our model with three state-of-the-art methods: DeepFacade [27], DeepWindows [35], and CFBS [28]. To ensure a fair comparison, we used semantic segmentation accuracy metrics, including the F1-score [52], which is the harmonic mean of precision and recall, and the Jaccard index (IoU) [53], to evaluate window extraction accuracy. Pixel accuracy metrics were excluded because they can be misleading due to foreground-background imbalance. For comparison, we converted our method's vector outputs into binary masks through rasterization.

4.3 Implementation Details

Our method was implemented using the mmdetection framework [54] and trained on a single 2080Ti GPU. We adopted ResNet-50, pre-trained on ImageNet, as the backbone. Apart from the proposed improvements, other parameters followed those of RepPoints v2 [55]. We trained the model using a synchronized stochastic gradient descent optimizer with two images per minibatch, an initial learning rate of 0.005, a weight decay of 0.0001, and momentum of 0.9. Training lasted 36 epochs with multi-scale images (1333 × 480 to 1333 × 960), and the learning rate decayed at the 24th and 32nd epochs. Random horizontal flipping was applied during training, and non-maximum suppression ( $\upsigma$ = 0.5) was used during inference for post-processing.

In the differentiable rendering loss, ground-truth masks were resampled to a resolution of 32 × 32 for supervision. During attention enhancement, the query and key channels were reduced by a factor of 32, and the value channel by a factor of 16. After computing row and column-based attention, a 1 × 1 convolution was applied to restore the feature channels to their original dimensions. All losses are summed up to form the final loss.

4.4 Quantitative Evaluation

Table 2 presents a quantitative comparison of our method with state-of-the-art approaches across multiple datasets, with bold values marking the best results for each metric. Our method consistently achieved the highest F1-score and IoU across most datasets, demonstrating superior precision, recall balance, and segmentation accuracy.

TABLE 2. Quantitative evaluation results.

Datasets	Methods	F1 (%)	IoU (%)
ECP	DeepFacade	—	80.3
	DeepWindows	79.6	63.8
	CFBS	90.1	82.0
	QuadWindow (Ours)	90.41	82.49
CMP	DeepFacade	80.0	—
	DeepWindows	71.3	56.7
	CFBS	82.4	64.1
	QuadWindow (Ours)	91.02	83.52
Graz50	DeepFacade	—	71.3
	DeepWindows	70.2	59.9
	CFBS	84.3	73.1
	QuadWindow (Ours)	85.43	74.57
ArtDeco	DeepFacade	—	70.7
	DeepWindows	74.6	60.5
	CFBS	87.7	72.1
	QuadWindow (Ours)	85.81	75.14
eTRIMS	DeepFacade	—	71.1
	DeepWindows	85.2	74.3
	CFBS	—	—
	QuadWindow (Ours)	85.33	74.42
Mean	DeepFacade	80.0	73.35
	DeepWindows	76.18	63.04
	CFBS	86.13	72.83
	QuadWindow (Ours)	87.6	78.03

Note: For both F1-score and IoU, the superior values are shown in bold.

On the CMP dataset, our F1-score reached 91.02% and our IoU reached 83.52%, outperforming all other approaches. Given the CMP dataset's larger sample size, these results strongly emphasize the superiority of our method. The observed slight performance reduction in terms of F1-score compared to the state-of-the-art CFBS method on the ArtDeco dataset is primarily attributed to our lightweight feature enhancement strategy, which prioritizes computational efficiency. This approach may insufficiently address complex occlusion scenarios present in this dataset. The extensive occlusions in the ArtDeco dataset may lead to missed detections of fully occluded windows. While our method effectively estimates the position and shape of partially occluded windows based on contextual clues, its performance diminishes for completely occluded windows, negatively affecting the overall results. Nonetheless, the regular shapes predicted by our method result in competitive performance compared to state-of-the-art methods.

Results on the ECP and Graz50 datasets further demonstrate the superiority of our method. Our approach achieves the highest average F1-score and IoU across these datasets, highlighting its adaptability to diverse architectural styles. It is noteworthy that CFBS achieves higher accuracy on its second step by combining segmentation results with original images to re-regress window bounding boxes as segmentation results. This approach introduces stronger assumptions about the façade images, leading to better outcomes on manually corrected datasets. As such, we did not include it in our comparison.

The eTRIMS dataset was not manually rectified, meaning the window arrangement priors are no longer valid. As a result, the DeepFacade method, which depends on symmetry loss functions, performs poorly. On the other hand, DeepWindows incorporates an instance-based relationship module and is less reliant on strict façade priors, making it more resilient to perspective distortions and resulting in the second-best performance. Our method, by combining spatial and shape priors through perspective transformation and quadrangle representation of masks, achieves the best results across all metrics.

4.5 Qualitative Comparison and Analysis

In addition to quantitative analysis, we also conducted visual comparisons to demonstrate the distinctive characteristics of our method.

Figure 7 shows the visual comparison results on the ECP dataset. Our method not only produces results with superior regularity but also closely matches the ground truth. In contrast, other methods exhibit excessively smooth corners and noisy boundaries, highlighting our method's advantage in predicting regular shapes and avoiding the over-smoothing effects common in conventional semantic segmentation.

Figure 8 presents the qualitative comparison on the CMP dataset. Similar to the ECP dataset, our method completely avoids noisy segmentation and produces more regular output compared to other methods. The attention mechanism enhancement further ensures that our method does not mistakenly extract irrelevant objects, such as doors and decorations, enhancing the accuracy of our results.

Figure 9 presents the qualitative comparison on the Graz50 dataset. It can be observed that our method closely matches the ground truth, not only detecting all windows but also avoiding the erroneous extraction of doors observed in other methods. Visually, our results are nearly indistinguishable from the ground truth, demonstrating the robustness and accuracy of our approach.

Figure 10 illustrates various occlusion scenarios in the ArtDeco dataset, which pose significant challenges for window extraction. In the first and second columns, showing partial occlusions, our method accurately estimates window positions and shapes, producing regular and precise predictions. In the third column, where a large area is occluded, our method extracts most windows while maintaining their regular shapes. In contrast, while other methods predict window locations correctly, their predicted shapes are often unsatisfactory.

Figure 11 presents a qualitative comparison of the segmentation results between our method and DeepWindows. It is evident that our method achieves superior boundary regularity for windows. DeepWindows produces jagged and irregular boundaries, whereas our method consistently maintains sharp, clean edges, ensuring more precise window extraction.

5 Discussion

In addition to comparing our method with state-of-the-art approaches, we conducted ablation studies to further verify its effectiveness. Specifically, we compared RepPoints v2 and its extension, Dense RepPoints v2 [55], which is designed for instance segmentation using more points, to examine the impact of domain knowledge on deep learning models and assess the effectiveness of our improvements. Furthermore, we visually validated the role of our perspective transformation estimation subnetwork and quantitatively analyzed its contribution to feature enhancement.

5.1 Effectiveness of the Proposed Quadrangle Representation

Accuracy comparison of our method with baseline methods is presented in Table 3. It is worth noting that the RepPoints v2 method is designed for object detection and cannot be directly compared with the other methods; therefore, we generate segmentation results based on its predicted bounding boxes.

TABLE 3. Accuracy comparison of our method with baseline methods.

Datasets	Methods	Precision (%)	Recall (%)	F1 (%)	IoU (%)
ECP	RepPoints v2	88.13	91.24	89.66	81.25
	Dense RepPoints v2	93.92	83.49	88.4	79.21
	QuadWindow (Ours)	91.58	89.26	90.41	82.49
CMP	RepPoints v2	93.11	88.84	90.92	83.36
	Dense RepPoints v2	95.53	79.47	86.76	76.62
	QuadWindow (Ours)	92.99	89.13	91.02	83.52
Graz50	RepPoints v2	81.59	89.31	85.28	74.33
	Dense RepPoints v2	90.17	80.18	84.88	73.74
	QuadWindow (Ours)	83.45	87.51	85.43	74.57
ArtDeco	RepPoints v2	84.57	84.95	84.19	73.26
	Dense RepPoints v2	85.54	83.52	84.52	73.18
	QuadWindow (Ours)	87.62	84.07	85.81	75.14
eTRIMS	RepPoints v2	80.74	89.06	84.7	73.46
	Dense RepPoints v2	92.39	79.07	85.21	74.23
	QuadWindow (Ours)	87.85	82.96	85.33	74.42

Note: Bold indicates the highest score for each metric.

From Table 3, it can be observed that, compared to baseline methods, our approach consistently achieves the best balance between precision and recall. While RepPoints v2 and Dense RepPoints v2 may surpass our method in precision or recall, our approach consistently achieved the highest F1-score across all five datasets. Additionally, our method outperformed the baseline methods in terms of IoU.

In addition to accuracy improvements, we further investigate the computational efficiency of our approach to evaluate its scalability in real-world applications. We compared the computational cost and parameter count of our method with Dense RepPoints v2, finding that our method reduces computational cost by 46.83% and parameters by 8.12%. These efficiency gains underscore the suitability of our proposed approach for deployment in resource-constrained environments, such as rapid disaster response and large-scale urban analysis tasks. Overall, these results validate the effectiveness of our quadrangle-based window extraction method and demonstrate the significant advantages of incorporating domain-specific knowledge into deep learning frameworks.

5.2 Effectiveness of Perspective Estimation

To evaluate the effectiveness of perspective estimation, we rectified the original images using the perspective matrix generated by our model and visually compared them. The results, shown in Figure 12, indicate that while the images were not perfectly rectified, the perspective distortion was significantly eliminated. This clearly demonstrates the effectiveness of our perspective transformation estimation subnetwork, as well as the perpendicular and alignment losses.

The quantitative evaluation of the effectiveness of the perspective transformation estimation module is presented in Table 4. It is evident that the performance of our method is suboptimal without feature enhancement, likely due to the model's limited receptive field, which restricts its ability to effectively utilize prior knowledge of window spatial arrangements. Incorporating row and column-based attention enhancement improves performance, particularly on the ArtDeco dataset, which contains occlusions. This improvement can be attributed to the network's ability to learn useful architectural priors even with sparse attention.

TABLE 4. Accuracy comparison of different feature enhancement methods.

Datasets	Methods	Precision (%)	Recall (%)	F1 (%)	IoU (%)
ECP	w/o AE	90.38	86.49	88.39	80.61
	w/o PM	91.56	89.11	90.32	82.45
	w/PM	91.58	89.26	90.41	82.49
CMP	w/o AE	91.12	87.61	89.33	81.72
	w/o PM	92.66	89.03	90.81	83.38
	w/PM	92.99	89.13	91.02	83.52
Graz50	w/o AE	85.04	85.74	85.39	74.46
	w/o PM	83.56	87.97	85.71	74.64
	w/PM	83.45	87.51	85.43	74.57
ArtDeco	w/o AE	87.21	80.37	83.65	72.57
	w/o PM	87.55	84.08	85.78	74.98
	w/PM	87.62	84.07	85.81	75.14
eTRIMS	w/o AE	88.2	80.8	84.34	72.92
	w/o PM	87.85	81.81	84.72	73.41
	w/PM	87.85	82.96	85.33	74.42

Note: w/o AE denotes no attention enhancement is used; w/o PM refers to the use of row- and column-based attention enhancement without the assistance of the perspective matrix; w/PM refers to the attention enhancement method proposed in this paper using the perspective matrix. The best performance for each metric is highlighted in bold.

When utilizing the perspective matrix for feature enhancement, our method achieves the best performance across most datasets, though improvements are marginal in some cases. Notably, in the Graz50 dataset, with its highly regular characteristic, perspective matrix-assisted feature enhancement led to a slight performance drop. Conversely, in the eTRIMS dataset, which lacks rectification, the enhancement significantly improves model performance by up to 1.1%. These results highlight the effectiveness of our perspective transformation estimation subnetwork and validate the robustness of our feature enhancement strategy.

5.3 Uncertainties and Limitations

Due to the inherent nonlinear complexity involved in accurately estimating perspective transformations, images exhibiting extreme perspective distortion may reduce the accuracy of the estimated perspective matrix. Furthermore, such extreme distortions can result in a significant loss of semantic information, leading to the degradation of window features. This, in turn, makes the precise extraction of windows more difficult. As a result, the proposed method is less effective on images with severe perspective distortion.

Additionally, a limitation of our proposed approach is its primary applicability to quadrilateral window shapes. Although quadrilaterals represent the majority of real-world window geometries, other shapes, such as arches or irregular polygons, exist and may impact extraction accuracy. While extending our approach from quadrilaterals to polygons by increasing the number of key points could enable finer-grained boundary representations, such polygonal representations introduce greater complexity. Ensuring stable training and reliable polygon triangulation thus requires further exploration and rigorous testing. Future research should therefore investigate methodologies capable of effectively handling these more intricate polygonal structures, thereby enhancing the generalizability and robustness of our proposed framework.

It is important to acknowledge that the current validation was performed on relatively clean façade images. In realistic post-disaster scenarios, images may contain occluded, damaged, or misaligned windows. These complex conditions are not well represented in existing benchmarks. Therefore, future work should explore the robustness of the proposed framework on such challenging, disaster-specific datasets to confirm its applicability in operational disaster management systems.

6 Conclusion

We present QuadWindow, a perspective-aware framework that enables direct, vectorized window extraction from street-view images without a separate vectorization step. The framework integrates a perspective transformation estimation subnetwork that predicts perspective matrices to rectify distorted façades, thereby leveraging the arrangement priors of windows to encode architectural priors into the feature space. Two specialized loss functions supervise this module using geometric constraints without requiring additional annotations. Furthermore, we modify RepPoints into parallel branches and introduce a differentiable rendering loss that aligns predicted quadrangles with raster annotations, enabling end-to-end vectorized predictions without manual re-annotation.

Extensive quantitative and qualitative experiments across five datasets demonstrate that our method outperforms state-of-the-art techniques, achieving an average F1-score of 87.6% and an IoU of 78.03%. Ablation studies further validate the effectiveness of the proposed modules, particularly in addressing perspective distortions and enabling automated vectorized window extraction. These innovations overcome limitations of prior research and enable automated extraction of façade-level features for post-disaster building assessments, rapid risk mapping, and vulnerability analysis. Integrated into natural hazard databases and GIS platforms, QuadWindow can enhance both the spatial and temporal granularity of structural monitoring, contributing to more intelligent and responsive disaster management systems.

Author Contributions

Zhuangqun Niu: data curation, formal analysis, methodology, software, validation, writing – original draft, visualization, investigation. Ke Xi: data curation, validation, writing – original draft, visualization, investigation. Yifan Liao: formal analysis, validation, writing – original draft. Pengjie Tao: conceptualization, methodology, resources, supervision, writing – review and editing, funding acquisition. Tao Ke: conceptualization, methodology, project administration, resources, supervision, writing – review and editing, funding acquisition.

Acknowledgments

This research was funded by the National Key Research and Development Program, grant number 2018YFD1100405.

Conflicts of Interest

The authors declare no conflicts of interest.

Open Research

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

1L. Troup, R. Phillips, M. J. Eckelman, and D. Fannon, “Effect of Window-To-Wall Ratio on Measured Energy Consumption in US Office Buildings,” Energy and Buildings 203 (2019): 109434, https://doi.org/10.1016/j.enbuild.2019.109434.
10.1016/j.enbuild.2019.109434
Web of Science® Google Scholar
2J. Cao, H. Metzmacher, J. O'Donnell, et al., “Facade Geometry Generation From Low-Resolution Aerial Photographs for Building Energy Modeling,” Building and Environment 123 (2017): 601–624, https://doi.org/10.1016/j.buildenv.2017.07.018.
10.1016/j.buildenv.2017.07.018
Web of Science® Google Scholar
3S. Pathirana, A. Rodrigo, and R. Halwatura, “Effect of Building Shape, Orientation, Window to Wall Ratios and Zones on Energy Efficiency and Thermal Comfort of Naturally Ventilated Houses in Tropical Climate,” International Journal of Energy and Environmental Engineering 10 (2019): 107–120, https://doi.org/10.1007/s40095-018-0295-3.
10.1007/s40095-018-0295-3
Google Scholar
4J. T. Szcześniak, Y. Q. Ang, S. Letellier-Duchesne, and C. F. Reinhart, “A Method for Using Street View Imagery to Auto-Extract Window-To-Wall Ratios and Its Relevance for Urban-Level Daylighting and Energy Simulations,” Building and Environment 207 (2022): 108108, https://doi.org/10.1016/j.buildenv.2021.108108.
10.1016/j.buildenv.2021.108108
Web of Science® Google Scholar
5S. Touzani, M. Wudunn, S. Fernandes, A. Zakhor, R. Najibi, and J. Granderson, “A Machine Learning Approach to Estimate Windows-to-Wall Ratio Using Drone Imagery,” in Remote Sensing Technologies and Applications in Urban Environments VI. SPIE, 62–69, 2021.
Google Scholar
6N. Tarkhan, S. Letellier-Duchesne, and C. Reinhart, “ Capturing Façade Diversity in Urban Settings Using an Automated Window to Wall Ratio Extraction and Detection Workflow,” in 2022 Annual Modeling and Simulation Conference (ANNSIM) (IEEE, 2022), 706–717.
10.23919/ANNSIM55834.2022.9859521
Google Scholar
7A. Baird, A. Palermo, and S. Pampanin, “Façade Damage Assessment of Concrete Buildings in the 2011 Christchurch Earthquake,” Structural Concrete 13 (2012): 3–13.
10.1002/suco.201100040
Web of Science® Google Scholar
8D. Dobre, C.-S. Dragomir, and E.-S. Georgescu, “Seismic Protection of Individual Buildings Located in Rural Areas,” Sci Pap Ser EL Reclamation, Earth Obs Surv Environ Eng 3 (2014): 37–44.
Google Scholar
9M. Shariq, H. Abbas, H. Irtaza, and M. Qamaruddin, “Influence of Openings on Seismic Performance of Masonry Building Walls,” Building and Environment 43 (2008): 1232–1240, https://doi.org/10.1016/j.buildenv.2007.03.005.
10.1016/j.buildenv.2007.03.005
Web of Science® Google Scholar
10A. Cohen, M. R. Oswald, Y. Liu, and M. Pollefeys, “Symmetry-Aware Façade Parsing With Occlusions,” in 2017 International Conference on 3D Vision (3DV). IEEE, Qingdao, 393–401, 2017.
Google Scholar
11S. Donkers, H. Ledoux, J. Zhao, and J. Stoter, “Automatic Conversion of IFC Datasets to Geometrically and Semantically Correct CityGML LOD3 Buildings,” Transactions in GIS 20 (2016): 547–569, https://doi.org/10.1111/tgis.12162.
10.1111/tgis.12162
Web of Science® Google Scholar
12Y. Wang, X. Hu, T. Zhou, Y. Ma, and Z. Li, “Efficient Building Facade Structure Extraction Method Using Image-Based Laser Point Cloud,” Transactions in GIS 27 (2023): 1145–1163, https://doi.org/10.1111/tgis.13063.
10.1111/tgis.13063
Web of Science® Google Scholar
13Y. Dehbi, F. Hadiji, G. Gröger, K. Kersting, and L. Plümer, “Statistical Relational Learning of Grammar Rules for 3D Building Reconstruction,” Transactions in GIS 21 (2017): 134–150, https://doi.org/10.1111/tgis.12200.
10.1111/tgis.12200
Web of Science® Google Scholar
14P. Müller, G. Zeng, P. Wonka, and L. Van Gool, “Image-Based Procedural Modeling of Facades,” ACM Transactions on Graphics 26 (2007): 85.
10.1145/1276377.1276484
PubMed Web of Science® Google Scholar
15P. Koutsourakis, L. Simon, O. Teboul, G. Tziritas, and N. Paragios, “ Single View Reconstruction Using Shape Grammars for Urban Environments,” in 2009 IEEE 12th International Conference on Computer Vision (IEEE, 2009), 1795–1802.
10.1109/ICCV.2009.5459400
Web of Science® Google Scholar
16O. Teboul, L. Simon, P. Koutsourakis, and N. Paragios, “ Segmentation of Building Facades Using Procedural Shape Priors,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA, USA, 2010), 3105–3112.
10.1109/CVPR.2010.5540068
Web of Science® Google Scholar
17L. Simon, O. Teboul, P. Koutsourakis, and N. Paragios, “Random Exploration of the Procedural Space for Single-View 3d Modeling of Buildings,” International Journal of Computer Vision 93 (2011): 253–271.
10.1007/s11263-010-0370-6
Web of Science® Google Scholar
18P. Zhao, T. Fang, J. Xiao, H. Zhang, Q. Zhao, and L. Quan, “ Rectilinear Parsing of Architecture in Urban Environment,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2010), 342–349.
10.1109/CVPR.2010.5540192
Web of Science® Google Scholar
19M. Recky and F. Leberl, “ Window Detection in Complex Facades,” in 2010 2nd European Workshop on Visual Information Processing (EUVIP) (IEEE, 2010), 220–225.
10.1109/EUVIP.2010.5699128
Google Scholar
20A. Cohen, A. G. Schwing, and M. Pollefeys, “ Efficient Structured Parsing of Facades Using Dynamic Programming,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2014), 3206–3213.
10.1109/CVPR.2014.410
Google Scholar
21C.-K. Li, H.-X. Zhang, J.-X. Liu, Y.-Q. Zhang, S.-C. Zou, and Y.-T. Fang, “Window Detection in Facades Using Heatmap Fusion,” Journal of Computer Science and Technology 35 (2020): 900–912, https://doi.org/10.1007/s11390-020-0253-4.
10.1007/s11390-020-0253-4
Web of Science® Google Scholar
22Y. Tao, Y.-T. Zhang, and X.-J. Chen, “Element-Arrangement Context Network for Facade Parsing,” Journal of Computer Science and Technology 37 (2022): 652–665, https://doi.org/10.1007/s11390-022-2189-3.
10.1007/s11390-022-2189-3
Web of Science® Google Scholar
23H. Riemenschneider, U. Krispel, W. Thaller, et al., “ Irregular Lattices for Complex Shape Grammar Facade Parsing,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), 1640–1647.
10.1109/CVPR.2012.6247857
Google Scholar
24R. Ma, C. Chen, B. Yang, et al., “CG-SSD: Corner Guided Single Stage 3D Object Detection From LiDAR Point Cloud,” ISPRS Journal of Photogrammetry and Remote Sensing 191 (2022): 33–48, https://doi.org/10.1016/j.isprsjprs.2022.07.006.
10.1016/j.isprsjprs.2022.07.006
Web of Science® Google Scholar
25R. Tyleček and R. Šára, “Spatial Pattern Templates for Recognition of Objects With Regular Structure,” in Pattern Recognition: 35th German Conference, GCPR 2013, Saarbrücken, Germany, September 3-6, 2013. Proceedings 35. Springer, 364–374, 2013.
Google Scholar
26R. Gadde, R. Marlet, and N. Paragios, “Learning Grammars for Architecture-Specific Facade Parsing,” International Journal of Computer Vision 117 (2016): 290–316.
10.1007/s11263-016-0887-4
Web of Science® Google Scholar
27H. Liu, Y. Xu, J. Zhang, J. Zhu, Y. Li, and S. C. H. Hoi, “DeepFacade: A Deep Learning Approach to Facade Parsing With Symmetric Loss,” IEEE Transactions on Multimedia 22 (2020): 3153–3165, https://doi.org/10.1109/TMM.2020.2971431.
10.1109/TMM.2020.2971431
Web of Science® Google Scholar
28X. Zhuo, J. Tian, and F. Fraundorfer, “Cross Field-Based Segmentation and Learning-Based Vectorization for Rectangular Windows,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16 (2023): 431–448, https://doi.org/10.1109/JSTARS.2022.3218767.
10.1109/JSTARS.2022.3218767
Web of Science® Google Scholar
29S. Wang, Q. Kang, R. She, W. P. Tay, D. N. Navarro, and A. Hartmannsgruber, “Building Facade Parsing r-Cnn,” preprint, arXiv:220505912, 2022.
Google Scholar
30F. Korc and W. Förstner, “eTRIMS Image Database for Interpreting Images of Man-Made Scenes,” Dept of Photogrammetry, University of Bonn, Tech Rep TR-IGG-P-2009-01, 2009.
Google Scholar
31Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “ RepPoints: Point Set Representation for Object Detection,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2019), 9656–9665.
10.1109/ICCV.2019.00975
Google Scholar
32F. Alegre and F. Dellaert, “ A Probabilistic Approach to the Semantic Interpretation of Building Facades,” in International Workshop on Vision Techniques Applied to the Rehabilitation of City Centres (VTRCC) (Instituto Superior Técnico, 2004), 3.
Google Scholar
33S. Reznik and H. Mayer, “Implicit Shape Models, Self-Diagnosis, and Model Selection for 3D façade Interpretation,” Photogrammetrie Fernerkundung Geoinformation 2008 (2008): 187–196.
Google Scholar
34W. Ma and W. Ma, “Deep Window Detection in Street Scenes,” KSII Transactions on Internet and Information Systems (TIIS) 14, no. 2 (2020): 855–870, https://doi.org/10.3837/tiis.2020.02.022.
10.3837/tiis.2020.02.022
Web of Science® Google Scholar
35Y. Sun, S. Malihi, H. Li, and M. Maboudi, “DeepWindows: Windows Instance Segmentation Through an Improved Mask R-CNN Using Spatial Attention and Relation Modules,” ISPRS International Journal of Geo-Information 11 (2022): 162, https://doi.org/10.3390/ijgi11030162.
10.3390/ijgi11030162
CAS Web of Science® Google Scholar
36G. Zhang, Y. Pan, and L. Zhang, “Deep Learning for Detecting Building façade Elements From Images Considering Prior Knowledge,” Automation in Construction 133 (2022): 104016.
10.1016/j.autcon.2021.104016
Web of Science® Google Scholar
37A. Fond, M.-O. Berger, and G. Simon, “Model-Image Registration of a Building's Facade Based on Dense Semantic Segmentation,” Computer Vision and Image Understanding 206 (2021): 103185, https://doi.org/10.1016/j.cviu.2021.103185.
10.1016/j.cviu.2021.103185
Web of Science® Google Scholar
38Y. Lu, W. Wei, P. Li, T. Zhong, Y. Nong, and X. Shi, “A Deep Learning Method for Building façade Parsing Utilizing Improved SOLOv2 Instance Segmentation,” Energy and Buildings 295 (2023): 113275, https://doi.org/10.1016/j.enbuild.2023.113275.
10.1016/j.enbuild.2023.113275
Web of Science® Google Scholar
39Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021): 172–186, https://doi.org/10.1109/TPAMI.2019.2929257.
10.1109/TPAMI.2019.2929257
PubMed Web of Science® Google Scholar
40Z. Yang, Y. Xu, H. Xue, et al., “Dense Reppoints: Representing Visual Objects With Dense Point Sets,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, 227–244, 2020.
Google Scholar
41S. Peng, W. Jiang, H. Pi, X. Li, H. Bao, and X. Zhou, “ Deep Snake for Real-Time Instance Segmentation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2020), 8530–8539.
10.1109/CVPR42600.2020.00856
Google Scholar
42T. Zhang, S. Wei, and S. Ji, “ E2ec: An End-To-End Contour-Based Method for High-Quality High-Speed Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE/CVF, 2022), 4443–4452.
Google Scholar
43A. Hatamizadeh, D. Sengupta, and D. Terzopoulos, “ End-To-End Trainable Deep Active Contour Models for Automated Image Segmentation: Delineating Buildings in Aerial Imagery,” in Computer Vision – ECCV 2020, ed. A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm (Springer International Publishing, 2020), 730–746.
10.1007/978-3-030-58610-2_43
Google Scholar
44Q. Zhang, J. Zhang, Y. Xu, and D. Tao, “Vision Transformer With Quadrangle Attention,” IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024): 3608–3624.
10.1109/TPAMI.2023.3347693
PubMed Web of Science® Google Scholar
45X. Zhan, Y. Liu, J. Zhu, and Y. Li, “Homography Decomposition Networks for Planar Object Tracking,” Proceedings of the AAAI Conference on Artificial Intelligence 36 (2022): 3234–3242, https://doi.org/10.1609/aaai.v36i3.20232.
10.1609/aaai.v36i3.20232
Google Scholar
46C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey, “ St-Gan: Spatial Transformer Generative Adversarial Networks for Image Compositing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Salt Lake City (2018), 9455–9464.
Google Scholar
47X. Glorot, A. Bordes, and Y. Bengio, “ Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR Workshop and Conference Proceedings (Fort Lauderdale, 2011), 315–323.
Google Scholar
48K. He, G. Gkioxari, P. Dollár, and R. Girshick, “ Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (IEEE, 2017), 2961–2969.
Google Scholar
49B. Xu, N. Wang, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolutional Network,” preprint, arXiv:150500853, 2015.
Google Scholar
50Z. Huang, X. Wang, Y. Wei, et al., “CCNet: Criss-Cross Attention for Semantic Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2023): 6896–6908.
10.1109/TPAMI.2020.3007032
PubMed Web of Science® Google Scholar
51S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila, “Modular Primitives for High-Performance Differentiable Rendering,” ACM Transactions on Graphics 39 (2020): 1–14.
10.1145/3414685.3417861
Web of Science® Google Scholar
52Q. Chen, L. Wang, S. L. Waslander, and X. Liu, “An End-To-End Shape Modeling Framework for Vectorized Building Outline Generation From Aerial Images,” ISPRS Journal of Photogrammetry and Remote Sensing 170 (2020): 114–126, https://doi.org/10.1016/j.isprsjprs.2020.10.008.
10.1016/j.isprsjprs.2020.10.008
Web of Science® Google Scholar
53X. Li, C. Lv, W. Wang, G. Li, L. Yang, and J. Yang, “Generalized Focal Loss: Towards Efficient Representation Learning for Dense Object Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (IEEE, 2022): 3139–3153.
Google Scholar
54K. Chen, J. Wang, J. Pang, et al., “MMDetection: Open MMLab Detection Toolbox and Benchmark,” preprint, arXiv:190607155, 2019.
Google Scholar
55Y. Chen, Z. Zhang, Y. Cao, L. Wang, S. Lin, and H. Hu, “Reppoints v2: Verification Meets Regression for Object Detection,” Advances in Neural Information Processing Systems 33 (2020): 5621–5631.
Google Scholar

Volume7, Issue7

July 2025

e70294

QuadWindow: A Perspective-Aware Framework for Geometric Window Extraction From Street-View Imagery

ABSTRACT

1 Introduction

2 Related Work

2.1 Window Extraction With Architectural Prior

2.2 Vectorized Windows Extraction

3 Methodology

3.1 Perspective Transformation Estimation

3.2 Alignment and Perpendicular Loss

3.2.1 Alignment Loss

3.2.2 Perpendicular Loss

3.3 Differentiable Rendering Loss for Quadrangles

4 Experiments and Results

4.1 Datasets

4.2 Comparative Methods and Evaluation Metrics

4.3 Implementation Details

4.4 Quantitative Evaluation

4.5 Qualitative Comparison and Analysis

5 Discussion

5.1 Effectiveness of the Proposed Quadrangle Representation

5.2 Effectiveness of Perspective Estimation

5.3 Uncertainties and Limitations

6 Conclusion

Author Contributions

Acknowledgments

Conflicts of Interest

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley