REVIEW ARTICLE

Open Access

Recent advances of Transformers in medical image analysis: A comprehensive review

Kun Xia

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China

Contribution: Conceptualization (equal), Investigation (equal), Resources (equal), Writing - original draft (equal), Writing - review & editing (equal)

Search for more papers by this author

Jinzhuo Wang,

Corresponding Author

Jinzhuo Wang

[email protected]

orcid.org/0000-0002-9464-4426

College of Future Technology, Peking University, Beijing, China

Correspondence Jinzhuo Wang, College of Future Technology, Peking University, Beijing, China.

Email: [email protected]

Contribution: Conceptualization (equal), Funding acquisition (lead), Investigation (equal), Resources (equal), Writing - original draft (equal), Writing - review & editing (equal)

Search for more papers by this author

Kun Xia,

Kun Xia

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China

Contribution: Conceptualization (equal), Investigation (equal), Resources (equal), Writing - original draft (equal), Writing - review & editing (equal)

Search for more papers by this author

Jinzhuo Wang,

Corresponding Author

Jinzhuo Wang

[email protected]

orcid.org/0000-0002-9464-4426

College of Future Technology, Peking University, Beijing, China

Correspondence Jinzhuo Wang, College of Future Technology, Peking University, Beijing, China.

Email: [email protected]

Contribution: Conceptualization (equal), Funding acquisition (lead), Investigation (equal), Resources (equal), Writing - original draft (equal), Writing - review & editing (equal)

Search for more papers by this author

First published: 24 March 2023

https://doi.org/10.1002/mef2.38

Citations: 3

Share a link

Email
Wechat
Bluesky

Abstract

Recent works have shown that Transformer's excellent performances on natural language processing tasks can be maintained on natural image analysis tasks. However, the complicated clinical settings in medical image analysis and varied disease properties bring new challenges for the use of Transformer. The computer vision and medical engineering communities have devoted significant effort to medical image analysis research based on Transformer with especial focus on scenario-specific architectural variations. In this paper, we comprehensively review this rapidly developing area by covering the latest advances of Transformer-based methods in medical image analysis of different settings. We first give introduction of basic mechanisms of Transformer including implementations of selfattention and typical architectures. The important research problems in various medical image data modalities, clinical visual tasks, organs and diseases are then reviewed systemically. We carefully collect 276 very recent works and 76 public medical image analysis datasets in an organized structure. Finally, discussions on open problems and future research directions are also provided. We expect this review to be an up-to-date roadmap and serve as a reference source in pursuit of boosting the development of medical image analysis field.

1 INTRODUCTION

For medical analysis, medical image is one of the most abundant modalities. With the increasing development of computer vision (CV), the medical image analysis can contribute to the clinical practice of doctors. Some specific CV tasks on medical images can be, to some extent, associated with the guide and aid for the doctors. For instance, the segmentation task in CV can help the doctor pick out the abnormal region, which reflects the symptom of disease and provide the preliminary information for the corresponding medical intervention. On the other hand, medical image analysis differs from the general CV task owing to the property of medical images. The medical image data set tends to be relatively small, which make some framework that have performed well in CV fail the medical image analysis task expectancy.

Despite the particularity of medical image analysis, there still exists a strong relation between the CV and medical image analysis. Hence, the shift of CV mainstream method has also been reflected in automatic analysis of medical images. Since the deep learning reshaped the development of CV, convolutional neural network (CNN) has been one of the most influencing frameworks in image processing. Correspondingly, previous attempts had been made to employ the CNN to the detection, segmentation, and other visual tasks on various medical image modalities, such as computed tomography (CT), ultrasound (US), magnetic resonance imaging (MRI). The convolution operation, on which CNNs were generally based, proved to be excellent in local feature extraction.

However, the shortcoming of convolution operation outbroke with the occurrence of Transformer structure tailored for images. Transformer network was initially launched for the natural language processing (NLP). The global dependency of attention mechanism made the Transformer dominate NLP in a short period. Afterward, the introducing of Transformer to the image processing, namely the vision Transformer, was shown to outperform the CNN largely in terms of image recognition. Other than image recognition, many visual tasks, such as image segmentation, image reconstruction also accepted the Transformer structure. Nevertheless, the Transformer structure brought a sharp improvement in efficiency yet a larger consumption of data scale and computation source. Notably, medical image analysis, with a scarcity of data scale, may suffer from the shortcoming of Transformer. It still remains an open question as for how to balance the efficiency of Transformer with its overwhelming computation cost.

In this paper, many attempts to solve this dilemma have been introduced. Considering the complexity of clinical circumstance and disease symptom, many researchers have proposed many innovative methods to utilized the Transformer in medical imaging analysis. Some fuse the Vision Transformer with other network while others tailor the Vision Transformer network according to the specific requirement of clinical demand. To demonstrate these Transformer-based methods more comprehensively and systematically, these attempts are arranged according to their corresponding visual tasks in CV and target diseases. The organization of this paper as follows. Section 2 introduces the mechanism of Transformer and Vision Transformer. Sections 3–9 categorizes the different applications of Vision Transformer in medical imaging. Section 10 introduces some public medical imaging datasets. Section 11 concludes the latest Transformer-based works in medical imaging and discusses the further development.

2 MECHANISMS OF TRANSFORMER

Initially proposed in Vaswani et al.,¹ Transformer has shown to perform an excellent task in NLP. In fact, the powerful weapon for Transformer: the global receptive fields and the long-range dependency also applies to the vision task with a trick of image patching. In Section 2, the selfattention module, positional encoding and the Transformer structure are introduced as the preparation for the Transformer-based works in Sections 3–9.

2.1 Self-attention

Self-attention is a mechanism adopted in Transformer (as is shown in Figure 1) to achieve the task of sequence labelling. In a seq-to-seq task, in which the inputs are vector sets instead of a single vector, the model may find it difficult to extract the contextual information. Thus, the self-attention is introduced with the help of Scaled dot-product. The contextual weight of each vector is obtained after the Scaled dot-product in pairs and a Softmax operation.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Flow chart of Transformer,¹ which consists of positional encoding, Multi-Head Attention and feed-forward network. Copyright permission from Vaswani et al.¹

2.1.1 Scaled dot-product

Scaled dot-product attention achieves this target and measures the correlations within the sequences based on three vectors: query, key, and value. These three vectors: query, key, value can be obtained after a linear transform from the input sequence X.

$urn:x-wiley:27696456:media:mef238:mef238-math-0001$ ()

$urn:x-wiley:27696456:media:mef238:mef238-math-0002$ ()

To promote the efficiency of computation and facilitate the practical work, the vectors of $urn:x-wiley:27696456:media:mef238:mef238-math-0003$ , $urn:x-wiley:27696456:media:mef238:mef238-math-0004$ , $urn:x-wiley:27696456:media:mef238:mef238-math-0005$ are shaped in the form of matrixes. With the matrix Q, K, the attention weight can be calculated by multiplying Q with K. After a $urn:x-wiley:27696456:media:mef238:mef238-math-0006$ (dimension) division and a Softmax operation, the compressed attention weight is assigned to the V matrix. In conclusion, the Scaled dot-product attention can be described mathematically as follows:

$urn:x-wiley:27696456:media:mef238:mef238-math-0007$ ()

2.1.2 Multi-Head Attention

Different from Scaled dot-product attention, Multi-Head Attention projects the queries, keys and values in parallel for h times. The result has proved its superiority in extracting hierarchy features. The Multi-Head Attention is calculated as follows:

$urn:x-wiley:27696456:media:mef238:mef238-math-0008$ ()

2.2 Positional encoding

Positional encoding is an effective method to retain positional information, which has also been previously adopted in CNN or RNN. As Transformer take the data as a whole with no regard to distance within sequences, the position information of the data is missing. Therefore, position encoding in Transformer is not complementary to recurrent or convolution, as it tends to do in CNN or RNN, but an essential indicator that carries all of the positional information in Transformer, whether this positional information is relative or absolute.

Positional encoding in Transformer takes the form of the sinusoid function. In the following formula of positional encoding, $urn:x-wiley:27696456:media:mef238:mef238-math-0009$ stands for the position index of a particular word. For the odd dimension and even dimension of the positional encoding, different sinusoid functions are adopted as follows:

$urn:x-wiley:27696456:media:mef238:mef238-math-0010$ ()

Owing to the property of sinusoid function: product to sum, $urn:x-wiley:27696456:media:mef238:mef238-math-0011$ can be represented as a linear function of $urn:x-wiley:27696456:media:mef238:mef238-math-0012$ .

2.3 Transformer architecture

Designed for sequence-to-sequence tasks, Transformer adopts the encoder-to-decoder architecture, like other excellent neural sequence transduction models. A typical Transformer consists of blocks for Multi-Head Attention, masked Multi-Head Attentionattention,Feed Forward, and layer normalization.

2.3.1 Encoder

Encoder of Transformer contains two main layers: Multi-Head Attention layer and Feed Forward. These two layers both adopt the residual connection. For an input sequence X = $urn:x-wiley:27696456:media:mef238:mef238-math-0013$ , after the Multi-Head Attention block, the sequence gets normalized and added by the input before the Multi-Head Attention. Then, the data sequence gets processed through a Feed Forward layer and a residual connection, an output sequence Z = $urn:x-wiley:27696456:media:mef238:mef238-math-0014$ is fed to decoder. The computation process of Feed Forward layer can be depicted as follows:

$urn:x-wiley:27696456:media:mef238:mef238-math-0015$ ()

2.3.2 Decoder

Decoder of the Transformer contains three layers. The target of decoder is to generate the output sequence of the whole Transformer model based on the output of the encoder. Notably, the bottom Multi-Head Attention employed in decoder gets masked. As the sequence-to-sequence task takes sequences as both input and output, the mask operation can cut off the influence from the subsequent positions. The following Multi-Head Attention layer and Feed Forward layer are similar to those in the encoder part.

In this review, we consider various types of Transformer models used in the field of medical image analysis in different settings.

3 MEDICAL IMAGE SEGMENTATION

Medical images are suitable for revealing the symptoms of the disease, which can be valuable for both diagnosis and treatment. However, from the pixel-wise perspective, only some part of a medical image make contribution to the latter diagnosis and treatment, namely the tumor part of a CT image. Hence, it remains an important and challenging task for researchers to segment the targeted region, whether the infected area or an abnormal organ, of a medical image. Thankfully, many medical image researchers have introduced the prevalent framework in computer vison: Transformer to solve this automatic medical image segmentation. Some of their efforts are summarized in Table 1.

Table 1. Summarized work using Transformer for medical image segmentation.

Name	Modality	Organ/Method	Disease/Dimension
RANT²	Laryng	Throat	None
MBT-Net³	Fundus	Eye	Corneal
PCAT-UNet⁴	Fundus	Retinal	Vessel
TransBridge⁵	Echocar	Cardiac	Left ventricle
GT-Unet⁶	X-ray	Tooth	Rootcanal
AGMB-Transformer⁷	X-ray	Tooth	Rootcanal
Chest l-Transformer⁸	X-ray	Chest	None
GuifangZ⁹	X-ray	Catheter	Guide-wire
MSAM¹⁰	PETCT	Lung	Lung cancer
TransDeepLab¹¹	Multimodality	Pure Transformer	2D
TransNorm¹²	Multimodality	UNet-based	2D
EG-TransUNet¹³	Multimodality	UNet-based	2D
HRSTNet¹⁴	Multimodality	Others	None
ScaleFormer¹⁵	Multimodality	CNN-based	None
TFCNs¹⁶	Multimodality	Others	None
Karimi¹⁷	Multimodality	Pure Transformer	3D
TransUNet¹⁸	Multimodality	UNet-based	2D
UTNet¹⁹	Multimodality	UNet-based	2D
MedT²⁰	Multimodality	CNN-based	None
SwinUnet²¹	Multimodality	UNet-based	2D
AFTer-UNet²²	Multimodality	UNet-based	3D
MissFormer²³	Multimodality	Pure Transformer	2D
DS-TransUNet²⁴	Multimodality	UNet-based	2D
PMTrans²⁵	Multimodality	CNN-based	None
UTran²⁶	Multimodality	UNet-based	2D
LeViT-UNet²⁷	Multimodality	Others	None
CASTformer²⁸	Multimodality	GAN	None
HiFormer²⁹	Multimodality	CNN-based	None
LViT³⁰	Multimodality	Others	None
DFormer³¹	Multimodality	UNet-based	3D
MCTrans³²	Multimodality	CNN-based	None
HyLT³³	Multimodality	CNN-based	None
TranFuse³⁴	Multimodality	CNN-based	None
UTrans²⁶	Multimodality	UNet-based	2D
Segtran³⁵	Multimodality	CNN-based	None
Li³⁶	Multimodality	UNet-based	2D
TransClaw UNet³⁷	Multimodality	UNet-based	2D
TransAttUNet³⁸	Multimodality	UNet-based	2D
nnFormer³⁹	Multimodality	Pure Transformer	3D
VT-UNet⁴⁰	Multimodality	Pure Transformer	3D
MSHT⁴¹	Multimodality	CNN-based	None
USegTransformer⁴²	Multimodality	CNN-based	None
TUnet⁴³	Multimodality	UNet-based	2D
ViTBIS⁴⁴	Multimodality	Pure Transformer	2D
Atlas-ISTN⁴⁵	Multimodality	Pure Transformer	3D
Shen Jiang⁴⁶	Micro	Celluar	Tissue
CellDETR⁴⁷	Micro	Cell	Cell
DFANet⁴⁸	MRI	Bone	Osteosarcoma
Liqun Huang⁴⁹	MRI	Brain	Glioma
TransBTS⁵⁰	MRI	Brain	Brain tumor
Swin-UNETR⁵¹	MRI	Brain	Brain tumor
BiTr-Unet⁵²	MRI	Brain	Brain tumor
3D Transformer⁵³	MRI	Brain	Brain region
MRA-TUNet⁵⁴	MRI	Cardiac	Atrial
HybridCTrm⁵⁵	MRI	Brain	Brain tumor
Zheyao G⁵⁶	MRI	Cardiac	Right ventricle
TransConver⁵⁷	MRI	Brain	Brain tumor
METran⁵⁸	MRI	Brain	Stroke
SwinBTS⁵⁹	MRI	Brain	Braintumor
BTSwin-Unet⁶⁰	MRI	Brain	Braintumor
UTransNet⁶¹	MRI	Brain	Stroke
TF-Unet⁶²	MRI	Cardiac	Atrial
CST⁶³	MRI	Colorectal	Colorectalcancer
SpecTr⁶⁴	HSI	None	None
BAT⁶⁵	Dermo	Skin	Melanoma
FAT-Net⁶⁶	Dermo	Skin	Melanoma
Swin-PANet⁶⁷	Dermo	Skin	Melanoma
Polyp-PVT⁶⁸	Colonos	Colorectal	Poly
SwinE-Net⁶⁹	Colonos	Colorectal	Poly
Cotr⁷⁰	CT	Multiorgan	3D organ
PHTrans⁷¹	CT	Multiorgan	Abonominal
COTRNet⁷²	CT	Kidney	Kidney cancer
HT-Net⁷³	CT	Multiorgan	Cross region
UCATR⁷⁴	CT	Brain	Stroke
CCAT-net⁷⁵	CT	Chest	COVID-19
Danfeng⁷⁶	CT	Lung	Lung cancer
CAC-EMVT⁷⁷	CT	Cardica	CAC
DTNet⁷⁸	CT	Bone	Cranio
Liu⁷⁹	ABVS	Breast	Breast tumor
MS-TransUNet⁸⁰	Multi	UNet	2D

3.1 X-ray or radiographic images

3.1.1 Tooth root segmentation

In root canal therapy for periodontitis, both underfilling and overfilling may have a negative influence on patients. However, the automatic assessment for root canal therapy has to be based on an accurate tooth root segmentation. Owing to the fuzzy boundary of tooth root, Li et al.⁶ proposed an AGMB-Transformer to achieve an efficient and accurate segmentation of tooth root. For the ambiguous boundary and low-resolution imaging, AGMB-Transformer designed the anatomy feature extractor and multibranch Transformer network. Experimental results showed the AGMB-Transformer's superior performance compared with ResNet, GCNet, and BoTNeT. The pipeline for AGMB-Transformer is shown in Figure 2.

3.1.2 Lung disease segmentation

Despite the popularity of weakly supervised deep learning models, these models may not apply to chest radiograph effectively. On the one hand, the lung image is mainly but rigorously symmetrical, which may confuse the learning models; On the other hand, some regions of the chest may be immune to certain diseases while these hidden connections tend to be ignored by the weakly supervised deep learning model. Thus, Gu et al.⁸ proposed a novel Chest L-Transformer to segment the thoracic disease region and diagnose the disease. Specifically, Chest-L-Transformer employed CNN to achieve the local feature extraction and the Transformer to distribute different attention on the chest radiograph regions with positional embedding. Experimental results on SIIM-ACR Pneumothorax Segmentation datasets showed that Chest L-Transformer's performance.

3.2 MRI

3.2.1 Bone tumor segmentation

As one of the malignant bone tumors, osteosarcoma is highly resistant to chemotherapy and bears a high recurrence rate. For the diagnosis of osteosarcoma, MRI can greatly reflect the soft-tissue, which makes it acute to osteosarcoma. However, the MRI images are accompanied with huge amount data and serious noises. Thus, Wang et al.⁴⁸ used an Edge Enhancement based Transformer (Eformer) for denoising of the input MRI image and a deep feature aggregation for real-time semantic segmentation (DFANET) to segment the osteosarcoma from the original MRI image of bone.

3.2.2 Brain tumor segmentation

Accounting for 80% of malignant brain tumors, glioma is difficult to automatically diagnosed due to its changeable appearance and ambiguous boundary. Transformer-based methods for glioma included.^{49-52, 55, 57, 59} Jiang et al.⁵⁹ proposed a SwinBTS to introduce the SwinTransformer to a U-shape structure to fulfill the task of 3D brain tumor segmentation. With a fusion of convolution operation and attention mechanism, SwinBTS adopted the SwinTransformer as the encoder and decoder. Besides, Jiang et al. also designed an Enhanced Transformer Block based on selfattention to give a further feature extraction if the former encoder failed to grasp the crucial information from the image. SwinBTS proved to reach state-of-the-art results on BraTS 2019, BraTS 2020, and BraTS 2021.

3.2.3 Stroke segmentation

Shortage of blood supply may lead to damage of brain tissue and possibly causes an Ischemic stroke. For the assessment of brain tissue, a precise segmentation method is needed to note the boundary of lesion area. Wang et al.⁵⁸ proposed a METrans, which intended to extract multiscale features to promote the segmentation quality of stroke lesion area. To be specific, Wang et al. introduced attention-based block: convolutional block attention module (CBAM) to the encoder-to-decoder structure. Meanwhile, to guarantee the presence of low-level features, Wang et al. supplemented the attention-based modules with three encoders for local details. Experimental results on ISLES2018 and ATLAS proved that METran outperformed the state-of-the-art methods in Dice.

3.2.4 Ventricle segmentation

For the diagnosis of many cardiovascular diseases, whether the cardiac structure can be made with high accuracy influences the assessment result. However, the precise segmentation of a right ventricle (RV) structure demands both short-axis (SA) images and long-axis (LA) images, posing a challenge for the current segmentation methods. Fusing the U-Net with Transformer, Chen et al.⁵⁴ proposed an MRA-TUNet to achieve the segmentation of atrial and ventricle. The performance of MRA-TUNet was confirmed by the experimental results on ACDC and 2018 atrial segmentation challenge. For the ventricle segmentation, dive score of MRA-TUNet for left ventricle was 0.961 and 0.911 for the right; For the atrium, the dice score reached 0.923.

3.3 CT scans

3.3.1 Kidney cancer segmentation

Kidney cancer, as one of the most prevalent cancers all over the world, can be cured efficiently if detected at an early age. For automatic CT diagnosis of kidney cancer, variation of the kidney tumors' location, shape and other properties may pose a challenge for kidney tumor segmentation. Shen et al.⁷² proposed an end-to-end COTRNet fusing CNN with Transformer. Skip connection operations are also taken in the encoder-to-decoder structure. In 2021 kidney and kidney tumor segmentation challenge (kits21), COTRNet won the 22th place with a performance of 61.6% for average dice and 49.1% for surface dice and 50.52% for tumor dice.

3.3.2 Brain stroke segmentation

Among three main kinds of strokes, it is urgent to diagnose the acute ischemic stroke (AIS) considering its probable deteriorative symptoms. However, the boundary between healthy tissue and AIS legions is not indistinguishable for naked eyes at an early age, which makes the early intervention of AIS quite demanding for doctors. Luo et al. proposed a novel UCATR network to segment the target area, namely the AIS region. Fusing the attention mechanism and convolution operation, Luo et al.⁷⁴ took an encoder-to-decoder for UCATR. For the encoder, UCATR chose to combine the CNN and Transformer for extracting both global and local features; for the decoder, UCATR utilized Transformer-based network to achieve a depiction of lesion area with high accuracy. Experimental results demonstrated that UCATR reached 73.58% for Dice similarity coefficients, which outperformed three other methods.

3.3.3 Craniomaxillofacial deformity segmentation

For patients who suffer from craniomaxillofacial deformities, their surgery may benefit from an accurate segmentation of bone and an intricate localization of anatomical landmark. Therefore, Lian et al.⁷⁸ proposed an end-to-end DTNet to fulfill both the segmentation and the localization task. With two communicative branches, DTNet can not only retain sufficient local details, but also have a global receptive field. Besides, a regionalized dynamic learner (RDL) was designed to associate the neighboring landmarks. In comparison with other multitask networks, DTNet achieved the state-of-the-art.

3.3.4 Guide-wire segmentation

A successful cardiovascular interventional therapy requires a precise insertion of guide-wire to build a stent or deliver the drug. However, previous guide-wire segmentation networks are CNN-based and in lack of global dependency. Thus, Zhang et al.⁹ proposed a novel network introducing Transformer for guide-wire segmentation. Instead of merely taking a single frame as the input, Zhang et al. added some previous frames into input sequences. CNN was used to extract the features of the input frames while the Transformer was utilized for developing a long-range dependency. Considering the scarcity of catheter data set, this network is tested on datasets from three hospitals and experimental results showed that this model outperformed other segmentation models.

3.3.5 2D organ segmentation

The data set in medical imaging task is generally smaller in magnitude than in other computer vision tasks. Thus, Liu et al.⁷¹ proposed a PHTrans, which combines both the Transformer and CNN. PHTrans took advantage of the U-shaped encoder-to-decoder design and arranged a series of Trans&Conv blocks into the Parallel Hybrid Module. Inside the Trans&Conv block, a Transformer-based network and CNN-based network are paralleled so that the global and local features can be processed simultaneously. Experimental results showed the PHTrans's superior performance over other state-of-the-art models.

3.3.6 3D organ segmentation

Despite performing a perfect task in constructing a global dependency, pure Transformer is not competent at 3D medical image segmentation owing to its high computational and spatial complexities. Thus, Xie et al.⁷⁰ combined a CNN and a deformable Transformer so as to balance the computation cost and accuracy. Evaluated on BCV data set, which included 11 major human organs, CoTr outperformed other CNN-based and Transformer-based methods.

3.4 Fundus or optical coherence tomography (OCT)

3.4.1 Corneal endothelial cell segmentation

Zhang et al.³ proposed an MBT-NET to address the blurred cell edge of corneal imaging, which was owing to the uneven reflection and tremor and movement of the corneal endothelial cell. Combing the architecture of CNN and Transformer, MBT-NET firstly used CNN to extract the local feature of the corneal endothelial cell image and give a global analysis through Transformer and residual connection. Experimental results on TM-EM3000 and Alisarine showed that MBT-NET outperformed the UNet and TransUNet on DICE, F1, SE, SP. Figure 3 demonstrates the MBT-NET's structure for segmentation.

3.4.2 Retinal vessel segmentation

The retinal vessel segmentation, if guaranteed high accuracy, can be beneficial to both ophthalmic and systemic disease diagnosis. However, this segmentation task is demanding in both local details and global information interaction, making pure CNN or pure Transformer unsuitable. Thus, Chen et al.⁴ proposed a PCAT-UNet, which took a U-shape structure with convolution operation to process local features and Transformer to construct global dependencies. In PCAT-UNet, Chen et al. designed two units: PCAT and FGAM for extraction and fusion of features. Experimental results on DRIVE and STARE data set demonstrated the PCAT-UNet's state-of-the-art performance.

3.5 Other modalities

3.5.1 Dermoscopy

Melanoma segmentation on dermoscopy image suffered from the varied appearance and vague boundaries of melanoma, which required sufficient local details. On the other hand, a larger receptive field was required for the accuracy of skin lesion segmentation. Pure CNN or pure Transformer can't tackle the problem of melanoma segmentation. Wu et al.⁶⁶ propose a FAT-Net by introducing an extra Transformer branch to ensure the global context and sufficient local information. In FAT-Net, three notable adjustments are made on classical Transformer: (1) A dual encoder instead of singular encoder is adopted (2) Three feature adaptation modules (FAM) are employed (3) A memory-efficient decoder is used to combine the both the global context and the local information. These adjustments are proved by experimental results on ISIC 2016, ISIC 2017, ISIC 2018, and PH2.

3.5.2 Microscopy

Jiang et al.⁴⁶ designed a gated position sensitive axial attention mechanism, which aimed to make Transformer-based network apply to small data set. Unlike the patch division that vision Transformer generally took, the proposed method chose to sample the input image iteratively. Besides, strip convolution module (SCM) and pyramid pooling module (PPM) were adopted to improve the network's capability in interpreting global context. Experimental results on three datasets showed this model outperformed other segmentation models in terms of F1 score and IoU. Instance segmentation of single-cell microscopy image requires much preliminary manual analysis. On the foundation of DETR, Tim et al.⁴⁷ proposed a CellDETR to achieve an end-to-end instance segmentation of yeast cells. The main architecture of CellDETR was similar to DETR. Otherwise, CellDETR reduced the parameter number of DETR by 10 times and employed learned position encoding so that the network can fulfill the cell-specific instance segmentation with higher efficiency. The experimental result of CellDETR is compared with Mask R-CNN as well as U-Net and shows an improvement in both segmentation accuracy and inference runtime cost.

3.5.3 Endoscope

Laryngeal disease, whether lesion or tumor, can only be detected through an electronic laryngoscope owing to the larynx's distinct structure complexity. For the laryngeal lesion detection assisted by CV, few study focus on multiobject segmentation for electronic laryngoscope image. Hence, Pan et al.² proposed a novel RANT, which utilized both the vision Transformer and CNN to for not only global context but also sufficient multiscale details. Specifically, four pyramid vision transformers (PVT) are employed to obtain the multiscale features while skip connections are made in each layer of PVT. Experimental results on two public laryngeal datasets showed that RANT achieved 76.63% and 88.77% for mIoU and 83.45% and 93.49% for mDSC.

3.5.4 Echocardiography

For left ventricle region segmentation, manual labelling may consume much time and leads to observer bias. Therefore, Deng et al.⁵ proposed a TransBridge that employed a lightweight Transformer-based model to segment the left ventricle region automatically with high efficiency. Combining the CNN and Transformer, TransBridge extracted the features using CNN encoder-to-decoder architecture and built a long-range dependency with Transformer. In comparison with CoTr,⁵ TransBridge reduced the total number of parameters by 78.7%, improved the dice coefficient to 91.4%.

3.5.5 Hyperspectral imaging

Unlike other medical imaging methods, Hyperspectral imaging is achieved by emitting a wide spectrum of light and analyzing the corresponding reflected and transmitted light, the band of which may be unrecognizable for naked eyes. Thus, Yun et al.⁶⁴ proposed a SpecTr, which introduced Transformer and CNN for the segmentation of hyperspectral image. The authors treated the analysis of spectral band representation as a sequence-to-sequence prediction task. Taking a U-shape structure, the authors set Transformer as the encoder with a sparsity constraint tailored for the property of spectral band. Convolution operations were utilized in both encoder and decoder part for feature extraction and recovery.

3.6 Multimodality

3.6.1 Pure Transformer-based 2D segmentation

Huang et al.²³ proposed a MISSFormer, based on pure Transformer to fulfill the task of medical image segmentation. A segmentation network based on pure Transformer was thought to be lacking in local details. To overcome the drawback of Transformer, Huang et al. made two modifications on Transformer-based structure: (1) MISSFormer replaced the typical feed-forward network (FFN) with an improved block: Enhanced Transformer block, which both boosted the global dependency and retained sufficient local details. (2) An enhanced transformer context bridge, designed in this paper, was employed for multiscale feature input for the network.

3.6.2 Pure Transformer-based 3D segmentation

Karimi et al.¹⁷ proposed a novel convolution-free 3D medical image segmentation method, which is based on pure Transformer. Specifically, Karimi et al. firstly divided the input 3D image into 3D patches and fed them into the attention-based encoder after a positional encoding. Eventually, the predicted patch represents the spatial distribution of the target region. Through experimental test on brain cortical plate, pancreas and hippocampus datasets, this model outperformed the CNN-based methods in terms of 3D medical image segmentation. Otherwise, for small training set, a pretraining may make the model's performance more superior.

3.6.3 CNN-based segmentation

There exist intrinsic inductive biases in CNN while Transformer requires a large data set. Therefore, combining these two structures can avoid the respective shortcomings. CNN-based Transformer attempts included.^{15, 20, 25, 26, 29, 32} From a scale-wise perspective, Huang et al. listed two major problems for those who replace the convolution layers with pure Transformer: intrascale and interscale. Targeting at these scale-wise problems, Huang et al.¹⁵ proposed a ScaleFormer. For intrascale problem, ScaleFormer designed a Dual-Axis MSA module to correlate the local features extracted from CNN; for interscale problem, ScaleFormer architected a novel Transformer-based network that can communicate between the regions in different scales. Experimental results on three datasets demonstrated the ScaleFormer surpassed the state-of-the-art result.

3.6.4 U-Net based 2D segmentation

U-Net design is naturally lacking in long-range dependency while the Transformer structure is deficit in low-level feature. Thus, a combination of U-Net and Transformer can achieve the balance between the global interaction and local sufficient details.^{12, 13, 18, 19, 21, 24, 26, 43} attempted to integrate both the U-shaped net with Transformer. Cao et al.²¹ proposed to integrate the Swin Transformer into a U-shape structure. This novel network, named as SwinUNet, was based on pure Transformer despite its look assimilated U-Net. To be specific, the Transformer-based network was based on Swin Transformer, which adopted the shifted window on the basis of a vanilla Transformer. Substituting the convolution modules in a typical U-Net encoder with Swin Transformer blocks for feature extraction, the SwinUNet's capability of grasping global context got largely improved. Meanwhile, the decoder in SwinUNet also employed the Swin Transformer to achieve the segmentation of the target region through a series of symmetric upsampling.

3.6.5 U-Net based 3D segmentation

Yan et al.³¹ proposed a D-Former to achieved the 3D medical image segmentation with high precision. This D-Former was capable of making full use of the depth information of the 3D medical images. It was notable that the design of D-Former³¹ improved the receptive field and boomed the information interaction while the computation of selfattention mechanism remained relatively low. Besides, D-Former achieved the positional encoding dynamically instead of using a singular function as in the vanilla Transformer.

3.6.6 GAN-based segmentation

You et al.²⁸ proposed a CASTFormer, which intended to solve the prevalent drawbacks of Transformer-based models: simple tokenization scheme, scarcity of scale variety and inadequate texture. Based on GAN, CASTFormer consisted of both a generator and a discriminator. For the generator, You et al.²⁸ employed a pyramid structure for sufficient multiscale features and proposed the class-aware Transformer modules to depict the target region from the input medical image. For the discriminator, You et al.²⁸ integrated the ResNet-based encoder and Transformer-based encoder to give a discriminative result. Experimental results on three benchmarks showed that the CASTFormer reached an absolute improvement of 2.54%–5.88% in Dice Coefficients compared with the state-of-the-art result.

3.6.7 Other methods

Refs.^{14, 16, 27, 30} integrated Transformer with other existing efficient frameworks.

Specifically, Wei et al.¹⁴ proposed a HRSTNet which combined the HRNet and Swin Transformer. Li et al.¹⁶ adopted the structure of FC-DenseNet joined with ResLinear-Transformer (RL-Transformer) and convolutional linear attention block (CLAB) and proposed the TFCNs.¹⁶ Xu et al.²⁷ proposed a LeViT-UNet, which combined both the LeViT and U-Net. Li et al.³⁰ proposed a novel LViT, standing for “Language meets Vison Transformer” to introduce the annotation of medical text as a supplement of limited data set.

4 MEDICAL IMAGE CLASSIFICATION

Medical images, which tend to carry a variety of symptom-specific information, serve as important solid material for doctor to make diagnosis. Considering the fact that this diagnosis is largely dependent on the personal interpretation of medical images, the diagnosis result may be inevitably influenced by some subjective factors, such as individual experience and inducive bias. Thus, the accurate and stable medical image classification is required as a supplement for the diagnosis of doctors. Some state-of-the-art classification methods based on medical images have reached for both the degree of a specific disease and the detailed medical judgement. These medical image classification works based on Transformer are partially listed in the Table 2.

Table 2. Transformer-based methods for medical image classification.

Name	Modality	Organ/Method	Disease/Dimension
ScoreNet⁸¹	Histopath	Iissue	Breast cancer
T2T-ViT⁸²	Histopath	Cervical	Cervical cancer
IL-MCAM⁸³	Histopath	Colorectal	Colorectal cancer
IViT⁸⁴	Histopath	Kidney	pRCC
Guo⁸⁵	Histologic	Lung	Lung cancer
MIL-VIT⁸⁶	Fundus	Eye	Retinal disease
LAT⁸⁷	Fundus	Eye	Diabetic retinopathy
CheXT⁸⁸	X-ray	Chest	Abnormality
ViT⁸⁹	X-ray	Bone	Fracture
Park⁹⁰	X-ray	Chest	COVID-19
FESTA⁹¹	X-ray	Chest	COVID-19
Liu⁹²	X-ray	Chest	COVID-19
Covid-Trans⁹³	X-ray	Chest	COVID-19
Tuan⁹⁴	X-ray	Chest	COVID-19
MXT⁹⁵	X-ray	Chest	COVID-19
Park⁹⁶	X-ray	Lung	COVID-19
Van⁹⁷	X-ray	Multiorgan	None
Verenich⁹⁸	X-ray	Lung	Abnormality
KAT⁹⁹	WSI	Pathology	Endometrial
GTN¹⁰⁰	WSI	Lung	Lung cancer
TransPath¹⁰¹	WSI	Pathology	Multiorgan
ScATNet¹⁰²	WSI	Pathology	Skin
TranMIL¹⁰³	WSI	Pathology	MIL
Gheflati¹⁰⁴	US	Breast	Breast cancer
POCFormer⁹⁶	US	Chest	COVID-19
Qayyum¹⁰⁵	Photo	Toe	DFU
LLCT¹⁰⁶	OCT	Eye	Retina lesion
RadioTransformer¹⁰⁷	Multimodality	Pure Transformer	2D
Matsokus¹⁰⁸	Multimodality	CNN-based	2D
TransMed¹⁰⁹	Multimodality	CNN-based	None
SEViT¹¹⁰	Multimodality	CNN-based	None
M3T¹¹¹	Multimodality	CNN-based	3D
Islam¹¹²	Micro	Blood	Malaria parasite
ViT-CNN¹¹³	Micro	Lymph	Leukemia
BrainFormer¹¹⁴	MRI	Brain	Brain disease
GlobalLocal¹¹⁵	MRI	Brain	Brain age
STAGIN¹¹⁶	MRI	Brain	Brain connectome
mfTrans¹¹⁷	MRI	Hepatic	Hepatocellular carcinoma
MVT¹¹⁸	Dermo	Skin	Melanoma
OOD¹¹⁵	Dermo	Skin	None
DPE-BoTNeT¹¹⁹	Dermo	Skin	Melanoma
CVM-Cervix¹²⁰	Cytopathology	Cervical	Cervical cancer
ViT& DenseNet¹²¹	Colposcopy	Cervical	Cervical cancer
xViTCOS¹²²	CT + X-ray	Chest	COVID-19
Hsu¹²³	CT	Chest	COVID-19
Zhang¹²⁴	CT	Chest	COVID-19
COViT-GAN¹²⁵	CT	Chest	COVID-19
Wu¹²⁶	CT	Lung	Emphysema
costa¹²⁷	CT	Lung	COVID-19
MIA-COV19D¹²⁸	CT	Chest	COVID-19
CTNet¹²⁹	CT	Chest	COVID-19
Scopeformer¹³⁰	CT	Brain	Intracranial hemorrhage
xia¹³¹	CT	Pancrea	Pancreatic cancer
NoduleSAT¹³²	CT	Lung	Lung nodule
TransCNN¹³³	CT	Lung	COVID-19
ParkS¹³⁴	CT	Lung	COVID-19
covid-ViT¹³⁵	CT	Chest	COVID-19
Uni4Eye¹³⁶	Ophthalmic	Eye	2D + 3D

4.1 X-ray or radiographic images

4.1.1 COVID-19 analysis

Considering the serious damage of COVID-19 on global health and economy, it is urgent to achieve a fast and effective diagnosis of COVID-19. Thus, authors in Refs,^{75, 90-96, 122-125, 127-129, 133, 135, 137} explored to achieve a rapid and accurate COVID-19 classification using Transformer-based architectures. Shome et al.⁹³ proposed a Covid-Transformer to achieve the automatic examining of COVID-19 according to the X-ray image. To overcome the scarcity of data set, three open-source datasets got amalgamated into a 30 K high-quality data set. For binary classification of COVID-19, Covid-Transformer⁹³ reached 98% for accuracy and 99% for AUC score; for multiclass classification (COVID-19, normal and pneumonia), Covid-Transformer⁹³ achieved 92% for accuracy and 98% for AUC score. Tuan et al.⁹⁴ proposed a novel network, which integrated both the convolution operation and selfattention mechanism for the classification of COVID-19. Tuan et al.⁹⁴ designed this network to achieve the classification of three types: normal, pneumonia and COVID-19. To assess the severity of COVID-19, Tuan et al.⁹⁴ constructed a data set, on which five deep learning models were tested. The result showed the efficiency of automatic chest X-ray diagnosis of COVID-19.

4.1.2 Fracture classification

Musculoskeletal diseases take the lead among the causes for disability. To intervene with musculoskeletal diseases as early as possible and work out the corresponding treatment for fracture, Tanzi et al.⁸⁹ proposed a novel network to give the automatic classification on fracture subtype according to the CT scan of bones. Specifically, Tanzi et al.⁸⁹ collected 4207 CT scan images with manual annotations and made the largest labeled data set of proximal femur fractures.⁸⁹ Superior to the CNN-based methods, this Transformer-based network reached 0.77 for precision, 0.76 for recall and 0.77 for F1-score.

4.1.3 Multiorgan classification

For multiview medical image analysis, images from different views have to be combined. Although these images reflect the same object, the variations in perspective may arouse huge difference in appearance, thus posing challenges to registration. When the registration can not be finished, images from multiple angles can only be integrated through a global fusion of feature vectors. Therefore, Van et al.⁹⁷ proposed a novel network that examined the spatial feature maps and associate the features extracted from unregistered views. Experimented on multiview mammography and chest X-ray datasets, this model outperformed the previous methods.

4.2 MRI

4.2.1 Brain disease classification

Brain diseases, which leaves no obvious structural lesion, can be reflected through the functional magnetic resonance imaging (fMRI). While functional connectivity has been widely taken as the basic feature in fMRI disease classification, the calculation of functional connectivity may rely too much on predefined regions of interests but jumps voxel-wise details. Thus, Dai et al.¹¹⁴ proposed a BrainFormer, which utilized Transformer for extracting global relations and 3D convolution to supplement the local details. Afterward, a single-stream model was set in BrainFormer¹¹⁴ to combine both the local and global information.

4.2.2 Brain age classification

With the help of deep learning, brain age can be estimated rapidly according to brain MRI result. However, previous automatic methods failed to obtain the global information but only concentrate on the local information. Thus, He et al.¹¹⁵ proposed a novel global-local Transformer, which fuse both global and local information for brain age estimation. Specifically, He et al.¹¹⁵ proposed two pathways respectively for global and local information, which got integrated through an attention mechanism. Evaluation on eight public datasets proved the global-local Transformer's performance. Figure 4 illustrated the global-local Transformer and multipatch age prediction in He et al.¹¹⁵

4.2.3 Brain connectome analysis

The temporal correlation in functional neuroimaging modalities can reflect the cross-region functional connectivity (FC) within the brain. Given the network-like property of these connectivity, graph neural networks (GNN) have been introduced to generate the graph representation of brain connectome. However, such attempts fail to incorporate the fluctuating property of functional connectivity network. Kim et al.¹¹⁶ proposed a STAGIN to model a dynamic graph representation of brain connectome. Apart from the GNN structure, Transformer encoder was also used in STAGIN¹¹⁶ to extract the global features. The performance of STAGIN has been validated on HCP-Rest and the HCP-Task datasets.

4.2.4 Hepatocellular carcinoma classification (HCC)

A preliminary preparation for the treatment of HCC is to examine the symptom of HCC quantitatively through the multiphase contrast-enhanced magnetic resonance imaging (CEMRI). Former CNN-based attempts for HCC measurement are lacking in long-range dependency establishment and multiphase CEMRI information selection. Therefore, Zhao et al.¹¹⁷ proposed a multifunction Transformer regression network (mfTrans-Net), which introduced attention mechanism for HCC qualitative measurement. To be specific, three CNN-based encoders were firstly parallelized to extract the features of CEMRI images. Nonlocal Transformer was set then to grasp the long-range dependencies. A multilevel training strategy was adopted for mfTrans-Net to improve the performance of HCC qualitative measurement.

4.3 CT

4.3.1 Emphysema classification

Emphysema can lead to the enlargement of alveoli, which may damage the lung. Based on the CT examine, emphysema is classified as three types: centrilobular emphysema (CLE), panlobular emphysema (PLE), and paraseptal emphysema (PSE). Considering the three types of emphysema demands different methods of treatment, Wu et al.¹²⁶ proposed CT-based emphysema classification model which was inspired by the structure of vision Transformer. Wu et al.¹²⁶ sliced the large patches obtained from the original CT images into sequences of patch embedding, which got fed into Transformer encoder after a positional encoding. Afterward, a softmax layer was utilized to give the final classification of emphysema subtype.

4.3.2 Lung nodule classification

The automatic diagnosis for multiple pulmonary nodules is crucial to clinical practice of pulmonary nodule treatment. However, previous studies tend to place emphasis on the single nodule, which may miss the correlation between nodules. Thus, Yang et al.¹³² proposed a novel NoduleSAT based on multiple instance learning (MIL) approach. NoduleSAT¹³² examined the patient's multiple nodules as a whole and analyzed the relations between multiple pulmonary nodules. To be specific, NoduleSAT¹³² introduced 3D CNN to Transformer-based structure and removed the pooling layer. Experiments on LUNA16 and LIDC-IDRI showed NoduleSAT¹³² achieved an outstanding performance on lung nodule and malignancy classification.

4.3.3 Intracranial hemorrhage classification

For RSNA intracranial hemorrhage classification, Yassine et al.¹³⁰ proposed a Scopeformer to achieve the identification of different hemorrhage types according to the CT slices. Fusing CNN with Vison Transformer, Scopeformer¹³⁰ employed Xception CNN for the extraction of feature maps and Vison Transformer for the establishment of long-range dependency of relevant features from different levels. When the CNN module got pretrained, the performance of Scopeformer¹³² could be improved further.

4.3.4 Pancreatic cancer classification

Pancreatic cancer is rare but fatal. Thus, pancreatic cancer's fatality makes it urgent for preliminary intervention while the its scarcity causes the huge health burden for general screening of whole population to have little positive effect. Therefore, considering the economic cost and complexity of single-phase noncontrast CT scan, Xia et al.¹³¹ proposed a novel model to classify the pancreatic ductal adenocarcinoma (PDAC) and other abnormalities (nonPDAC) from the other normal according to the CT image. Xia et al.¹³¹ tested their model on a data set that included 1321 patients and reached 95.2% for sensitivity and 95.8% for specificity.

4.4 Fundus or OCT

4.4.1 Retinal disease classification

Medical imaging task, unlike other tasks in CV, may not provide a large data set for the training of automatic classification. Thus, Transformer-based network, with its huge demand for large-scale training data set, may not apply to medical imaging task. To maintain the Transformer's outstanding performance and adapt it to retinal disease classification, Yu et al.⁸⁶ proposed a MIL-VIT, which pretrained the Transformer model preliminarily on a fundus image data set and the fine-tuned the network for the sake of retinal disease classification. Additionally, a MIL was employed to improve the performance of the model, which was proved on two public datasets to outperform other CNN-based methods.

4.4.2 Diabetic retinopathy (DR) classification

Taking the lead in causing permanent blindness, DR can be recognized at an early stage with the help of automatic classification method, the tasks of which include both DR grading and lesion discovery. Unlike what the previous methods took, Sun et al.⁸⁷ proposed to achieve the DR grading and lesion discovery simultaneously and therefore introduced a novel lesion-aware Transformer (LAT). LAT⁸⁷ adopted the encoder-to-decoder structure, in which a pixel relation based encoder and a lesion filter based decoder were set. The performance of LAT¹³⁷ was tested on Messidor-1, Messidor-2, and EyePACS. Figure 5 showed the structure of LAT in Wang et al.¹³⁷

4.4.3 Retina lesion analysis

Compared with other images, retina OCT images bear obvious speckle noise, irregularity and vague features. To tackle these problems, Wen et al.¹⁰⁶ proposed a novel lesion-localization convolution Transformer (LLCT), which not only classify the ophthalmic diseases, but also localize the target retina lesion region. Specifically, LLCT¹⁰⁶ employed CNN to obtain the feature map, which was then reshaped as the input for the Transformer-based network. The gradient weight during the backward propagation was summed to get the lesion location region.

4.5 Histopathology images

4.5.1 Breast cancer classification

The image resolution and high cost for annotations have influenced the progress in digital pathology. For the pathology image classification, patch-based MIL is generally adopted, which give unified attention on each part of the images with only a small fraction being useful. Thus, Thomas et al.⁸¹ proposed a ScoreNet to reassign the computational resources according to the distribution of discriminative image regions. With a combination of local and global features, ScoreNet81 could achieve an efficient classification on target regions. Additionally, ScoreMix,⁸¹ a novel method for data augmentation, was utilized in ScoreNet.⁸¹ Validated on three breast cancer histology datasets, ScoreNet reached a state-of-the-art result.

4.5.2 Cervical cancer classification

There exist only a few public cervical cancer datasets, the quality of which were also unsatisfactory in image quality and sample distribution. Thus, Zhao et al.⁸² introduced the taming Transformer design to launch a novel cervical cell image generation model: T2T-ViT to improve the classification results of cervical cancer. This Tokens-to-Token Vision Transformers (T2T-ViT) model can provide balanced and sufficient cervical cancer datasets with high quality. With an encoder-to-decoder structure, T2T-ViT introduced SE-block and MultiRes-block in the encoder and SMOTE-Tomek Links⁸² to adjust the sample numbers and image weights of the data set.

4.5.3 Colorectal cancer classification

Chen et al.⁸³ proposed an IL-MCAM model for the diagnosis of colorectal cancer. Unlike existing approaches focusing on end-to-end classification, IL-MCAM framework place emphasis on human-computer interaction. Fusing attention mechanism and interactive learning, IL-MCAM⁸³ can be correspondingly divided into two stages. In the first stage, automatic learning was achieved via three Transformer-based network and CNN; in the second stage, misclassified images were rejoined to training set interactively for the promotion of performance. Experimental results on HE-NCT-CRC-100K data set demonstrated the superiority of IL-MCAM over other methods.

4.5.4 Renal cell carcinoma (RCC) classification

For papillary (p) RCC, the two subtypes: type 1 and type 2 of pRCC are similar but informative of different information about the symptom of pRCC, such as cellular and cell-layer level patterns. Considering the CNN's incapability of distinguishing these two subtypes, Gao et al.⁸⁴ proposed an instance-based Vision Transformer (IViT), which utilized Transformer-based network to finish the classification of two subtypes based on representations of input images. To be specific, top-K instances were chosen to be aggregated to obtain the cellular and cell-layer information after attention mechanism.

4.5.5 Lung cancer classification

Lung cancer accounts for many deaths all over the world. For nonsmall-cell lung cancer (NSCLC), there existed two subtypes: Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). Histology is generally used by pathologists to give the classification result of lung cancer subtype. To achieve the automatic classification of lung cancer subtypes, Guo et al.⁸⁵ proposed a novel framework, which employed a pretrained vision Transformer to finish the multilabel lung cancer based on the histology images. Zheng et al.¹⁰⁰ proposed a novel Graph-Transformer based framework for processing pathology data (GTP), which made use of morphological and spatial information in predicting the disease grade. The design of Transformer-based GTP⁸⁵ is shown in Figure 6.

4.5.6 Endometrial classification

Despite the wide appearance of Transformer in whole slide image (WSI) classification, the limitation of effectiveness and efficiency, which was caused by token-wise selfattention design and positional embedding operation in a typical Transformer, handicapped the further development of WSI classification. Zheng et al.⁹⁹ proposed a kernel attention Transformer (KAT), which transmitted the information tokens via a cross-attention mechanism and used a set of kernels to represent the positional anchors on the WSI. Therefore, KAT⁹⁹ can balance the detail contextual information of WSI with the computational complexity.

4.5.7 Melanoma classification

It is extremely challenging to recognize the melanocytic lesion according to the pathology image. Generally, only an experienced dermatopathologist can overcome the intra- and interobserver variability and judge whether the invasive melanoma. With the digitalization of whole slide image, some automatic classification methods have emerged as the attempts to replicate the pathologists' diagnosis. Wu et al.¹⁰² proposed a novel ScATNet to obtain the multiscale representations of melanocytic skin lesions on WSI modality. Experimental results showed that ScATNet0'superiority to other WSI classification methods.

4.6 Multimodality

4.6.1 Pure Transformer-based

Bhattacharya et al.¹⁰⁷ proposed a Student–teacher Transformer-based network, called RadioTransformer to model the radiologists' diagnosis on chest radiography. For radiologists, visual information was crucial to the classification of medical image. With an eye-gaze tracking technology, the behavior of expert can be captured. RadioTransformer¹⁰⁷ made full use of the rich detail information for diagnosis. Specifically, RadioTransformer¹⁰⁷ took a global-local Transformer encoder-to-decoder structure to extract both global and local information for a visual depiction of attention regions.

4.6.2 CNN-based 2D classification

It has been over a decade since CNN emerged as the dominant method for medical imaging tasks. However, as the Transformer from NLP got modified for vision tasks, the traditional mainstream CNN has been challenged by the newcomer: vison Transformer. A comprehensive comparison between CNN and vision Transformer is needed, which provokes Matsokus et al.¹⁰⁸ to raise the question: Is it Time to Replace CNNs with Transformers for Medical Images?.¹⁰⁸ Attempting to answer this question after a careful examine on both CNN and ViT's performances, Matsokus et al.¹⁰⁸ set some experiments to conclude based on concrete quantitative results.

4.6.3 CNN-based 3D classification

Jang et al.¹¹¹ introduced a multiplane and multislice Transformer (M3T) network to construct a three-dimension model for medical image classification. Aiming at the Alzheimer's disease, Jang et al.¹¹¹ integrated CNN of both 2D and 3D with Transformer-based network to achieve the classification of Alzheimer's disease. In fact, these three parts were respectively responsible for different targets: 2D and 3D CNN extracted the local features based on 2D and 3D input images while Transformer-based network developed a long-range relationship on CNN output.

4.7 Other modalities

4.7.1 Dermoscopy

As one of the deadliest diseases all over the world, skin cancer takes thousands of lives each year. To provide intervention for skin cancer at an early stage, the deep learning method is used for the classification and diagnosis of skin cancer. However, the automatic classification of skin cancer is posed with some certain challenges: lower accuracy, deficit of labeled data and poor generalization. For these challenges, Aladhadh et al.¹¹⁸ proposed a medical vision transformer (MVT), a two-stage framework designed to introduce the attention mechanism for skin cancer classification. Nakai et al.¹¹⁹ proposed a novel bottleneck Transformer network (DPE-BoTNeT) by joining convolution network with the Transformer design to supplement the initial network with the capability of extracting global dependency and interpretating the positional information.

4.7.2 US

For breast cancer imaging, US imaging can be extremely helpful owing to its low cost and safety. Among the automatic classification methods based on US image, CNNs have emerged as the most prevalent structure. However, the limitation of CNN in receptive field lead to the loss of global context information. Thus, Gheflati et al.¹⁰⁴ introduced vision Transformer into the US-based breast cancer classification. According to the classification accuracy and area under the curve (AUC) metrics on US data set, this method outperformed the state-of-the-art CNNs.

4.7.3 Photo

One out of three diabetic patients may be troubled with diabetic foot ulcers (DFU). With the recent increase in DFU, it is urgent to diagnose the DFU at an early stage before ischemia and infection appears as DFU deteriorates. Qayyum et al.¹⁰⁵ introduced a novel network combining both CNN and Transformer to achieve the diagnosis of DFU. Fine-tuning the CNN and Transformer structure on DFUC-21 data set, Qayyum et al.¹⁰⁵ chose two of the five Transformers for the feature extraction and finished the DFU detection.

4.7.4 Microscopy

As one of the most dangerous diseases that mosquito bites may arouse, malaria can cause serious consequences, even death. To recognize the existence of malaria timely, microscopy is used to examine whether malaria parasites are present in the blood sample. However, this method consumes much time and effort, which may not apply to a large-scale examine of malaria. Thus, Islam et al.¹¹² proposed a novel method based on multiheaded attention mechanism to diagnose the malaria parasite. This model reached 96.41%, 96.99%, 95.88%, 96.44%, and 99.11% for accuracy, precision, recall, f1-score, and AUC score on testing data set. Among lymph diseases, acute lymphocytic leukemia (ALL) is a cancer with high fatality for both adults and children. To give a timely and accurate diagnosis of ALL, Jiang et al.¹¹³ proposed a ViT-CNN ensemble model to find cancer cell images from the other normal cell images. Combining both CNN and vision Transformer, ViT-CNN¹¹³ utilized CNN to extract rich features from the input images and Transformer to give the classification result. Experimental results on test set demonstrated that ViT-CNN reached 99.03% for accuracy in terms of diagnosis of ALL.

4.7.5 Cytopathology

As the fourth most common female cancer all over the world, cervical cancer can be diagnosed via cervical cytopathology. The automatic classification of cervical cancer classification based on cytopathology images has correspondingly developed considering the manual screening's time expenses and probable errors. Thus, Liu et al.¹²⁰ proposed a novel CVM-Cervix to give the cervical cell classification with high speed and accuracy. CVM-Cervix set a CNN module and a vision Transformer module respectively for the extraction of local and global features. Eventually, the classification result was given fusing both local and global feature via a Multilayer Perceptron module.¹²⁰

4.7.6 Colposcopy

Human papilloma virus (HPV) may cause both the cervical lesions and cervical cancer. As the symptom deteriorates, the precancerous lesion can be classified into three stages: CIN1, CIN2, and CIN3. The classification of these three stages can help the patient with the treatment of cervical cancer. Therefore, Li et al.¹²¹ proposed a novel method, which fused both the vision Transformer and DenseNet to classify the subtype of the cervical cancer. Li et al.¹²¹ employed the fivefold cross-validation methods to train and fuse the vision Transformer and DenseNet161 model. Experimental results showed that this model reached 68% for accuracy rate.

5 MEDICAL IMAGE DETECTION

Unlike the segmentation and classification, medical image detection place emphasis on detecting the occurrence of a specific malady. In fact, considering the conflict between the limited medical resource and the expanding medical requirement, the automatic medical image detection may be potentially taken into consideration as a powerful tool for disease screening. The automation of medical image detection takes advantage of artificial intelligence, which spared some of doctors so that they can exploit their talent in dealing with more serious diseases. The papers on medical image detection based on Transformer are shown in Table 3.

Table 3. Transformer-based methods for medical image detection.

Name	Modality	Organ/Method	Disease/Dimension
RDFNet¹³⁸	Image	Tooth	Caries
MA¹³⁹	Fundus	Eye	Microaneurysm
AG-CNN¹⁴⁰	Fundus	Eye	Glaucoma
TIA-Net¹⁴¹	Fundus	Eye	Glaucoma
Koushik¹⁴²	X-ray	Chest	COVID-19
Duong¹⁴³	X-ray	Chest	Tuberculosis
CLAM¹⁴⁴	WSI	Multiorgan	None
MCAT¹⁴⁵	WSI	Multiorgan	Cancer
Gheflati¹⁰⁰	US	Breast	Breast cancer
UTRAD¹⁴⁶	Multimodality	UNet-based	2D
Tomita¹⁴⁷	Micro	Esophagus	Esophagus tissue
DETR¹⁴⁸	MRI	Lymph	Lymph node
COTR¹⁴⁹	Colonos	Colorectal	Polyp
Liu¹⁵⁰	Colonos	Colorectal	Polyp
Rahhal¹⁵¹	CT + X-ray	Chest	COVID-19
TRACE¹⁵²	CT	Kidney	CKD
SATr¹⁵³	CT	Lesion	None
STCovidNet¹³⁷	CT	Lung	COVID-19
Chuang¹⁵⁴	CT	Lung	Lung nodule
Chest X¹⁴²	CT	Lung	COVID-19
POCFormer¹⁵⁵	CT	Lung	COVID-19
SwinFpn¹⁵⁶	CT	Multiorgan	None
Effinet¹⁴³	CT	Lung	Tuberculosis
Islam¹⁵⁷	CT	Kidney	Cyst, stone, tumor
Covid-Trans⁹³	CT	Lung	COVID-19
AANet¹⁵⁸	CT	Lung	COVID-19
Pesce¹⁵⁹	CT	Lung	Lung nodule
TR-Net¹⁶⁰	CCTA	Artery	Coronary artery stenosis
VIEW-DISENTANGLED¹⁶¹	MRI	Brain	Brain lesion
Name	Modality	Organ/Method	Disease/Dimension
E-DSSR¹⁶²	Endoscopy	Surgical	Dynamic surgical scene
MIST-net¹⁶³	X-ray	Cardiac	Image quality improvement
CNN-Transformer¹⁶⁴	X-ray	Bone	Long bone
SSTrans-3D¹⁶⁵	SPECT	Cardiac	3D reconstruction
ADVIT¹⁶⁶	PET	Brain	Alzheimer's Disease
TransEM¹⁶⁷	PET	Brain	Image quality improvement
SMIR¹⁶⁸	MRI	Brain	Super-resolution
SLATER¹⁶⁹	MRI	Multiorgan	Unsupervised reconstruction
ReconFormer¹⁷⁰	MRI	Multiorgan	Accelerated reconstruction
DSFormer¹⁷¹	MRI	Multiorgan	Accelerated reconstruction
McMRSR¹⁷²	MRI	Multiorgan	Super-resolution
PKT¹⁷³	MRI	Multiorgan	Undersample reconstruction
HUMUS-Net¹⁷⁴	MRI	Multiorgan	Accelerated reconstruction
Kspace-Trans¹⁷⁵	MRI	Multiorgan	Accelerated reconstruction
SVoRT¹⁷⁶	MRI	Brain	Fetal brain
TITLE¹⁷⁷	MRI	Multiorgan	Accelerated reconstruction
T2Net¹⁷⁸	MRI	Multiorgan	Joint
MTrans¹⁷⁹	MRI	Multiorgan	Accelerated reconstruction
SRT¹⁸⁰	MRI	Brain	2D to 3D
FedGIMP¹⁸¹	MRI	Multiorgan	accelerated reconstruction
ASMT¹⁸²	MRI	Brain	Super-resolution
SAT-net¹⁸³	MRI	Cartilage	Acceleration + image quality
McSTRA¹⁸⁴	MRI	Multiorgan	Accelerated reconstruction
KangLin¹⁸⁵	MRI	Multiorgan	Acceleration reconstruction
ASFT¹⁸²	MRI	Brain	Super-resolution
TranSMS¹⁸⁶	MPI	Multiorgan	Super-resolution
CTformer¹⁸⁷	LDCT	Multiorgan	Denoising
Liu¹⁸⁸	LDCT	Multiorgan	Degradation
TransCT¹⁸⁹	LDCT	Multiorgan	Enhancement
TED-Net¹⁹⁰	LDCT	Liver	Denoising
transGAN-SDAM¹⁹¹	l-PET	Brain	Image quality improvement
Wu¹⁹²	CT	Multiorgan	Image quality improvement
DuDoTrans¹⁹³	CT	Multiorgan	Image quality improvement
TVSRN¹⁹⁴	CT	Multiorgan	Super-resolution
Sizikova¹⁹⁵	CT	Lung	3D Shape Induction
Eformer¹⁹⁶	CT	Multiorgan	Denoising

5.1 X-ray or radiographic images

5.1.1 COVID-19 detection

For COVID-19, the CT-based or X-ray-based detection involves the medical professionals, which may restrict the speed and efficiency of detection. Thus, Krishnan et al.¹⁴² attempted to ease this problem by proposing an automatic method for detecting the COVID-19 according to the CT or X-ray images. Krishnan et al.¹⁴² fine-tuned a vision Transformer to fit the need of COVID-19 detection task on CT or X-ray images and reached a state-of-the-art performance in terms of accuracy, precision, recall and F1 score. Rahhal et al.¹⁵¹ proposed to perform the Coronavirus detection based on the CT and X-ray images. Specifically, Rahhal et al.¹⁵¹ took a vision Transformer as the backbone and Siamese encoder for feature extraction. After a patch division, input images were processed in the encoder. After the evaluation on CT and X-ray datasets, this framework outperformed other methods in five indicators.

5.2 MRI

5.2.1 Lymph node (LN) detection

For researchers who attempt to assess the lymphoproliferative disease, the LN need to be identified in T2 MRI images. Diverse appearances of lymph nodes made it difficult for radiologists to pick out the LNs according to T2 MRI image. Therefore, Mathai et al.¹⁴⁸ proposed the DEtection Transformer (DETR) framework for the localization of lymph nodes. A bounding box fusion technique was adopted in DETR¹⁴⁸ to reduce the false-positive rate. Experimental results showed that DETR¹⁴⁸ reached 65.41% for precision and 91.66% for sensitivity.

5.2.2 Brain lesion detection

For the location of small brain lesions, the 3D context synthesis conflicts with the computational cost. Therefore, Li et al.¹⁶¹ proposed a view-disentangled Transformer for MRI feature extraction to accurately detect the tumor. Taking a Transformer backbone with view-disentangled Transformer module, this framework modelled a long-range dependency within the different positions. Multiple 2D slice features were extracted and enhanced in this view-disentangled Transformer module.

5.3 CT

5.3.1 Lung nodule detection

In fact, the earlier the lung nodules get detected, the more likely it is for lung cancer patients to survive. However, the varied appearance and location of lung nodules make it difficult for automatic computer-aided detection of lung nodule. To diminish the high false-positive rate of nodule detection, Niu et al.¹⁵⁴ proposed a 3D Transformer framework to achieve lung nodule detection. Specifically, the input CT images got sliced into a nonoverlap sequence, each unit of which was analyzed with selfattention mechanism. Besides, Niu et al.¹⁵⁴ chose a region-based contrastive method to train the model to promote the training result.

5.3.2 Tuberculosis detection

For the early detection and analysis of tuberculosis, artificial intelligence technology can be beneficial to the automatic recognition of tuberculosis based on the chest X-ray images, which have been proved on the experimental results on some small chest X-ray data set. Duong et al.¹⁴³ aimed to propose a novel framework, which maintained the encouraging performance on large datasets. Specifically, Duong et al.¹⁴³ fused three networks for the detection of tuberculosis: modified EfficientNet, modified vision Transformer and modified hybrid model.

5.3.3 Chronic kidney disease (CKD) detection

The shortage of the positive patients and other risk factors have handicapped the development of CKD automatic detection. Wang et al.¹⁵² proposed the TRACE (Transformer-RNN Autoencoder-enhanced CKD Detector), which achieved the end-to-end CKD prediction. TRACE adopted an autoencoder with both attention mechanism and RNN unit. Consequently, TRACE¹⁵² can analyze the medical history information of the patients comprehensively. Experimental results on a data set based on real-world medical information showed that TRACE¹⁵² performed a state-of-the-art task.

5.3.4 Cyst, stone, tumor detection

Although the renal failure, with its severe consequence, has aroused much public attention, few attempts to apply artificial intelligence to diagnose kidney diseases. Setting kidney stones, cysts, and tumors, the three major renal diseases, as the target, Islam et al.¹⁵⁷ launched six models, such as EANet, CCT and Swin Transformer. After a comprehensive comparison, the framework based on Swin Transformer have shown to be outperform other methods as for accuracy, F1 score, precision and recall.

5.3.5 Universal lesion detection (ULD)

For ULD, most methods achieved the detection task by extracting 3D contextual information combining a series of adjacent CT slice Slice s. However, these operations affect the globality of feature representation. Li et al. proposed a novel Slice Attention Transformer (SATr), which was joined in a convolution-based structure. This design obtained both a long-range dependency and a local feature extraction, which was validated by the experimental results on testing data set.

5.4 Fundus or OCT

5.4.1 Microaneurysm detection

The size and complexity of retinal fundus makes it difficult to detect the Microaneurysms, which are the early sign of DR. Zhang et al.¹³⁹ proposed a novel detection model. To be specific, Zhang et al.¹³⁹ used equalization operations to improve the fundus image quality. Afterward, attention mechanism was adopted to obtain the preliminary features of retinal fundus images. Besides, Zhang et al.¹³⁹ analyzed the association between the microaneurysm and blood vessel from a spatial perspective. Experimental results on IDRiD_VOC data set showed this method outperformed other attempts in terms of average accuracy and sensitivity.

5.4.2 Glaucoma detection

Associated with vision deprivation, glaucoma was targeted by many automatic detection researchers. But the high redundancy in fundus image influenced the further accuracy improvement of glaucoma detection. Therefore, Li et al.¹⁴⁰ proposed AG-CNN, a novel framework fusing convolution operation with attention mechanism. Li et al.¹⁴⁰ prepared a large glaucoma database with 11,760 fundus images. For AG-CNN, Li et al.¹⁴⁰ designed three subnets, which were respectively responsible for attention prediction, pathological area localization and glaucoma classification.

5.5 Histopathology image

5.5.1 Multiorgan detection

Most computational pathology methods based on deep learning need the manual labelling of many whole slide images. To get rid of the burden of manual efforts, Lu et al.¹⁴⁴ proposed Clustering-constrained Attention Multiple instance learning (CLAM) to achieve the automatic multiorgan detection with both efficiency and interpretability. CLAM¹⁴⁴ employed the attention mechanism to find out discriminated sub-regions, which were then clustered to refine the target region.

5.5.2 Cancer detection

It is of great difficulty to predict the survival outcome based on the whole slide images (WSIs) of patients. Both computational complexity and data heterogeneity gap pose a challenge for attempts to treat WSIs as the bags for MIL. Therefore, Chen et al.¹⁴⁵ proposed a multimodal co-attention transformer (MCAT) framework to solve the above problems. Mapping the WSI features into an embedding space, MCAT¹⁴⁵ assimilated how word embeddings picked salient objects. The spatial complexity got specifically reduced when extracting the WSI-based features. Experimental results on five cancer data set demonstrated the MCAT's superior performance.

5.6 Other modalities

5.6.1 Coronary CT angiography (CCTA)

Considering the serious consequence of coronary artery disease (CAD), the corresponding automatic diagnosis is of great importance. To overcome the structural complexity, which has troubled the modelling of coronary artery, Ma et al.¹⁶⁰ proposed a Transformer network (TR-Net) to detect the coronary artery stenosis based on the CCTA. TR-Net integrated Transformer-based encoder with convolutional modules to take in the advantages of both. Consequently, TR-Net analyzed the cross-image information and find out the stenosis.

5.6.2 Image-based

Although the dental caries is quite widespread, few studies place emphasis on caries detection. Thus, Jiang et al.¹³⁸ proposed a RDFNet, which was suitable for portable caries detection. Utilizing the attention mechanism to extract the features from the input images, RDFNet adopted the FReLU activation function to accelerate the caries detection so that it can fit the condition of portable devices. Experimental results on datasets showed that RDFNet outperformed other methods in terms of accuracy and speed.

5.6.3 Dermoscopy

In many vision tasks, such as object detection, image classification and semantic segmentation, Transformer-based models have proved to outperform the CNN-based models. However, Transformer-based network fail to maintain its superior performance on finding out the out-of-distribution samples. Thus, Li et al.¹⁹⁷ evaluated four Transformer on two open-sourced medical image datasets. The result showed that the Transformer-based attempts on out-of-distribution were still insufficient.

5.6.4 Microscopy

Tomita et al.¹⁴⁷ proposed a novel method for detection of Barrett esophagus (BE) and esophageal adenocarcinoma. To be specific, this method made use of annotations from the tissue level based on the histological patterns on microscopy images. Both convolution operations and attention-mechanism were used in this framework. The testing set for this model include 123 images divided into four classes: normal, BE-no-dysplasia, BE-with-dysplasia, and adenocarcinoma.

5.6.5 Colonoscopy

Colonoscopy is widely adopted for the diagnosis of polyp lesions, which may evolve into the second most mortal cancer: colorectal cancer. To spare the endoscopists from the huge manual efforts on screening out the polyp, Shen et al.¹⁴⁹ proposed an end-to-end polyp detection model, named convolution in Transformer (COTR). Inspired by detection Transformer (DETR), COTR¹⁴⁹ utilized CNN for the feature extraction, Transformer encoders to encode and recalibrate the features and Transformer decoders to query the object. Experimented on two public polyp datasets, COTR¹⁴⁹ outperformed other state-of-the-art methods. Combining the attention mechanism with the convolution layers, Liu et al.¹⁵⁰ proposed a novel framework for the accurate polyp detection. Liu et al.¹⁵⁰ used a traditional CNN backbone to give a preliminary 2D representation, the result of which got passed to Transformer encoder after a flattening and positional encoding. After the Transformer decoder, a feedforward network (FFN) obtained the output embedding of Transformer decoder to give the prediction of detection.

5.6.6 Multimodality

Chen et al.¹⁴⁶ proposed U-TRansformer based Anomaly Detection framework (UTRAD) to overcome the unstable training and ununified judgement for feature distribution evaluation. UTRAD¹⁴⁶ employed attention-based autoencoders to describe the pretrained features. Setting not the raw images but the feature distribution to reconstruct, UTRAD¹⁴⁶ stabilized the training process and improved the detection accuracy. A multiscale pyramidal hierarchy was adopted in UTRAD¹⁴⁶ for the detection of anomalies. Tested on retinal, brain, head data set, UTRAD¹⁴⁶ outperformed other methods.

6 MEDICAL IMAGE RECONSTRUCTION

In clinical practice, there may exists an image quality deficit in the obtained medical images. After all, the medical images, unlike the computer simulation, are collected through some medical equipment, the result of which may be influenced by the realistic constraint and accidental factors. Besides, sometimes for the diminishing of side effect, some technology may sacrifice the image quality, namely low-dose computed tomography (LDCT). For low-quality medical images, some researchers have raised their solutions to improvement of the image quality through the image construction Transformer-based AI models. Their trials are collected in Table 4.

Table 4. Transformer-based works for medical image reconstruction task.

Name	Modality	Organ/Method	Disease/Dimension
E-DSSR¹⁶²	Endoscopy	Surgical	Dynamic surgical scene
MIST-net¹⁶³	X-ray	Cardiac	Image quality improvement
CNN-Transformer¹⁶⁴	X-ray	Bone	Long bone
SSTrans-3D¹⁶⁵	SPECT	Cardiac	3D reconstruction
ADVIT¹⁶⁶	PET	Brain	Alzheimer's Disease
TransEM¹⁶⁷	PET	Brain	Image quality improvement
SMIR¹⁶⁸	MRI	Brain	Super-resolution
SLATER¹⁶⁹	MRI	Multiorgan	Unsupervised reconstruction
ReconFormer¹⁷⁰	MRI	Multiorgan	Accelerated reconstruction
DSFormer¹⁷¹	MRI	Multiorgan	Accelerated reconstruction
McMRSR¹⁷²	MRI	Multiorgan	Super-resolution
PKT¹⁷³	MRI	Multiorgan	Undersample reconstruction
HUMUS-Net¹⁷⁴	MRI	Multiorgan	Accelerated reconstruction
Kspace-Trans¹⁷⁵	MRI	Multiorgan	Accelerated reconstruction
SVoRT¹⁷⁶	MRI	Brain	Fetal brain
TITLE¹⁷⁷	MRI	Multiorgan	Accelerated reconstruction
T2Net¹⁷⁸	MRI	Multiorgan	Joint
MTrans¹⁷⁹	MRI	Multiorgan	Accelerated reconstruction
SRT¹⁸⁰	MRI	Brain	2D to 3D
FedGIMP¹⁸¹	MRI	Multiorgan	Accelerated reconstruction
ASMT¹⁸²	MRI	Brain	Super-resolution
SAT-net¹⁸³	MRI	Cartilage	Acceleration + image quality
McSTRA¹⁸⁴	MRI	Multiorgan	Accelerated reconstruction
KangLin¹⁸⁵	MRI	Multiorgan	Acceleration reconstruction
ASFT¹⁸²	MRI	Brain	Super-resolution
TranSMS¹⁸⁶	MPI	Multiorgan	Super-resolution
CTformer¹⁸⁷	LDCT	Multiorgan	Denoising
Liu¹⁸⁸	LDCT	Multiorgan	Degradation
TransCT¹⁸⁹	LDCT	Multiorgan	Enhancement
TED-Net¹⁹⁰	LDCT	Liver	Denoising
transGAN-SDAM¹⁹¹	l-PET	Brain	Image quality improvement
Wu¹⁹²	CT	Multiorgan	Image quality improvement
DuDoTrans¹⁹³	CT	Multiorgan	Image quality improvement
TVSRN¹⁹⁴	CT	Multiorgan	Super-resolution
Sizikova¹⁹⁵	CT	Lung	3D shape induction
Eformer¹⁹⁶	CT	Multiorgan	Denoising

6.1 X-ray or radiography

6.1.1 Cardiac reconstruction

Decreasing projection views to lower X-ray radiation dose usually causes severe streak artifacts, especially for cardiac X-ray images. To improve image quality from sparse-view data, a multidomain integrative swin transformer network (MIST-net) was proposed.¹⁶³ MIST-net fused lavish domain features from data, residual-data, image, and residual-image adopting flexible network architectures, where residual-data and residual-image subnetwork was utilized as data consistency module to eliminate interpolation and reconstruction errors. A trainable edge enhancement filter was constructed to detect and protect image edges for high-quality reconstruction of image global features. According to the experiment results on numerical and real cardiac clinical datasets with 48-views, MIST-net improved the image quality with more small features and sharp edges than other competitors.

6.1.2 Long bone reconstruction

For conventional 3D imaging technologies like CT and MRI, their high radiation dose and the requirements for lying postures could influence the accuracy of reconstructed bones and diagnosis results to a large extent. Besides, methods based on bone contours tend to be dependent on prior knowledge with precise bone segmentation methods to be rare. To address these issues, a novel model based on multiviews of contours was proposed in Ge et al.¹⁶⁴ for bone reconstruction and a hybrid CNN-Transformer approach for bone contours segmentation. When tested on 301 bone X-ray images and by considering p-value < 0.05, the proposed Trans-Detseg approach performed a satisfactory task with Dice Similarity Coefficient of 0.949 and Hausdorff Distance of 26.17 than three state-of-the-art models. Figure 7 showed the pipeline of CNN-Transformer for medical image reconstruction in Ge et al.¹⁶⁴

6.2 MRI

6.2.1 Accelerated reconstruction

With some under-sampled and noisy input images, deep learning can reconstruct the ideal MRI images. Among these methods, both CNN-based networks and Transformer-based model bear its own advantages and drawbacks.^{170, 171, 174, 175, 177, 179, 181, 184} made corresponding adjustments to some part of MRI, including k-space and sampling to promote the speed of MRI. The scan of MRI demands much time to generate the complete K-space matrices. To reduce the scan time to a large extent, Liu et al.¹⁷⁷ proposed the transformer involved trajectory learning (TITLE), a reinforcement learning framework based on Transformer. TITLE¹⁷⁷ associated the Q-value in reinforcement learning with the reconstruction image quality of MRI. Here, TITLE¹⁷⁷ predicted the Q-value based on phase-indicator vectors and K-space matrices. With inverse Fourier transform operation to be widely used, TITLE¹⁷⁷ eventually achieved an efficient reconstruction of MRI image.

6.2.2 Cartilage reconstruction

Despite a high image quality, MRI requires a demanding time expense for data acquisition. The introduction of convolution neural modules does make contribution to the acceleration of MRI yet brings a limited receptive field. Thus, Wu et al.¹⁸³ proposed a SAT-net to achieve the promotion on both image fidelity and acceleration. With an attention mechanism for long-range relationship, Wu et al.¹⁸³ maintained the residual convolutional modules in SAT-net. Applied on cartilage MRI, SAT-net¹⁸³ got trained on 336 3D images and got tested on 24 images.

6.2.3 Super-resolution (SR)

Compressed sensing, a traditional method for reconstruction, was generally adopted for the down-sampling MRI SR. To overcome the time expense of compressed sensing, Yan et al.¹⁶⁸ proposed to introduce Swin Transformer to MRI SR of brain, called as SMIR. To be specific, SMIR¹⁶⁸ were divided into two modules: a multilevel feature extraction module and a reconstruction module. To ensure the details of reconstruction, SMIR¹⁶⁸ attend to both the frequency domain and spatial domain losses.

6.2.4 2D-to-3D reconstruction

For 3D construction in invasive surgeries, the convolutional-based frameworks are too complex in structure while the GAN-based networks are hard to train. Thus, Hu et al. proposed the shape reconstruction transformer (SRT) to fuse the selfattention mechanism with generative design to achieve a 3D brain construction with both high speed and accuracy. Hu et al. used point clouds to give a 3D description based on the 2D input images. With both a qualitative demonstration and a quantitative experiment, SRT showed a superior performance compared with other state-of-the-art methods.

6.2.5 Unsupervised reconstruction

Recent studies on supervised reconstruction methods integrated the image operators with the untrained MRI priors for the sake of supervision requirement reduction. Korkmaz et al. proposed a zero-Shot Learned Adversarial TransformER (SLATER) to fuse the attention mechanism with the adversarial network for the unsupervised MRI reconstruction. The pretraining period prepared the high-quality MRI prior for the inference period, in which SLATER¹⁶⁹ achieved a zero-show reconstruction via the imaging operator. Experimental results on brain MRI datasets showed SLATER9'state-of-the-art performance.

6.2.6 Undersampling reconstruction

Inspired by the Transformer network to deal with long-range dependencies in sequence transduction tasks,¹⁷³ proposed to rearrange the radial spokes to sequential data according to the chronological order of acquisition and introduce the Transformer network to give a prediction of unacquired radial spokes from the acquired data. The authors proposed novel data augmentation methods called projection-based K-space transformer (PBKT) to generate a large amount of training data from a limited number of subjects, which can, furthermore, be applied to different anatomical structures. Experimental results show PBKE achieve a superior performance compared to state-of-the-art deep neural networks.

6.3 CT

6.3.1 3D shape induction

Sizikova et al. propose an approach for training an automatic chest CT reconstruction algorithm with X-ray only.¹⁹⁵ The authors augment existing model training on DRR-generated X-ray and CT pairs with a shape induction loss, which make the model capable of learning from only real input X-rays. This approach allows grasping the variability of real X-ray images and directly incorporating it into the training of the CT generation model. The ability to obtain rich distributions from real X-rays is particularly essential for practical applications where the network is required to adapt to different imaging sensor types and diverse patient anatomy.

6.3.2 Image quality improvement

Using CT reconstruction from X-ray is useful for clinical diagnosis, iodine radiation during the imaging process induces irreversible injury, thereby leading researchers to focus on sparse-view CT reconstruction by recovering a high-quality CT image from a sparse set of sinogram views. Iterative models are presented to alleviate the appeared artifacts in sparse view CT images though a rather expensive computation cost. To overcome the above-mentioned issues, a dual-domain Transformer (DuDoTrans) was proposed in Wang et al.¹⁹³ to simultaneously restore informative sinograms by modelling the long-range dependency and achieve the reconstruction task of CT image with both the enhanced and raw sinograms. As reported in the work, reconstruction performance on the NIH-AAPM data set and COVID-19 data set experimentally confirms the effectiveness and generalizability of DuDoTrans with fewer involved parameters. According to the extensive experiments, DuDoTrans also demonstrate its robustness with different noise-level scenarios for sparse-view CT reconstruction.

6.3.3 SR

In clinical practice, anisotropic volumetric medical images with low through-plane resolution are commonly used owing to short acquisition time and lower storage cost. However, the coarse resolution may bring difficulties in medical image diagnosis by either physicians or computer-aided diagnosis algorithms. In fact, deep learning-based volumetric SR methods have risen as feasible ways to improve resolution, with CNN at their core. Despite recent progress, these methods are restricted by inherent properties of convolution operators, which ignore content relevance and fail to effectively model long-range dependencies. Furthermore, most of the existing methods adopt pseudo-paired volumes for training and evaluation, where pseudo low-resolution (LR) volumes are generated by a basic degradation of their high-resolution (HR) counterparts. However, the domain gap between pseudo- and real-LR volumes leads to the unsatisfactory performance of these methods in practice. To address the above issues, the first public real-paired data set RPLHR-CT was proposed in Yu et al.¹⁹⁴ as a benchmark for volumetric SR. The baseline results are provided by re-implementing four state-of-the-art CNN based methods. To get rid of the inherent shortcoming of CNN, the authors propose a Transformer volumetric SR network (TVSRN) based on attention mechanisms, dispensing with convolutions entirely. As the first research to use a pure Transformer for CT volumetric SR, TVSRN reached the experimental results, which show that TVSRN significantly outperforms all baselines on both PSNR and SSIM additionally with a better trade-off between the image quality, the number of parameters, and the running time.

6.3.4 Denoising

Image denoising is a long-standing topic in CV and processing community. For medical image field, compared to general images in ImageNet, there is a lot of prior knowledge that can be leveraged to enhance our model. In Luthra et al.,¹⁹⁶ the authors present an edge-enhancement based model, that is, Eformer, a novel architecture that constructs an encoder-decoder network using Transformer blocks for medical image denoising. Nonoverlapping window-based selfattention is utilized in the Transformer block that reduces computational burden. This work further incorporates learnable Sobel-Feldman operators to enhance edges in the image and explore an effective way to concatenate them in the intermediate layers of Eformer. Eformer undergoes the experimental analysis, which is conducted by comparing deterministic learning and residual learning for the task of medical image denoising. In addition, Eformer also gets evaluated on the AAPM-Mayo Clinic Low-Dose CT Grand Challenge Data set and receives the state-of-the-art performance, that is, 43.487 PSNR, 0.0067 RMSE, and 0.9861 SSIM.

6.4 LDCT

6.4.1 Denoising

LDCT is widely applied in clinical practice. However, in comparison with normal dose CT, in the LDCT images, there exist stronger noise and more artifacts which are obstacles for practical applications. In the past few years, convolution-based end-to-end deep learning methods have been prevalently used for LDCT image denoising. Recently, Transformer has demonstrated superior performance over convolution with more feature interactions. Yet its applications in LDCT denoising have not been comprehensively cultivated. In Wang et al.,¹⁹⁰ the authors propose a convolution-free T2T vision Transformer-based Encoder-decoder Dilation network (TED-net) to enrich the family of LDCT denoising algorithms. The model is irrelevant to convolution blocks and consists of a symmetric encoder-decoder block based on sole Transformer. The model is evaluated on the AAPM-Mayo clinic LDCT Grand Challenge data set and achieves the experimental results, which shows that the proposed model outperforms other state-of-the-art models with the highest SSIM value and smallest RMSE value. As for the continual improvement of this model, it can be further slimmed with a more powerful tokenization without downgrading of images.

6.4.2 Multiorgan reconstruction

Compared to the normal dose CT (NDCT), LDCT images are subjected to severe noise and artifacts, which left much to be done for deep learning-based reconstruction methods. Recently in many studies, vision Transformers have shown superior feature representation ability over CNNs. However, unlike CNNs, the potential of vision Transformers in LDCT denoising was far from fully explored so far. To fill this gap, the authors in Wang et al.¹⁹⁰ proposed a Convolution-free Token2Token Dilated Vision Transformer, called CTformer for low-dose CT denoising. The CTformer uses a more powerful token rearrangement to encompass local contextual information to replace the role that convolution operation plays. It also dilates and shifts feature maps to capture longer-range interaction. The authors interpret the CTformer by statically inspecting patterns of its internal attention maps and dynamically tracing the hierarchical attention flow with an explanatory graph. Besides, the authors also introduce an overlapped inference mechanism so as to effectively eliminate the boundary artifacts that are common for encoder-decoder-based denoising models. Experimental results on Mayo LDCT data set prove that the CTformer outperforms the state-of-the-art denoising methods with a low computation overhead.

6.4.3 Degradation

Liu et al.¹⁸⁸ proposed a weakly supervised method to learn the degradation of low-dose CT from unpaired low-dose and normal-dose CT images. To be specific, low-dose CT and normal-dose images were fed into one shared flow-based model and projected to the latent space. Then, the degradation between low-dose and normal-dose images was modeled in the latent space. Finally, the authors train the model by minimizing the negative log-likelihood loss with no requirement of paired training data. It should be maintained that the authors validated the effectiveness of the generated image pairs on a classic CNN, REDCNN, and a novel Transformer-based model, TransCT. The proposed method reached 24.43 dB for mean PSNR, 0.785 for mean SSIM on an abdomen CT data set, and 33.88 dB for mean PSNR, 0.797 for mean SSIM on a chest CT data set, which outperformed other advanced CT denoising methods, the same network trained by CycleGAN-generated data, and a novel transfer learning method.

6.4.4 Enhancement

Inspired by the internal similarity of the LDCT images, the authors in Zhang et al.¹⁸⁹ present a Transformer-based neural network for LDCT, which can explore large-range dependencies between LDCT pixels. To ease the impact of noise on high-frequency texture recovery, the authors employ a Transformer encoder to further excavate the low-frequency part of the latent texture features and then exploit these texture features to restore the high-frequency features from noisy high-frequency parts of LDCT image. The final high-quality LDCT image can be piece-wise reconstructed with the incorporation of low-frequency content and high-frequency features. Extensive experiments on Mayo LDCT data set show that TransCT produces superior results and outperforms other methods.

6.5 Other modalities

6.5.1 Magnetic particle imaging (MPI)

MPI is a recent modality that provides exceptional contrast for magnetic nanoparticles (MNP) at high spatio-temporal resolution. A common procedure in MPI starts with a calibration scan to measure the system matrix (SM), the result of which can be offered to setup an inverse problem to reconstruct images of the particle distribution during subsequent scans. This calibration enables the reconstruction to be sensitive to various system imperfections. Yet time-consuming SM measurements have to be repeated under notable drifts or changes in system properties. Gungor et al.¹⁸⁶ introduce a novel deep learning approach for accelerated MPI calibration based on Transformers for SM super-resolution (TranSMS). To be specific, low-resolution SM measurements are performed through large MNP samples for improved signal-to-noise ratio efficiency, and the high-resolution SM is super-resolved via a model-based deep network. TranSMS leverages a vision Transformer module to fulfill the task of capturing contextual relationships in low-resolution input images, a dense convolutional module for the localization of high-resolution image features, and a data-consistency module to ensure consistency to measurements. Tested on both simulated and experimental data, the results indicate that TranSMS achieves significantly improved SM recovery and image reconstruction in MPI, while enabling acceleration up to 64-fold during two-dimensional calibration.

6.5.2 Positron emission tomography image (PET)

Xing et al. proposed a ViT-based architecture called ADVIT.¹⁶⁶ A new model trained on multimodalities of Positron Emission Tomography images (PET-AV45 and PETFDG) for Alzheimer's Disease (AD) diagnosis was introduced. Unlike the conventional methods using multimodal 3D/2D CNN architecture, ADVIT design replaces the CNN by ViT. Considering the high computation cost of 3D images, ADVIT first employ a 3D-to-2D operation to project the 3D PET images into 2D fusion images. Then, it forwards the fused multimodal 2D images to a parallel ViT model for feature extraction, followed by classification for AD diagnosis. For evaluation, PET images from ADNI were used. The proposed model outperforms several strong baseline models in our experiments and achieves 0.91 accuracy and 0.95 AUC.

6.5.3 Endoscopy

Long et al.¹⁶² proposed E-DSSR, which is an efficient reconstruction pipeline for highly dynamic surgical scenes that runs at 28 fps. Specifically, the authors design a Transformer-based stereoscopic depth perception for efficient depth estimation and a lightweight tool segmentor to handle tool occlusion. Besides, E-DSSR adopts a dynamic reconstruction algorithm which can estimate the tissue deformation and camera movement, and aggregate the information over time specifically for surgical scene reconstruction. When evaluating the proposed pipeline on two datasets, the public Hamlyn Centre Endoscopic Video Data set and in-house DaVinci robotic surgery data set, the results suggest that E-DSSR can recover the scene obstructed by the surgical tool and deal with the movement of camera in realistic surgical scenarios effectively at real-time speed.

7 REPORT GENERATION

For doctors with rich medical knowledge and abundant experience, one of the most time-consuming task may be writing the diagnosis report. In fact, the writing of medical report, including the radio report, involve medical knowledge but little creativity. With the rapid development of Artificial Intelligence, this medical task may be assigned to automatic report generation AI model. Despite many existing difficulties of report generation, the developing of a report generation framework may save much time and effort of the doctors from the writing reports. Table 5 gives a brief collection of the Transformer-based report generation methods.

Table 5. Report generation works using Transformer.

Name	Modality	Organ/Method	Disease/Dimension
MengYaXu¹⁹⁸	Surgical	None	None
CIDA¹⁹⁹	Surgical	None	None
Zhang²⁰⁰	Surgical	None	None
PPKED²⁰¹	Radio	None	None
VTI²⁰²	Radio	None	None
CMN²⁰³	Radio	None	None
ASGK²⁰⁴	Radio	None	None
yixiwang²⁰⁵	Radio	None	None
RadBERT²⁰⁶	Radio	None	None
CDGPT2²⁰⁷	Radio	None	None
fullTrans²⁰⁸	Radio	None	None
Jia²⁰⁹	Radio	None	Rare disease
Farhad²¹⁰	Radio	None	None
RTMIC²¹¹	Radio	None	None
Miura²¹²	Radio	None	None
Chen²¹³	Radio	None	None
KGAE²¹⁴	Medical	None	None
MedSkip²¹⁵	Medical	None	None
Park²¹⁶	Medical	None	None
HoangTN²¹⁷	X-ray	None	None
RATCHET²¹⁸	X-ray	Chest	None
AlignTransformer²¹⁹	X-ray	None	None
KERP²²⁰	X-ray	None	None
Yan²²¹	X-ray	None	None
CEDT²²²	X-ray	None	None

7.1 Radio report

If the radiology report can be generated automatically as the radio examine finishes, the radiologists can escape from the tiring report writing and the possible mistakes in diagnosis. Considering the textual property of the radio report, the automatic generation of radio report posed a challenge for the deep learning models.^{201-203, 205-207, 212, 213} aimed to assist the radiologists in radio report generation. Najdenkoska et al.²⁰² proposed the variational topic inference framework (VTI) to overcome the diversity of radiologist writing style. VTI²⁰² prepared a topic set, in which different topics serve as the guidance for the sentence arrangement in the report. Experimental results on test results showed that VTI²⁰² performed a state-of-the-art task in automatic radio report generation. Wang et al.²⁰⁵ proposed to give a quantitative measurement of the uncertainty, whether visual or textual, to promote the quality of generated reports. Integrating the information of different modal, Wang et al.²⁰⁵ analyzed the uncertainty both sentence-to-sentence and as a whole. Experimental results on two public datasets showed that this model outperformed other methods.

7.1.1 Rare disease report

Despite many previous attempts for the cross-modal radiology report generation, little attention has been paid to rare disease report generation. In fact, the useless pixel redundancy and multimodal decoding failure made handicapped the development of report generation. Thus, Jia et al.²⁰⁹ proposed TransGen, a Transformer-based framework designed for the automatic generation of rare disease report. TransGen²⁰⁹ utilized a semantic-aware visual learning (SVL) module for the target region recognition and a memory augmented semantic enhancement (MASE) module to absorb the historical report sentences to boost a better report generation.

7.2 Medical report

Owing to the insufficiency of medical data, the expenses for the supervised training of a report generation framework is relatively high. Liu et al.²¹⁴ proposed the knowledge graph auto-encoder (KGAE), an unsupervised encoder-to-decoder method to achieve the automatic generation of medical report without the strict restriction on the paired training data set. Specifically, KGAE²¹⁴ set the knowledge graph to come across the gap between visual and textual modality. The encoder and decoder of KGAE,²¹⁴ with a knowledge-driven aid, associated the images with the report context in the shared latent space. In Figure 8, the framework and outcome of KGAE was shown. At present, the recurrent attempts to generate the medical report suffered from the top-down features, which consumed much time and showed an inefficiency in comprehending the report. Therefore, Xiang et al.²¹¹ proposed an encoder-to-decoder framework, in which the encoder part grasped the visual features while the decoder part performed the computation in parallel to improve the computational efficiency. This design, names RTMIC,²¹¹ got trained via a reinforcement learning and received a performance that was superior to other state-of-the-art results.

7.3 X-ray report

The Transformer-based attempts to attempted to explore the automated generation of accurate and fluent X-ray reports included.^217-222 As for the X-ray medical report generation, You et al.²¹⁹ build an AlignTransformer to reach the correspondence between the visual regions with the disease tags. The authors divide the task into two parts: the first part is the prediction of disease tags and the feature extraction of the relation between the images and the disease tags while the second part is to produce the medical report based on the extracted information. In practical, the authors launch an align hierarchical attention (AHA) module for the former task and the multigrained transformer (MGT) for the latter. Tested on public datasets, extensive experiments demonstrate that the AlignTransformer outperforms other competing methods and receives the evaluation support from the professional radiologists.

8 MEDICAL IMAGE REGISTRATION

Information in a single medical image, no matter how rich it is, is limited to one angle due to the inferior 2-dimension restriction of the image modality. However, for different medical images on a shared target under different realistic conditions, the employment of Artificial Intelligence can help fuse the information from the different incomes with an enlargement of richness and dimension of the original image information. Vision Transformer, as one of the most advanced models in CV, have shown to perform well in terms of image registration and the corresponding papers is summarized in Table 6.

Table 6. Transformer-based methods for medical image registration.

Name	Modality	Organ/Method	Disease/Dimension
FPT²²³	US	Bone	Spine
XMorpher²²⁴	Multimodality	Pure Transformer	None
Yibo Wang²²⁵	Multimodality	UNet-based	None
C2FViT²²⁶	Multimodality	CNN-based	None
ViT-V-Net²²⁷	Multimodality	UNet-based	None
TransMorph²²⁸	Multimodality	CNN-based	None
LKU-Net²²⁹	Multimodality	UNet-based	None
TD-Net²³⁰	Multimodality	CNN-based	None
GraformerDIR²³¹	Multimodality	CNN-based	None
CEMSA²³²	Multimodality	UNet-based	None
ADMIR²³³	MRI	Brain	Drug addiction
PC-SwinMorph²³⁴	MRI	Brain	None

8.1 MRI

8.1.1 Fetal brain registration

Xu et al.¹⁷⁶ propose to introduce the Transformer to the slice-to-volume registration task. The authors take multiple MRI slices as a sequence and exploit the attention mechanism's potential on the automatic detection on the inter-slice relevance and the unknown slice prediction. In Xu et al.,¹⁷⁶ the authors also give estimations on 3D volume, which are updated to the model to promote the accuracy. Experimental results show that this framework reaches the reduction in registration error and the improvement in reconstruction quality. To have a knowledge of to what extent this proposed model can boost the 3D reconstruction quality, the authors conduct extent experiments on real-world MRI data.

8.1.2 Drug addiction brain registration

Tang et al.²³³ propose an ADMIR (Affine and Deformable Medical Image Registration) as an unsupervised solution to the medical image registration. This work in Tang et al.²³³ consists of three modules: the registration module computes the parameters of affine transformation; the deformable registration module builds the displacement vector field; the spatial transformer module absorbs the output of the former two modules and generate the final image. The performance of ADMIR gets evaluated on some MRI data collected on drug-addicted brains and appears to outperform other competing methods in terms of important indicators in medical registration task. Notably, ADMIR can be applied to medical registration task with high accuracy and speed.

8.1.3 Brain registration

Liu et al.²³⁴ intend to achieve both the registration and the segmentation of medical images through a novel framework named SwinMorpy. In an unsupervised manner, this model explores the patch representation to boost the ultimate performance. In concrete, the authors introduce a patch-based strategy to capture rich local features and the other patch stitching strategy based on a multiattention backbone with a 3D shifted window mechanism. The experimental results, provided in Liu et al.,²³⁴ have shown this model's superiority over other state-of-the-art works.

8.2 Multimodality

8.2.1 Pure Transformer-based

Shi et al.²²⁴ present a XMorpher to fulfill the medical image registration task based on pure Transformer backbone. The attention mechanism in Transformer was modified into a cross attention transformer (CAT) in this paper to ensure a sufficient interaction between moving images. On the foundation of the CAT block, the authors in Shi et al.²²⁴ design a dual network to capture the features of input images. Then, the multilevel features get incorporated, by means of the fusion module, into the comprehensive representation of the feature. With the help of CAT block, this network can exploit the potential of attention mechanism in terms of the alignment of different images. As a result, XMorpher achieved a computational progress in efficiency and the smoothing of interference.

8.2.2 U-Net-based

Some researchers placed emphasis on how to integrate the Transformer with the U-Net to boost the performance of registration, including.^{225, 227, 229, 232} Wang et al.²²⁵ proposed to join the Transformer with the U-Net structure for the medical image registration. Specifically, the authors utilize the distinctive Transformer structure to capture global and local features, both of which are used for the supervised generation of registered images. The design in Wang et al.²²⁵ can boost the registration accuracy, which has been validated by the experimental results on brain MRI datasets. According to the extensive experiments on LPBA40 and OASIS-1, the work in Wang et al.²²⁵ outperforms other registration frameworks, whether conventional or DL-based, in terms of registration accuracy.

8.2.3 CNN-based

In Refs,^{226, 228, 230, 231} authors constructed the framework with a combination of CNN and Transformer. In medical image registration, affine registration plays an indispensable role. Previous attempts on affine registration target at how to boost the speed of affine registration, most of which based the framework on CNN. In Mok and Chung,²²⁶ the authors propose a novel Coarse-to-Fine ViT, which exploited both the globality and locality of model. This work in Mok and Chung²²⁶ introduced the convolutional operation into vision Transformer. Experimental results on 3D brain datasets show that Coarse-to-Fine ViT outperforms other CNN-based works.

9 MEDICAL IMAGE SYNTHESIS

9.1 MRI

MRI is one of the noninvasive medical modalities that carry much crucial detailed information, especially for the structural development of human brain. Across different stages of life, MRI can be employed for the comprehensive analysis of neurodevelopment. Although the MRI analysis is of great abundance for adults, the researchers are highly short of MRI images on infants. Infants are customed to be opposed to staying still and keeping concentrated, which is quite influencing to the collection of MRI images. The authors in Zhang et al.²³⁵ notice the data shortage of infant MRI images and correspondingly propose a novel pyramid transformer net (PTNet) to achieve the synthesis of MRI. Specifically, PTNet combines the Transformer layer with multiscale pyramid design. Experimental results showed that PTNet outperformed other GAN-based models.

9.2 PET

It is challenging to practice the medical image synthesis task on PET images owing to their intensity range and density degree.

As for the PET images, the intensity values are of great significance during the computation of reproducible parameters but intensity range is so fluctuating that the manual intervention is commonly required. Therefore, the authors in Shin et al.²³⁶ propose GANBERT with a comprehensive integration of BERT with GAN. In the process of PET synthesis, BERT takes the responsibility for predicting some masked value images while the GAN discriminator is based on the “next sentence prediction (NSP)” part of the BERT. As the result, the manual effort in adjusting the PET synthesis gets replaced by the GANBERT. The further evolvement of GANBERT may be led to the introduction of U-Net architecture as the generator or the NSP as the GAN discriminator.

9.3 CT

The authors in Ristea et al.²³⁷ propose an image translation method for CT scans. The transform from unpaired contrast CT scans into noncontrast CT scans may both supplement the contrast CT scans source and the pairing of contrast and noncontrast CT scans.

Therefore, in Ristea et al.,²³⁷ the authors propose the CyTran, which bases its foundation on GAN and Transformer. Fed with unpaired CT images, this neural network employed a cycle-consistency loss to promote the training effect. For the sake of the property of high-resolution CT scans, CyTran joins both the convolutional and Multi-Head Attention mechanism for the registration of CT scans. Coltea-Lung-CT-100W, as a novel data set with 37,290 lung CT images, was specifically built for the training of CyTran. Experimental results showed CyTran's superiority over other competing methods in medical image synthesis.

9.4 OCT

Dye injection, as an effective method for tracking the vascular structure of retina, may lead to serious side effect on health while the color fundus imaging fails to meet the fidelity requirement despite its noninvasive property. As the mere noninvasive option for retinal vasculature capturing, optical coherence tomography-angiography (OCTA) can only guarantee the stable imaging of rather small areas on the retina, not to mention its relatively high expense. In Kamran et al.,²³⁸ the authors introduced deep learning framework, specifically GAN, for the synthesis of Fluorescein Angiography (FA) images based on the fundus photo input. This network, as is called VTGAN in Kamran et al.²³⁸ merits both a noninvasive solution to retinal vasculature imaging and an effective prediction tool to detect the retinal abnormalities. Experimental results of VTGAN showed its superiority over the state-of-the-art frameworks in terms of fundus-to-angiography synthesis.

9.5 Multimodality

The collection of complementary tissue morphology information can promote the clinical practice of disease diagnosis. On the other hand, the scan cost makes it difficult to widen the application of tissue morphology information acquirement. To balance the effect with the expenses, medical image synthesis can be applied to this problem. Among the recent methods for medical image synthesis outstand the GAN-based models owing to their excellent ability to comprehensively concentrate on the structural details. However, GAN, with a main framework based on CNN, also inherited a locality bias and spatial invariance, which rises as an obstacle before the development of long-range dependencies. Therefore, Hu et al.²³⁹ proposed a cross-modal framework for the medical image synthesis with a double-scale deep learning method. Concretely, this work²³⁹ based the local discriminator on CNN and the global discriminator on Transformer and joined them into a double-scale discriminator. Evaluation on standard benchmark IXI data set showed promising results in Table 7.

Table 7. Transformer-based methods for medical image synthesis.

Name	Modality	Organ/Method	Disease/Dimension
ganbert²³⁶	PET	None	MRI to PET
VTGAN²³⁸	OCT	Eye	Retinal
ResViT²⁴⁰	Multimodality	CNN-based	None
Hu²³⁹	Multimodality	GAN	None
SwinTrans²⁴¹	Multimodality	GAN	None
PTNet²³⁵	MRI	Brain	Infant brain
CyTran²³⁷	CT	Lung	Noncontrast to contrast

10 MEDICAL IMAGE DATASETS

One of the major limitations regarding medical datasets is that not enough data is available to train a Transformer model, especially compared with that in CV and natural langue prepossessing community. Recently, this phenomenon has drawn increasingly attention. Researchers have made tremendous efforts to construct high-quality datasets. We summarize popular datasets used in various medical image analysis tasks in Table 8. As can be seen, classification and segmentation are two most concerned tasks. Datasets designed for other tasks such as synthesis, detection and reconstruction are relatively in the minority. However, in practice, we recommend researchers make full use of existing datasets for different tasks with the help of the advanced techniques in deep learning community such as weakly supervised learning, multimodal learning, multitask learning, transfer learning and selfsupervised learning. For example, backbone models could be learned on datasets designed for segmentation using selfsupervised learning with carefully designed pretext tasks. This backbone is then used as the input for downstream tasks such as synthesis and detection tasks. In this way, all datasets in the community are fully leveraged for various tasks.

Table 8. Public datasets for medical image analysis.

Name	Modality	Organ/Method	Disease/Property	Task
PPMI²⁴²	Multimodal	Brain	Parkinson	Synthesis
BRATS²⁴³	MRI	Brain	Brain	Synthesis
iSeg-2017²⁴⁴	MRI	Brain	Brain tissue	Segmentation
BraTS-2020²⁴⁵	MRI	Brain	Brain tissue	Segmentation
MRBrainS²⁴⁶	MRI	Brain	Brain tissue	Segmentation
UKBB²⁴⁷	MRI	Brain	Brain tissue	Segmentation
ERI²⁴⁸	MRI	Cardiac	Cardiac	Segmentation
CHAOS²⁴⁹	MRI	Abnominal	Abnominal	Segmentation
KiTS19²⁵⁰	CT	Pediatric	Pediatric	Segmentation
USCD²⁵¹	CT	Eye	Drusen	Segmentation
MSD-01²⁵²	CT	Multiorgan	Multi	Segmentation
M& Ms²⁵³	MRI	Cardiac	Cardiac	Segmentation
ISIC2017²⁵⁴	Dermo	Skin	Melanoma	Segmentation
GlaS²⁵⁵	Histopath	Colorectal	Cancer	Segmentation
MoNuSeg²⁵⁶	Micro	Cell	Nuclear	Segmentation
Pannuke²⁵⁷	Micro	Cell	Cell	Segmentation
NIH Chest²⁵⁸	X-ray	Lung	Lung	Segmentation
Clean-CC-CCII²⁵⁹	CT	Lung	COVID-19	Segmentation
Bowl²⁶⁰	Micro	Cell	Nuclear	Segmentation
Thorax-85²⁶¹	CT	Multiorgan	Multi	Segmentation
SegTHOR²⁶²	CT	Thoracic	Thoracic	Segmentation
ACDC²⁶³	MRI	Multiorgan	Multi	Segmentation
Kvasir-SEG²⁶⁴	Colon	Gastrointestinal	Polyp	Segmentation
Clinic DB²⁶⁵	Colon	Gastrointestinal	Polyp	Segmentation
EndoScene²⁶⁶	Colon	Colorectal	Polyp	Segmentation
ETIS²⁶⁷	Colon	Colorectal	Cancer	Segmentation
Choledoch²⁶⁸	Histopath	Cholangio	Cholangiocarcinoma	Segmentation
TCIA²⁶⁹	Multimodal	Multiorgan	Cancer	Segmentation
HECKTOR²⁷⁰	PETCT	Headneck	Tumor	Segmentation
REFUGE20²⁷¹	Fundus	Eye	Glaucoma	Segmentation
CVC²⁷²	Colon	Colorectal	Polyp	Segmentation
Alizarine²⁷³	Fundus	Eye	Corneal	Segmentation
EchoNet-Dynamic²⁷⁴	Echocardiography	Cardiac	Ventricle	Segmentation
CBIS-DDSM²⁷⁵	CT	Breast	Mammography	Segmentation
DIARETDB1²⁷⁶	Fundus	Eye	Retinal	Segmentation
STARE¹	Fundus	Eye	Retinal	Segmentation
IU Chest X-ray²⁷⁷	X-ray	Chest	Drusen	Report generation
MIMIC-CXR²⁷⁸	X-ray	Eye	Drusen	Report generation
PadChest²⁷⁹	X-ray	Chest	Lung	Report generation
Ffa-ir²⁸⁰	Multimodal	Multiorgan	Multi	Report generation
DeepOpht²⁸¹	Fundus	Eye	Retinal	Report generation
IH-AAPM Mayo²⁸²	PETCT	Abnominal	Abnominal	Reconstruction
Kirby21²⁸³	PETCT	Multi	Multi	Reconstruction
DIV2K²⁸⁴	MRI	Multi	Multi	Reconstruction
fastMRI²⁸⁵	MRI	Multi	Multi	Reconstruction
dHCP²⁸⁶	MRI	Brain	Infant	Reconstruction
NIH-AAPM²⁸⁷	LDCT	Liver	Liverlesion	Reconstruction
Open MPI²⁸⁸	MPI	Multiorgan	Multi	Reconstruction
COVIDGR-E²⁸⁹	X-ray	Lung	COVID-19	Detection
IDRiD²⁹⁰	Fundus	Artery	Microaneurysm	Detection
COVIDx-CT-2A²⁹¹	CT	Lung	COVID-19	Detection
Cancer Genome Atlas²⁹²	Histopath	Cancer	Multitype	Detection
LUNA²⁹³	CT	Lung	Nodule	Classification
LIDC-IDRI²⁹⁴	CT	Lung	Nodule	Classification
Saber²⁹⁵	CT	Lung	Emphysema	Classification
COVID-CT²⁹⁶	CT	Lung	COVID-19	Classification
Sars-CoV-2²⁹⁷	CT	Lung	COVID-19	Classification
COVID19-CT-DB²⁹⁸	CT	Lung	COVID-19	Classification
COVID-CTset²⁹⁹	CT	Lung	COVID-19	Classification
BIMCV COVID19³⁰⁰	X-ray	Lung	COVID-19	Classification
PosteriorAnterio³⁰¹	X-ray	Lung	COVID-19	Classification
COVIDx³⁰²	X-ray	Lung	COVID-19	Classification
Color Fundus³⁰³	X-ray	Eye	Retinal	Classification
Cohen³⁰⁴	X-ray	Lung	COVID-19	Classification
CHOWDHURY³⁰⁵	X-ray	Lung	COVID-19	Classification
Cohen's data set³⁰⁶	X-ray	Lung	COVID-19	Classification
Kather³⁰⁷	X-ray	Colorectal	Cancer	Classification
BUSI³⁰⁸	US	Breast	Cancer	Classification
Data set B³⁰⁹	US	Breast	Cancer	Classification
CAMELYON16³¹⁰	Micro	Lymph	Node	Classification
TCGA-NSCLC³¹¹	CT	Lung	Cancer	Classification
RFMiD2020³¹²	Fundus	Eye	Retinal	Classification
Messidor³¹³	Fundus	Eye	Retinal	Classification
EyePACS³¹⁴	Fundus	Eye	Retinal	Classification
CheXpert³¹⁵	X-ray	Lung	COVID-19	Classification
POCUS³¹⁶	X-ray	Lung	COVID-19	Classification
Qi³¹⁷	X-ray	Lung	COVID-19	Classification

11 DISCUSSIONS AND CONCLUSIONS

This paper reviews the literature of Transformer-based models for medical image analysis. We consider most of popular tasks such as classification, detection, segmentation, reconstruction, registration, synthesis, and clinical report generations. On the other hand, in each task, we review existing works for different types of input resource, for example, X-rays, CT Scans, MRIs, fundus and multimodals. We also summarize the popular datasets for medical image analysis according to input modalities, organs, methods, diseases, properties and tasks. These efforts hopefully assist researchers move forward in the field of medical image analysis, especially with the help of Transformers. To keep pace with rapid development of Transformers in deep learning community, we recommend organizing the relevant workshops in CV and medical imaging conferences and arranging special issues in prestigious journals to promote research in medical image analysis.

Transformer, as one of the most powerful models currently in NLP and CV, has been applied to many areas. The long-range dependency of Transformer makes it capable of grasping the deep features hidden in an image. However, despite the previous success of Transformer in other areas, it remains challenging for the integration of Transformer and medical images. On the one hand, apart from the excellent performance, Transformer also consumes much computation and requires a large data set for training. On the other hand, the specificity of medical images, such as the unabundant amount and the strict realistic constraints, also sets some challenges for the researchers who attempted to introduce the Transformer for medical tasks. These conflicts, along with the other detailed obstacles in the clinical practice stood against the launching of Transformer based on medical images.

Thankfully, with the continuous efforts and the spectacular talents of researchers, as our review has shown, many innovative Transformer-based methods have been proposed for medical tasks. From specific disease to general examine, many researchers have given their solution to facilitate the computer-aided medical image analysis. As the result of their works, the advantage of Transformer is maintained on medical tasks. Ranging from segmentation to registration, Transformer has been efficient in reaching the target in medical image-based tasks.

We firmly believed that as science develops, the integration of different subjects may play an increasingly important role in the future. And these Vision Transformer-based trials in medical imaging area is another proof that one of the most advanced AI models can joined with the exploration of some frontier clinical problems. With the rapid growth of research using Transformers in the field of medical image analysis, we hope this review provides a road map to researchers to move forward in this field conveniently.

AUTHOR CONTRIBUTIONS

Kun Xia: Conceptualization (equal); investigation (equal); resources (equal); writing—original draft (equal); writing—review and editing (equal). Jinzhuo Wang: Conceptualization (equal); funding acquisition (lead); investigation (equal); resources (equal); writing—original draft (equal); writing—review and editing (equal). Both authors have read and approved the final manuscript.

ACKNOWLEDGMENTS

This research was supported by Discipline Development of Peking University (7101302940,7101303005) and the National Natural Science Foundation of China (62172273).

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

ETHICS STATEMENT

This work needs no ethics approval.

Open Research

DATA AVAILABILITY STATEMENT

This work does not contain data and code.

REFERENCES

1Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017; 30.
Google Scholar
2Pan X, Bai W, Ma M, Zhang S. RANT: a cascade reverse attention segmentation framework with hybrid transformer for laryngeal endoscope images. Biomed Signal Process Control. 2022; 78:103890.
10.1016/j.bspc.2022.103890
Web of Science® Google Scholar
3Zhang Y, Higashita R, Fu H, et al. A Multi-branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation. Springer; 2021: 99-108.
Google Scholar
4Chen D, Yang W, Wang L, Tan S, Lin J, Bu W. PCAT-UNet: UNet-like network fused convolution and transformer for retinal vessel segmentation. PLoS One. 2022; 17(1):e0262689.
10.1371/journal.pone.0262689
CAS PubMed Web of Science® Google Scholar
5Deng K, Meng Y, Gao D, et al. Transbridge: A Lightweight Transformer for Left Ventricle Segmentation in Echocardiography. Springer; 2021: 63-72.
Google Scholar
6Li Y, Wang S, Wang J, et al. Gt u-net: A U-Net like Group Transformer Network for Tooth Root Segmentation. Springer; 2021: 386-395.
Google Scholar
7Li Y, Zeng G, Zhang Y, et al. Agmb-transformer: anatomy-guided multi-branch transformer network for automated evaluation of root canal therapy. IEEE J Biomed Health Inform. 2022; 26(4): 1684-1695.
10.1109/JBHI.2021.3129245
PubMed Web of Science® Google Scholar
8Gu H, Wang H, Qin P, Wang J. Chest L-transformer: local features with position attention for weakly supervised chest radiograph segmentation and classification. Front Med. 2022; 9:923456.
10.3389/fmed.2022.923456
Web of Science® Google Scholar
9Zhang G, Wong H-C, Wang C, Zhu J, Lu L, Teng G. A temporary transformer network for guide- wire segmentation. Presented at: 2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI); 2021.
Google Scholar
10Lee MCH, Petersen K, Pawlowski N, Glocker B, Schaap M. TeTrIS: template transformer networks for image segmentation with shape priors. IEEE Trans Med Imaging. 2019; 38(11): 2596-2606.
10.1109/TMI.2019.2905990
PubMed Web of Science® Google Scholar
11Azad R, Heidari M, Shariatnia M, et al. Transdeeplab: Convolution-Free Transformer-Based Deeplab v3+ for Medical Image Segmentation. Springer; 2022: 91-102.
Google Scholar
12Azad R, Al-Antary MT, Heidari M, Merhof D. Transnorm: transformer provides a strong spatial normalization mechanism for a deep segmentation model. IEEE Access. 2022; 10: 108205-108215.
10.1109/ACCESS.2022.3211501
Web of Science® Google Scholar
13Pan S, Liu X, Xie N, Chong Y. EG-TransUNet: a transformer-based U-Net with enhanced and guided models for biomedical image segmentation. BMC Bioinformatics. 2023; 24: 85. https://doi.org/10.1186/s12859-023-05196-1
10.1186/s12859-023-05196-1
PubMed Web of Science® Google Scholar
14Wei C, Ren S, Guo K, Hu H, Liang J. High-resolution swin transformer for automatic medical image segmentation. arXiv preprint, arXiv:220711553; 2022. https://arxiv.org/pdf/2207.11553.pdf
Google Scholar
15Huang H, Xie S, Lin L, et al. ScaleFormer: revisiting the transformer-based backbones from a scale-wise perspective for medical image segmentation. arXiv preprint, arXiv:220714552; 2022. https://arxiv.org/pdf/2207.14552.pdf
Google Scholar
16Li Z, Li D, Xu C, et al. TFCNs: A CNN-Transformer Hybrid Network for Medical Image Segmentation. Springer; 2022: 781-792.
Google Scholar
17Karimi D, Vasylechko SD, Gholipour A. Convolution-Free Medical Image Segmentation Using Transformers. Springer; 2021: 78-88.
Google Scholar
18Chen J, Lu Y, Yu Q, et al. TransUNet: transformers make strong encoders for medical image segmentation. arXiv preprint, arXiv:210204306; 2021. https://arxiv.org/pdf/2102.04306.pdf
Google Scholar
19Gao Y, Zhou M, Metaxas DN. UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation. Springer; 2021: 61-71.
Google Scholar
20Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM. Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. Springer; 2021: 36-46.
Google Scholar
21Cao H, Wang Y, Chen J, et al. Swin-UNet: UNet-like pure transformer for medical image segmentation. arXiv preprint, arXiv: 210505537; 2021. https://arxiv.org/pdf/2105.05537.pdf
Google Scholar
22Yan X, Tang H, Sun S, Ma H, Kong D, Xie X. After-UNet: axial fusion transformer UNet for medical image segmentation. 2022; 3971-3981.
Google Scholar
23Huang X, Deng Z, Li D, Yuan X. Missformer: an effective medical image segmentation transformer. arXiv preprint, arXiv: 210907162; 2021. https://arxiv.org/pdf/2109.07162.pdf
Google Scholar
24Lin A, Chen B, Xu J, Zhang Z, Lu G, Zhang D. Ds-transUNet: dual swin transformer u-net for medical image segmentation. IEEE Trans Instrum Meas. 2022; 71: 1-15.
Web of Science® Google Scholar
25Zhang Z, Zhang W. Pyramid medical transformer for medical image segmentation. arXiv preprint arXiv:210414702; 2021.
Google Scholar
26Petit O, Thome N, Rambour C, Themyr L, Collins T, Soler L. U-Net Transformer: Self and Cross Attention for Medical Image Segmentation. Springer; 2021: 267-276.
Google Scholar
27Xu G, Wu X, Zhang X, He X. Levit-UNet: make faster encoders with transformer for medical image segmentation. arXiv preprint arXiv:210708623; 2021.
Google Scholar
28You C, Zhao R, Liu F, et al. Class-aware generative adversarial transformers for medical image segmentation. arXiv preprint, arXiv: 220110737; 2022. https://arxiv.org/pdf/2201.10737.pdf
Google Scholar
29Heidari M, Kazerouni A, Soltany M, et al. Hiformer: hierarchical multiscale representations using transformers for medical image segmentation. arXiv; 2023.
Google Scholar
30Li Z, Li Y, Li Q, et al. LViT: language meets vision transformer in medical image segmentation. arXiv preprint, arXiv: 220614718; 2022. https://arxiv.org/pdf/2206.14718.pdf
Google Scholar
31Wu Y, Liao K, Chen J, et al. D-former: a U-shaped dilated transformer for 3d medical image segmentation. Neural Comput Appl. 2022; 35: 1-14.
Web of Science® Google Scholar
32Ji Y, Zhang R, Wang H, et al. Multi-Compound Transformer for Accurate Biomedical Image Segmentation. Springer; 2021: 326-336.
Google Scholar
33Luo H, Changdong Y, Selvan R. Hybrid ladder transformers with efficient parallel-cross attention for medical image segmentation. PMLR. 2022; 172: 808-819.
Google Scholar
34Zhang Y, Liu H, Hu Q. Transfuse: Fusing Transformers and CNNS For Medical Image Segmentation. Springer; 2021: 14-24.
Google Scholar
35Li S, Sui X, Luo X, Xu X, Liu Y, Goh R. Medical image segmentation using squeeze-and-expansion transformers. arXiv preprint, arXiv:210509511; 2021. https://arxiv.org/pdf/2105.09511.pdf
Google Scholar
36Li Y, Cai W, Gao Y, Li C, Hu X. More than encoder: introducing transformer decoder to upsample. IEEE. 2022: 1597-1602.
Google Scholar
37Chang Y, Menghan H, Guangtao Z, Xiao-Ping Z. Transclaw U-net: Claw U-net with transformers for medical image segmentation. arXiv preprint, arXiv:210705188; 2021. https://arxiv.org/pdf/2107.05188.pdf
Google Scholar
38Chen B, Liu Y, Zhang Z, Lu G, Kong AWK. TransattUNet: multi-level attention-guided U-net with transformer for medical image segmentation. arXiv preprint, arXiv:210705274; 2021. https://arxiv.org/pdf/2107.05274.pdf
Google Scholar
39Zhou H-Y, Guo J, Zhang Y, Yu L, Wang L, Yu Y. nnformer: interleaved transformer for volumetric segmentation. arXiv preprint, arXiv:210903201; 2021. https://arxiv.org/pdf/2109.03201.pdf
Google Scholar
40Peiris H, Hayat M, Chen Z, Egan G, Harandi M. A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation. Springer; 2022: 162-172.
Google Scholar
41Wang L, Wang X, Zhang B, et al. Multi-scale Hierarchical Transformer structure for 3D medical image segmentation. Presented at: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2021.
Google Scholar
42Dhamija T, Gupta A, Gupta S, Anjum, Katarya R, Singh G. Semantic segmentation in medical images through transfused convolution and transformer networks. Appl Intel. 2023; 53(1): 1132-1148.
10.1007/s10489-022-03642-w
PubMed Web of Science® Google Scholar
43Xu S, Quan H. ECT-NAS: searching efficient CNN-transformers architecture for medical image segmentation. IEEE. 2021: 1601-1604.
Google Scholar
44Lian C, Wang F, Deng HH, et al. Multi-task dynamic transformer network for concurrent bone segmentation and large-scale landmark localization with dental CBCT. Med Image Comput Comput Assist Interv. 2020; 12264: 807-816.
PubMed Google Scholar
45Sagar A. VITBIS: Vision Transformer for Biomedical Image Segmentation. Springer; 2021: 34-45.
Google Scholar
46Jiang S, Li J, Hua Z. Transformer with progressive sampling for medical cellular image segmentation. Math Biosci Eng. 2022; 19(12): 12104-12126.
10.3934/mbe.2022563
PubMed Web of Science® Google Scholar
47Prangemeier T, Reich C, Koeppl H. Attention-based transformers for instance segmentation of cells in microstructures. IEEE. 2020: 700-707.
Google Scholar
48Wang L, Yu L, Zhu J, Tang H, Gou F, Wu J. Auxiliary segmentation method of osteosarcoma in MRI images based on denoising and local enhancement. Healthcare. 2022; 10(8): 1468.
10.3390/healthcare10081468
Web of Science® Google Scholar
49Huang L, Chen L, Zhang B, Chai S. A transformer-based generative adversarial network for brain tumor segmentation. arXiv preprint, arXiv:220714134; 2022. https://arxiv.org/pdf/2207.14134.pdf
Google Scholar
50Wang W, Chen C, Ding M, Yu H, Zha S, Li J. Transbts: Multimodal Brain Tumor Segmentation Using Transformer. Springer; 2021: 109-119.
Google Scholar
51Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin Unetr: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI images. Springer; 2022: 272-284.
Google Scholar
52Jia Q, Shu H. Bitr-UNet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation. Springer; 2022: 3-14.
Google Scholar
53Karimi D, Dou H, Gholipour A. Medical image segmentation using transformer networks. IEEE Access. 2022; 10: 29322-29332.
10.1109/ACCESS.2022.3156894
PubMed Web of Science® Google Scholar
54Chen S, Qiu C, Yang W, Zhang Z. Multiresolution aggregation transformer UNet based on multiscale input and coordinate attention for medical image segmentation. Sensors. 2022; 22(10): 3820.
10.3390/s22103820
Web of Science® Google Scholar
55Sun Q, Fang N, Liu Z, Zhao L, Wen Y, Lin H. HybridCTrm: bridging CNN and transformer for multimodal brain image segmentation. J Healthc Eng. 2021; 2021: 1-10.
Web of Science® Google Scholar
56Gao Z, Zhuang X. Consistency Based Co-segmentation for Multi-view Cardiac MRI Using Vision. Springer-Verlag; 2022: 306-314.
Google Scholar
57Liang J, Yang C, Zeng M, Wang X. TransConver: transformer and convolution parallel network for developing automatic brain tumor segmentation in MRI images. Quant Imaging Med Surg. 2022; 12(4): 2397-2415.
10.21037/qims-21-919
PubMed Web of Science® Google Scholar
58Wang J, Wang S, Liang W. METrans: multi-encoder transformer for ischemic stroke segmentation. Electron Lett. 2022; 58(9): 340-342.
10.1049/ell2.12444
Web of Science® Google Scholar
59Jiang Y, Zhang Y, Lin X, Dong J, Cheng T, Liang J. SwinBTS: a method for 3D multimodal brain tumor segmentation using swin transformer. Brain Sci. 2022; 12(6): 797.
10.3390/brainsci12060797
PubMed Web of Science® Google Scholar
60Liang J, Yang C, Zhong J, Ye X. BTSwin-Unet: 3D U-shaped symmetrical swin transformer-based network for brain tumor segmentation with self-supervised pre-training. Neural Process Lett. 2022. doi:10.1007/s11063-022-10919-1
10.1007/s11063-022-10919-1
Web of Science® Google Scholar
61Feng P, Ni B, Cai X, Xie Y. UTransNet: transformer within U-Net for Stroke Lesion Segmentation. Presented at: 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD); 2022.
Google Scholar
62Sha Y, Zhang Y, Ji X, Hu L. Transformer-UNet: raw image processing with UNet. arXiv preprint, arXiv:210908417; 2021. https://arxiv.org/pdf/2109.08417.pdf
Google Scholar
63Chartsias A, Papanastasiou G, Wang C, et al. Disentangle, align and fuse for multimodal and semi-supervised image segmentation. IEEE Trans Med Imaging. 2021; 40(3): 781-792.
10.1109/TMI.2020.3036584
PubMed Web of Science® Google Scholar
64Yun B, Wang Y, Chen J, Wang H, Shen W, Li Q. Spectr: spectral transformer for hyperspectral pathology image segmentation. arXiv preprint, arXiv:210303604; 2021. https://arxiv.org/pdf/2103.03604.pdf
Google Scholar
65Wang J, Wei L, Wang L, Zhou Q, Zhu L, Qin J. Boundary-Aware Transformers for Skin Lesion Segmentation. Springer; 2021: 206-216.
Google Scholar
66Wu H, Chen S, Chen G, Wang W, Lei B, Wen Z. FAT-Net: feature adaptive transformers for automated skin lesion segmentation. Med Image Anal. 2022; 76:102327.
10.1016/j.media.2021.102327
PubMed Web of Science® Google Scholar
67Liao Z, Fan N, Xu K. Swin transformer assisted prior attention network for medical image segmentation. Appl Sci. 2022; 12(9): 4735.
10.3390/app12094735
CAS Web of Science® Google Scholar
68Dong B, Wang W, Fan D-P, Li J, Fu H, Shao L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint, arXiv:210806932; 2021. https://arxiv.org/pdf/2108.06932.pdf
Google Scholar
69Park K-B, Lee JY. SwinE-Net: hybrid deep learning approach to novel polyp segmentation using convolutional neural network and swin transformer. J Comp Design Eng. 2022; 9(2): 616-632.
10.1093/jcde/qwac018
Web of Science® Google Scholar
70Xie Y, Zhang J, Shen C, Xia Y. COTR: Efficiently Bridging Cnn and Transformer for 3D Medical Image Segmentation. Springer; 2021: 171-180.
Google Scholar
71Liu W, Tian T, Xu W, et al. Phtrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation. Springer; 2022: 235-244.
Google Scholar
72Shen Z, Yang H, Zhang Z, Zheng S. Automated kidney tumor segmentation with convolution and transformer network. In: N Heller, F Isensee, D Trofimova, R Tejpaul, N Papanikolopoulos, C Weight, eds. Kidney and Kidney Tumor Segmentation. KiTS 2021. Lecture Notes in Computer Science, vol 13168. Springer; 2022. https://doi.org/10.1007/978-3-030-98385-7%5F1
Google Scholar
73Ma M, Xia H, Tan Y, Li H, Song S. HT-Net: hierarchical context-attention transformer network for medical ct image segmentation. Appl Intel. 2022; 52(9): 10692-10705.
10.1007/s10489-021-03010-0
Web of Science® Google Scholar
74Luo C, Zhang J, Chen X, Tang Y, Weng X, Xu F. UCATR: based on CNN and transformer encoding and cross-attention decoding for lesion segmentation of acute ischemic stroke in non-contrast computed tomography images. IEEE. 2021: 3565-3568.
Google Scholar
75Liu M, Xiao L, Jiang H, He Q. Ccat-net: a novel transformer based semi-supervised framework for covid-19 lung lesion segmentation. IEEE. 2022: 1-5.
Google Scholar
76Guo D, Terzopoulos D. A transformer-based network for anisotropic 3D medical image segmentation. Presented at: 2020 25th International Conference on Pattern Recognition (ICPR); 2021.
Google Scholar
77Ning Y, Zhang S, Xi X, Guo J, Liu P, Zhang C. CAC-EMVT: Efficient coronary artery calcium segmentation with multi-scale vision transformers. Presented at: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2021.
Google Scholar
78Pak DH, Caballero A, Sun W, Duncan JS. Efficient aortic valve multilabel segmentation using a spatial transformer network. IEEE. 2020: 1738-1742.
Google Scholar
79Sui D, Zhang K, Liu W, Chen J, Ma X, Tian Z. CST: a multitask learning framework for colorectal cancer region mining based on transformer. BioMed Res Int. 2021; 2021: 1-8.
10.1155/2021/2567202
PubMed Web of Science® Google Scholar
80Wang B, Wang F, Dong P, Li C. Multiscale transUNet++: dense hybrid U-Net with transformer for medical image segmentation. Signal Image Video Process. 2022; 16(6): 1607-1614.
10.1007/s11760-021-02115-w
Web of Science® Google Scholar
81Tanzi L, Audisio A, Cirrincione G, Aprato A, Vezzetti E. Vision transformer for femur fracture classification. Injury. 2022; 53(7): 2625-2634.
10.1016/j.injury.2022.04.013
PubMed Web of Science® Google Scholar
82Zhao C, Shuai R, Ma L, Liu W, Wu M. Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT. Multimedia Tools Appl. 2022; 81(17): 24265-24300.
10.1007/s11042-022-12670-0
PubMed Web of Science® Google Scholar
83Chen H, Li C, Li X, et al. IL-MCAM: an interactive learning and multi-channel attention mechanism-based weakly supervised colorectal histopathology image classification approach. Comput Biol Med. 2022; 143:105265.
10.1016/j.compbiomed.2022.105265
CAS PubMed Web of Science® Google Scholar
84Almalik F, Yaqub M, Nandakumar K. Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification. Springer; 2022: 376-386.
Google Scholar
85Bhattacharya M, Jain S, Prasanna P. RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention–Guided Disease Classification. Springer; 2022: 679-698.
Google Scholar
86Zheng Y, Gindra R, Betke M, Beane JE, Kolachalama VB. A deep learning based graph-transformer for whole slide image classification. medRxiv, 2021;2021.10. 15.21265060.
Google Scholar
87Yu S, Ma K, Bi Q, et al. MIL-VT: Multiple Instance Learning Enhanced Vision Transformer for fundus image classification. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science, vol 12908. Springer; 2021. https://doi.org/10.1007/978-3-030-87237-3%5F5
Google Scholar
88Liu W, Li C, Xu N, et al. CVM-Cervix: A hybrid cervical pap-smear image classification framework using CNN, visual transformer and multilayer perceptron. Pattern Recogn. 2022; 130:108829.
10.1016/j.patcog.2022.108829
Web of Science® Google Scholar
89Cai Z, Lin L, He H, Tang X. Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification. Springer; 2022: 88-98.
Google Scholar
90Sun R, Li Y, Zhang T, Mao Z, Wu F, Zhang Y. Lesion-aware transformers for diabetic retinopathy grading. IEEE. 2021: 10938-10947.
Google Scholar
91Park S, Kim G, Kim J, Kim B, Ye JC. Federated split vision transformer for COVID-19 CXR diagnosis using task-agnostic training. arXiv preprint, arXiv:211101338; 2021. https://arxiv.org/pdf/2111.01338.pdf
Google Scholar
92Liu C, Yin Q. Automatic Diagnosis of Covid-19 Using a Tailored Transformer-Like Network. IOP Publishing; 2021:012175.
Google Scholar
93Shome D, Kar T, Mohanty S, et al. COVID-Transformer: interpretable COVID-19 detection using vision transformer for healthcare. Int J Environ Res Public Health. 2021; 18(21):11086.
10.3390/ijerph182111086
CAS PubMed Web of Science® Google Scholar
94Le Dinh T, Lee S-H, Kwon S-G, Kwon K-R. COVID-19 chest X-ray classification and severity assessment using convolutional and transformer neural networks. Appl Sci. 2022; 12(10): 4861.
10.3390/app12104861
CAS Web of Science® Google Scholar
95Jiang X, Zhu Y, Cai G, Zheng B, Yang D. MXT: a new variant of pyramid vision transformer for multi-label chest X-ray image classification. Cogn Comput. 2022; 14(4): 1362-1377.
10.1007/s12559-022-10032-4
Web of Science® Google Scholar
96Park S, Kim G, Oh Y, et al. Vision transformer for Covid-19 CXR diagnosis using chest X-ray feature corpus. arXiv preprint, arXiv:210307055; 2021. https://arxiv.org/pdf/2103.07055.pdf
Google Scholar
97Tgtym E. Multi-view analysis of unregistered medical images using cross-view transformers. Presented at: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2021.
Google Scholar
98Verenich E, Martin T, Velasquez A, Khan N, Hussain F. Pulmonary disease classification using globally correlated maximum likelihood: an auxiliary attention mechanism for convolutional neural networks. arXiv preprint, arXiv:210900573; 2021. https://arxiv.org/pdf/2109.00573.pdf
Google Scholar
99Stegmüller T, Bozorgtabar B, Spahr A, Thiran J-P. Scorenet: learning non-uniform attention and augmentation for transformer-based histopathological image classification. 2023; 6170-6179.
Google Scholar
100Gheflati B, Rivaz H. Vision transformers for classification of breast ultrasound images. IEEE. 2022: 480-483.
Google Scholar
101Wang X, Yang S, Zhang J, et al. TransPath: transformer-based self-supervised learning for histopathological image classification. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference, Strasbourg, France, September 27 – October 1, 2021, Proceedings, Part VIII. Springer-Verlag; 2021: 186-195. https://doi.org/10.1007/978-3-030-87237-3%5F18
Google Scholar
102Wu W, Mehta S, Nofallah S, et al. Scale-aware transformers for diagnosing melanocytic lesions. IEEE Access. 2021; 9: 163526-163541.
10.1109/ACCESS.2021.3132958
PubMed Web of Science® Google Scholar
103Jang J, Hwang D. M3T: three-dimensional medical image classifier using multi-plane and multi-slice transformer. 2022; 20718-20729.
Google Scholar
104Matsoukas C, Haslum JF, Söderberg M, Smith K. Is it time to replace cnns with transformers for medical images? arXiv preprint, arXiv:210809038; 2021. https://arxiv.org/pdf/2108.09038.pdf
Google Scholar
105Qayyum A, Benzinou A, Mazher M, Meriaudeau F. Efficient multi-model vision transformer based on feature fusion for classification of DFUC2021 challenge. In: Diabetic Foot Ulcers Grand Challenge: Second Challenge, DFUC 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings. Springer-Verlag; 2022: 62-75. https://doi.org/10.1007/978-3-030-94907-5%5F5
Google Scholar
106Guo F-M, Fan Y. Zero-shot and few-shot learning for lung cancer multi-label classification using vision transformer. arXiv preprint, arXiv:220515290; 2022. https://arxiv.org/pdf/2205.15290.pdf
Google Scholar
107Gao X, Qian Y, Gao A. COVID-VIT: classification of COVID-19 from CT chest images based on vision transformer models. arXiv preprint, arXiv: 210701682; 2021. https://arxiv.org/pdf/2107.01682.pdf
Google Scholar
108Zheng Y, Li J, Shi J, Xie F, Jiang Z. Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification. Springer; 2022: 283-292.
Google Scholar
109Nakai K, Han X-H. DPE-BoTNeT: dual position encoding bottleneck transformer network for skin lesion classification. Presented at: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI); 2022.
Google Scholar
110Dai Y, Gao Y, Liu F. TransMed: transformers advance multi-modal medical image classification. Diagnostics. 2021; 11(8): 1384.
10.3390/diagnostics11081384
PubMed Web of Science® Google Scholar
111Gao Z, Hong B, Zhang X, et al. Instance-Based Vision Transformer For Subtyping of Papillary Renal Cell Carcinoma In Histopathological Image. Springer; 2021: 299-308.
Google Scholar
112Islam MR, Nahiduzzaman M, Goni MOF, et al. Explainable transformer-based deep learning model for the detection of malaria parasites from blood cell images. Sensors (Basel). 2022;(12): 22.
PubMed Google Scholar
113Jiang Z, Dong Z, Wang L, Jiang W. Method for diagnosis of acute lymphoblastic leukemia based on ViT-CNN ensemble model. Comput Intell Neurosci. 2021; 2021: 1-12.
CAS Web of Science® Google Scholar
114Han Y, Ding Y, Tewfik A, Peng Y, Wang Z. CheXT: Knowledge-Guided Cross-Attention Transformer for Abnormality Classification and Localization in Chest X-rays.
Google Scholar
115He S, Grant PE, Ou Y. Global-local transformer for brain age estimation. IEEE Trans Med Imaging. 2022; 41(1): 213-224.
10.1109/TMI.2021.3108910
PubMed Web of Science® Google Scholar
116Kim B-H, Ye JC, Kim J-J. Learning dynamic graph representation of brain connectome with spatio-temporal attention. Adv Neural Inf Process Syst. 2021; 34: 4314-4327.
Google Scholar
117Zhao J, Xiao X, Li D, et al. MfTrans-Net: quantitative measurement of hepatocellular carcinoma via multi-function transformer regression network. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference, Strasbourg, France, September 27 – October 1, 2021, Proceedings, Part V. Springer-Verlag; 2021: 75-84. https://doi.org/10.1007/978-3-030-87240-3%5F8
Google Scholar
118Aladhadh S, Alsanea M, Aloraini M, Khan T, Habib S, Islam M. An effective skin cancer classification mechanism via medical vision transformer. Sensors (Basel). 2022;(11): 22.
PubMed Google Scholar
119Wang M, Zhu W, Shi F, et al. MsTGANet: automatic drusen segmentation from retinal OCT images. IEEE Trans Med Imaging. 2022; 41(2): 394-406.
10.1109/TMI.2021.3112716
PubMed Web of Science® Google Scholar
120Khan A, Lee B. Gene transformer: transformers for the gene expression-based classification of lung cancer subtypes. arXiv preprint, arXiv:210811833; 2021. https://arxiv.org/pdf/2108.11833.pdf
Google Scholar
121Wen H, Zhao J, Xiang S, et al. Towards more efficient ophthalmic disease classification and lesion location via convolution transformer. Comput Methods Programs Biomed. 2022; 220:106832.
10.1016/j.cmpb.2022.106832
PubMed Web of Science® Google Scholar
122Mondal AK, Bhattacharjee A, Singla P, Prathosh AP. xViTCOS: explainable vision transformer based COVID-19 screening using radiography. IEEE J Transl Eng Health Med. 2022; 10: 1-10.
10.1109/JTEHM.2021.3134096
CAS Web of Science® Google Scholar
123Hsu C-C, Chen G-L, Wu M-H. Visual transformer with statistical test for covid-19 classification. arXiv preprint, arXiv:210705334; 2021. https://arxiv.org/pdf/2107.05334.pdf
Google Scholar
124Zhang L, Wen Y. A transformer-based framework for automatic COVID19 diagnosis in chest CTs. 2021; 513-518.
Google Scholar
125Ambita AAE, Boquio ENV, Naval PC. COViT-GAN: vision transformer for COVID-19 detection in CT scan images with self-attention GAN for data augmentation. In: Artificial Neural Networks and Machine Learning – ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part II. Springer-Verlag; 2021: 587-598. https://doi.org/10.1007/978-3-030-86340-1%5F47
Google Scholar
126Wu Y, Qi S, Sun Y, Xia S, Yao Y, Qian W. A vision transformer for emphysema classification using CT images. Phys Med Biol. 2021; 66(24):245016.
10.1088/1361-6560/ac3dc8
Web of Science® Google Scholar
127Costa GSS, Paiva AC, Junior GB, Ferreira MM. COVID-19 automatic diagnosis with CT images using the novel transformer architecture. SBC. 2021: 293-301.
Google Scholar
128Zhang L, Wen Y. A transformer-based framework for automatic COVID19 diagnosis in chest CTs, In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW); 2021: 513-518. https://doi.org/10.1109/ICCVW54120.2021.00063
10.1109/ICCVW54120.2021.00063
Google Scholar
129Liang S, Zhang W, Gu Y. A hybrid and fast deep learning framework for Covid-19 detection via 3D Chest CT Images. 2021; 508-512.
Google Scholar
130Barhoumi Y, Ghulam R. Scopeformer: n-CNN-ViT hybrid model for intracranial hemorrhage classification. arXiv preprint, arXiv:210704575; 2021. https://arxiv.org/pdf/2107.04575.pdf
Google Scholar
131Xia Y, Yao J, Lu L, et al. Effective Pancreatic Cancer Screening on Non-contrast CT Scans via Anatomy-Aware Transformers. Springer; 2021: 259-269.
Google Scholar
132Yang J, Deng H, Huang X, Ni B, Xu Y. Relational learning between multiple pulmonary nodules via deep set attention transformers. IEEE. 2020: 1875-1878.
Google Scholar
133Fan X, Feng X, Dong Y, Hou H. COVID-19 CT image recognition algorithm based on transformer and CNN. Displays. 2022; 72:102150.
10.1016/j.displa.2022.102150
CAS Web of Science® Google Scholar
134Park S, Kim G, Oh Y, et al. Multi-task vision transformer using low-level chest X-ray feature corpus for COVID-19 diagnosis and severity quantification. Med Image Anal. 2022; 75:102299.
10.1016/j.media.2021.102299
PubMed Web of Science® Google Scholar
135Shao Z, Bian H, Chen Y, Wang Y, Zhang J, Ji X. Transmil: transformer based correlated multiple instance learning for whole slide image classification. Adv Neural Inf Process Syst. 2021; 34: 2136-2147.
Google Scholar
136Dai W, Zhang Z, Tian L, et al. BrainFormer: a hybrid CNN-transformer model for brain fMRI data classification. arXiv preprint, arXiv:220803028; 2022. https://arxiv.org/pdf/2208.03028.pdf
Google Scholar
137Wang B, Zhang D, Tian Z. STCovidNet: Automatic Detection Model of Novel Coronavirus Pneumonia Based on Swin Transformer. Research Square; 2022. https://doi.org/10.21203/rs.3.rs-1401026/v1
10.21203/rs.3.rs-1401026/v1
Google Scholar
138Jiang H, Zhang P, Che C, Jin B. RDFNet: a fast caries detection method incorporating transformer mechanism. Comput Math Methods Med. 2021; 2021: 1-9.
CAS Web of Science® Google Scholar
139Zhang L, Feng S, Duan G, Li Y, Liu G. Detection of microaneurysms in fundus images based on an attention mechanism. Genes (Basel). 2019; 10(10): 817.
10.3390/genes10100817
CAS PubMed Web of Science® Google Scholar
140Li L, Xu M, Liu H, et al. A large-scale database and a CNN model for attention-based glaucoma detection. IEEE Trans Med Imaging. 2020; 39(2): 413-424.
10.1109/TMI.2019.2927226
PubMed Web of Science® Google Scholar
141Xu X, Guan Y, Li J, Ma Z, Zhang L, Li L. Automatic glaucoma detection based on transfer induced attention network. Biomed Eng Online. 2021; 20(1): 39.
10.1186/s12938-021-00877-5
PubMed Web of Science® Google Scholar
142Krishnan KS, Krishnan KS. Vision transformer based COVID-19 detection using chest X-rays. IEEE. 2021: 644-648.
Google Scholar
143Duong LT, Le NH, Tran TB, Ngo VM, Nguyen PT. Detection of tuberculosis from chest X-ray images: boosting the performance with vision transformer and transfer learning. Expert Syst Appl. 2021; 184:115519.
10.1016/j.eswa.2021.115519
Web of Science® Google Scholar
144Lu MY, Williamson DFK, Chen TY, Chen RJ, Barbieri M, Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat Biomed Eng. 2021; 5(6): 555-570.
10.1038/s41551-020-00682-w
PubMed Web of Science® Google Scholar
145Chen RJ, Lu MY, Weng W-H, et al. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. IEEE. 2021: 4015-4025.
Google Scholar
146Chen L, You Z, Zhang N, Xi J, Le X. UTRAD: anomaly detection and localization with U-transformer. Neural Net. 2022; 147: 53-62.
10.1016/j.neunet.2021.12.008
PubMed Web of Science® Google Scholar
147Tomita N, Abdollahi B, Wei J, Ren B, Suriawinata A, Hassanpour S. Attention-based deep neural networks for detection of cancerous and precancerous esophagus tissue on histopathological slides. JAMA Netw Open. 2019; 2(11):e1914645.
10.1001/jamanetworkopen.2019.14645
PubMed Web of Science® Google Scholar
148Mathai TS, Lee S, Elton DC, et al. Lymph Node Detection in T2 MRI with Transformers. SPIE; 2022: 855-859.
Google Scholar
149Shen Z, Fu R, Lin C, Zheng S. COTR: convolution in transformer network for end to end polyp detection. IEEE. 2021: 1757-1761.
Google Scholar
150Liu S, Zhou H, Shi X, Pan J. Transformer for polyp detection. arXiv preprint, arXiv:211107918; 2021. https://arxiv.org/pdf/2111.07918.pdf
Google Scholar
151Al Rahhal MM, Bazi Y, Jomaa RM, et al. COVID-19 detection in CT/X-ray imagery using vision transformers. J Pers Med. 2022; 12(2): 310.
10.3390/jpm12020310
PubMed Web of Science® Google Scholar
152Wang Y, Guan Z, Hou W, Wang F. TRACE: Early Detection of Chronic Kidney Disease Onset with Transformer-Enhanced Feature Embedding. Springer; 2021: 166-182.
Google Scholar
153Li H, Chen L, Han H, Kevin Zhou S. SATr: Slice Attention With Transformer for Universal Lesion Detection. Springer; 2022: 163-174.
Google Scholar
154Niu C, Wang G. Unsupervised contrastive learning based transformer for lung nodule detection. Phys Med Biol. 2022; 67(20):204001.
10.1088/1361-6560/ac92ba
CAS Web of Science® Google Scholar
155Perera S, Adhikari S, Yilmaz A. Pocformer: a lightweight transformer architecture for detection of covid-19 using point of care ultrasound. IEEE. 2021: 195-199.
Google Scholar
156Wittmann B, Shit S, Navarro F, Peeken JC, Combs SE. Swinfpn: Leveraging vision transformers for 3d organs-at-risk detection. 2022. https://openreview.net/forum?id=yiIz7DhgRU5
Google Scholar
157Islam MN, Hasan M, Hossain MK, Alam MGR, Uddin MZ, Soylu A. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from CT-radiography. Sci Rep. 2022; 12(1):11440.
10.1038/s41598-022-15634-4
CAS PubMed Web of Science® Google Scholar
158Lin Z, He Z, Xie S, et al. AANet: adaptive attention network for COVID-19 detection from chest X-ray images. IEEE Trans Neural Netw Learn Syst. 2021; 32(11): 4781-4792.
10.1109/TNNLS.2021.3114747
PubMed Web of Science® Google Scholar
159Pesce E, Joseph Withey S, Ypsilantis P-P, Bakewell R, Goh V, Montana G. Learning to detect chest radiographs containing pulmonary lesions using visual attention networks. Med Image Anal. 2019; 53: 26-38.
10.1016/j.media.2018.12.007
PubMed Web of Science® Google Scholar
160Ma X, Luo G, Wang W, Wang K. Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries. Springer; 2021: 516-525.
Google Scholar
161Li H, Huang J, Li G, et al. View-disentangled transformer for brain lesion detection. Presented at: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI); 2022.
Google Scholar
162Long Y, Li Z, Yee CH, et al. E-DSSR: Efficient Dynamic Surgical Scene Reconstruction With Transformer-based Stereoscopic Depth Perception. Springer; 2021: 415-425.
Google Scholar
163Pan J, Wu W, Gao Z, Zhang H. MIST-net: multi-domain integrative swin transformer network for sparse-view CT reconstruction. arXiv preprint, arXiv:211114831; 2021. https://arxiv.org/pdf/2111.14831.pdf
Google Scholar
164Ge Y, Zhang Q, Shen Y, Sun Y, Huang C. A 3D reconstruction method based on multi-views of contours segmented with CNN-transformer for long bones. Int J Comp Assist Radiol Surg. 2022; 17(10): 1891-1902.
10.1007/s11548-022-02701-4
PubMed Web of Science® Google Scholar
165Xie H, Thorn S, Liu Y-H, et al. Deep-learning-based few-angle cardiac SPECT reconstruction using transformer. IEEE Trans Radiat Plasma Med Sci. 2023; 7(1): 33-40.
10.1109/TRPMS.2022.3187595
Web of Science® Google Scholar
166Xing X, Liang G, Zhang Y, Khanal S, Lin A-L, Jacobs N. Advit: vision transformer on multi-modality pet images for Alzheimer disease diagnosis. Presented at: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI); 2022.
Google Scholar
167Hu R, Liu H. TransEM: Residual Swin-Transformer Based Regularized PET Image Reconstruction. Springer; 2022: 184-193.
Google Scholar
168Yan C, Shi G, Wu Z. SMIR: a transformer-based model for MRI super-resolution reconstruction. Presented at: 2021 IEEE International Conference on Medical Imaging Physics and Engineering (ICMIPE); 2021.
Google Scholar
169Korkmaz Y, Dar SUH, Yurt M, Ozbey M, Cukur T. Unsupervised MRI reconstruction via zero-shot learned adversarial transformers. IEEE Trans Med Imaging. 2022; 41(7): 1747-1763.
10.1109/TMI.2022.3147426
PubMed Web of Science® Google Scholar
170Guo P, Mei Y, Zhou J, Jiang S, Patel VM. ReconFormer: accelerated MRI reconstruction using recurrent transformer. arXiv preprint, arXiv:220109376; 2022. https://arxiv.org/pdf/2201.09376.pdf
Google Scholar
171Zhou B, Dey N, Schlemper J, et al. DSFormer: a dual-domain self-supervised transformer for accelerated multi-contrast MRI reconstruction. 2023: 4966-4975.
Google Scholar
172Li G, Lv J, Tian Y, et al. Transformer-empowered multi-scale contextual matching and aggregation for multi-contrast MRI super-resolution. 2022: 20636-20645.
Google Scholar
173Gao C, Shih S-F, Finn JP, Zhong X. A Projection-Based K-space Transformer Network for Undersampled Radial MRI Reconstruction with Limited Training Subjects. Springer; 2022: 726-736.
Google Scholar
174Fabian Z, Soltanolkotabi M. HUMUS-Net: hybrid unrolled multi-scale network architecture for accelerated MRI reconstruction. arXiv preprint, arXiv:220308213; 2022. https://arxiv.org/pdf/2203.08213.pdf
Google Scholar
175Zhao Z, Zhang T, Xie W, Wang Y, Zhang Y. K-space transformer for fast MRI reconstruction with implicit representation. arXiv preprint, arXiv:220606947; 2022. https://arxiv.org/pdf/2206.06947.pdf
Google Scholar
176Xu J, Moyer D, Grant PE, Golland P, Iglesias JE, Adalsteinsson E. SVoRT: Iterative Transformer for Slice-to-Volume Registration in Fetal Brain MRI. Springer; 2022: 3-13.
Google Scholar
177Liu Y, Pang Y, Jin R, Wang Z. Active phase-encode selection for slice-specific fast MR scanning using a transformer-based deep reinforcement learning framework. arXiv preprint, arXiv:220305756; 2022. https://arxiv.org/pdf/2203.05756.pdf
Google Scholar
178Feng C-M, Yan Y, Fu H, Chen L, Xu Y. Task Transformer Network for Joint MRI Reconstruction and Super-Resolution. Springer; 2021: 307-317.
Google Scholar
179Feng C-M, Yan Y, Chen G, et al. Multi-modal transformer for accelerated mr imaging. IEEE Trans Med Imaging. 2022: 1.
10.1109/TMI.2022.3180228
Google Scholar
180Hu B, Shen Y, Wu G, Wang S. SRT: shape reconstruction transformer for 3D reconstruction of point cloud from 2D MRI. Presented at: 2022 14th International Conference on Machine Learning and Computing (ICMLC); 2022.
Google Scholar
181Elmas G, Dar SU, Korkmaz Y, et al. Federated learning of generative image priors for MRI reconstruction. IEEE Trans Med Imaging. 2022: 1.
10.1109/TMI.2022.3220757
Web of Science® Google Scholar
182Wang L, Zhu H, He Z, Jia Y, Du J. Adjacent slices feature transformer network for single anisotropic 3D brain MRI image super-resolution. Biomed Signal Process Control. 2022; 72:103339.
10.1016/j.bspc.2021.103339
Web of Science® Google Scholar
183Wu Y, Ma Y, Liu J, Du J, Xing L. Self-attention convolutional neural network for improved MR image reconstruction. Inf Sci (NY). 2019; 490: 317-328.
10.1016/j.ins.2019.03.080
Web of Science® Google Scholar
184Ekanayake M, Pawar K, Harandi M, Egan G, Chen Z. Multi-head cascaded Swin transformers with attention to k-space sampling pattern for accelerated MRI reconstruction. arXiv preprint, arXiv:220708412; 2022. https://arxiv.org/pdf/2207.08412.pdf
Google Scholar
185Lin K, Heckel R. Vision transformers enable fast and robust accelerated MRI. PMLR. 2022: 774-795.
Google Scholar
186Gungor A, Askin B, Soydan DA, Saritas EU, Top CB, Cukur T. TranSMS: transformers for super-resolution calibration in magnetic particle imaging. IEEE Trans Med Imaging. 2022; 41(12): 3562-3574.
10.1109/TMI.2022.3189693
PubMed Web of Science® Google Scholar
187Wang D, Fan F, Wu Z, Liu R, Wang F, Yu H. Ctformer: convolution-free token2token dilated vision transformer for low-dose ct denoising. arXiv preprint, arXiv:220213517; 2022. https://arxiv.org/pdf/2202.13517.pdf
Google Scholar
188Liu X, Liang X, Deng L, Tan S, Xie Y. Learning low-dose CT degradation from unpaired data with flow-based model. Med Phys. 2022; 49(12): 7516-7530.
10.1002/mp.15886
CAS PubMed Web of Science® Google Scholar
189Zhang Z, Yu L, Liang X, Zhao W, Xing L. TransCT: Dual-Path Transformer for Low Dose Computed Tomography. Springer; 2021: 55-64.
Google Scholar
190Wang D, Wu Z, Yu H. Ted-net: Convolution-Free t2t Vision Transformer-Based Encoder-Decoder Dilation Network for Low-Dose CT Denoising. Springer; 2021: 416-425.
Google Scholar
191Fu Y, Dong S, Liao Y, et al. A resource-efficient deep learning framework for low-dose brain PET image reconstruction and analysis. IEEE. 2022: 1-5.
Google Scholar
192Wu M, Xu Y, Xu Y, Wu G, Chen Q, Lin H. Adaptively re-weighting multi-loss untrained transformer for sparse-view cone-beam CT reconstruction. arXiv preprint, arXiv:220312476; 2022. https://arxiv.org/pdf/2203.12476.pdf
Google Scholar
193Wang C, Shang K, Zhang H, Li Q, Hui Y, Zhou SK. Dudotrans: dual-domain transformer provides more attention for sinogram restoration in sparse-view CT reconstruction. arXiv preprint, arXiv:211110790; 2021. https://arxiv.org/pdf/2111.10790.pdf
Google Scholar
194Yu P, Zhang H, Kang H, Tang W, Arnold CW, Zhang R. RPLHR-CT Dataset and Transformer Baseline for Volumetric Super-Resolution from CT Scans. Springer; 2022: 344-353.
Google Scholar
195Sizikova E, Cao X, Lewis A, Moise K, Coffee M. Improving computed tomography (CT) reconstruction via 3D shape induction. arXiv preprint, arXiv:220810937; 2022. https://arxiv.org/pdf/2208.10937.pdf
Google Scholar
196Luthra A, Sulakhe H, Mittal T, Iyer A, Yadav S. Eformer: edge enhancement based transformer for medical image denoising. arXiv preprint, arXiv:210908044; 2021. https://arxiv.org/pdf/2109.08044.pdf
Google Scholar
197Li X, Desrosiers C, Liu X. Out-of-distribution detection using vision transformers; 2021.
Google Scholar
198Xu M, Islam M, Lim CM, Ren H. Learning domain adaptation with model calibration for surgical report generation in robotic surgery. IEEE. 2021: 12350-12356.
Google Scholar
199Xu M, Islam M, Lim CM, Ren H. Class-Incremental Domain Adaptation with Smoothing and Calibration for Surgical Report Generation. Springer; 2021: 269-278.
Google Scholar
200Zhang J, Nie Y, Chang J, Zhang JJ. Surgical Instruction Generation with Transformers. Springer; 2021: 290-299.
Google Scholar
201Liu F, Wu X, Ge S, Fan W, Zou Y. Exploring and distilling posterior and prior knowledge for radiology report generation. arXiv. 2021: 13753-13762. https://doi.org/10.48550/arXiv.2106.06963
10.48550/arXiv.2106.06963
Google Scholar
202Najdenkoska I, Zhen X, Worring M, Shao L. Variational Topic Inference for Chest X-ray Report Generation. Springer; 2021: 625-635.
Google Scholar
203Chen Z, Shen Y, Song Y, Wan X. Cross-modal memory networks for radiology report generation. arXiv preprint, arXiv:220413258; 2022. https://arxiv.org/pdf/2204.13258.pdf
Google Scholar
204Li M, Liu R, Wang F, Chang X, Liang X. Auxiliary signal-guided knowledge encoder-decoder for medical report generation. World Wide Web. 2023; 26(1): 253-270.
10.1007/s11280-022-01013-6
PubMed Google Scholar
205Wang Y, Lin Z, Xu Z, et al. Trust It or not: confidence-guided automatic radiology report generation. arXiv preprint, arXiv:210610887; 2021. https://arxiv.org/pdf/2106.10887.pdf
Google Scholar
206Yan A, McAuley J, Lu X, et al. RadBERT: adapting transformer-based language models to radiology. Radiology: Artificial Intelligence. 2022; 4(4):e210258.
10.1148/ryai.210258
PubMed Google Scholar
207Alfarghaly O, Khaled R, Elkorany A, Helal M, Fahmy A. Automated radiology report generation using conditioned transformers. Inform Med Unlocked. 2021; 24:100557.
10.1016/j.imu.2021.100557
Google Scholar
208Irbaz MS, Azad A. Radiology Report Generation Using Full Transformer Architecture. https://msi1427.github.io/projects/radiology-report-gen/Project%5FBaseline%5FReport.pdf
Google Scholar
209Jia X, Xiong Y, Zhang J, et al. Radiology report generation for rare diseases via few-shot transformer. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA; 2021: 1347-1352. doi:10.1109/BIBM52615.2021.9669825
10.1109/BIBM52615.2021.9669825
Google Scholar
210Nooralahzadeh F, Gonzalez NP, Frauenfelder T, Fujimoto K, Krauthammer M. Progressive transformer-based generation of radiology reports. arXiv preprint, arXiv:210209777; 2021. https://arxiv.org/pdf/2102.09777.pdf
Google Scholar
211Xiong Y, Du B, Yan P. Reinforced Transformer for Medical Image Captioning. Machine Learning in Medical Imaging. 2019: 673-680.
Google Scholar
212Miura Y, Zhang Y, Tsai EB, Langlotz CP, Jurafsky D. Improving factual completeness and consistency of image-to-text radiology report generation. arXiv preprint, arXiv:201010042; 2020. https://arxiv.org/pdf/2010.10042.pdf
Google Scholar
213Chen Z, Song Y, Chang T-H, Wan X. Generating radiology reports via memory-driven transformer. arXiv preprint, arXiv:201016056; 2020. https://arxiv.org/pdf/2010.16056.pdf
Google Scholar
214Liu F, You C, Wu X, Ge S, Sun X. Auto-encoding knowledge graph for unsupervised medical report generation. Adv Neural Inf Process Syst. 2021; 34: 16266-16279.
Google Scholar
215Pahwa E, Mehta D, Kapadia S, Jain D, Luthra A. Medskip: medical report generation using skip connections and integrated attention. 2021; 3409-3415.
Google Scholar
216Park H, Kim K, Park S, Choi J. Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEE Access. 2021; 9: 150560-150568.
10.1109/ACCESS.2021.3124564
Web of Science® Google Scholar
217Nguyen HT, Nie D, Badamdorj T, et al. Automated generation of accurate & fluent medical X-ray reports. arXiv preprint, arXiv: 2021;210812126. https://arxiv.org/pdf/2108.12126.pdf
Google Scholar
218Hou B, Kaissis G, Summers RM, Kainz B. Ratchet: Medical Transformer for Chest X-ray Diagnosis and Reporting. Springer; 2021: 293-303.
Google Scholar
219You D, Liu F, Ge S, Xie X, Zhang J, Wu X. Aligntransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation. Springer; 2021: 72-82.
Google Scholar
220Li CY, Liang X, Hu Z, Xing EP. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. arXiv. https://doi.org/10.48550/arXiv.1903.10122
10.48550/arXiv.1903.10122
Google Scholar
221Yan A, He Z, Lu X, et al. Weakly supervised contrastive learning for chest X-ray report generation. arXiv preprint, arXiv:210912242; 2021. https://arxiv.org/pdf/2109.12242.pdf
Google Scholar
222Lee H, Cho H, Park J, Chae J, Kim J. Cross encoder-decoder transformer with global-local visual extractor for medical image captioning. Sensors (Basel). 2022;(4): 22.
PubMed Google Scholar
223Baum ZM, Ungi T, Schlenger C, Hu Y, Barratt DC. Learning Generalized Non-rigid Multimodal Biomedical Image Registration from Generic Point Set Data. Springer; 2022: 141-151.
Google Scholar
224Shi J, He Y, Kong Y, et al. Xmorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention. Springer; 2022: 217-226.
Google Scholar
225Wang Y, Qian W, Li M, Zhang X. A Transformer-based Network For Deformable Medical Image Registration. Springer; 2022: 502-513.
Google Scholar
226Mok TC, Chung A. Affine medical image registration with coarse-to-fine vision transformer. 2022; 20835-20844.
Google Scholar
227Chen J, He Y, Frey EC, Li Y, Du Y. Vit-v-net: vision transformer for unsupervised volumetric medical image registration. arXiv preprint, arXiv:210406468; 2021. https://arxiv.org/pdf/2104.06468.pdf
Google Scholar
228Chen J, Frey EC, He Y, Segars WP, Li Y, Du Y. Transmorph: transformer for unsupervised medical image registration. Med Image Anal. 2022; 82:102615.
10.1016/j.media.2022.102615
PubMed Web of Science® Google Scholar
229Jia X, Bartlett J, Zhang T, Lu W, Qiu Z, Duan J. U-net vs Transformer: Is U-Net Outdated in Medical Image Registration? Springer; 2022: 151-160.
Google Scholar
230Song L, Liu G, Ma M. TD-Net: unsupervised medical image registration network based on transformer and CNN. Appl Intel. 2022; 52(15): 18201-18209.
10.1007/s10489-022-03472-w
Web of Science® Google Scholar
231Yang T, Bai X, Cui X, Gong Y, Li L. GraformerDIR: graph convolution transformer for deformable image registration. Comput Biol Med. 2022; 147:105799.
10.1016/j.compbiomed.2022.105799
PubMed Web of Science® Google Scholar
232Ma M, Xu Y, Song L, Liu G. Symmetric transformer-based network for unsupervised image registration. Knowledge-Based Syst. 2022; 257:109959.
10.1016/j.knosys.2022.109959
Web of Science® Google Scholar
233Tang K, Li Z, Tian L, Wang L, Zhu Y. ADMIR–affine and deformable medical image registration for drug-addicted brain images. IEEE Access. 2020; 8: 70960-70968.
10.1109/ACCESS.2020.2986829
Web of Science® Google Scholar
234Liu L, Huang Z, Liò P, Schönlieb C-B, Aviles-Rivero AI. Pc-swinmorph: patch representation for unsupervised medical image registration and segmentation. arXiv preprint, arXiv:220305684; 2022. https://arxiv.org/pdf/2203.05684.pdf
Google Scholar
235Zhang X, He X, Guo J, et al. Ptnet: a high-resolution infant MRI synthesizer based on transformer. arXiv preprint, arXiv: 210513993; 2021. https://arxiv.org/pdf/2105.13993.pdf
Google Scholar
236Shin H-C, Ihsani A, Mandava S, et al. GANBERT: generative adversarial networks with bidirectional encoder representations from transformers for MRI to PET synthesis. arXiv preprint, arXiv:200804393; 2020. https://arxiv.org/pdf/2008.04393.pdf
Google Scholar
237Ristea N-C, Miron A-I, Savencu O, et al. Cytran: cycle-consistent transformers for non-contrast to contrast ct translation. arXiv preprint, arXiv:211006400; 2021. https://arxiv.org/pdf/2110.06400.pdf
Google Scholar
238Kamran SA, Hossain KF, Tavakkoli A, Zuckerbrod SL, Baker SA. Vtgan: semi-supervised retinal image synthesis and disease prediction using vision transformers. 2021: 3235-3245.
Google Scholar
239Hu Z, Liu H, Li Z, Yu Z. Cross-model transformer method for medical image synthesis. Complexity. 2021; 2021: 1-7.
Web of Science® Google Scholar
240Dalmaz O, Yurt M, Cukur T. ResViT: residual vision transformers for multimodal medical image synthesis. IEEE Trans Med Imaging. 2022; 41(10): 2598-2614.
10.1109/TMI.2022.3167808
PubMed Web of Science® Google Scholar
241Yan S, Wang C, Chen W, Lyu J. Swin transformer-based GAN for multi-modal medical image translation. Front Oncol. 2022; 12:942511.
10.3389/fonc.2022.942511
PubMed Web of Science® Google Scholar
242Parkinson Progression Marker I. The Parkinson progression marker initiative (PPMI). Prog Neurobiol. 2011; 95(4): 629-635.
Google Scholar
243Menze BH, Jakab A, Bauer S, et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans Med Imaging. 2015; 34(10): 1993-2024.
10.1109/TMI.2014.2377694
PubMed Web of Science® Google Scholar
244Wang L, Nie D, Li G, et al. Benchmark on automatic six-month-old infant brain segmentation algorithms: the iSeg-2017 challenge. IEEE Trans Med Imaging. 2019; 38(9): 2219-2230.
10.1109/TMI.2019.2901712
Web of Science® Google Scholar
245Bakas S, Akbari H, Sotiras A, et al. Advancing the cancer genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci Data. 2017; 4:170117.
10.1038/sdata.2017.117
PubMed Web of Science® Google Scholar
246Mendrik AM, Vincken KL, Kuijf HJ, et al. MRBrainS challenge: online evaluation framework for brain image segmentation in 3T MRI scans. Comput Intell Neurosci. 2015; 2015: 1-16.
10.1155/2015/813696
Web of Science® Google Scholar
247Sudlow C, Gallacher J, Allen N, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015; 12(3):e1001779.
10.1371/journal.pmed.1001779
PubMed Web of Science® Google Scholar
248Stirrat CG, Alam SR, MacGillivray TJ, et al. Ferumoxytol-enhanced magnetic resonance imaging assessing inflammation after myocardial infarction. Heart. 2017; 103(19): 1528-1535.
10.1136/heartjnl-2016-311018
CAS PubMed Web of Science® Google Scholar
249Kavur AE, Gezer NS, Barış M, et al. CHAOS challenge-combined (CT-MR) healthy abdominal organ segmentation. Med Image Anal. 2021; 69:101950.
10.1016/j.media.2020.101950
PubMed Web of Science® Google Scholar
250Heller N, Isensee F, Maier-Hein KH, et al. The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: results of the KiTS19 challenge. Med Image Anal. 2021; 67:101821.
10.1016/j.media.2020.101821
PubMed Web of Science® Google Scholar
251Kermany DS, Goldbaum M, Cai W, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018; 172(5): 1122-1131.e9.
10.1016/j.cell.2018.02.010
CAS PubMed Web of Science® Google Scholar
252Simpson AL, Antonelli M, Bakas S, et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint, arXiv:190209063; 2019. https://arxiv.org/pdf/1902.09063.pdf
Google Scholar
253Campello VM, Gkontra P, Izquierdo C, et al. Multi-centre, multi-vendor and multi-disease cardiac segmentation: the M&Ms challenge. IEEE Trans Med Imaging. 2021; 40(12): 3543-3554.
10.1109/TMI.2021.3090082
PubMed Web of Science® Google Scholar
254Codella NC, Gutman D, Celebi ME, et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). IEEE. 2018: 168-172.
Google Scholar
255Sirinukunwattana K, Pluim JPW, Chen H, et al. Gland segmentation in colon histology images: the glas challenge contest. Med Image Anal. 2017; 35: 489-502.
10.1016/j.media.2016.08.008
PubMed Web of Science® Google Scholar
256Kumar N, Verma R, Anand D, et al. A multi-organ nucleus segmentation challenge. IEEE Trans Med Imaging. 2020; 39(5): 1380-1391.
10.1109/TMI.2019.2947628
PubMed Web of Science® Google Scholar
257Gamper J, Alemi Koohbanani N, Benet K, Khuram A, Rajpoot N. PanNuke: An Open Pan-cancer Histology Dataset for Nuclei Instance Segmentation and Classification. Springer; 2019: 11-19.
Google Scholar
258Tang Y-B, Tang Y-X, Xiao J, Summers RM. Xlsor: a robust and accurate lung segmentor on chest X-rays using criss-cross attention and customized radiorealistic abnormalities generation. PMLR. 2019: 457-467.
Google Scholar
259He X, Wang S, Shi S, et al. Benchmarking deep learning models and automated model design for COVID-19 detection with chest CT scans[J]. MedRxiv, 2020;2020.06. 08.20125963.
Google Scholar
260Caicedo JC, Goodman A, Karhohs KW, et al. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nature Methods. 2019; 16(12): 1247-1253.
10.1038/s41592-019-0612-7
CAS PubMed Web of Science® Google Scholar
261Chen X, Sun S, Bai N, et al. A deep learning-based auto-segmentation system for organs-at-risk on whole-body computed tomography images for radiation therapy. Radiother Oncol. 2021; 160: 175-184.
10.1016/j.radonc.2021.04.019
PubMed Web of Science® Google Scholar
262Lambert Z, Petitjean C, Dubray B, Kuan S. Segthor: segmentation of thoracic organs at risk in ct images. IEEE. 2020: 1-6.
Google Scholar
263Bernard O, Lalande A, Zotti C, et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans Med Imaging. 2018; 37(11): 2514-2525.
10.1109/TMI.2018.2837502
PubMed Web of Science® Google Scholar
264Jha D, Smedsrud PH, Riegler MA, et al. Kvasir-seg: A Segmented Polyp Dataset. Springer; 2020: 451-462.
Google Scholar
265Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, Vilariño F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Comput Med Imaging Graph. 2015; 43: 99-111.
10.1016/j.compmedimag.2015.02.007
PubMed Web of Science® Google Scholar
266Vazquez D, Bernal J, Sanchez FJ, et al. A benchmark for endoluminal scene segmentation of colonoscopy images. J Healthc Eng. 2017; 2017:4037190.
10.1155/2017/4037190
PubMed Web of Science® Google Scholar
267Silva J, Histace A, Romain O, Dray X, Granado B. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. Int J Comp Assist Radiol Surg. 2014; 9: 283-293.
10.1007/s11548-013-0926-3
PubMed Web of Science® Google Scholar
268Zhang Q, Li Q, Yu G, Sun L, Zhou M, Chu J. A multidimensional choledoch database and benchmarks for cholangiocarcinoma diagnosis. IEEE Access. 2019; 7: 149414-149421.
10.1109/ACCESS.2019.2947470
Web of Science® Google Scholar
269Clark K, Vendt B, Smith K, et al. The cancer imaging archive (TCIA): maintaining and operating a public information repository. J Digit Imaging. 2013; 26(6): 1045-1057.
10.1007/s10278-013-9622-7
PubMed Web of Science® Google Scholar
270Andrearczyk V, Oreiller V, Jreige M, et al. Overview of the HECKTOR Challenge at MICCAI 2020: Automatic Head and Neck Tumor Segmentation in PET/CT. Head and Neck Tumor Segmentation. 2021: 1-21. https://doi.org/10.1007/978-3-030-67194-5%5F1
10.1007/978-3-030-67194-5_1
Google Scholar
271Orlando JI, Fu H, Barbosa Breda J, et al. Refuge challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med Image Anal. 2020; 59:101570.
10.1016/j.media.2019.101570
PubMed Web of Science® Google Scholar
272Fan D-P, Ji G-P, Zhou T, et al. Pranet: Parallel Reverse Attention Network for Polyp Segmentation. Springer; 2020: 263-273.
Google Scholar
273Ruggeri A, Scarpa F, De Luca M, Meltendorf C, Schroeter J. A system for the automatic estimation of morphometric parameters of corneal endothelium in alizarine red-stained images. Br J Ophthalmol. 2010; 94(5): 643-647.
10.1136/bjo.2009.166561
PubMed Web of Science® Google Scholar
274Ouyang D, He B, Ghorbani A, et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature. 2020; 580(7802): 252-256.
10.1038/s41586-020-2145-8
CAS PubMed Web of Science® Google Scholar
275Lee RS, Gimenez F, Hoogi A, Miyake KK, Gorovoy M, Rubin DL. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci Data. 2017; 4:170177.
10.1038/sdata.2017.177
PubMed Web of Science® Google Scholar
276Kauppi T, Kalesnykiene V, Kamarainen J-K, et al. The diaretdb1 diabetic retinopathy database and evaluation protocol. 2007; 10.
Google Scholar
277Demner-Fushman D, Kohli MD, Rosenman MB, et al. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc. 2016; 23(2): 304-310.
10.1093/jamia/ocv080
PubMed Web of Science® Google Scholar
278Johnson AEW, Pollard TJ, Berkowitz SJ, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019; 6(1): 317.
10.1038/s41597-019-0322-0
PubMed Web of Science® Google Scholar
279Bustos A, Pertusa A, Salinas J-M, de la Iglesia-Vayá M. Padchest: a large chest X-ray image dataset with multi-label annotated reports. Med Image Anal. 2020; 66:101797.
10.1016/j.media.2020.101797
PubMed Web of Science® Google Scholar
280Li M, Cai W, Liu R, et al. Ffa-ir: towards an explainable and reliable medical report generation benchmark. PhysioNet. 2021. https://doi.org/10.13026/ccbh-z832
10.13026/ccbh-z832
Google Scholar
281Huang J-H, Yang C-HH, Liu F, et al. Deepopht: medical report generation for retinal images via deep models and visual explanation. 2021; 2442-2452.
Google Scholar
282Chen B, Duan X, Yu Z, Leng S, Yu L, McCollough C. Technical note: development and validation of an open data format for CT projection data. Med Phys. 2015; 42(12): 6964-6972.
10.1118/1.4935406
PubMed Web of Science® Google Scholar
283Landman BA, Huang AJ, Gifford A, et al. Multi-parametric neuroimaging reproducibility: a 3-T resource study. Neuroimage. 2011; 54(4): 2854-2866.
10.1016/j.neuroimage.2010.11.047
PubMed Web of Science® Google Scholar
284Agustsson E, Timofte R. Ntire 2017 challenge on single image super-resolution: dataset and study. 2017: 126-135.
Google Scholar
285Zbontar J, Knoll F, Sriram A, et al. fastMRI: an open dataset and benchmarks for accelerated MRI. arXiv preprint, arXiv:181108839; 2018. https://arxiv.org/pdf/1811.08839.pdf
Google Scholar
286Bakas S, Reyes M, Jakab A, et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv preprint, arXiv:181102629; 2018. https://arxiv.org/pdf/1811.02629.pdf
Google Scholar
287McCollough CH, Bartley AC, Carter RE, et al. Low-dose CT for the detection and classification of metastatic liver lesions: results of the 2016 low dose CT grand challenge. Med Phys. 2017; 44(10): e339-e352.
10.1002/mp.12345
CAS PubMed Web of Science® Google Scholar
288Knopp T, Szwargulski P, Griese F, Gräser M. OpenMPIData: an initiative for freely accessible magnetic particle imaging data. Data Brief. 2020; 28:104971.
10.1016/j.dib.2019.104971
PubMed Web of Science® Google Scholar
289Tabik S, Gomez-Rios A, Martin-Rodriguez JL, et al. COVIDGR dataset and COVID-SDNet methodology for predicting COVID-19 based on chest X-ray images. IEEE J Biomed Health Inf. 2020; 24(12): 3595-3605.
10.1109/JBHI.2020.3037127
CAS PubMed Web of Science® Google Scholar
290Porwal P, Pachade S, Kamble R, et al. Indian diabetic retinopathy image dataset (IDRiD): a database for diabetic retinopathy screening research. Data. 2018; 3(3): 25.
10.3390/data3030025
Web of Science® Google Scholar
291Gunraj H, Sabri A, Koff D, Wong A. COVID-Net CT-2: enhanced deep neural networks for detection of COVID-19 from chest CT images through bigger, more diverse learning. Front Med (Lausanne). 2021; 8:729287.
10.3389/fmed.2021.729287
PubMed Web of Science® Google Scholar
292Weinstein JN, Collisson EA, Mills GB, et al. The cancer genome atlas pan-cancer analysis project. Nature Genet. 2013; 45(10): 1113-1120.
10.1038/ng.2764
CAS PubMed Web of Science® Google Scholar
293Setio AAA, Traverso A, De Bel T, et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med Image Anal. 2017; 42: 1-13.
10.1016/j.media.2017.06.015
PubMed Web of Science® Google Scholar
294Armato SG III, McLennan G, Bidaut L. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans: the LIDC/IDRI thoracic CT database of lung nodules. Med Phys. 2011; 38(2): 915-931.
10.1118/1.3528204
PubMed Web of Science® Google Scholar
295Srensen L, Shaker SB, de Bruijne M. Quantitative analysis of pulmonary emphysema using local binary patterns. IEEE Trans Med Imaging. 2010; 29(2): 559-569.
10.1109/TMI.2009.2038575
PubMed Web of Science® Google Scholar
296Yang X, He X, Zhao J, Zhang Y, Zhang S, Xie P. COVID-CT-dataset: a CT scan dataset about COVID-19. arXiv preprint, arXiv:200313865; 2020. https://arxiv.org/pdf/2003.13865.pdf
Google Scholar
297Soares E, Angelov P, Biaso S, Froes MH, Abe DK. SARS-CoV-2 CT-scan dataset: a large dataset of real patients CT scans for SARS-CoV-2 identification. MedRxiv, 2020;2020.04. 24.20078584.
Google Scholar
298Kollias D, Arsenos A, Soukissian L, Kollias S. Mia-cov19d: Covid-19 detection through 3-d chest CT image analysis. 2021; 537-544.
Google Scholar
299Rahimzadeh M, Attar A, Sakhaei SM. A fully automated deep learning-based network for detecting COVID-19 from a new and large lung CT scan dataset. Biomed Signal Process Control. 2021; 68:102588.
10.1016/j.bspc.2021.102588
PubMed Web of Science® Google Scholar
300Vayá, MDLI, Saborit, JM, Montell JA, et al. BIMCV COVID-19+: a large annotated dataset of RX and CT images from COVID-19 patients. arXiv preprint arXiv:200601174; 2020. https://arxiv.org/pdf/2006.01174.pdf
Google Scholar
301Sait U, Lal K, Prajapati S, et al. Curated Dataset for COVID-19 Posterior-Anterior Chest Radiography Images (X-Rays). Mendeley Data V3; 2020. https://doi.org/10.17632/9xkhgts2s6.3
10.17632/9xkhgts2s6.3
Google Scholar
302Wang L, Lin ZQ, Wong A. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci Rep. 2020; 10(1):19549.
10.1038/s41598-020-76550-z
CAS PubMed Web of Science® Google Scholar
303Hajeb Mohammad Alipour S, Rabbani H, Akhlaghi MR. Diabetic retinopathy grading by digital curvelet transform. Comput Math Methods Med. 2012; 2012: 1-11.
10.1155/2012/761901
Web of Science® Google Scholar
304Cohen JP, Morrison P, Dao L. COVID-19 image data collection. arXiv preprint, arXiv:200311597; 2020. https://arxiv.org/pdf/2003.11597.pdf
Google Scholar
305Chowdhury MEH, Rahman T, Khandakar A, et al. Can AI help in screening viral and COVID-19 pneumonia? IEEE Access. 2020; 8: 132665-132676.
10.1109/ACCESS.2020.3010287
Web of Science® Google Scholar
306Cohen JP, Morrison P, Dao L, Roth K, Duong TQ, Ghassemi M. Covid-19 image data collection: prospective predictions are the future. arXiv preprint, arXiv:200611988; 2020. https://arxiv.org/pdf/2006.11988.pdf
Google Scholar
307Janowczyk A, Madabhushi A. Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J Pathol Inform. 2016; 7: 29.
10.4103/2153-3539.186902
PubMed Google Scholar
308Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data Brief. 2020; 28:104863.
10.1016/j.dib.2019.104863
PubMed Web of Science® Google Scholar
309Yap MH, Pons G, Marti J, et al. Automated breast ultrasound lesions detection using convolutional neural networks. IEEE J Biomed Health Inform. 2018; 22(4): 1218-1226.
10.1109/JBHI.2017.2731873
PubMed Web of Science® Google Scholar
310Ehteshami Bejnordi B, Veta M, Johannes van Diest P, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. 2017; 318(22): 2199-2210.
10.1001/jama.2017.14585
PubMed Web of Science® Google Scholar
311Bakr S, Gevaert O, Echegaray S, et al. A radiogenomic dataset of non-small cell lung cancer. Sci Data. 2018; 5:180202.
10.1038/sdata.2018.202
CAS PubMed Web of Science® Google Scholar
312Quellec G, Lamard M, Conze P-H, Massin P, Cochener B. Automatic detection of rare pathologies in fundus photographs using few-shot learning. Med Image Anal. 2020; 61:101660.
10.1016/j.media.2020.101660
PubMed Web of Science® Google Scholar
313Decencière E, Zhang X, Cazuguel G, et al. Feedback on a publicly distributed image database: the messidor database. Stereology. 2014; 33(3): 231-234.
10.5566/ias.1155
Web of Science® Google Scholar
314Cuadros J, Bresnick G. EyePACS: an adaptable telemedicine system for diabetic retinopathy screening. J Diabetes Sci Technol. 2009; 3(3): 509-516.
10.1177/193229680900300315
PubMed Google Scholar
315Irvin J, Rajpurkar P, Ko M, et al. Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. 2019; 590-597.
Google Scholar
316Born J, Wiedemann N, Cossio M, et al. Accelerating detection of lung pathologies with explainable ultrasound image analysis. Appl Sci. 2021; 11(2): 672.
10.3390/app11020672
CAS Web of Science® Google Scholar
317Qi X, Brown LG, Foran DJ, Nosher J, Hacihaliloglu I. Chest X-ray image phase features for improved diagnosis of COVID-19 using convolutional neural network. Int J Comp Assist Radiol Surg. 2021; 16(2): 197-206.
10.1007/s11548-020-02305-w
PubMed Web of Science® Google Scholar

Citing Literature

Volume2, Issue1

March 2023

e38

This article also appears in: