RESEARCH ARTICLE

Open Access

Evaluating the potential risks of employing large language models in peer review

Lingxuan Zhu

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Yancheng Lai,

Yancheng Lai

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Jiarui Xie,

Jiarui Xie

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Weiming Mou,

Weiming Mou

Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China

Search for more papers by this author

Lihaoyun Huang,

Lihaoyun Huang

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Chang Qi,

Chang Qi

Institute of Logic and Computation, TU Wien, Vienna, Austria

Search for more papers by this author

Tao Yang,

Tao Yang

Department of Medical Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

Search for more papers by this author

Aimin Jiang,

Aimin Jiang

Department of Urology, Changhai hospital, Naval Medical University (Second Military Medical University), Shanghai, China

Search for more papers by this author

Wenyi Gan,

Wenyi Gan

Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital, Jinan University, Guangzhou, Guangdong, China

Search for more papers by this author

Dongqiang Zeng,

Dongqiang Zeng

Department of Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, Guangdong, China

Search for more papers by this author

Bufu Tang,

Bufu Tang

Department of Radiation Oncology, Zhongshan Hospital Affiliated to Fudan University, Shanghai, China

Search for more papers by this author

Mingjia Xiao,

Mingjia Xiao

Hepatobiliary Surgery Department, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, China

Search for more papers by this author

Guangdi Chu,

Guangdi Chu

Department of Urology, The Affiliated Hospital of Qingdao University, Qingdao, China

Search for more papers by this author

Zaoqu Liu,

Zaoqu Liu

Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

Search for more papers by this author

Quan Cheng,

Corresponding Author

Quan Cheng

[email protected]

Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China

National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, China

Correspondence

Quan Cheng, Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China.

Email: [email protected]

Anqi Lin and Peng Luo, Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, 222000, China.

Email: [email protected] and [email protected]

Search for more papers by this author

Anqi Lin,

Corresponding Author

Anqi Lin

[email protected]

orcid.org/0000-0002-6324-0410

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Correspondence

Quan Cheng, Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China.

Email: [email protected]

Email: [email protected] and [email protected]

Search for more papers by this author

Peng Luo,

Corresponding Author

Peng Luo

[email protected]

orcid.org/0000-0002-8215-2045

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Department of Microbiology, State Key Laboratory of Emerging Infectious Diseases, Carol Yu Centre for Infection, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China

Correspondence

Quan Cheng, Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China.

Email: [email protected]

Email: [email protected] and [email protected]

Search for more papers by this author

Lingxuan Zhu,

Lingxuan Zhu

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Yancheng Lai,

Yancheng Lai

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Jiarui Xie,

Jiarui Xie

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Weiming Mou,

Weiming Mou

Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China

Search for more papers by this author

Lihaoyun Huang,

Lihaoyun Huang

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Search for more papers by this author

Chang Qi,

Chang Qi

Institute of Logic and Computation, TU Wien, Vienna, Austria

Search for more papers by this author

Tao Yang,

Tao Yang

Search for more papers by this author

Aimin Jiang,

Aimin Jiang

Department of Urology, Changhai hospital, Naval Medical University (Second Military Medical University), Shanghai, China

Search for more papers by this author

Wenyi Gan,

Wenyi Gan

Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital, Jinan University, Guangzhou, Guangdong, China

Search for more papers by this author

Dongqiang Zeng,

Dongqiang Zeng

Department of Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, Guangdong, China

Search for more papers by this author

Bufu Tang,

Bufu Tang

Department of Radiation Oncology, Zhongshan Hospital Affiliated to Fudan University, Shanghai, China

Search for more papers by this author

Mingjia Xiao,

Mingjia Xiao

Hepatobiliary Surgery Department, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, China

Search for more papers by this author

Guangdi Chu,

Guangdi Chu

Department of Urology, The Affiliated Hospital of Qingdao University, Qingdao, China

Search for more papers by this author

Zaoqu Liu,

Zaoqu Liu

Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

Search for more papers by this author

Quan Cheng,

Corresponding Author

Quan Cheng

[email protected]

Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China

National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, China

Correspondence

Quan Cheng, Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China.

Email: [email protected]

Email: [email protected] and [email protected]

Search for more papers by this author

Anqi Lin,

Corresponding Author

Anqi Lin

[email protected]

orcid.org/0000-0002-6324-0410

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Correspondence

Quan Cheng, Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China.

Email: [email protected]

Email: [email protected] and [email protected]

Search for more papers by this author

Peng Luo,

Corresponding Author

Peng Luo

[email protected]

orcid.org/0000-0002-8215-2045

Department of Oncology, Zhujiang Hospital, Southern Medical University; Donghai County People's Hospital (Affiliated Kangda College of Nanjing Medical University), Lianyungang, China

Correspondence

Quan Cheng, Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, Hunan, China.

Email: [email protected]

Email: [email protected] and [email protected]

Search for more papers by this author

First published: 27 June 2025

https://doi.org/10.1002/ctd2.70067

Lingxuan Zhu, Yancheng Lai, Jiarui Xie and Weiming Mou have contributed equally to this work and share first authorship.

Share a link

Email
Wechat
Bluesky

Abstract

Objective

This study aims to systematically investigate the potential harms of Large Language Models (LLMs) in the peer review process.

Background

LLMs are increasingly used in academic processes, including peer review. While they can address challenges like reviewer scarcity and review efficiency, concerns about fairness, transparency and potential biases in LLM-generated reviews have not been thoroughly investigated.

Methods

Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from eLife's new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.

Results

LLM-generated reviews were somewhat consistent with human reviews but lacked depth, especially in detailed critique. The model proved highly proficient at generating convincing rejection comments and could create plausible citation requests, including requests for unrelated references. AI detectors struggled to identify LLM-generated reviews, with 82.8% of responses classified as human-written by GPTzero.

Conclusions

LLMs can be readily misused to undermine the peer review process by generating biased, manipulative, and difficult-to-detect content, posing a significant threat to academic integrity. Guidelines and detection tools are needed to ensure LLMs enhance rather than harm the peer review process.

1 INTRODUCTION

Large language models (LLMs) have gained widespread attention within the academic community for their exceptional text processing capabilities, such as generating paper abstracts and providing assistance with academic writing.^1-3 Moreover, there is evidence that LLMs have also been utilised in peer review processes.^{4, 5} According to a study, between 6.5% and 16.9% of the text in peer review reports for some artificial intelligence (AI) conferences might have been substantially generated or modified by LLMs.⁶ Another study highlighted that some peer review reports published in academic journals display signs of reviewers using LLMs to compose their reviews.⁷

As a critical component of academic publishing, peer review serves as a crucial mechanism for ensuring the quality and academic integrity of scientific papers, playing a decisive role in the acceptance or rejection of a research paper. However, peer review faces numerous challenges, including the scarcity of qualified reviewers, variability in evaluation quality, and the timeliness of the review procedure.⁸ Therefore, some studies have attempted to integrate the assistance of LLMs into the peer review process,⁹ as exemplified by ChatReviewer.¹⁰ Other research has explored the feasibility of using LLMs for peer review. Liang et al. created an automated pipeline using LLMs to generate peer review reports for published papers and explored researchers' perceptions of these LLM-generated reviews. The findings showed that 82.4% of authors considered LLM reviews to be more beneficial than those provided by human reviewers.¹¹ A study by Dario von Wedel et al. used LLMs to review paper abstracts to make acceptance or rejection decisions, demonstrating that LLMs might help reduce affiliation bias in peer review.¹²

However, the above-mentioned studies on the use of LLMs in peer review have some limitations. First, existing studies have employed LLMs to review published versions of articles, meaning that the manuscripts submitted to LLMs for review have already been revised based on peer review comments, which differ from the original versions that reviewers assess during the initial review process. Second, due to previous technical limitations, these studies were either unable to input full texts to LLMs¹² or relied on external plugins to upload full texts.¹¹ This has led to less accurate assessments of LLMs' performance in peer review. Third, prior research focused on the benefits of using LLMs for peer reviews while neglecting their potential harm. The incorporation of LLMs in peer review could have disruptive consequences for academic publishing, as LLMs have the potential to generate biased peer review reports based on predetermined review inclinations (such as acceptance/rejection) in the prompt. Some malicious individuals might use this to undermine the integrity of the peer review process.

In this study, we used Claude 2.0 [Anthropic] to review the original versions of articles published in eLife under its new publishing model. At the time of our research, Claude was the only AI assistant that allowed users to upload PDF files without the need for payment or requiring third-party plugins, making it more user-friendly compared to other AI tools available. Under eLife’s new publishing model, articles are published as reviewed preprints, accompanied by reviewer reports and the original unmodified manuscripts used for peer review. The scenarios set in this study closely resemble the real-world applications of LLMs for peer review. We explored the use of LLMs to directly generate review reports from scratch, generate rejection comments, and request authors to cite specific literature and refute the request to cite literature. We not only need to evaluate the hazards and risks brought by LLMs during the review process in our research but also need to conduct a preliminary exploration of potential solutions to alleviate the drawbacks of using LLMs in peer review.

2 METHOD

2.1 Study design and article acquisition

This study included four tasks: (i) Using LLM to directly generate review reports from scratch. (ii) Using LLM to recommend rejection. (iii) Using LLM to request a citation. (iv) Using LLM to refute citation request (Figure 1). The latest 20 cancer-related research papers with their peer review reports published under the new publishing model in the cancer biology section of eLife between 24 August 2023 and 17 October 2023 were included (Table S1A). We obtained the publicly available peer review reports from the article pages and retrieved the unmodified manuscripts used for peer review based on the preprint links recorded. The public release dates of the peer review reports are all later than the cut-off date of the model's training data, ensuring that the review reports are not included in the model's training data. For the ‘Request citation’ task, we collected the titles and abstracts of two articles from each field, totalling six articles from three fields: cancer research, non-oncological medical research and materials science research (Table S1B).

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Research design.

2.2 Collecting responses

The unmodified manuscripts used for peer review were uploaded to the Claude 2.0 model as attachments, using the prompts listed in Table S2. For generating complete peer review reports, we instructed the model to review the uploaded manuscript attachments and write a peer review report that includes three sections: general comments, major comments and minor comments.¹³ For review comments requesting the citation of specific references, we submitted the original manuscript being reviewed along with the title and abstract of the article to be cited. For rebuttal requests, we uploaded the manuscript, title and abstract of the article to be cited, as well as the corresponding review comment requesting the citation of that article. Notably, we did not provide the LLM with the actual review comments corresponding to the manuscripts, nor did we provide additional knowledge on how to conduct peer reviews. Each inquiry was repeated to generate three responses,¹⁴ with each response generated in an independent session. A researcher who did not participate in the scoring, manually deleted irrelevant parts in the responses, such as ‘Here is a draft of the peer review based on the information provided’, retaining only the main body of the review comments generated by the model.

2.3 AI detector

Two popular AI detectors, zeroGPT¹⁵ and GPTzero,¹⁶ were used to test whether specific algorithms could detect if the review comments were generated by LLM. The zeroGPT uses its proprietary DeepAnalyse deep learning algorithm technology, which reportedly can distinguish between human and LLM-generated text with up to 98% accuracy.¹⁵ The GPTzero primarily relies on perplexity (randomness of the text) and burstiness (variation in perplexity) as metrics to determine whether the text was generated by LLM.¹⁶ The results returned by the two algorithms include the probability (0–1) that input content was generated by LLM, as well as a conclusion: the text was generated by LLM, was written by a human, or is a mix of human-written content and LLM, along with the confidence level of the conclusion.

2.4 Evaluation of LLM-generated review comments

Two experts with extensive peer review experience in the field of oncology independently scored the LLM-generated review comments using a five-point Likert scale. For the generated reviews, the reviewers evaluated the degree of consistency between the LLM-generated comments and the actual review comments. Five points represent the highest consistency, while one point represents the lowest consistency. For the remaining tasks, the reviewers assessed the reasonableness of the LLM-generated comments, with five points indicating very reasonable and one point indicating completely unreasonable. For the task of requesting citation, a researcher who did not participate in the scoring selected one reasonable response for each reviewed article. This chosen response was subsequently used to generate the responses to refute citation requests. The two reviewers also scored the level of detail in the actual peer review comments, evaluating whether the real review comments focused on specific details of the article or were merely general remarks. Additionally, the Flesch reading ease score was used to assess the readability of the LLM-generated review comments. This score is on a scale of 0–100, where 10–30 corresponds to a College graduate level and 0–10 corresponds to a Professional level.¹⁷

2.5 Statistical analysis

Spearman correlation analysis was used to assess the associations between variables. The Wilcoxon rank-sum test was used to compare means between two groups. For multiple group comparisons, the Benjamini–Hochberg method was applied for correction. Statistical analyses were performed using R (version 4.2.3). The ggplot2 and ggstats R packages were used for visualisation. Two-sided tests with p-values less than .05 were considered statistically significant.

3 RESULT

3.1 LLM-generated review comments cannot be identified by AI detectors

The workflow of this study is shown in Figure 1. LLM was asked to review the unmodified versions of manuscripts used for peer review from 20 articles in the cancer biology section published under the new publishing model in eLife. The average length of all the LLM-generated review comments was 218 ± 84 words. The average length of the rejection comments (383 ± 56 words) was significantly higher than that of the directly generated review comments (313 ± 50 words) (Figure 2A, p < .0001). The average length of the refute citation request comments (215 ± 48 words) was higher than that of the request citation comments (175 ± 38 words) (Figure 2A, p < .0001). There was a positive correlation between the length of the LLM-generated responses and their scores (Figures S1 and S2). Additionally, the average Flesch reading ease score for each task is within the range of 10–30 points (Figure 2B), indicating a college graduate level of readability. This suggests that a professional, academic language style was employed in the review comments generated by LLM.

Subsequently, we used two AI detectors, GPTzero and zeroGPT, to test whether their algorithms could detect if the review comments were generated by LLM. The results showed that for most of the texts, the probability given by the algorithms that the texts were LLM-generated was below 50% (Figure 3A–D). Specifically, the average LLM-generated probability for all responses was 17.1% for GPTzero and 12.4% for zeroGPT. In terms of the conclusions given by the algorithms, without considering the confidence level, zeroGPT and GPTzero identified 59.7% and 82.8% of the responses as human-written, respectively (Figure 3E,F; Figures S3 and S4). ZeroGPT identified only 27.2% of the responses as LLM-generated and 13.1% as a mix of human and AI writing. GPTzero confidently identified only 13.9% of the responses as LLM-generated and indicated that 3.3% may include parts written by LLM. In summary, these results suggest that LLM can already generate text that closely resembles the academic research style and is difficult to identify as LLM-generated by detection tools in most cases.

3.2 LLM can replace human reviewers in some cases

Overall, the two reviewers were not satisfied with the consistency between the AI-generated review comments and the actual review comments (Figure 4A), with only 15.0% and 28.3% of the scores given by the two reviewers being greater than 3. However, we found that the consistency scores of the AI-generated reviews with the actual review comments were significantly negatively correlated with the detail scores of the actual review comments (Figure 4B,D). This means that the more concise or broad the actual review comments are, the more consistent the AI-generated review comments are with the actual review comments. Furthermore, based on the scores given by the two reviewers for the original peer review comments, we divided them into two groups (undetailed peer review for scores ≤3, detailed peer review for scores >3). We found that for both Reviewer 1 (p < .0001) and Reviewer 2 (p < .01), the scores of the undetailed peer review group were significantly higher than those of the detailed peer review group (Figure 4C,E). These results suggest that although LLM-generated review comments cannot fully take the place of human reviewers' comments, they can already achieve a level comparable to relatively broad and undetailed review comments.

3.3 LLM can provide convincing rejection comments

We next explored the potential harms of using LLM for peer review. LLM demonstrated an unexpected ability to provide rejection comments. The proportion of scores greater than 3 given by the two reviewers was 76.7% and 58.3%, respectively (Figure 4F).

3.4 LLM can generate seemingly reasonable citation requests for unrelated references

It is not uncommon for reviewers to request authors to cite specific references during peer review.¹⁸ We additionally collected a total of six articles, with two articles each from different journals on cancer research, non-oncological medical research, and materials science research. We then asked the LLM to generate a review comment requesting the authors to cite these references. It was found that the scores for requesting citations of cancer research were significantly higher than those for other types of research articles (p < .0001, Figure 5A–C). Unsurprisingly, the scores for requesting citations for materials science research were significantly lower than those for other types of research articles. Notably, some responses requesting citations of materials research received a score of 5. This indicates that LLM can generate seemingly reasonable and somewhat persuasive citation requests for some irrelevant references, but may provide rather far-fetched reasons for requesting citation of certain articles.

3.5 Using AI to refute unreasonable citation requests

We then attempted to use the LLM to refute its own unreasonable citation requests. Both reviewers gave high scores to all the rebuttal comments, with all scores being ≥3 (Figure 5D). This indicates that LLM can easily generate comments refuting unreasonable citation requests from reviewers, which could play a crucial role in the process of authors responding to reviewers' comments.

4 DISCUSSION

Our study using Claude as an example reveals the potential hazards and risks of employing LLMs in peer review. We utilised unmodified manuscripts used for peer review and obtained matching actual review comments, making them closer to real-world scenarios. Our results indicate that LLMs can already generate text consistent with academic research styles and, in most cases, can evade detection by popular AI detection tools. Furthermore, we explored the harmful potential of using LLMs to generate biased review comments. We found that LLMs can easily be used to generate rejection comments and request authors to cite specific references, presenting a new challenge to the peer review process.

First, our study found that LLM-generated review comments received a certain degree of recognition, but they cannot completely replace human opinions. We observed that LLM-generated comments align more closely with actual review comments only when the actual review comments are broader. While LLMs can replicate the form of academic critique, their grasp of the substantive logic and nuanced scientific reasoning may be limited, impacting the true depth of their evaluations. High-quality peer review requires reviewers to have a deep background knowledge of the discipline, an understanding of the development trends in the research field, and the experience and insight to identify innovation. These are areas where current LLMs fall short. Currently, LLM-generated review comments do not reach the level of high-quality review comments. And those lower quality review reports generated by LLMs cannot effectively enhance the quality of manuscripts or exclude manuscripts unsuitable for publication, which goes against the primary purpose of peer review. Therefore, over-reliance on LLMs for peer review could compromise the effectiveness of quality control in academic publishing.

Second, LLMs seem to be quite adept at providing reasons for rejecting papers. Theoretically, no paper is perfect. While this capability may help researchers identify deficiencies in their manuscripts before submission and thereby improve their quality, it could also be misused. If a malicious reviewer uses LLMs to generate rejection comments intentionally during peer review, it could lead to the erroneous rejection of research that should have been published. As editors cannot be experts in all fields, a seemingly compelling rejection comment could prompt the editor to decide to reject the manuscript. Even though authors may appeal to reverse the decision, the lengthy appeal process could delay the publication, affecting the novelty and impact of the paper. This not only discourages researchers but also hinders the dissemination and development of scientific knowledge. Conversely, paper mills that seek to manipulate peer review might use this capability to write detailed reports recommending that the editor accept the manuscript directly or with minor revisions, thereby achieving their goal of manipulating the peer review process.¹⁹

Additionally, we also found that LLMs can generate seemingly reasonable and persuasive citation requests for completely unrelated references, highlighting the potential risks LLMs pose to academic integrity. Some reviewers might use LLMs to generate specific citation requests in an attempt to manipulate citation metrics, such as increasing their own H-index. Such behaviour not only violates academic ethics but also compromises the quality of the reviewed papers by introducing irrelevant citations.

Lastly, our study also explored possible countermeasures, such as using LLMs to refute unreasonable citation requests. This provides a new approach to addressing academic misconduct. Furthermore, by leveraging the powerful text capabilities of LLMs, we might be able to more efficiently identify and respond to inappropriate citation behaviours, thereby maintaining the health of the academic ecosystem.

In summary, our research highlights the potential detrimental effects that the application of LLMs may have on the integrity and reliability of the peer review process. We urge the academic community to recognise the impact of LLMs on peer review and to take swift action. For authors and reviewers, just as authors are required to declare the use of generative AI tools in writing their papers, journals should also require reviewers to disclose if they have used such tools in preparing their peer review reports; indeed, for data security reasons, it might even be necessary to prohibit reviewers from using AI for peer review to prevent the leakage of unpublished manuscript content. For publishers and editors, it is crucial to develop operational guidelines for handling suspected AI-generated reviews. This protocol could involve using detection tools not for automatic rejection, but as a flag for closer scrutiny, allowing editors to assess the review's substance before deciding whether to discount its influence, discard it or contact the reviewer for clarification. For LLM service providers, feasibility discussions around technical-level restrictions are vital. This could include exploring solutions like digital watermarking for traceability or developing specialised ‘academic modes’ to prevent misuse. At a minimum, clear terms of service prohibiting the input of confidential manuscript data should be established to create ethical and legal guardrails. Our study has certain limitations. Due to the time-consuming nature of manual peer review, we included only 20 articles in our study. Additionally, we selected research papers from the field of oncology based on our expertise, which may not be representative of papers from all fields. The pros and cons of using LLMs in peer review require further and larger scale evaluation. All in all, the potential disruptive impact of LLMs on peer review cannot be ignored. In the future, we need to establish robust norms and regulatory mechanisms to maximise the positive contributions of AI technology while effectively mitigating its potential risks. This will require the joint efforts of the academic community, publishers, technology companies and other stakeholders to explore the development path of academic publishing in the LLM era.

AUTHOR CONTRIBUTIONS

Lingxuan Zhu, Yancheng Lai, Jiarui Xie, Weiming Mou: conceptualisation; investigation; writing—original draft; methodology; literature review; writing—review and editing; visualisation. Haoyun Haungli, Chang Qi, Tao Yang, Aiming Jiang, Wenyi Gan, Dongqiang Zeng: conceptualisation; writing—review and editing; investigation. Bufu Tang, Mingjia Xiao, Guangdi Chu Zaoqu Liu: writing—review and editing; visualisation. Quan Cheng, Anqi Lin, Peng Luo: conceptualisation; literature review; project administration; supervision; resources; writing—review and editing. All authors have read and approved the final manuscript.

CONFLICT OF INTEREST STATEMENT

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

FUNDING INFORMATION

Not applicable.

ETHICS STATEMENT

Not applicable.

Open Research

DATA AVAILABILITY STATEMENT

Data generated or analysed during this study are included in this published article and its supplementary information files. Any additional information required is available upon request.

Supporting Information

REFERENCES

1Wang J, Liang Y, Meng F, et al. Zero-shot cross-lingual summarization via large language models. ACL Anthol. 2023: 12-23. doi:10.48550/arXiv.2302.14229
10.48550/arXiv.2302.14229
Google Scholar
2Tam D, Mascarenhas A, Zhang S, Kwan S, Bansal M, Raffel C. Evaluating the factual consistency of large language models through news summarization. ACL Anthol. 2023: 5220-5255. doi:10.48550/arXiv.2211.08412
10.48550/arXiv.2211.08412
Google Scholar
3Hwang SI, Lim JS, Lee RW, et al. Is ChatGPT a “fire of prometheus” for non-native English-speaking researchers in academic writing? Korean J. Radiol. 2023; 24: 952-959.
10.3348/kjr.2023.0773
PubMed Google Scholar
4Singh Chawla D. Is ChatGPT corrupting peer review? Telltale words hint at AI use. Nature. 2024; 628: 483-484.
10.1038/d41586-024-01051-2
PubMed Google Scholar
5Verharen JPH. ChatGPT identifies gender disparities in scientific peer review. Elife. 2023; 12:RP90230. doi:10.7554/elife.90230
10.7554/eLife.90230.3
CAS PubMed Web of Science® Google Scholar
6Liang W, Izzo Z, Zhang Y, et al. Monitoring AI-modified content at scale: a case study on the impact of ChatGPT on AI conference peer reviews. ICML24. 2024: 29575-29620. doi:10.48550/arXiv.2403.07183
10.48550/arXiv.2403.07183
Google Scholar
7Graham F. Daily briefing: signs that ChatGPT is polluting peer review. Nature. 2024. doi:10.1038/d41586-024-01106-4
10.1038/d41586?024?01106?4
Google Scholar
8Larsen PO, von Ins M. The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics. 2010; 84: 575-603.
10.1007/s11192-010-0202-z
CAS PubMed Web of Science® Google Scholar
9Biswas S, Dobaria D, Cohen HL. ChatGPT and the future of journal reviews: a feasibility study. Yale J Biol Med. 2023; 96: 415-420.
10.59249/SKDH9286
PubMed Web of Science® Google Scholar
10 GitHub - nishiwen1214/ChatReviewer. GitHub. Accessed January 20, 2024. https://github.com/nishiwen1214/ChatReviewer/blob/main/readme_en.md
Google Scholar
11Liang W, Zhang Y, Cao H, et al. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. arXiv. 2023. doi:10.48550/arXiv.2310.01783
10.48550/arXiv.2310.01783
Google Scholar
12von Wedel D, Schmitt RA, Thiele M, et al. Affiliation bias in peer review of abstracts by a large language model. JAMA. 2024; 331: 252-253.
10.1001/jama.2023.24641
PubMed Google Scholar
13Stiller-Reeve M. How to write a thorough peer review. Nature. 2018. doi:10.1038/d41586-018-06991-0
10.1038/d41586-018-06991-0
Google Scholar
14Zhu L, Mou W, Hong C, et al. The evaluation of generative AI should include repetition to assess stability. JMIR MHealth UHealth. 2024; 12:e57978.
10.2196/57978
PubMed Web of Science® Google Scholar
15 AI detector - trusted AI checker for ChatGPT, GPT4 & gemini. ZeroGPT. Accessed April 6, 2024. https://www.zerogpt.com/
Google Scholar
16 GPTZero | The trusted AI detector for ChatGPT, GPT-4, & more. GPTZero. Accessed April 6, 2024. https://gptzero.me/
Google Scholar
17Flesch R. A new readability yardstick. J Appl Psychol. 1948; 32: 221-233.
10.1037/h0057532
CAS PubMed Web of Science® Google Scholar
18Thombs BD, Levis AW, Razykov I, et al. Potentially coercive self-citation by peer reviewers: a cross-sectional study. J Psychosom Res. 2015; 78: 1-6. https://doi.org/10.1016/j.jpsychores.2014.09.015
10.1016/j.jpsychores.2014.09.015
PubMed Web of Science® Google Scholar
19Ryan J. Plagiarism in peer-review reports could be the ‘tip of the iceberg’. Nature. 2024. doi:10.1038/d41586-024-01312-0
10.1038/d41586-024-01312-0
Google Scholar

Volume5, Issue4

August 2025

e70067

Filename	Description
ctd270067-sup-0001-tableS1.xlsx16.2 KB	Supporting Information
ctd270067-sup-0002-tableS2.docx24.5 KB	Supporting Information
ctd270067-sup-0003-tableS3.xlsx286.2 KB	Supporting Information
ctd270067-sup-0004-tableS4.xlsx40.9 KB	Supporting Information
ctd270067-sup-0005-figureS1.pdf1.7 MB	Supporting Information
ctd270067-sup-0006-figureS2.pdf1.7 MB	Supporting Information
ctd270067-sup-0007-figureS3.pdf495.6 KB	Supporting Information
ctd270067-sup-0008-figureS4.pdf500.9 KB	Supporting Information

Evaluating the potential risks of employing large language models in peer review