Artificial Intelligence–Based Psychotherapeutic Intervention on Psychological Outcomes: A Meta-Analysis and Meta-Regression
Abstract
Background: Artificial intelligence (AI)–based psychotherapeutic interventions may bring a new and viable approach to expanding psychiatric care. However, evidence of their effectiveness remains scarce. We evaluated the efficacy of AI-based psychotherapeutic interventions on depressive, anxiety, and stress symptoms at postintervention and follow-up assessments.
Methods: A three-step comprehensive search via nine electronic databases (PubMed, Embase, CINAHL, Cochrane Library, Scopus, IEEE Xplore, Web of Science, PsycINFO, and ProQuest Dissertations and Theses) was performed.
Results: Thirty randomized controlled trials (RCTs) in 31 publications involving 6100 participants from nine countries were included. The majority (79.1%) of trials with intention-to-treat analysis but less than half (48.6%) of trials with perprotocol analysis were graded as low risk. Meta-analyses showed that interventions significantly reduced depressive symptoms at the postintervention assessment (t = −4.40, p = 0.001) with medium effect size (g = −0.54, 95% CI: −0.79 to −0.29) and at 6–12 months of assessment (t = −3.14, p < 0.016) with small effect size (g = −0.23, 95% CI: −0.40 to −0.06) in comparison with comparators. Our subgroup analyses revealed that the depressed participants had a significantly larger effect size in reducing depressive symptoms than participants with stress and other conditions. At postintervention and follow-up assessments, we discovered that AI-based psychotherapeutic interventions did not significantly alter anxiety, stress, and the total scores of depressive, anxiety, and stress symptoms in comparison to comparators. The random-effects univariate meta-regression did not identify any significant covariates for depressive and anxiety symptoms at postintervention. The certainty of evidence ranged between moderate and very low.
Conclusions: AI-based psychotherapeutic interventions can be used in addition to usual treatments for reducing depressive symptoms. Well-designed RCTs with long-term follow-up data are warranted.
Trial Registration: CRD42022330228
1. Introduction
The World Health Organization (WHO) found that approximately 1 billion people worldwide struggle with some form of psychological problems, which are the leading cause of years lived with disability [1]. Between 1990 and 2019, the global number of disability-adjusted life years due to mental disorders increased from 80.8 to 125.3 million over 20 years [2]. Many of these people with mental disorders are not receiving treatment due to a shortage of therapists, stigmatization, transport costs, and expensive consultation fees [3, 4]. During the coronavirus disease of 2019 (COVID-19) pandemic, an additional 53.2 million cases of major depressive disorders and 76.2 million cases of anxiety disorders were found globally [5]. A systematic review found that the global prevalence of depression, anxiety, and stress among the general population during the COVID-19 pandemic ranged from 25.18% to 29.57% [6]. This pandemic has created an increased urgency to consider the accessibility of psychotherapy due to restrictions and lockdowns [7]. The WHO found that mental health interventions are insufficient and inadequate globally [1], prompting the utilization of new technology to meet the needs.
In parallel with the advancements in artificial intelligence (AI) technology, psychotherapy has begun to incorporate AI techniques for creating psychotherapeutic interventions [8, 9]. This human–computer interaction technology is believed to be intelligent enough to comprehend the conversation between a patient and a chatbot therapist based on machine learning (ML) algorithms [4, 10]. Applications could help prevent, treat, and prevent relapses in behavioral and psychiatric issues [11]. According to Bendig et al. [11], AI chatbots, also known as conversational or relational agents, are machine conversation systems that interact with human users using various AI technologies. Responses can be generated using a rule-based model (predefined rules or decision tree), natural language processing (NLP), or ML through text-based or speech-enabled conversations [4, 7]. AI chatbots try to talk like humans, including the emotional, social, and relational parts of natural conversation [11]. They do this to imitate a therapeutic conversational style that can help users transfer therapeutic content and mirror therapeutic processes [7, 12].
According to Boucher et al. [7] and Vaidyam et al. [4], AI chatbots are thought to possess sufficient intelligence to comprehend conversations with human users using written, spoken, and visual language through an interactive interface. Some scientists have developed an avatar, a computer-generated character, as an embodied conversational agent in an intervention aimed at improving usability and intention to use [13]. An embodied agent can emulate some human interactions, including gaze, speech, hand gestures, and other nonverbal modalities [13]. Different platforms, including websites, mobile applications, short message services, virtual reality, and smart technology, can integrate AI chatbots to perform various functions such as therapy, counseling, monitoring, engagement, adherence, or psychoeducation [7].
Psychotherapeutic interventions can use AI chatbots and tailor them to specific populations [4]. This innovative approach may improve the shortage of therapists and engagement in therapy [4, 13]. AI-based psychotherapeutic interventions offer several advantages when used, including lowering the stigma associated with therapy, fostering a comfortable environment for self-disclosure, being cost-effective, reducing travel time, eliminating geographical restrictions, freeing up human resources, and broadening overall accessibility [4, 10]. Hence, AI-based psychotherapeutic interventions may offer a potential solution to overcome barriers and expand psychiatric care.
Given that AI-based psychotherapeutic intervention is an emerging field, different types of systematic review have been found, including three integrative reviews [4, 7, 10], four scoping reviews [8, 11, 14, 15], one mixed-method review [16], and three systematic reviews [9, 17, 18].
Boucher et al. [7] highlighted the potential integration of AI-based chatbots into digital mental health interventions. Pham, Nabizadeh, and Selek [10] described different AI-based interventions and their clinical practices. Vaidyam et al. [4] explored the roles of conversational agents, or chatbots in the screening, diagnosis, and treatment of mental illness. The integrative review suggested that an AI-supported intervention could increase engagement, even though its therapeutic effect was not reported enough [4]. The integrative review proposed that an AI-supported intervention could enhance engagement, despite the underreporting of its therapeutic effect [4]. According to these integrative reviews [7, 10], future research should focus on utilizing randomized controlled trials (RCTs) to investigate the efficacy of AI psychotherapeutic interventions.
A mixed-method review [16] aimed to evaluate the use of conversational agent interventions in the treatment of mental health problems. The scoping reviews were supposed to look at how chatbots have been developed and used in public health [15], how they are used for mental health [8, 14], and how useful, acceptable, and practicable they are in clinical psychology and psychotherapy [11]. Results regarding the practicability, feasibility, and acceptability of AI-supported intervention for mental problems were promising [14, 16], but there was a lack of consensus on reporting and evaluation for chatbots [8] and no direct transferability to psychotherapeutic context [11]. Hence, more reviews are required to demonstrate its efficacy [14].
We found three systematic reviews [9, 17, 18] relating to the efficacy of AI-based psychotherapeutic intervention in existing literature. Gual-Montolio et al. [9] aimed to use AI-based methods to enhance outcomes in psychological interventions in real-time or close to real-time. Li et al. [17] and Lim et al. [18] examined the feasibility and/or effectiveness of AI-based psychotherapeutic interventions. However, these reviews had certain limitations. These included relying on a limited number of databases [9], combining AI-based and non-AI-based interventions [18], only providing a narrative synthesis [9, 17], focusing only on depressive symptoms as an outcome [18], and incorporating different research designs [9].
Emerging evidence has shown that AI-based psychotherapeutic intervention may improve psychological outcomes. However, relatively few reviews have investigated the long-term effects of interventions. To fill this gap, the current review aims to evaluate the efficacy of AI-based psychotherapeutic interventions on depressive, anxiety, and stress symptoms at postintervention and follow-up assessments.
2. Material and Method
This systematic review was reported following the preferred reporting items for systematic reviews and meta-analyses (PRISMA) (Table S1) [19].
2.1. Eligibility Criteria
Given that RCTs are considered the gold standard for evaluating the effectiveness of interventions [20], only RCTs were included in this review. The population targeted adults aged ≥18 years old with or without medical, psychological, and behavioral problems. The intervention used a conversational (chatbot) interface to deliver any form of psychotherapy with self-guided or therapist support incorporating AI technology. Response generation contained rule-based or other AI technologies. Input and output modalities involved written, spoken, visual, or emoji. The presentation could use either an embodied or nonembodied chatbot. The comparator included treatment as usual, waitlist, placebo control, or any type of intervention. The psychological outcomes included depressive, anxiety, and stress symptoms at postintervention and follow-up assessments. No restrictions were imposed on the population and publication date. This review included published and unpublished trials in the English language [21]. The details of the eligibility criteria can be found in Table S2.
2.2. Search Strategy
A scoping search for existing systematic reviews with similar aims was conducted in the Cochrane Database of Systematic Reviews, Joanna Briggs Institute, and the PROSPERO database to prevent any duplication. An iterative process was used to develop the search terms. An initial keyword search comprising “artificial intelligence” AND “psychological outcomes” was used to conduct a simple search. After the inclusion of potential articles, the search terms and keywords were revised in consultation with a university librarian. The eventual search terms comprising both keywords and index terms for the respective databases can be found in Table S3.
Following the development of the search terms, a three-step search [22] was conducted from inception to February 9, 2023. First, a search was conducted in nine English electronic databases (PubMed, Embase, CINAHL, Cochrane Library, Scopus, IEEE Xplore, Web of Science, PsycINFO, and ProQuest Dissertations and Theses) to locate relevant articles. Second, a search for unpublished trials was conducted in three clinical trial registries (ANZCTR, ISRCTN, and CenterWatch), and an email was sent to all corresponding authors to obtain information about their trials. Finally, a hand search of the reference lists of the selected studies and gray literature was conducted to maximize the search. We contacted the authors via email for additional data when the information included in their publication was insufficient.
2.3. Study Selection
EndNote X20 was used to manage the retrieved citations from the search. Duplicates were removed using automated and manual functions. Two authors (W.W. and S.H.W.) screened all the articles by title and abstract, with reference to the eligibility criteria. When disagreements occurred, a third author (L.Y.) was consulted. Inter-rater reliability was measured using Cohen’s kappa (κ), l, with −1 suggesting an absence of agreement, and 1 indicating perfect agreement [23]. Values greater than 0.75 were considered excellent agreement, whereas values between 0.40 and 0.75 were considered good agreement [24].
2.4. Data Management and Extraction
The data extraction form was designed with reference to the Cochrane Handbook [22]. Two reviewers (W.W. and S.H.W.) extracted all the data independently. The data elements extracted included trial characteristics (number of studies, author, publication year, country, recruitment setting, design, nature of participants, mean age, gender distribution, AI-based psychotherapeutic intervention, name, comparator, sample size, psychological outcomes, measures, attrition rate, intention-to-treat analysis [ITT], missing data management [MDM], protocol, trial registration, and grant support), description of the intervention (intervention content, type of AI chatbot, psychological principle, duration of the intervention, follow-up assessment, frequency of use, and mean amount of time engagement in minutes), and psychological outcomes (depressive, anxiety, and stress symptoms) at postintervention and follow-up assessments (mean, standard deviation, and total numbers).
2.5. Risk of Bias Version 2 (RoB 2.0)
The Cochrane risk of RoB 2.0 [25] was used to appraise the methodological quality of all included studies. Risk of bias was performed via an Excel tool to implement RoB 2.0 by two independent reviewers (W.W. and D.A.). The risk of bias was evaluated against the following five domains of bias: (1) randomization process, (2) deviations from intended intervention, (3) missing outcome data, (4) measurement of the outcome, and (5) selection of the reported result [25]. Two reviewers responded to signaling questions in each domain to select the options of “yes,” “probably yes,” “probably no,” “no,” or “no information.” The RoB 2.0 algorithmic tool rates the risk of bias as “low,” “high,” or “some concerns” [25].
2.6. Certainty of Evidence
The grading of recommendations, assessment, development, and evaluation (GRADE) criteria was used to assess the overall certainty of evidence [26]. To determine the certainty of evidence, two reviewers (D.A. and L.Y.) independently evaluated the studies based on the following domains: risk of bias, inconsistency, indirectness, imprecision, and effect. The ratings were classified as very low, low, moderate, or high, and the decision was determined based on justifications [26]. Publication bias was determined using the Egger regression test [27] and funnel plot of precision using standardized mean difference [28]. Publication bias was ascertained using a p-value of less than 0.05 from the Egger test and asymmetrical funnel plot [29].
2.7. Data Synthesis
We used the meta [30] and metaphor [31] packages of R software to conduct the meta-analysis, subgroup analysis, and meta-regression analysis. Prediction interval (PI) was used based on t-distribution (t) to predict a range of true effects for future trials with similar settings [32]. A 95% PI was used to estimate the 95% probability that the next trial will be contained within this range. A statistically significant effect is expected for a future trial if all values of the 95% PI are on the same side of the null of 0, whereas an insignificant effect is expected if all values are on both sides of the null of 0 [32]. Hedges’ g was used to communicate the effect size because of its precision for studies with small sample sizes [33, 34]. Random-effects model was used to assume that the observed estimates of treatment effect can vary across studies [35]. Restricted maximum likelihood method was used as the estimator for random-effect meta-analysis to provide unbiased estimates [36]. Hartung–Knapp adjustment for random-effects models was selected to prevent counterintuitive effects [37]. A 95% confidence interval was used to communicate the precision of the summary estimate and derive the p-value [22].
Heterogeneity was assessed using Cochran Q test and I2 values [22]. A p-value of <0.01 indicated heterogeneity. The extent of heterogeneity was quantified using I2 values [22]. A Cochran Q test p-value of <0.01 and I2 > 50% indicated heterogeneity [22]. Additional subgroup and meta-regression analyses were conducted to explore the reasons for heterogeneity [22].
Subgroup analyses were conducted based on the predetermined groups based on the nature of participants (depression ± others, stress/distress ± others, other condition, or healthy), age groups (18–30, 31–40, 41–50, or >50), type of AI-based chatbot (Deprexis vs. others), different comparator (passive vs. active), type of psychotherapy (cognitive behavioral therapy [CBT] vs. others), type of platforms (Internet vs. others), response generation (rule-based vs. NLP), and embodiment (yes vs. no) use of ITT/MDM (yes vs no), and protocol publication/trial registration (yes vs. no). Significant subgroup differences were determined based on the Q statistic with a subgroup effect of p < 0.1 [38]. Meta-regression analyses were conducted to examine the effects of potential covariates (publication year, duration of intervention, sample size, attrition rate, and portion of males) on the psychological symptoms. The relationships were expressed using coefficient β, which represents the change in the value of depressive symptoms relative to the unit change in the covariates [39]. A p-value of <0.05 was used to conclude the association between the covariate and outcomes based on effect size [22].
3. Results
The outcomes of the three-step search are shown in Figure 1. A total of 13,521 articles were retrieved from 12 electronic databases and three clinical trial registries. Ten records were found from trial registries and excluded, providing reasons for each exclusion (Table S4). Following the removal of 2389 duplicates, a total of 11,132 articles were screened based on their title and abstract. Twenty-five records were identified from websites, organizations, and citation searching. Fifty-seven articles from both sources were assessed in full text for eligibility. Twenty-six articles were excluded, and their reasons were documented in Table S5. A total of 30 RCTs in 31 publications with a study number ranging from 1 to 30 [12, 40–68 ] were included in this systematic review and meta-analysis.

3.1. Trial Characteristics
The characteristics of the 30 trials evaluating the effect of AI-based psychotherapeutic interventions involving 6100 participants can be found in Table 1. The trials were published from 2009 [12] to 2022 [70]. They were conducted in Argentina (n = 1) [71], China (n = 2) [70, 72], Germany (n = 10) [11, 12, 26, 39, 40, 42, 48, 50–52], Italy (n = 2) [7, 44], Korea (n = 2) [16, 47], Romania (n = 1) [73], the United Kingdom (n = 3) [8, 41, 74], the United States (n = 5) [2, 9, 23, 45, 46], and three countries (n = 1) [43]. The participants were recruited from the community (n = 21) (1–3,5,6,8,9,11–16,19,20–23,25,27,28), clinical setting (n = 6) [11, 16, 26, 39, 43, 47], and a mixture of both (n = 3) [48, 50, 52]. Twenty-five trials adopted a two-arm RCT, three trials [40, 49, 72] adopted a three-arm RCT, one trial [7] adopted a four-arm RCT, and one trial [23] adopted a crossover design. The sample sizes of the trials ranged from 21 [44] to 1013 [48]. Half of them reported follow-up outcomes after postassessment, which ranged from 2 weeks [74] to 12 months [48].
Number | Author, year | Country/recruitment | Design | Nature of participants (criteria) | Mean age (gender portion) | AI-based psychotherapeutic intervention (name) | Type of comparator (name) | Sample size | Psychological outcomes (measures) | Follow-up | Attri rate (%) | ITT/MDM | Protocol/registry/grant |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1. | Beevers et al. [40] | United States/community | Two-arm RCT | Adults with depression (QIDS-SR ≥10) |
|
Internet-based intervention (Deprexis) |
|
|
Depressive symptoms (QIDS-SR and HRSD-17) | No | 20.4 | Y/Y | N/Y/Ya |
2. | Bennion et al. [41] | United Kingdom/community | Two-arm RCT | Older adults (>50) with emotional distress (NR) |
|
Internet-based conversational agent (Chatbot, MYLO) |
|
|
Depressive, anxiety, and stress symptoms (DASS-21) | 2 weeks | 12.5 | N/N | N/N/Ya |
3. | Berger et al. [42] | Sweden and Germany/community | Three-arm RCT |
|
|
|
|
|
Depressive symptoms (BDI-II) | 6 months | 0 | Y/Y | N/N/Ya |
4. | Berger et al. [43] | Germany/clinical | Two-arm RCT | Adults with depression (BDI-II >13)/ unipolar affective disorder |
|
Internet-based intervention (Deprexis) + psychotherapy | Active control (psychotherapy) |
|
Depressive (BDI-II), anxiety (GAD-7) symptoms | 6 months | 29.6 | Y/Y | Y/Y/Ya |
5. | Bird et al. [44] | United Kingdom/community | Two-arm RCT | Adults (students and staff in university) with distress |
|
Internet-based conversational agent (Chatbot, MYLO) | Active control Internet-based conversational agent (Chatbot, ELIZA) |
|
Depressive, anxiety, and stress symptoms (DASS-21) | 2 weeks | 0 | Y/Y | N/N/Ya |
6. | Bücker et al. [45] | Germany/community | Two-arm RCT | Adults with gambling and mood problems (NR) |
|
Internet-based intervention (Deprexis) |
|
|
Depressive (PHQ-9), anxiety (GAD-7) symptoms | No | 47.9 | Y/Y | Y/Y/Yb |
7. | Burton et al. [46] | Romania, Spain, and United Kingdom/clinical | Two-arm RCT | Adults with major depressive disorder (NR) |
|
Embodied virtual agent-based system (Help4Mood) |
|
|
Depressive symptoms (BDI-II, QIDS-SR) | No | 25.0 | Y/N | N/Y/Ya |
8. | Danieli et al. [47] | Italy/community | Two-arm RCT | Adults with distress, anxiety, and depression (NR) |
|
Mobile-based (TEO) intervention (Chatbot, m-PHA) + SMT-CBT |
|
|
Depressive and anxiety (SCL-90-R), stress (PSS) symptoms | 3 months | 0 | N/N | N/Y/Ya |
9. | Danieli et al. [48] | Italy/community | Four-arm RCT | Older adults stress and anxiety symptoms (NR) |
|
|
|
|
Depressive (SCL-90-R and PHQ-8), anxiety (SCL-90-R and GAD-7), and stress (PSS) symptoms | 3 months | 5 | N/N | N/Y/Ya |
10. | Fischer et al. [49] | Germany/clinical | Two-arm RCT | Adults with multiple sclerosis and depressive symptoms (NR) |
|
Internet-based intervention (Deprexis) | Passive control (waitlist) |
|
Depressive symptoms (BDI) | 3 months | 21.1 | Y/Y | N/Y/Ya |
11. | Fitzpatrick, Darcy, and Vierhile [12] | United States/community | Two-arm RCT | Young adults with anxiety and depressive symptoms (NR) |
|
Computer-based/mobile-based intervention (Chatbot, Woebot) | Active control (eBook on depression) |
|
Depressive (PHQ-9), anxiety (GAD-7) symptoms | No | 20.0 | Y/Y | N/N/Yc |
12. | Fitzsimmons-Craft et al. [50] | United States/community | Two-arm RCT | Female young adults with risk of eating disorders (NR) |
|
Internet-based (StudentBodies) intervention (Chatbot, Tessa) | Passive control (waitlist) |
|
Depressive (PHQ-8), anxiety (GAD-7) symptoms | 6 months | 37.3 | Y/Y | N/Y/Ya |
13. | Gaffney et al. [51] | United Kingdom/community | Two-arm RCT | Young adults (students in university) with distress |
|
Internet-based conversational agent (Chatbot, MYLO) |
|
|
Depressive, anxiety, and stress symptoms (DASS-21) | 2 weeks | 10.4 | N/N | N/N/Ya |
14. | Guțu et al. [52] | Romania/community | Two-arm RCT | Young adults from social media |
|
Computer-based/mobile-based intervention (Chatbot, Woebot) | Active control (psychoeducational daily email) |
|
Depression, and anxiety symptoms (DASS-21) | No | 55.2 | Y/Y | N/N/NR |
15. | He et al. [53] | China/community | Three-arm RCT | Young adults with depressive symptoms (CSMHSS: 2–3) |
|
Internet-based intervention (Chatbot, XiaoE) |
|
|
Depressive symptoms (PHQ-9) | 1 month | 15.5 | Y/Y | N/Y/Ya |
16. | Hunt et al. [54] | United States/community | Crossover trial | Adults with IBS (by physician or Rome IV criteria) |
|
Mobile-based intervention (Chatbot, Zemedy) | Passive control (waitlist) |
|
Depressive (PHQ-9), and anxiety symptoms (DASS-21) | 3 months | 28.1 | Y/Y | N/Y/Yb |
17. | Jang et al. [55] | Korea/clinical | Two-arm RCT | Adults with attention-deficit (ADHD score: 4/6 items) |
|
Mobile-based intervention (Chatbot, Todaki) | Active control (self-help information of ADHD) |
|
Depressive (QIDS-SR), anxiety (SAS), and stress (PSS) symptoms | No | 19.6 | Y/N | N/N/Yc |
18. | Klein et al. [56] | Germany/clinical and community | Two-arm RCT | Adults with depressive symptoms (PHQ-9 : 5–14) |
|
Internet-based intervention (Deprexis) |
|
|
Depressive symptoms (PHQ-9, HDRS-24) | 12 months | 21.6 | Y/N | Y/Y/Ya |
19. | Klos et al. [57] | Argentina/community | Two-arm RCT | Young adults (students from university) |
|
Internet-based intervention (Chatbot, Tess) | Active control (psychoeducation eBook on affective symptoms) |
|
Depressive (PHQ-9), anxiety (GAD-7) symptoms | No | 59.7 | N/N | N/N/NR |
20. | Liu et al. [58] | China/community | Two-arm RCT | Young adults with depressive symptoms (PHQ-9 ≥ 9) |
|
Pipeline-based intervention (Chatbot, XiaoNan) | Active control (self-help bibliotherapy intervention) |
|
Depressive (PHQ-9), anxiety (GAD-7) symptoms | No | 24.1 | Y/Y | N/N/No |
21. | Ly, Ly, and Andersson [59] | Sweden/community | Two-arm RCT | Young adults (students from universities, website, and social media) |
|
Mobile-based intervention (Chatbot, Shim) | Passive control (waitlist) |
|
Stress symptoms (PSS) | No | 0 | Y/N | N/N/Ya |
22. | Maeda et al. [60] | Japan/community | Three-arm RCT | Female young adults who want a baby |
|
Internet-based intervention (online chatbot for fertility education) |
|
|
Anxiety symptoms (STAI) | No | 0 | Y/N | N/Y/Ya,c |
23. | Meyer et al. [61] | Germany/community | Two-arm RCT | Adults with depressive symptoms (NR) |
|
Internet-based intervention (Deprexis) |
|
|
Depressive symptoms (BDI) | 6 months | 45.5 | Y/Y | N/N/NR |
24. | Meyer et al. [62] | Germany/clinical and community | Two-arm RCT | Adults with depressive symptoms (PHQ-9 : 15 – 27) |
|
Internet-based intervention (Deprexis) |
|
|
Depressive (PHQ-9), anxiety (GAD-7) symptoms | 6 months | 17.8 | Y/Y | N/Y/Yc |
25. | Moritz et al. [63] | Germany/community | Two-arm RCT | Adults with depressive symptoms (NR) |
|
Internet-based intervention (Deprexis) |
|
|
Depressive symptoms (BDI) | NR | 19.0 | Y/N | N/Y/Yc |
26. | Oh et al. [64] | Korea/clinical | Two-arm RCT | Adults with panic symptoms (MINI) |
|
Mobile-based intervention (Chatbot, Todaki | Active control (book for panic disorder) |
|
Depressive and anxiety symptoms (HADS) | No | 8.89 | N/N | N/N/Yc |
27. | Prochaska et al. [65] | United States/community | Two-arm RCT | Adult with substance misuse (CAGE-AID > 1) |
|
Computer-based/mobile-based intervention (Chatbot, Woebot) | Passive control (waitlist) |
|
Depressive (PHQ-8), anxiety (GAD-7) symptoms | 8 weeks | 15.6 | Y/N | N/Y/NR |
28. | Sandoval et al. [66] | United States/community | Two-arm RCT | Adults with MDD or dysthymic disorder (DSM-IV-TR, PHQ-9 > 9) |
|
Interactive media based, computer-delivered depression treatment program (imbPST) |
|
|
Depressive symptoms (BDI-II, HSCL-20-d) | No | 0 | N/N | N/N/Ya |
29. | Schroder et al. [67] | Germany/clinical and community | Two-arm RCT | Adults with epilepsy (PESOS) and depressive symptoms (NR) |
|
Internet-based intervention (Deprexis) |
|
|
Depressive symptoms (BDI) | No | 26.9 | Y/N | N/Y/NR |
30 | Zwerenz et al. [68, 69] | Germany/clinical | Two-arm RCT | Adults with depressive symptoms (BDI-II >13, ICD-10) |
|
Internet-based intervention (Deprexis) |
|
|
Depression (BDI-II) | 6 months | 13.5 | Y/Y | Y/Y/Ya |
- Abbreviations: ADHD, attention-deficit/hyperactivity disorder; ADHD score, Attention-Deficit/Hyperactivity Disorder Self-Rating Scale Version 1.1 regardless psychiatric diagnosis; AI, artificial intelligence; Attri, attrition rate; BDI, Beck Depression Inventory; BDI-II, Beck Depression Inventory-II; C, comparator; CAGE-AID, cut down, annoyed, guilty, eye opener-adapted to included drugs; CSMHSS, College Students Mental Health Screening Scale; DASS-21, Depression, Anxiety, and Stress Scale short form; Deprexis, an Internet-based software platform that provides personalized cognitive behavioral therapy-based support to help improve depression symptoms; DSM-IV-TR, Diagnostic and Statistical Manual of Mental Disorders Text Revision Fourth Edition; ELIZA, a chatbot that mimics a therapist using a humanistic principle; F, female; GAD-7, General Anxiety Disorder 7-item scale; HADS, Hospital Anxiety and Depression Scale; HDRS-17, Hamilton Depression Rating Scale; HDRS-24, Hamilton Depression Rating Scale; Help4Mood, an interactive system with an embodied virtual agent (avatar) to assist in self-monitoring of patients receiving treatment for depression; HSCL-20-d, Hopkins Symptom Checklist 20-Item Depression Scale; I, intervention; IBS, irritable bowel syndrome; ICD-10, International Classification of Diseases 10th Revision); imbPST, interactive media-based, computer-delivered depression treatment program; ITT, intention-to-treat analysis; M, male; MDD, major depressive disorder; MDM, missing data management; MINI, Mini-International Neuropsychiatric Interview; m-PHA, mobile personal health care agent; MYLO, Manage Your Life Online; N, no; NR, not reported; PESOS, an epilepsy-specific inventory, the performance, sociodemographic aspects, subjective estimation; PHQ-8, Patient Health Questionnaire-8-item scale; PHQ-9, Patient Health Questionnaire 9-item scale; PSS, Perceived Stress Scale; QIDS-SR, Quick Inventory of Depressive Symptoms-Self-Report; RCT, randomized controlled trial; ROME IV, ROME IV diagnostic criteria for irritable bowel syndrome; SAS, Self-Rating Anxiety Scale; SCL–90-R, Symptom Checklist−90-Revised; SMT-CBT, stress management training and cognitive behavioral therapy; STAI, State-Trait Anxiety Inventory; T, total; TEO, therapy empowerment opportunity; Xiaoai, a chatbot for small talk with unrestricted content; Y, yes.
- aGrants were not industry sponsored.
- bGrants were industry sponsored but declared that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- cGrants were industry sponsored.
3.2. Description of AI-Based Psychotherapeutic Interventions
Fourteen types of AI-based chatbot were found, including Deprexis (n = 11) [11, 12, 14, 26, 39, 40, 42, 48, 50–52], Manage Your Life Online (n = 3) [8, 41, 74], Woebot (n = 3) [2, 39, 73], Therapy Empowerment Opportunity (n = 2) [7, 44], and others. The content of the interventions is described in Table 2. Table S6 provides a summary of 14 different types of AI-based chatbots. The psychological principle was largely grounded in CBT (n = 25), and the area of use was mainly for the treatment of various conditions (n = 26). The main functions included counseling (n = 18), therapy (n = 18), and monitoring (n = 20) via the Internet (n = 18). Most interventions (n = 27) were self-guided and four of them [7, 40, 44, 46] were supported by a therapist. Two trials [40, 46] used therapists for counseling and therapy, while others used a self-help version of an AI-based chatbot. Berger et al. [40, 42] adopted one intervention group with a low-intensity therapist-guided self-help version of Deprexis, and Fitzsimmons-Craft et al. [46, 50] relied on human authoring of conversations via a chatbot (Tessa). Two trials [7, 44] used the mobile personal health care agent (m-PHA) to communicate with patients, but the therapist supervised the m-PHA interactions with the patients. The therapists provided support regarding the events mentioned during the therapy sessions, as well as reviewing the notes and recollections [7, 44]. The duration of intervention ranged from one time [41] to 16 weeks [58]. Most trials did not report the frequency and time of usage. Response generation contained rule-based (n = 16), NLP (n = 14), and other AI technologies. Input and output modalities involved written (n = 30), spoken (n = 3) [43, 70, 72], visual (n = 1) [72], and emojis (n = 1) [71]. Seven embodied chatbots [9, 16, 23, 43, 47, 49, 70] were observed.
Number | Author, year | Name of chatbot Deprexis/other | Principle CBT/other | Area of use treatment/prevention | Function | Platform Internet/other | Guide | Frequency/Duration/Time use | Response generation rule-based/other | Input | Output | Embodied Yes/No |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1. | Beevers et al. [40] | Deprexis | CBT | Treatment of depression |
|
Internet | Self | NR/8 weeks/261.6 min | Rule-based | Written | Written | No |
2. | Bennion et al. [41] | MYLO |
|
Treatment of distress | -Counselling | Internet | Self | One time/24.17 min | Rule-based | Written | Written | No |
3. | Berger et al. [42] | Deprexis | CBT | Treatment of depression |
|
Internet |
|
|
Rule-based | Written | Written | No |
4. | Berger et al. [43] | Deprexis | CBT | Treatment of depression |
|
Internet | Self |
|
Rule-based | Written | Written | No |
5. | Bird et al. [44] | MYLO |
|
Treatment of distress | -Counselling | Internet | Self |
|
Rule-based | Written | Written | No |
6. | Bücker et al. [45] | Deprexis | CBT | Treatment of gambling and mood |
|
Internet | Self |
|
Rule-based | Written | Written | No |
7. | Burton et al. [46] | Help4Mood | CBT | Treatment of depression |
|
|
Self |
|
Natural language processing | Written |
|
Yes |
8. | Danieli et al. [47] | m-PHA (TEO) | CBT | Treatment of distress, anxiety, depression | -Monitoring |
|
Therapist |
|
Natural language processing | Written | Written | No |
9. | Danieli et al. [48] | m-PHA (TEO) | CBT | Treatment of distress, anxiety symptoms | -Monitoring |
|
Therapist | NR/8 weeks /NR | Natural language processing | Written | Written | No |
10. | Fischer et al. [49] | Deprexis | CBT | Treatment of depression |
|
Internet | Self | NR/9 weeks/332 min | Rule-based | Written | Written | No |
11. | Fitzpatrick, Darcy, and Vierhile [12] | Woebot | CBT | Treatment of depression and anxiety |
|
|
Self |
|
Natural language processing | Written | Written | No |
12. | Fitzsimmons-Craft et al. [50] | Tessa | CBT | Eating disorders |
|
|
Therapist |
|
Rule-based/algorithm-based | Written | Written | No |
13. | Gaffney et al. [51] | MYLO |
|
Treatment of distress | -Counselling | Internet | Self | One time/19.23 min | Rule-based | Written | Written | No |
14. | Guțu et al. [52] | Woebot | CBT | Prevention |
|
|
Self |
|
Natural language processing | Written | Written | No |
15. | He et al. [53] | XiaoE | CBT | Treatment of depression |
|
Internet (WeChat) | Self |
|
Natural language processing and deep learning | Written |
|
No |
16. | Hunt et al. [54] | Zemedy | CBT | Irritable bowel syndrome |
|
|
Self |
|
Natural language processing | Written | Written | Yes |
17. | Jang et al. [55] | Todaki | CBT | Attention-deficit/hyperactivity disorder |
|
|
Self |
|
Natural language processing | Written | Written | Yes |
18. | Klein et al. [56] | Deprexis | CBT | Treatment of depression |
|
Internet | Self |
|
Rule-based | Written | Written | No |
19. | Klos et al. [57] | Tess | CBT, EFT, SFT, MI | Prevention |
|
|
Self |
|
Natural language processing |
|
|
No |
20. | Liu et al. [58] | XiaoNan | CBT | Treatment of depression |
|
|
Self |
|
Natural language processing, intention classification and emotion recognition. |
|
Written | Yes |
21. | Ly, Ly, and Andersson [59] | Shim | CBT, positive psychology | Prevention |
|
Smartphone app | Self |
|
Rule-based | Written | Written | No |
22. | Maeda et al. [60] | Chatbot for fertility education | Transtheoretical model | Prevention |
|
Internet (chat via Google Cloud’s Dialogflow) | Self |
|
Natural language processing | Written | Written | Yes |
23. | Meyer et al. [61] | Deprexis | CBT | Treatment of depression |
|
Internet | Self |
|
Rule-based | Written | Written | No |
24. | Meyer et al. [62] | Deprexis | CBT | Treatment of depression |
|
Internet | Self |
|
Rule-based | Written | Written | No |
25. | Moritz et al. [63] | Deprexis | CBT | Treatment of depression |
|
Internet | Self |
|
Rule-based | Written | Written | No |
26. | Oh et al. [64] | Todaki | CBT | Panic disorder |
|
|
Self |
|
Natural language processing | Written | Written | Yes |
27. | Prochaska et al. [65] | Woebot | CBT | Substance use |
|
|
Self |
|
Natural language processing | Written | Written | No |
28. | Sandoval et al. [66] | imbPST | Problem-solving therapy | Treatment of depression |
|
Computer software | Self |
|
Rule-based | Written | Written | Yes |
29. | Schroder et al. [67] | Deprexis | CBT | Treatment of depression |
|
Internet | Self |
|
Rule-based | Written | Written | No |
30. | Zwerenz et al. [68, 69] | Deprexis | CBT | Treatment of depression |
|
Internet | Self |
|
Rule-based | Written | Written | No |
- Abbreviations: App, application; CBT, cognitive behavioral therapy; EFT, emotion-focused therapy; MI, motivational interviewing; NR, not reported; SFT, solution-focused brief therapy.
3.3. Individual Quality Assessment
A total of 30 RCTs were evaluated against the RoB 2.0 criteria (Figure S1). Twenty-three trials used ITT analysis, and seven trials used perprotocol analysis. Low risk of bias rating across five domains was found in the majority (79.1%) of trials with ITT analysis but less than half (48.6%) of trials with perprotocol analysis. Nine trials (30%) did not provide information on allocation. Seventeen trials (56.7%) rated some concerns for deviations from the intended intervention because the participants and personals were aware of the assigned intervention. Table 1 displays that 13 trials (43.3%) did not publish a protocol or did not register in clinical trial registries, so the selection of the reported results had some concerns. The attrition rate ranged from 0% [9] to 59.7% [71]. The majority (80%) of trials received grants from various sources, including 16 nonindustry-sponsored grants, seven industry-sponsored grants, and one trial [45] that included both industry-sponsored and nonindustry-sponsored grants. However, six trials did not report or had no grant support. Even though seven trials mentioned industry-sponsored grants, two trials [23, 42] declared that they were not involved in data analysis.
3.4. Depressive Symptoms
A total of 29 arms of 26 RCTs [2, 7, 9, 11, 12, 14, 16, 23, 26, 39–48, 50–52, 70, 72–74] among 4349 individuals, eight arms of six RCTs [7, 23, 41, 44, 72, 74] among 418 individuals, and eight arms of seven RCTs [11, 12, 26, 40, 46, 48, 50] among 2268 participants evaluated the effects of AI-based psychotherapeutic interventions at the postintervention assessment and 2 weeks to 3 months and 6–12 months of follow-up assessments (Figure 2). The meta-analyses showed that AI-based psychotherapeutic interventions significantly reduced depressive symptoms at the postintervention assessment (t = −4.40, p = 0.001) with medium effect size (g = −0.54, 95% CI: −0.79 to −0.29) and 6–12 months of follow-up assessment (t = −3.14, p < 0.016) with small effect size (g = −0.23, 95% CI: −0.40 to −0.06) compared with the comparators. No differences were observed between the intervention and comparator at 2 weeks to 3 months of follow-up assessment (t = −0.08, p = 0.936).

The 95% PIs were −1.77 to 0.69, −1.27 to 1.24, and −0.64 to 0.19 for three-time points. Given that the 95% PI contained values on both sides of the null of 0, suggesting that the intervention will predict an insignificant reduction of depressive symptoms in future similar studies. Heterogeneity was substantial (I2 = 70%–85%) for postintervention assessment and 2 weeks to 3 months of follow-up assessment and moderate (I2 = 42%) for 6–12 months of follow-up assessment. Given the presence of substantial heterogeneity at the postintervention assessment, subgroup analyses, and meta-regression analyses were conducted to explore the reasons for heterogeneity.
We conducted subgroup analyses as shown in Table 3 and Figures S2–S26. Significant differences (p < 0.1) were found between subgroups based on participants’ nature, age group, embodiment, ITT/MDM, and protocol/registration on reduction of depressive symptoms at three-time points. Trials that were conducted among the participants with depression or depression combined with other health issues had a larger effect on reducing depressive symptoms at postintervention (g = −0.81, 95% CI: −1.16. to −0.45) and follow-up 2 weeks to 3 months (g = −0.64, 95% CI: −2.22. to 0.95) compared with their counterparts. The interventions used embodied chatbot (g = −0.57, 95% CI: −1.17 to 0.03) among those aged 31–40 (g = −0.57, 95% CI: −1.17 to 0.03) using ITT or MDM (g = −0.35, 95% CI: −1.11 to 0.40) had a greater effect on reducing depressive symptoms at follow-up 2 weeks to 3 months than their counterparts. We observed trials with a protocol or registration (g = −0.32, 95% CI: −0.65 to 0.02) have a greater effect on the reduction of depressive symptoms at follow-up 6 months to 12 months than those without a protocol or registration. Hence, between-trial heterogeneity could be partially explained by participant characteristics.
Category | Subgroups | Number of arms | Sample size | Effect size (g) | 95% CI | I2 | Subgroup difference |
---|---|---|---|---|---|---|---|
Depressive symptoms (postintervention) | |||||||
Nature of participants | Depression ± others | 17 | 2789 | −0.81 | −1.16, −0.45 | 86% |
|
Stress/distress ± others | 6 | 308 | 0.09 | −0.18, 0.35 | 0% | ||
Other condition | 5 | 1040 | −0.34 | −0.87, 0.19 | 76% | ||
Healthy | 1 | 212 | −0.06 | −0.33, 0.21 | NA | ||
Age groups | 18–30 years | 10 | 1482 | −0.48 | −0.94, −0.02 | 84% |
|
31–40 years | 11 | 1251 | −0.54 | −0.77, −0.30 | 64% | ||
41–50 years | 6 | 1571 | −0.69 | −1.95, 0.57 | 94% | ||
>50 years | 2 | 45 | −0.24 | −2.00, 1.52 | 0% | ||
Type of AI chatbot | Deprexis | 13 | 2719 | −0.68 | −1.12, −0.24 | 88% |
|
Others | 16 | 1630 | −0.42 | −0.74, −0.09 | 79% | ||
Different comparators | Active control | 6 | 410 | −0.64 | −2.05, 0.76 | 95% |
|
Passive control | 23 | 3939 | −0.48 | −0.65, −0.32 | 72% | ||
Type of psychotherapy | CBT | 26 | 4103 | −0.54 | −0.79, −0.29 | 84% |
|
Others | 3 | 246 | −0.53 | −3.46, 2.39 | 93% | ||
Type of platforms | Internet | 19 | 3686 | −0.61 | −0.93, −0.30 | 87% |
|
Others | 10 | 663 | −0.37 | −0.85, 0.11 | 78% | ||
Response generation | Rule-based | 16 | 3453 | −0.65 | −1.06, −0.24 | 90% |
|
Others | 13 | 896 | −0.41 | −0.70, −0.12 | 69% | ||
Embodiment | Yes | 5 | 271 | −0.59 | −1.08, −0.10 | 50% |
|
No | 24 | 4078 | −0.53 | −0.84, −0.23 | 87% | ||
ITT or MDM | Yes | 22 | 3455 | −0.60 | −0.88, −0.33 | 85% |
|
No | 7 | 894 | −0.29 | −1.00, 0.42 | 78% | ||
Protocol or registration | Yes | 15 | 2534 | −0.55 | −0.97, −0.12 | 87% |
|
No | 14 | 1815 | −0.51 | −0.84, −0.19 | 81% | ||
Depressive symptoms (follow-up assessments at 2 weeks to 3 months) | |||||||
Nature of participants | Depression ± others | 2 | 106 | −0.64 | −2.22, 0.95 | 0% |
|
Stress/distress ± others | 5 | 267 | 0.32 | 0.04, 0.59 | 0% | ||
Other condition | 1 | 45 | −0.57 | −1.17, 0.03 | NA | ||
Age groups | 18–30 years | 4 | 307 | −0.17 | −1.01, 0.68 | 78% |
|
31–40 years | 1 | 45 | −0.57 | −1.17, 0.03 | NA | ||
41–50 years | 1 | 21 | 0.99 | 0.07, 1.90 | NA | ||
>50 years | 2 | 45 | 0.29 | −1.82, 2.40 | 0% | ||
Different comparators | Active control | 5 | 301 | 0.17 | −0.43, 0.78 | 56% |
|
Passive control | 3 | 117 | −0.37 | −1.94, 1.19 | 61% | ||
Type of psychotherapy | CBT | 6 | 217 | −0.12 | −0.82, 0.59 | 68% |
|
Others | 2 | 201 | 0.26 | 0.15, 0.37 | 0% | ||
Type of platforms | Internet | 4 | 307 | −0.17 | −1.01, 0.68 | 78% |
|
Others | 4 | 111 | 0.20 | −0.86, 1.27 | 67% | ||
Response generation | Rule-based | 2 | 201 | 0.26 | 0.15, 0.37 | 0% |
|
Others | 6 | 217 | −0.12 | −0.82, 0.59 | 68% | ||
Embodiment | Yes | 1 | 45 | −0.57 | −1.17, 0.03 | NA |
|
No | 7 | 373 | 0.07 | −0.47, 0.60 | 68% | ||
ITT or MDM | Yes | 4 | 310 | −0.35 | −1.11, 0.40 | 79% |
|
No | 4 | 108 | 0.41 | −0.12, 0.94 | 0% | ||
Protocol or registration | Yes | 4 | 111 | 0.20 | −0.86, 1.27 | 67% |
|
No | 4 | 307 | −0.17 | −1.01, 0.68 | 78% | ||
Depressive symptoms (follow-up assessments at 6 months to 12 months) | |||||||
Nature of participants | Depression ± others | 7 | 1568 | −0.27 | −0.48, −0.07 | 45% |
|
Other condition | 1 | 700 | −0.10 | −0.24, 0.05 | NA | ||
Age groups | 18–30 years | 1 | 700 | −0.10 | −0.24, 0.05 | NA |
|
31–40 years | 3 | 175 | −0.13 | −0.79, 0.54 | 0% | ||
41–50 years | 4 | 1393 | −0.32 | −0.65, 0.02 | 67% | ||
Type of AI chatbot | Deprexis | 7 | 1568 | −0.27 | −0.48, −0.07 | 45% |
|
Others | 1 | 700 | −0.10 | −0.24, 0.05 | NA | ||
Different comparators | Active control | 1 | 44 | −0.27 | −0.88, 0.33 | NA |
|
Passive control | 7 | 2224 | −0.23 | −0.43, −0.03 | 51% | ||
ITT or MDM | Yes | 7 | 1568 | −0.27 | −0.48, −0.07 | 45% |
|
No | 1 | 700 | −0.10 | −0.24, 0.05 | NA | ||
Protocol or registration | Yes | 4 | 1393 | −0.32 | −0.65, 0.02 | 67% |
|
No | 4 | 875 | −0.10 | −0.25, 0.05 | 0% | ||
Anxiety symptoms (postintervention) | |||||||
Nature of participants | Depression ± others | 4 | 385 | −0.60 | −1.97, 0.77 | 92% |
|
Stress/distress ± others | 6 | 308 | 0.12 | −0.37, 0.61 | 45% | ||
Other condition | 5 | 1040 | −0.15 | −0.28, −0.02 | 0% | ||
Healthy | 4 | 1189 | −0.31 | −0.61, −0.02 | 48% | ||
Age groups | 18–30 years | 10 | 2289 | −0.15 | −0.33, 0.02 | 58% |
|
31–40 years | 4 | 335 | −0.25 | −0.45, −0.06 | 0% | ||
41–50 years | 3 | 253 | −0.40 | −4.14, 3.35 | 95% | ||
>50 years | 2 | 45 | −0.36 | −3.79, 3.07 | 0% | ||
Type of AI chatbot | Deprexis | 3 | 294 | −0.86 | −3.00, 1.28 | 93% |
|
Other | 16 | 2628 | −0.15 | −0.31, 0.00 | 53% | ||
Different comparators | Active control | 12 | 1313 | −0.20 | −0.64, 0.23 | 85% |
|
Passive control | 7 | 1609 | −0.24 | −0.36, −0.12 | 0% | ||
Type of psychotherapy | CBT | 15 | 1794 | −0.27 | −0.59, 0.04 | 78% |
|
Others | 4 | 1128 | −0.14 | −0.70, 0.42 | 79% | ||
Type of platforms | Internet | 10 | 2255 | −0.38 | −0.78, 0.03 | 86% |
|
Others | 9 | 667 | −0.10 | −0.31, 0.12 | 32% | ||
Response generation | Rule-based | 6 | 1195 | −0.37 | −1.18, 0.44 | 91% |
|
Others | 13 | 1727 | −0.23 | −0.40, −0.06 | 43% | ||
Embodiment | Yes | 6 | 1177 | −0.36 | −0.48, −0.24 | 0% |
|
No | 13 | 1745 | −0.20 | −0.59, 0.19 | 83% | ||
ITT or MDM | Yes | 13 | 2723 | −0.31 | −0.60, −0.02 | 82% |
|
No | 6 | 199 | −0.00 | −0.64, 0.63 | 58% | ||
Protocol or registration | Yes | 11 | 2219 | −0.36 | −0.80, 0.07 | 84% |
|
No | 8 | 703 | −0.03 | −0.21, 0.15 | 1% | ||
Anxiety symptoms (follow-up assessments at 2 weeks to 3 months) | |||||||
Nature of participants | Stress/distress ± others | 5 | 175 | 0.38 | 0.05, 0.70 | 0% |
|
Other condition | 1 | 45 | −0.12 | −0.71, 0.47 | NA | ||
Age groups | 18–30 years | 2 | 109 | 0.31 | −2.13, 2.74 | 0% |
|
31–40 years | 1 | 45 | −0.12 | −0.71, 0.47 | NA | ||
41–50 years | 1 | 21 | 0.83 | −0.07, 1.73 | NA | ||
>50 years | 2 | 45 | 0.35 | −1.36, 2.06 | 0% | ||
Different comparators | Active control | 4 | 157 | 0.36 | −0.08, 0.80 | 0% |
|
Passive control | 2 | 63 | 0.08 | −3.68, 3.84 | 20% | ||
Type of psychotherapy | CBT | 4 | 111 | 0.27 | −0.39, 0.93 | 12% |
|
Others | 2 | 109 | 0.31 | −2.13, 2.74 | 0% | ||
Type of platforms | Internet | 2 | 109 | 0.31 | −2.13, 2.74 | 0% |
|
Others | 4 | 111 | 0.27 | −0.39, 0.93 | 12% | ||
Response generation | Rule-based | 2 | 109 | 0.31 | −2.13, 2.74 | 0% |
|
Others | 4 | 111 | 0.27 | −0.39, 0.93 | 12% | ||
Embodiment | Yes | 1 | 45 | −0.12 | −0.71, 0.47 | NA |
|
No | 5 | 175 | 0.38 | 0.05, 0.70 | 0% | ||
ITT or MDM | Yes | 2 | 112 | 0.05 | −1.67, 1.76 | 0% |
|
No | 4 | 108 | 0.52 | 0.16, 0.88 | 0% | ||
Protocol or registration | Yes | 4 | 111 | 0.27 | −0.39, 0.93 | 12% |
|
No | 2 | 109 | 0.31 | −2.13, 2.74 | 0% | ||
Anxiety symptoms (follow-up assessments at 6 months) | |||||||
Nature of participants | Depression ± others | 3 | 380 | −0.37 | −0.70, −0.05 | 0% |
|
Other condition | 1 | 700 | −0.10 | −0.24, 0.05 | NA | ||
Age groups | 18–30 years | 1 | 700 | −0.10 | −0.24, 0.05 | NA |
|
41–50 years | 3 | 380 | −0.37 | −0.70, −0.05 | 0% | ||
Type of AI chatbot | Deprexis | 3 | 380 | −0.37 | −0.70, −0.05 | 0% |
|
Other | 1 | 700 | −0.10 | −0.24, 0.05 | NA | ||
Different comparators | Active control | 1 | 44 | −0.22 | −0.82, 0.39 | NA |
|
Passive control | 3 | 1036 | −0.25 | −0.74, 0.23 | 65% | ||
Stress symptoms (postintervention) | |||||||
Nature of participants | Stress/distress ± others | 5 | 267 | 0.07 | −0.23, 0.37 | 0% |
|
Other condition | 2 | 126 | −0.32 | −4.67, 4.02 | 70% | ||
Healthy | 1 | 28 | −0.85 | −1.63, −0.07 | NA | ||
Age groups | 18–30 years | 4 | 275 | −0.03 | −0.66, 0.59 | 46% |
|
31–40 years | 1 | 80 | −0.64 | −1.09, −0.19 | NA | ||
41–50 years | 1 | 21 | 0.31 | −0.55, 1.17 | NA | ||
>50 years | 2 | 45 | −0.35 | −3.34, 2.63 | 0% | ||
Different comparators | Active control | 4 | 249 | 0.08 | −0.32, 0.49 | 0% |
|
Passive control | 4 | 172 | −0.40 | −1.07, 0.27 | 41% | ||
Type of psychotherapy | CBT | 6 | 220 | −0.33 | −0.79, 0.12 | 36% |
|
Others | 2 | 201 | 0.14 | −0.15, 0.43 | 0% | ||
Type of platforms | Internet | 2 | 201 | 0.14 | −0.15, 0.43 | 0% |
|
Others | 6 | 220 | −0.33 | −0.79, 0.12 | 36% | ||
Response generation | Rule-based | 3 | 229 | −0.12 | −1.41, 1.17 | 64% |
|
Others | 5 | 192 | −0.25 | −0.75, 0.26 | 34% | ||
Embodiment | Yes | 2 | 126 | −0.32 | −4.67, 4.02 | 70% |
|
No | 6 | 295 | −0.09 | −0.54, 0.35 | 37% | ||
ITT or MDM | Yes | 4 | 313 | −0.27 | −1.05, 0.50 | 74% |
|
No | 4 | 108 | −0.05 | −0.61, 0.51 | 0% | ||
Protocol or registration | Yes | 4 | 146 | −0.34 | −1.03, 0.34 | 30% |
|
No | 4 | 275 | −0.03 | −0.66, 0.59 | 46% | ||
Stress symptoms (follow-up assessments at 2 weeks to 3 months) | |||||||
Nature of participants | Stress/distress ± others | 5 | 267 | 0.34 | 0.02, 0.66 | 0% |
|
Other condition | 1 | 45 | −0.20 | −0.78, 0.39 | NA | ||
Age groups | 18–30 years | 2 | 201 | 0.31 | 0.27, 0.35 | 0% |
|
31–40 years | 1 | 45 | −0.20 | −0.78, 0.39 | NA | ||
41–50 years | 1 | 21 | 1.06 | 0.13, 1.98 | NA | ||
>50 years | 2 | 45 | 0.17 | −3.45, 3.79 | 0% | ||
Different comparators | Active control | 4 | 249 | 0.33 | −0.11, 0.76 | 11% |
|
Passive control | 2 | 63 | 0.07 | −4.37, 4.51 | 39% | ||
Type of psychotherapy | CBT | 4 | 111 | 0.25 | −0.65, 1.16 | 49% |
|
Others | 2 | 201 | 0.31 | 0.27, 0.35 | 0% | ||
Type of platforms | Internet | 2 | 201 | 0.31 | 0.27, 0.35 | 0% |
|
Others | 4 | 111 | 0.25 | −0.65, 1.16 | 49% | ||
Response generation | Rule-based | 2 | 201 | 0.31 | 0.27, 0.35 | 0% |
|
Others | 4 | 111 | 0.25 | −0.65, 1.16 | 49% | ||
Embodiment | Yes | 1 | 45 | −0.20 | −0.78, 0.39 | NA |
|
No | 5 | 267 | 0.34 | 0.02, 0.66 | 0% | ||
ITT or MDM | Yes | 2 | 204 | 0.12 | −3.01, 3.25 | 56% |
|
No | 4 | 108 | 0.38 | −0.29, 1.06 | 13% | ||
Protocol or registration | Yes | 4 | 111 | 0.25 | −0.65, 1.16 | 49% |
|
No | 2 | 201 | 0.31 | 0.27, 0.35 | 0% |
- Note: I2 means heterogeneity.
- p < 0.05 ∗, p < 0.01 ∗∗, p < 0.001 ∗∗∗.
A series of random-effects meta-regression analyses were conducted to evaluate the effect of the various covariates on the effect size of depressive symptoms (Table 4). The univariate meta-regression analyses concluded that publication year (β = 0.017, p = 0.617), duration of intervention based on the number of days (β = −0.004, p = 0.270), sample size (β = 0.001, p = 0.383), attrition rate (β = −0.003, p = 0.774), and the portion of males (β = −0.011, p = 0.134) had no effects on depressive symptoms. Thus, the between-trial heterogeneity could not be explained by these covariates.
Covariates | Depressive symptoms | Anxiety symptoms | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
β | SE | 95% lower | 95% upper | p-Value | β | SE | 95% lower | 95% upper | p-Value | |
Year of publication | 0.017 | 0.033 | −0.052 | 0.085 | 0.617 | −0.001 | 0.056 | −0.118 | 0.117 | 0.993 |
Duration of intervention (days) | −0.004 | 0.004 | −0.01 | 0.003 | 0.270 | −0.007 | 0.003 | −0.013 | <0.001 | 0.056 |
Sample size | 0.001 | 0.001 | −0.001 | 0.002 | 0.383 | <−0.001 | <0.001 | −0.001 | 0.001 | 0.671 |
Attrition rate | −0.003 | 0.009 | −0.020 | 0.015 | 0.774 | −0.007 | 0.007 | −0.020 | 0.007 | 0.339 |
Portion of males | −0.011 | 0.007 | −0.025 | 0.004 | 0.134 | −0.006 | 0.007 | −0.021 | 0.009 | 0.407 |
- Note: β means regression coefficients.
- Abbreviation: SE, standard error.
3.5. Anxiety Symptoms
A total of 19 arms of 17 trials [2, 7, 16, 23, 40–42, 44–47, 49, 50, 70, 71, 73, 74] involving 2922 participants at the postintervention assessment, six arms of five trials [7, 23, 41, 44, 74] including 220 participants at 2 weeks to 3 months of follow-up assessment, and four trials [11, 26, 46, 50] of 1080 participants at 6–12 months of follow-up assessment were found. Meta-analyses showed no differences between the intervention and comparator at the postintervention assessment (t = −1.95, p = 0.067) and 2 weeks to 3 months (t = 2.08, p = 0.093) and 6–12 months (t = −2.82, p = 0.067) of follow-up assessment, as shown in Figure 3.

The 95% PIs were −1.19 to 0.71, −0.11 to 0.65, and −0.99 to 0.51 for three-time points. Hence, the intervention will predict an insignificant reduction in anxiety symptoms compared with comparators in future similar studies. Heterogeneity was substantial (I2 = 78%) at the postintervention assessment, insignificant (I2 = 0%) at 2 weeks to 3 months of follow-up assessment, and moderate (I2 = 45%) at 6–12 months of follow-up assessment. To explore the sources of heterogeneity, subgroup analyses and meta-regression analyses were performed.
We conducted a series of subgroup analyses for three-time points (Table 3 and Figures S27–S49). Significant differences (p < 0.1) were found between subgroups based on the nature of participants, age groups, the type of AI chatbot they used, and how they used ITT/MDM to improve anxiety symptoms at follow-up 2 weeks to 3 months and 6 months. Subgroup analyses showed that AI-based psychotherapeutic interventions using Deprexis with people aged 41–50 in Europe who had depression or depression along with other health problems had a bigger effect (g = −0.37, 95% CI: −0.70. to −0.05) on lowering anxiety symptoms at follow-up 6 months later than the other groups. We found a smaller effect size in the trials using ITT or MDM (g = 0.05, 95% CI: −1.67 to 1.76) on decreasing depressive symptoms at follow-up 2 weeks to 3 months when compared to its counterpart.
The univariate meta-regression analyses suggested that the publication year (β = −0.001, p = 0.993), duration of intervention based on number of days (β < −0.007, p = 0.056), sample size (β < −0.001, p = 0.671), attrition rate (β = −0.007, p = 0.339), and portion of males (β = −0.006, p = 0.407) had no effects on anxiety symptoms (Table 4). Therefore, the cause of high heterogeneity could not be explained by these covariates.
3.6. Stress Symptoms
A total of eight arms of seven RCTs [7, 23, 27, 41, 44, 47, 74] among 421 participants at the postintervention assessment and six arms of five RCTs [7, 23, 41, 44, 74] involving 312 participants at 2 weeks to 3 months of follow-up assessment were pooled to evaluate the effect of intervention on stress symptoms (Figure 4). Meta-analyses did not yield any significant differences (t = −1.18 to 2.04, p = 0.098–0.277) between the intervention and comparator at the postintervention assessment and 2 weeks to 3 months of follow-up assessment.

A series of subgroup analyses were performed for two-time points (Table 3 and Figures S50–S67). Significant differences (p < 0.1) were revealed between subgroups based on the nature of participants, their different comparators, the type of psychotherapy they received, the platforms they used, and how the embodied chatbot presented itself to decrease stress symptoms at postintervention and follow-up assessment. Trials that were conducted among participants with conditions other than stress or distress had a larger effect on reducing stress symptoms at postintervention (g = −0.32, 95% CI: −4.67 to 4.02) and follow-up (g = −0.20, 95% CI: −0.78 to 0.39) compared with their counterparts. The interventions that adopted CBT (g = −0.33, 95% CI: −0.79 to 0.12) using non-Internet platforms (g = −0.33, 95% CI: −0.79 to 0.12) and passive control (g = −0.40, 95% CI: −1.07 to 0.27) had a greater effect on decreasing stress symptoms at postintervention than their counterparts. Trials conducted in non-Europe (g = −0.20, 95% CI: −0.78. to 0.39) when the interventions used embodied chatbot (g = −0.20, 95% CI: −0.78. to 0.39) had a greater effect on improving stress symptoms at follow-up assessment when compared to their counterparts.
3.7. Depressive, Anxiety, and Stress Symptoms
Three RCTs [8, 41, 74] were found to examine the effect of intervention on the total scores of depressive, anxiety, and stress symptoms using the 21-item Depressive, Anxiety, and Stress Scale [75] in 295 participants at the postintervention assessment and 2 weeks to 3 months of follow-up assessment (Figure 5). The meta-analyses did not reveal any differences between the two groups (t = 1.34–1.46, p = 0.281–0.311).

3.8. Overall Evidence
The GRADE criteria were used to evaluate 10 outcomes of this review (Tables S7), and the certainty of evidence ranged from very low to moderate. Inconsistency, indirectness, and imprecision were downgraded due to the presence of high heterogeneity, various populations and interventions, a small sample, and a wide confidence interval. Given the more than 10 trials for depressive and anxiety symptoms at postintervention, funnel plots and Egger’s test were performed. No evidence of publication bias was found because of symmetrical funnel plots and the Egger tests (p = 0.091–0.983; Figures S68 and S69).
4. Discussion
4.1. Summary of Findings
Through 13,546 records from the 12 databases, three clinical trial registries, and other methods by using three-step comprehensive searching, we found 30 RCTs among 6100 samples across nine countries. Our review showed that AI-based psychotherapeutic interventions significantly reduced depressive symptoms at postintervention assessment with a medium effect size and 6–12 months of follow-up assessment with a small effect size compared with comparators. No significant effect of AI-based psychotherapeutic interventions was found on anxiety, stress, or the total scores of depressive, anxiety, and stress symptoms at postintervention or different periods of follow-up assessments. A series of subgroup analyses revealed significant differences in the reduction of psychological symptoms at various points based on participants’ nature, age group, type of AI chatbot, type of psychotherapy, type of platform, embodiment, different comparator, ITT/MDM, and protocol/registration. The random-effects univariate meta-regression did not detect a significant covariate on depressive and anxiety symptoms at postintervention. The majority (79.1%) of trials with ITT analysis and less than half (48.6%) of trials with perprotocol analysis rated a low risk of bias across five domains using the RoB 2.0 criteria. No publication bias was detected for depressive and anxiety symptoms at postintervention. The certainty of evidence ranged from very low to moderate for 10 psychological outcomes according to the GRADE criteria.
4.2. Depressive Symptoms
In line with a piece of previous meta-analytic evidence [18], we found that depressive symptoms significantly reduced following AI-based psychotherapeutic interventions at postintervention. Our result also indicated a significant effect at 6–12 months of follow-up assessment. Thus, AI-based psychotherapeutic interventions reduce immediate and long-term effects. AI chatbots can be designed to deliver various psychotherapies using AI technology according to different psychological principles, such as CBT [40], method of levels therapy [44], or problem-solving therapy [66]. Users may engage in the intervention in text-based or voice-activated conversations [70], and such interactions can offer psychological, relational, and emotional support [76]. Chatbots can also provide initial counseling, guide users to use a self-help library, and lead users to correct services [74]. Chatbots use AI algorithms to interpret user dialogues and conduct useful interactions. They may have a low attrition rate due to increased engagement and motivation [17]. Therefore, AI-based psychotherapeutic interventions can ameliorate depressive symptoms. Given that only seven trials included 6–12 months of follow-up assessment, a conclusion of the long-term effect of intervention cannot be made.
In our review, the majority of the interventions used rule-based response generation and less than half used NLP. Rule-based response generation consists of simple dialogue components based on rules, following a predefined decision tree and communicating in a scripted manner [74]. Conversely, generative-based response generation is more complex and relies on ML to construct its dialogues; AI uses this method to generate possible answers and enhance conversational proficiency [11]. With the increasing integration of AI technology into psychotherapy [77], future interventions can consider using advanced generative deep learning techniques that may allow AI chatbots to interact with users in an empathetic, coherent, and personalized manner [7, 74].
Seven interventions used embodied conversational agents in our review. Our subgroup analysis showed a greater effect size for embodied agents compared with nonembodied agents. An embodied conversational agent is a computer-based dialogue system with a virtual embodiment (full body or face-only) that typically interacts with users using multimodal communication cues of speech, text, animated facial expressions, or gestures [73]. Evidence showed that embodied conversational agents can build trust and rapport and can create a sense of warmth, leading to companionship and long-term usage [13, 78]. Future interventions can consider adopting embodied agents. Only one intervention [57] used emojis (images depicting facial expressions) to share and track the participants’ moods over time. Considering emotions can be used to express, imitate, and appraise the varying degrees of emotions [79]; more research is needed to evaluate its effectiveness.
Notably, the intervention failed to demonstrate superior effects at 2 weeks to 3 months of follow-up assessment in eight trials. Most comparators (62.5%) were active control groups, such as using another conversational chatbot [44, 51, 53], stress management training and CBT [47, 48], and e-books on depression [53]. This finding aligns with the results of a previous mixed-method review [16] demonstrating similar patterns. Our review revealed comparable effects between AI-based psychotherapeutic interventions and active comparators. Furthermore, a few of the participants (25%) had depressive problems at 2 weeks to 3 months of follow-up assessment. The plausible interpretation of the findings suggested that AI-based psychotherapeutic intervention may not alleviate depression symptoms in persons who are not depressed. However, we could not conclude an absolute treatment efficacy on the reduction of depressive symptoms at 2 weeks to 3 months of follow-up assessment.
Our subgroup analyses revealed that intervention had a greater effect size among participants with depression or depression combined with other health issues aged 31–40 than other age groups. One reason could be that younger adults had greater knowledge of AI [72] and more engagement in activities [80] than older adults. Therefore, young adults were more likely to adhere to interventions than older adults. Consistent with a previous review 18], the intervention significantly improved depressive symptoms in participants with depression or depression combined with other health issues. This finding suggests that interventions were more effective for treatment in depressive participants compared to other health conditions. Hence, the intervention was more beneficial for the young depressive group. Our subgroup results showed a significant subgroup difference based on the nature of participants, age groups, embodiment, ITT/MDM, and protocol/registration at follow-up 2 weeks to 12 months, but the subgroup analysis only used 1–4 trials. Hence, the results should be interpreted with caution. Hence, more investigations are recommended for future trials to confirm the findings.
4.3. Other Psychological Outcomes
Contrary to our expectations, the meta-analyses revealed that AI-based psychotherapeutic interventions did not improve anxiety symptoms, stress symptoms, and a combination of depressive, anxiety, and stress symptoms at postintervention and follow-up assessments compared with comparators. These findings are inconsistent with a previous review [17]. One possible reason may be attributed to the fact that most comparators are active control for these outcomes. Another possibility is the differences between depressive, anxiety, and stress symptoms [81]. Stress symptoms are a sense of feeling overwhelmed that measures chronic nonspecific arousal, tension, agitation, and irritability; anxiety symptoms are a sense of fear or dread that focuses on autonomic arousal, physical symptoms of anxiety, and the subjective experience of anxious affect; and depression symptoms are a sense of unhappiness or sadness, such as dysphoria, hopelessness, low self-esteem, anhedonia, and loss of interest [82, 83]. These discrepancies of feeling with specific cognitive processes and coping strategies may explain the different results [83]. At this stage, we can only speculate about the reason for this occurrence. Hence, conclusions cannot be drawn, and further studies are required.
According to our subgroup analyses, we found significant differences between subgroups based on participant’s nature, age groups, type of AI chatbot, psychotherapy, platform, comparators used, and how they used ITT/MDM to improve anxiety and stress symptoms at postintervention and follow-up assessments. However, these subgroup comparisons used only 1–6 trials in each group, and we also found an uneven number of trials in the subgroups. It is therefore important to evaluate the data cautiously[38]. Therefore, we advise further research to validate the results in subsequent studies.
4.4. Strengths and Limitations
The current systematic review has several strengths. This review was the first to examine the short- and long-term effects of AI-based psychotherapeutic interventions on psychological outcomes. A comprehensive search strategy, including 12 databases and three clinical trial registries, was used to identify 30 RCTs to reduce publication bias. The random-effect meta-analysis applied the restricted maximum likelihood method [36] with Hartung–Knapp adjustment [37]. The 95% PI for the meta-analyses was reported to predict true effects in future settings [32], and the certainty of evidence on each outcome was assessed.
Notwithstanding the strengths, this review had several limitations. First, the psychological outcomes were self-reported, which may cause social desirability bias. Second, the number of trials included in some meta-analyses was limited, especially for follow-up assessment; thus, statistical power was reduced. Third, the uneven number of trials in the subgroups may have failed to estimate valid results [38]. Fourth, included interventions were designed from a wide variety of psychological principles, and six meta-analyses revealed substantial heterogeneities that restricted the accuracy of pooled estimates. Fifth, the certainty of the evidence for the six outcomes was either very low or low, which may eliminate the confidence in implementing AI-based psychotherapeutic interventions. Sixth, some trials did not provide a regimen of intervention that limited the feature comparison. Lastly, the majority (n = 18) of the trials were from European countries, which might restrict the generalization of the findings.
4.5. Clinical Implications and Future Research
In this review, we found that depressive symptoms had small to medium effects at postintervention and follow-up assessments at 6–12 months. Given that the intervention was under variable comparator conditions, the active control groups may mask effectiveness [84]. Hence, small to medium effects can be considered either clinically important differences or the minimum clinically important differences [71, 85]. The participants could have experienced meaningful treatment benefits from AI-based psychotherapeutic interventions. However, the certainty of evidence quality of six outcomes was very low or low; thus, AI-based psychotherapeutic interventions can be supported as a supplementary intervention. Given the shortage of mental health workers globally, such intervention can be considered adjunctive to the usual treatments during the therapeutic process. Interventions can be incorporated into comprehensive web applications to facilitate access to psychotherapy amid physical distancing requirements, particularly during the ongoing COVID-19 pandemic. However, designing an interface adaptable to diverse user profiles presents certain challenges. Technical challenges are encountered in interpreting emotions in dialogues and improving features of chatbots in a human-like manner. Privacy and security of interventions are other important issues to pay attention to during the development of the intervention.
Despite the COVID-19 pandemic driving the use of AI technology, AI chatbots may have the risk of being used inappropriately. Healthcare research teams should collaborate closely and regularly with computing scientists to modify and upgrade human–computer interactions. The subgroup results suggest that intervention can target depressive populations aged 31–40 years. Sustainable heterogeneities exist in some meta-analyses, suggesting that the interventions varied across regimes of interventions, settings, and populations. Future interventions should consider using standardized regimes among specific populations in the same setting to draw a conclusive result. In addition, future research should include more detailed content and regimen according to the Template for Intervention Description and Replication Guide [86]. Given the low or very low certainty for the six outcomes, well-designed RCTs are necessary to minimize selection, performance, and reporting biases by reporting allocation concealment, blinding participants, use of ITT or MDM, and registering/publishing trial protocols. Future RCTs should recruit large samples in non-European countries to improve the generalizability of the findings.
5. Conclusion
This review revealed significant effects in reducing depressive symptoms after AI-based psychotherapeutic interventions at postintervention assessment and 6–12 months of follow-up assessment. We found comparable effects on anxiety, stress, and combined symptoms between AI-based psychotherapeutic interventions and active comparators. AI-based psychotherapeutic interventions can supplement the existing psychiatric care targeting depressive groups ages 31–40. Future studies should improve the transparency of the intervention’s content and regimen. Further investigations should also use methodologically robust approaches with a large-scale and long-term follow-up assessment to evaluate the sustainability of the intervention.
Conflicts of Interest
The authors declare no conflicts of interest.
Author Contributions
Ying Lau, Kin Sun Chan, Patrick Cheong-Iao Pang, and Sai Ho Wong conceptualized and designed the study. Wei How Darryl Ang and Wen Wei Ang conducted a systematic literature search with the help of a senior librarian. Sai Ho Wong, Wen Wei Ang, and Ying Lau performed the title and abstract screening, data extraction, and assessed the quality of selected studies. Ying Lau, Sai Ho Wong, Wei How Darryl Ang, and Wen Wei Ang conducted data management, data analysis, and data synthesis. Ying Lau supervised the systematic review and wrote the article. All authors have read and approved the final version of the article.
Funding
No funding was used in the study.
Acknowledgments
We acknowledge the senior librarian, Suei Nee Wong, for her support in developing the search strategy. We also appreciated the supplementary data from the trial authors.
Supporting Information
Additional supporting information can be found online in the Supporting Information section.
Open Research
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.