Volume 2025, Issue 1 6091900
Review Article
Open Access

Artificial Intelligence for Text Analysis in the Arabic and Related Middle Eastern Languages: Progress, Trends, and Future Recommendations

Abdullah Y. Muaad

Corresponding Author

Abdullah Y. Muaad

IT Department , Sana’a Community College , Sana’a , Yemen , scc.edu.ye

Search for more papers by this author
Md Belal Bin Heyat

Corresponding Author

Md Belal Bin Heyat

CenBRAIN Neurotech Center of Excellence , School of Engineering , Westlake University , Hangzhou , China , westlake.edu.cn

Search for more papers by this author
Faijan Akhtar

Corresponding Author

Faijan Akhtar

School of Computer Science and Engineering , University of Electronic Science and Technology of China , Chengdu , Sichuan, China , uestc.edu.cn

Search for more papers by this author
Usman Naseem

Usman Naseem

School of Computer Science , University of Sydney , Sydney , Australia , sydney.edu.au

Search for more papers by this author
Wadeea R. Naji

Wadeea R. Naji

Department of Computer Science & Information Technology , Ibb University , Ibb , Yemen , ibb-univ.net

Search for more papers by this author
Suresha Mallappa

Suresha Mallappa

Department of Studies in Computer Science , University of Mysore , Mysore , Karnataka, India , uni-mysore.ac.in

Search for more papers by this author
Hanumanthappa J.

Hanumanthappa J.

Department of Studies in Computer Science , University of Mysore , Mysore , Karnataka, India , uni-mysore.ac.in

Search for more papers by this author
First published: 12 July 2025
Academic Editor: Mohamadreza (Mohammad) Khosravi

Abstract

In the last 10 years, there has been a rise in the number of Arabic texts, which necessitates a more profound understanding of algorithms to efficiently understand and classify Arabic texts in many applications, like sentiment analysis. This paper presents a comprehensive review of recent developments in Arabic text classification (ATC) and Arabic text representation (ATR). We analyze the effectiveness of various models and techniques. Our review finds that while deep learning models, particularly transformer-based architectures, are increasingly effective for ATC, challenges such as dialectal variations and insufficient labeled datasets remain key obstacles. However, developing suitable representation models and designing classification algorithms is still challenging for researchers, especially in Arabic. A basic introduction to ATC is provided in this survey, including preprocessing, representation, dimensionality reduction (DR), and classification with many evaluation metrics. In addition, the survey includes a qualitative and quantitative study of the ATC’s existing works. Finally, we conclude this work by exploring the limitations of the existing methods. We also mention the open challenges related to ATC, which help researchers identify new directions and challenges for ATC.

1. Introduction

Nearly 447 million people speak Arabic as their first language. At the same time, it is one of the world’s most widely spoken languages and is regarded as the fourth official language of the United Nations (UN) [1, 2]. The rise in the number of users leads to an increase in Arabic textual data generated daily. So, extracting information from such huge data is a challenging task, especially for Arabic text (AT). Therefore, there is a need to preprocess AT and remove words that do not have significant meaning, change words to their roots, eliminate noise, and improve the performance of Arabic text classification (ATC) [3].

The process of cleaning and preparing text for further processing is known as preprocessing. It is the preliminary step in any text classification pipeline. Specific preprocessing methods and algorithms are required to extract useful patterns from unstructured Arabic textual data. Preprocessing for AT includes many techniques such as white space removal, lemmatization, stemming, and stop-word removal. There are several preprocessing techniques that have been used to enhance the performance of ATC. However, most of the available techniques are still not able to cover all the requirements to prepare AT for further processing due to the complexity of AT [2].

Representation and feature engineering (selection and extraction) are the second steps in the ATC pipeline. The efficiency of succeeding natural language processing (NLP) tasks is strongly influenced by the quality of these techniques [4]. Representation is the process of converting unstructured text documents into their structured equivalent so that machine learning (ML) algorithms [5, 6] can understand [7]. Several feature extraction techniques, including bag-of-words (BoW) [8], term frequency–inverse document frequency (TF–IDF) [9], term class relevance (TCR) [10], term class weight–inverse class frequency (TCW–ICF) [10, 11], symbolic representation, and N-gram features, have been used for feature representation. At the same time, different levels of representation can be used to represent text with different levels, such as character-level, word-level, and phrase-level representations [8, 9].

Most of the researchers have used TF–IDF or BoW, which are inherently problematic due to the lack of the sequence of the words and skip the semantics meaning of the sentence, so various sentences might have the same vector if they have the same words with a different sequence, for example, علي مدرس (which means Ali is a teacher) and أعلي مدرس؟ (which means Dose Ali a teacher?). However, these techniques do not have problems with memory consumption for storage, but they lose semantic meaning. To overcome those limitations, many other techniques have been proposed, for example, Word2Vec [10], GloVe (https://nlp.stanford.edu/projects/glove/) [11], and contextualized word representations [12].

Text categorization is the process of determining if a text belongs to one of the several predefined categories based on its meaning [13, 14]. Once the representation of a given text is achieved, a classifier needs to classify AT into various classes [15]. Many of the ML algorithms such as Decision Trees (DT) [16, 17], Naive Bayes (NB) [18, 19], support vector machines (SVM) [20, 21], and artificial neural networks (ANN) [21, 22] were used for ATC. However, getting high performance is still a real challenge. Therefore, in this survey, we attempt to perform a comprehensive taxonomy study for ATC to find the strengths and weaknesses of the existing work.

Given the growing demand for accurate ATC in domains such as healthcare, finance, and e-commerce, it is essential to explore effective techniques, address linguistic challenges, and mitigate ethical concerns. This study aims to provide a comprehensive taxonomy survey of ATC, analyze existing approaches, and highlight open research challenges and future directions to improve the field. Due to the limited research on ethical considerations and bias in ATC, existing studies have not sufficiently addressed this aspect, making it a key future challenge for researchers. Therefore, there is a pressing need to advance research in fairness, transparency, and explainability in Arabic NLP systems to ensure the development of more equitable and accurate models that meet the requirements of various real-world applications.

1.1. Motivation

Due to the increase in the number of ATs in social media, there is a need to perform a comprehensive study and analysis to find the strengths and weaknesses of existing studies on ATC, which helps build an efficient, effective, and robust algorithm to represent and classify AT. Simultaneously, this increased the number of Arabic speakers to more than 447 users and increased the number of internet users by 9348% more than English (https://www.internetworldstats.com/stats7.htm), which has only 7429%. Therefore, developing tools and applications to handle AT became mandatory. The following is a list of some motivations for this survey:
  • Increase the number of Arabic users and text generation for the Arabic language in many domains, especially with COVID-19.

  • Many researchers still use traditional representation techniques such as a BoW that cannot work well with huge.

  • Little research is conducted on AT compared to other languages, such as English.

  • Lack of tools and applications for the Arabic language.

  • Many non-Arab people who speak and use the Arabic language as a second language are also more than native speakers; therefore, studying these limitations and finding solutions for these problems and challenges will help many people.

1.2. Contributions

The main contributions of this survey are mentioned in the following list:
  • A comprehensive review of available studies and existing surveys in ATC, focusing on their objective, scopes, and research gaps.

  • Explores the architecture of ATC and ATR.

  • A comparative study of ATC stages such as preprocessing, representation, feature engineering, and classification.

  • A comparative study of seven ATC and ATR models to evaluate their performance through an experimental analysis using the AlKhaleej dataset.

  • A quantitative analysis of the proposed techniques for ATC based on publication year and categories.

  • Review and mention the available datasets and open-source libraries.

  • Implementation and discussion for seven models based on preprocessing, feature selection, feature extraction, and classification algorithms such as NB and SVC to evaluate their performance.

  • A qualitative analysis of the ATC and ATR models based on their strengths and weaknesses.

  • An overview of current challenges and future research work after quantitative analysis.

While there have been several surveys on ATC and ATR, most of them focus on limited aspects such as preprocessing techniques or specific classification algorithms. This work offers a broader perspective by providing a comprehensive taxonomy that encompasses all stages of ATC, including preprocessing, representation, dimensionality reduction (DR), and evaluation. Furthermore, this work uniquely combines qualitative and quantitative analyses, which offer deeper insights into the strengths and limitations of existing methods. Unlike previous works, this survey also emphasizes the challenges specific to Arabic language features, such as its complex morphology and dialectical variations, and provides actionable recommendations for overcoming these challenges. Such a holistic approach has not been addressed in existing literature, making this study a novel and valuable contribution to the field. In addition, this article claims to increase the efficiency of learning cutting-edge methodologies for ATC. In addition, it identifies prospective research gaps, allowing researchers to pick their research routes. According to our information, it will enlarge their minds and open the path for future new approaches.

1.3. Organization of the Paper

The organization of this survey is as follows: Section 2 studies and compares the existing surveys. Section 3 discusses the background and general architecture of the ATC model. The main steps of ATC are explored and analyzed in Sections 4, 5, 6, 7, and 8. The tools and open-source library are presented in Section 9. The quantitative analysis is highlighted in Section 10. The experimental analysis is presented in Section 11. The discussion and open challenges are highlighted in Sections 12 and 13. Finally, we conclude this survey in the conclusion section. For more clarity, Figure 1 illustrates this taxonomy using a mind map diagram using the lucid chart.

Details are in the caption following the image
Mind map diagram for the taxonomy of Arabic text classification.

2. Existing Surveys

One of the pivotal goals of the article is to explore the existing surveys. However, some surveys have been done for ATC. This survey is examined, assessed, and compared with existing surveys in this section. We are inspired to survey all steps to make this research different from the existing one. There has been a slew of reviews and polling pieces published for ATC. However, most of them do not study each step individually. At the same time, in comparison with the previous work, this section will study the prior surveys on ATC and will provide an analysis comparing other researchers’ work with this taxonomy. In the next part, we study them and compare them. As shown in Table 1, there are various extant reviews and surveys in the state of ATC. However, they did not consider all stage aspects, as our study did.

Table 1. Comparative analysis of our survey with the existing survey in ATC.
Ref. Year Preprocessing Features extraction Classification Qualitative analysis Taxonomy Experimental analysis Quantitative analysis Evaluation matrices
[23] 2016
[24] 2017
[25] 2017
[26] 2018
[27] 2019
[28] 2019
[29] 2019
[30] 2019
[31] 2020
[32] 2020
[2] 2021
[33] 2022
[34] 2023
This work

A critical review of the methodologies revealed that while traditional techniques such as BoW and TF–IDF excel in simplicity and efficiency, they struggle with sparsity and fail to capture semantic relationships in AT. Similarly, deep learning (DL) methods, particularly transformer-based models like BERT, show promising results but require substantial training data, often unavailable for dialectical Arabic. The reviewed studies highlight a recurring limitation: the inability of existing models to adapt to Arabic’s morphological complexity and dialectical diversity. Addressing these challenges necessitates the development of more context-aware models and larger annotated datasets.

3. General Architecture of ATC

This part describes the entire ATC workflow, as shown in Figure 2, as well as a simple notion of preprocessing, representation, and classification models in Sections 3.1, 3.2, 3.3, and 3.4, respectively.

Details are in the caption following the image
Arabic text classification architecture.

3.1. Preprocessing

The process of cleaning and preparing the text for subsequent processing is known as preprocessing. It is the initial step in the text categorization pipeline [35]. Tokenization, stop-word removal, and stemming are only a few of the methods for text preparation. Tokenization is a method of removing white space and special characters from a document. Stop words are general terms employed to complement informational material with minimal meaning; they provide a grammatical function but do not reveal the subject matter, and there are many other techniques [36].

3.2. Representation

Text representation is a crucial stage in any text classification model. ML algorithms [37, 38] can understand the transformation of unstructured text into structured text documents. There are different types of representation (level) of text, such as character level, word level, sentence level, phrase level, document level, and so on. The most important thing here is not only representation; feature engineering (selection and extraction) is also significant in making the ATC system work efficiently and effectively.

3.3. DR and Feature Engineering

DR is employed to reduce the dimensionality of the input feature space. There are various methods to reduce the size, such as feature selection (wrapper, embedded, filter), ensemble, and hybrid techniques. DR can be applied simultaneously in the preprocessing phase, such as stemming before or after representation, such as chi-square.

3.4. Classification

Once the representation for a given text collection is created through an optimal set of representation and feature extraction techniques, the classifier has to be trained to learn the pattern of classifying text into different classes [15]. There are many applications for text classification [39, 40] in other scenarios such as information retrieval (IR), sentiment analysis (SA), recommender systems, and hate speech detection. At the same time, text classification can be utilized in numerous domains such as health, social sciences, and law domains [41, 42].

4. Preprocessing

Preprocessing techniques prepare text for further processing by transforming unstructured text into structured data. Many techniques have been used for this task. Figure 3 and Table 2 explore these techniques based on work that has been done for ATC. Each preprocessing method in ATC has its advantages and disadvantages, impacting model performance in different ways. For instance, diacritic removal simplifies text representation and reduces data sparsity, but it may lead to ambiguity, as some words have different meanings depending on diacritics. Stemming and lemmatization help normalize words by reducing them to their root forms, improving generalization; however, stemming can be overly aggressive, cutting words too short, and losing meaning, while lemmatization requires linguistic knowledge and is computationally expensive. Tokenization, especially in Arabic, is challenging due to the absence of clear word boundaries in certain cases, which may lead to errors in splitting words. Stop-word removal helps reduce computational complexity and improve efficiency, but in some contexts, stop words carry semantic importance, and their removal can affect classification accuracy. Normalization techniques, such as unifying different forms of Arabic letters (e.g., converting “ي” to “ى”), improve consistency but may lead to unintended modifications in certain words. Therefore, selecting the right preprocessing techniques requires balancing efficiency, linguistic integrity, and task-specific requirements to optimize the performance of ATC.

Details are in the caption following the image
Preprocessing techniques for Arabic text classification.
Table 2. Comparative analysis of preprocessing techniques.
Ref. Year Objective Method Dataset Evaluation matrices
[43] 2001 Aim to improve normalization and stemming Normalization and stemming Precision
[44] 2002 To design new light stemmers Stemming Precision
[45] 2002 To improve retrieval effectiveness Light stemming approach Provided by the text retrieval conference Precisions and recall
[46] 2007 The goal is to contrast and compare two feature selection techniques. Light stemming vs. stemming Stem vectors and light stem vectors 15,000 documents for three classes F1-score
[47] 2008 To generate index words for AT documents Stemming and weight assignment technique, and an autoindexing method 24 arbitrary texts of different lengths Recall and precision
[48] 2008 To introduce a novel lemmatization algorithm Lemmatization House corpus Recall and precision
[49] 2008 Proposed a new method for stemming AT Stemming techniques
[50] 2008 Design a new stemming algorithm Stemming Arabic words with a dictionary Arabic corpus Accuracy
[51] 2009 Presents and compares three techniques for the reduction Height stemming and word clusters Create dataset Recall and precision
[52] 2010 Sought to determine the effect of 5 measures with two types of preprocessing for R document clustering The Information Science Research Institute stemmer 1680 documents Cosine, Jaccard, Pearson, Euclidean, and DAvg KL
[53] 2010 To create an efficient rule-based light stemmer Light stemmer for the Arabic language
[54] 2010 Aim to present a new dictionary-based Arabic stemmer Local stem The dataset contains 2966 documents Accuracy
[55] 2010 Aim to design Arabic morphological analysis tools Stemming and light stemming Open-Source Arabic Corpus Accuracy
[56] 2011 Aim to work with many techniques for ATC Stop-word removal 2363 documents Recall and precision
[57] 2011 Improved stemming to extract the stem and root of words Dictionary-based stemmer Collected Arabic corpus Accuracy
[58] 2012 Aim to increase accuracy 3 stemmers House corpus collected Accuracy
[59] 2012 Propose the first nonstatistically accurate Arabic lemmatizer algorithm that is suitable for information retrieval (IR) systems An accurate Arabic root-based lemmatizer for information The dataset contains 50 documents Accuracy
[60] 2013 Investigates the relevance of using the roots of words as input features in a sentiment analysis system Tashaphyne stemmer with ISRI stemmer and Khoja stemmer Penn Arabic Treebank with movie corpus Accuracy, recall, precision, and F1-score
[61] 2013 Aim to improve khoja Enhancement of khoja House corpus collected Accuracy
[62] 2014 Aim to design a model for the extraction of the word root Stemmer for feature selection CNN FROM OSAC Recall, precision, and F1-score
[63] 2014 Aim to design a light stemmer Novel root-based Arabic stemmer Dataset consists 6081 Arabic words Accuracy
[64] 2014 Aim to design an analyzer for dialectal Arabic morphology Analyzer called ADAM SAMA databases
[65] 2014 Aim to compare studies for stemming Khoja stemmer with chi-square CNN FROM OSAC Recall
[66] 2015 To study and compare the effect of three stemmer algorithms Root extractor, light, and khoja stemmer Arabic WordNet F1-score
[67] 2015 To improve stemming P-stemmer P-stemmer House corpus collected F1-score
[68] 2015 Aim to root extraction using transducers and rational kernels Root extraction Saudi Press Agency dataset Accuracy, recall, precision, and F1-score
[69] 2015 To introduce a new stemming technique Approximate stemming Accuracy and F1-score
[70] 2015 To build a new Arabic light stemmer A new algorithm for light stemming The dataset consists of 6225 Arabic words Accuracy
[71] 2016 To improve accuracy by design feature selection Normalization and stemming techniques Dataset 1, dataset 2, and dataset 3) collected from the website https://www.aljazeera.net Accuracy, recall, precision, and F1-score
[72] 2016 To study the Khoja stemmer and the light stemmer stemming algorithm Normalization, root base stemming, and light stemming approaches Create a dataset with 750 documents Recall, precision
[73] 2016 To design a software tool for AT stemming Light stemmer
[74] 2016 Aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system Stemming techniques with House corpus collected F1-score
[75] 2016 Aim to study a fast and accurate segmenter Arabic segmenter
[76] 2017 To review stemming ATs Effective Arabic stemmer
[77] 2017 To implement a new Arabic light stemmer Light stemmer ARASTEM dataset Using Paice’s parameters
[78] 2017 To design a new morphological model based on regular expressions Morphological model Some Surat from the Holy Quran False positive and false negative rate
[79] 2017 Evaluation study among several preprocessing tools in Arabic TC Among several preprocessing tools Alj-News Dataset and Alj-Mgz Dataset F1-score
[80] 2018 To design the FS technique and improve the accuracy Improved chi-square Open-Source Arabic Corpora (OSAC) and (CNN) Precision, recall, and F1-score
[81] 2018 Conduct a comparative study about the impact of stemming algorithms Stemming CNN-Arabic site and contains 5070 Recall
[82] 2019 To study different steamer AR Stem, Information Science Research Institute, and Tashaphyne Stemming CNN-Arabic site and contains 5071 F1-score
[83] 2019 Aim to extract a root by processing word-stemming levels to remove all additional affixes Root extraction and stemming Collection of 350 documents Accuracy
[84] 2019 Aims to review the state of the retrieval performance of Arabic light stemmers Light stemmers TREC data Accuracy
[85] 2019 To a novel method that detects not only domain-independent stop words Stop word Corpus combines 1261 Facebook comments, 781 tweets, and 32 reviews F1-score
[86] 2020 To discuss the impact of the light stemming algorithm on text classification Study the effects of the light stemming BBC Arabic dataset Recall, precision
[87] 2020 To discuss the impact of a stemming algorithm on word embedding representation Stemming techniques ANT version 1.1 and SPA corpus F1-score
[88] 2021 Design a new method to prepare and analyze the AT Normalization, such as shape repeated letters, non-normal words, and spelling mistakes Collect data character
[33] 2024 Study how ATC work on hate speech Many methods Survey

4.1. Tokenization

It is the process of segmenting a given text into small units. Alyafeai et al. proposed three novel text tokenization algorithms for AT [36].

4.2. Linguistic Preprocessing

It refers to additional preprocessing such as part-of-speech tagging, which is applied to get additional information about the content of the text, for instance, ADIDA, MADAMIRA, etc. [89].

4.3. Stop-Word Removal

It refers to the elimination of words that do not give meaning to the text. Auxiliary words, prepositions, conjunctions, modal words, and other high-frequency words in diverse publications are all examples of stop words [82].

4.4. Normalization

This refers to a collection of many documents of various formats that are transformed into a standard format such as “.txt” in case our data are represented as a multidocument. On the other side, when our data are represented as a single document, the normalization here is to make all words in the same form, and there are many techniques such as stemming. Finally, normalization takes in rules or regular expressions [71].

4.5. Lemmatization

Lemmatization reduces a word to its simplest form by replacing the suffix or prefix of a word with a different one or removing the suffix or prefix from the word utilizing lexical knowledge [90, 91].

4.6. Stemming

Text stemming is the process of reducing inflected or derived words to their common canonical form. For example, ‘teacher’, ‘school’, and ‘studying’ might be reduced to their root forms such as ‘teach’, ‘school’, and ‘study’ (مدرس, مدرسه, يدرس, الى, درس) [90]. There are various types of stemming, for example, root-based stemmers—Khoja, light-based stemmers—Larky, and statistical-based stemming like N-grams, as shown in Figure 2.

Larkey and Connel [43] implemented and improved normalization and stemming methods for AT. In addition, they have created a dictionary and expanded inquiries for AT with no prior knowledge of the language. Larkey et al. [44] further developed several light stemmers based on heuristics and statistical stemmers for Arabic retrieval. A morphological stemmer that sought to locate the root for each word proved more successful for cross-language retrieval than the best light stemmer did. Duwairi et al. applied different FS approaches to the Arabic corpus. They compared stemming and light stemming, coming to the conclusion that light stemming improves classification accuracy. Three feature reduction methods based on stemming, light stemming, and word clusters were proposed with K-NN as classifiers [51]. Mohd et al. attempted to describe the influence of several metrics, such as cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance, and averaged Kullback–Leibler divergence on document clustering algorithms with two forms of morphology-based preprocessing [52]. Mansour et al. [47] proposed an autoindexing method for IR to create index words for AT documents while applying different grammatical rules to extract stems. Al-Shammari and Lin [48] introduced a novel lemmatization algorithm for AT and argued that lemmatization is a superior word normalization approach to stemming. Al-Shargabi et al. [56] applied different preprocessing methods and compared the performance of SVM, NB, J48, and SMO classifiers performance and concluded that SMO outperformed the other classifiers.

Hadni et al. [58] implemented an effective hybrid approach for ATC that is reported to supersede Larky, Khoja, and N-gram stemmer. Oraby et al. [60] studied the effect of stemming methods on Arabic SA. Their accuracy results were 93.2%, 92.6%, 92.6%, and 92.2% for Tashaphyne, stemmer, ISRI stemmer, and Khoja stemmer, respectively. Bahassine et al. [62] studied the effect of the origin stemmer and Khoja’s stemmer on Arabic document classification. CHI statistics were used to reduce the number of selected features. Their proposed stemming method outperformed Khoja’s stemmer. Al-Kabi et al. [63] proposed a new light stemmer for AT. The empirical evaluation indicated that the proposed stemmer’s accuracy is higher than one of the two well-known Arabic stemmers utilized as a baseline. Salloum and Habash [64] presented an analyzer for dialectal Arabic morphology for AT. It is an analyzer for dialectal Arabic, and its performance is comparable to an Egyptian dialectal morphological analyzer. Yousif et al. [66] presented an ATC system based on NB with a conceptual representation based on Arabic WordNet. They assessed the impact of three stemming algorithms: a light stemmer, a Khoja stemmer, and a best-performing root extractor.

Kanan and Fox [67] developed a taxonomy for Arabic news with automatic classification techniques using binary SVM classifiers and a novel Arabic light stemmer called P-Stemmer. Nehar et al. [68, 71] enhanced ATC utilizing an improved feature set, including the BoW and term-frequency approach and the frequency ratio accumulation method classifier. Nehar et al. provided a new approach to root extraction based on using an Arabic pattern stemmer to classify AT. Nasef and Jakovljević [73] presented the categorization of AT using stemming. The software is based on an open-source version of the Lucene-based light stemmer for Arabic, and it allows for stemming and categorization into 12 classes. Mustafa et al. [76] presented an extensive survey on Arabic stemmer. Abainia et al. [77] suggested the design of a unique Arabic light stemmer based on certain new principles for smartly removing prefixes, suffixes, and infixes. It is also the first book to address the irregular norms of Arabic infixes.

Bahassine et al. [80] increased the accuracy of Arabic document categorization; FS approaches employing IG, MI, and CHI were used. Boukil et al. [81] proposed the classification of Arabic documents while using stemming techniques as FE systems and KNN as a classifier. Alhaj et al. employed various stemmers, including Information Science Research Institute (ISRI), Tashaphyne, and ARLStem for ATC with SVM as the best-performing classifier. They further studied Arabic document classification utilizing light stemming techniques with FE techniques such as BoW and TF–IDF. Moreover, different FS methods, such as CHI, IG, and singular value decomposition (SVD), were used to select the most relevant features [82, 86]. Belal proposed a system for stemming word-level levels to extract a root in the process of removing all additional affixes. Eliminating all further affixes is proposed as a technique for stemming word-level levels to extract a root. If the procedure of matching between a word and proper names is accessible, remove the affixes using patterns and rules based on root dictionaries [83]. Ouahiba and Othman review the performance of various Arabic Light stemmers and conclude light 10 is the outperforming stemmer [84]. Almuzaini and Azmi [87] discussed the effect of Arabic document classification by stemming strategies and word embedding on different DL models, including CNN, CNN–long short-term memory network (LSTM), gated recurrent units (GRU), and attention-based LSTM which has been investigated with Word2Vec representation algorithm. Al-Shammari and Lin produced a novel method for stemming Arabic documents called educated text stemmer. They used stemming weight as an assessment measure to compare the new method’s performance to that of the Khoja stemming algorithm [49]. Ayedh et al. [74] investigated the influence of preprocessing tasks on the efficiency of the Arabic document categorization system. Three-classification approaches are utilized in this study: NB, KNN, and SVM. Al-Kabi [61] highlighted the flaws in the Khoja stemmer and brought about 5% improvements in accuracy by adding missing patterns. Nehar [69] developed a novel stemming approach known as “approximate stemming,” which is based on the usage of Arabic patterns using transducers without relying on any dictionary. Aljlayl and Frieder proposed rule-based light stemming and demonstrated its performance better than a root-based algorithm [45]. Kchaou and Kanoun [50] proposed a method for stemming AT that works similarly to Khoja’s strategy, but the difference here is that there are two dictionaries, one for roots and another for radicals. It addresses handicapped roots and radicals in Khoja.

Kanan et al. proposed a novel light stemming from AT and demonstrated its effectiveness in improving search in IR Elshammari [53]. Al-Shammari proposed a context-dependent stemmer without relying on a dictionary and improved ATC by utilizing a new free Arabic stemmer dictionary [54]. The proposed stemmer is compared with the root-based and light stemmers and outperforms them. Alhanini and Aziz proposed an improved stemmer for extracting the stem and root of Arabic words to address the shortcomings of light stemming and dictionary-based stemming. However, the proposed stemmer does not address the issue of broken (irregular) plurals [57]. El-Shishtawy [59] proposed a nonstatistical lemmatizer that uses several Arabic knowledge resources to produce accurate lemma forms and relevant features that can be utilized in IR systems. Abdelali et al. [75] proposed a Farasa Arabic segmenter based on SVM ranking with linear kernels that is comparable to the state. Said et al. reviewed several preprocessing tools in ATC and compared the raw text within many techniques, such as Al-Stem stemmer, Sebawai root extractor, and RDI MORPHO3 stemmer [79].

Elghannam [92] created a new technique for identifying the domain of a corpus. The detection is domain-independent and domain-dependent stop words. Othman et al. developed a new framework based on regular expressions and Arabic grammar rules to extract and recognize an Arabic sentence’s syntax analysis [78]. Hegazi et al. [88] designed an approach that provides a framework for building effective apps for analyzing and processing AT on social media.

5. Representation and Feature Engineering

ML algorithms cannot understand unstructured text as humans do unless represented in terms of numbers. Hence, text representation is a process of converting unstructured text into its structured equivalent representation, which ML algorithms can understand and interpret. One of the most effective approaches to text representation is word embedding, which captures the semantic and syntactic relationships between words in a continuous vector space. Traditional techniques such as BoW and TF–IDF treat words as discrete entities, failing to capture contextual meaning. Machine-readable representations of text can be constructed using various representation methods. Figure 4 and Table 3 explore a different type of level representation first. Then, different feature extraction techniques are explored.

Details are in the caption following the image
Representation and feature extraction techniques for Arabic text classification.
Table 3. Comparative analysis of representation techniques.
Ref. Year Objective Method Dataset Evaluation matrices
[93, 94] 2008 Aim to use ML for AT documents classification Dice measure for classification and representing by trigram frequency statistics Arabic documents corpus Precision, recall
[95] 2010 Aim to explore the sentiment of AT at two levels: document and sentence Design a novel grammatical approach and semantic orientation of words, documents, and sentences at the document and sentence level 44 documents Accuracy
[56] 2011 To make a comparison of different text classification algorithms Stop-word removal 2363 documents Recall and precision
[96] 2012 Propose a conceptual representation for AT representation Chi-square Corpus of Arabic texts built by Mesleh Precision, recall, and F1-score
[97] 2013 Aim to represent AT using rich semantic graph Graph A small dataset that contains three paragraphs
[98] 2014 Aim to design an algorithm by combining bag-of-words and the bag-of-concepts TF and TF–IDF Arabic 1445 dataset and Saudi newspapers (SNP) dataset Accuracy, recall, precision, and F1-score
[99] 2015 Aim to propose four models for text sentiment classification in Arabic Bag-of-words word embeddings LDC ATB dataset F1-score
[100] 2015 Aim to explore the efficient of word N-grams N-grams Saudi Press Agency dataset Accuracy
[6, 101] 2015 Aim to represent a word in a vector and minimizing for cosine error outperforms Word embeddings CBOW, SKIP-G, GloVe Collected ATs Root mean square error and Pearson’s correlation
[102] 2016 To use cosine similarity for ATC Latent semantic indexing (LSI) 4000 documents on 10 topics Accuracy
[7] 2016 Aim to solve binary classifiers and detect subjectivity Word embeddings Collect datasets to create word representations Accuracy
[103] 2016 Aim to study sentiment polarity from the AT Word embeddings Word2Vec 3.4 billion-word corpus. Accuracy
[104] 2016 Aim to explore the character level for discriminating between similar languages and dialects Character-level DSL 2016 shared task Accuracy and F1-score
[105] 2017 Aim to design a new graph-based algorithm for ATC Graph Essex Arabic summaries corpus Recall, precision, and F1-score
[106] 2016 Aim to prove document embeddings better than text preprocessing methods Word vectors and Doc2Vec model BBC, CNN, OSAC, and Arabic Newswire LDC Precision, recall, and F1-score
[107] 2017 Aim to propose pretrained word representation for AT Word embeddings (AraVec) Different resources: Wikipedia, Twitter, and Common Crawl webpages (word embedding) None
[108] 2017 Aim to use various models for word representations to classify AT (CBOW, Skip-Gram, and GloVe) Two datasets: SemEval 2017 and ASTD F1-score
[109] 2017 Aim to work on three problems for Arabic sentiment analysis Word embedding with Word2Vec Syria Tweets dataset Accuracy recall, precision, and F1-score
[110] 2017 Aim to propose a study that minimizes the high dimension TF–IDF Corpus of sport news Precision, recall, and F-measure
[111] 2018 Aim to utilize deep learning for Arabic sentences classification Word embeddings Essex Arabic summaries corpus (EASC) None
[112] 2018 Aim to design graph model for document Graph Arabic dataset Precision, recall, and F1-score
[113] 2018 Aim to distinguish the 5 dialects using char-level representation Character level ADI dataset for the shared task Accuracy and F1-score
[114] 2018 Aim to propose a new representation technique TCR–ICF Collect a new dataset Accuracy
[115] 2018 Aim to study of several word embedding models is conducted, including GloVe, CBOW, and Skip-gram GloVe and Word2Vec Many datasets such as OSAC, LABR, and Abu El-Khair corpus
[116] 2018 Aim to compare pretrained vectors of the word for AT Word embedding (WE) models Collected from Twitter Accuracy of 93.5% with AraFT
[117] 2018 Aim to use word representation for sentiment analysis Word2Vec Language Health Sentiment Dataset Accuracy
[118] 2018 Aim to use term weighting and multiple reducts Term weighting 2700 documents for 9 classes Recall, precision, and F1-score
[119] 2019 Aim to create word embedding models ARWORDVEC models ASTD and ARASENTI Accuracy and F1-score
[92] 2019 Aim to create a new bigram alphabet approach Bigram alphabet Arabic dataset Aljazeera News. Accuracy
[120] 2019 Aim to introduce N-gram embeddings N-gram embeddings Using many western and eastern Arabic datasets Accuracy, precision, recall, and F1-score
[121] 2019 Aim to study word embedding for text representation Char level Merge many datasets Accuracy
[122] 2019 Aim to design an algorithm for a combined document embedding representation Word sense OSAC Precision, recall, and F1-score
[123] 2019 Aim to propose a new representation model based on N-gram N-gram DOSC and HARD datasets Accuracy, precision, recall, and F1-score
[124] 2019 Aim to introduce a graph-based semantic representation model Graph ArbTED Accuracy precision, recall, and F1-score
[125] 2020 Aim to find a technique for the proposed technique by reducing the high dimensionality TF–IDF CNN dataset and Alj-News5 dataset Precision, recall, and F1-score
[126] 2020 Aim to introduce Doc2Vec and machine learning approaches PV–DM and PV–DBOW Five Arabic datasets Accuracy and F1-score
[127] 2020 Aim to use transfer learning as a new technique for representation BERT HARD; ASTD; ArSenTD-Lev; LABR:AJGT Accuracy and F1-score
[128] 2020 Aim to create embeddings vector based on word and character Character and word embeddings TASK Pearson correlation coefficient
[121] 2020 Aim to apply transfer learning for emotion analysis in Arabic Character-level representation Hotel reviews AND 1012 tweets Accuracy
[129] 2020 Aim to study the impact of BERT model AT formal and unformal BERT CREATE TWO F1-score
[130] 2020 Aim study word-level representations to tackle the Romanized alphabet of Tunisian Word2vec Word Accuracy-measure
[131] 2020 Aim to study Arabic opinion mining using a different type of representation Unigram, bigram, and trigram HTL and LABR datasets Accuracy
[132] 2020 Aim to use pretrained word embedding for Arabic sentiment ARAVEC and FastText library Arabic Gold Standard Twitter Data for sentiment analysis (ASTD) ROC curve
[133] 2020 Aim to classify text utilizing fine-tuned Word2Vec Word2Vec Movie review dataset Accuracy
[134] 2021 Aim to represent text at the word level and investigate an efficient bidirectional LST for classification Word embedding ASTD ArTwitter LABR MPQA Precision, recall, and F1-score
[135] 2021 Aim to introduce a contextual semantic embedding representation BERT OSAC Accuracy and F1-score
[136] 2020 Aim to propose a model for representation embeddings at the different levels Character, word, and sentence embedding IMDB movie dataset Accuracy, precision, recall, and F1-score
[137] 2024 This work combines the trained Arabic language model ARABERT with the potential of long short-term memory (LSTM) ARABERT 4071 Arabic audio clips Accuracy, word error rate, character error rate, BLEU score, and perplexity

The basic unit in language is a word, which produces phrases, sentences, and documents. Because of this, word-based representations are the most critical research direction since the total number of words that we can get from any language is huge compared to characters or phrases.

5.1. Representation Based on Character-Level Methods

Character-level representation refers to a way of representing text data where each character in the text is considered a separate unit of analysis, as opposed to word-level or sentence-level representation, where words or entire sentences are treated as units of analysis. Character-level representation is commonly used in NLP tasks such as language modeling, TC, and machine translation. In this approach, each character in a text is mapped to a unique numeric representation using techniques such as one-hot encoding or embedding. One advantage of character-level representation is that it can handle out-of-vocabulary (OOV) words or rare words not present in a predefined vocabulary; each character can be mapped to a unique representation even if it has never been encountered before. However, character-level representation may not capture the semantics of words or phrases and may require more computational resources than word-level or sentence-level representation. The methodology of character-level embedding starts by dividing each Arabic word into basic letter forms and encoding each alphabet separately. There are two ways to represent text at the character level: encoding every alphabet alone or using another technique called N-gram, adding one, two, or three N-grams. The following subsections present the existing work on these representations.

5.1.1. N-Gram Embeddings

N-gram-level embedding divides each Arabic word into basic letter form and encodes each alphabet differently by taking two or three letters. Petasis et al. [138] proposed a model to deal with high dimensionality for ATC using trigram frequency to represent text, and their results demonstrated that trigram text categorization was effective. Al-Thubaity et al. [100] used a neural network to map English vectors from Arabic vectors, develop continuous representations that capture semantic and syntactic features, and test these vectors using intrinsic and extrinsic evaluations. Elghannam et al. [92] proposed a novel bigram character-based method to represent text for a TC system and evaluated it on the Aljazeera News dataset. Mulki et al. [120] proposed a model that uses N-gram embedding for sentiment in many Arabic dialects. Saeed et al. [123] represented text using the N-gram method in numerous classification algorithms to detect spam in Arabic opinion texts, including rule-based and ML algorithms. Elzayady et al. [131] proposed a model for SA by employing CNN for FS and RNN for classification. The method did not address the issue of OOV terms.

5.1.2. Character-Level Embeddings

Character-level embeddings separate each Arabic word into basic letter forms and then encode each alphabet separately. Belinkov et al. represented text at the character level using CNN to distinguish between similar languages and dialects [104]. Ali proposed a CNN-based model to distinguish five dialects of the Arabic language [113]. Omara et al. used a CNN-based model for SA at the character level. Furthermore, the model was evaluated for emotion identification and SA [121].

5.2. Word-Based Embeddings

Word representation refers to the process of encoding words as numeric vectors or embeddings, which can be processed by ML algorithms for various NLP tasks. Word embedding tokenizes a sequence of words at the word level and assigns a vector to each word. In the following section, state-of-the-art word embedding methods have been discussed.

5.2.1. Weighted Words

At the word level, there are many representation techniques to represent text using weighted words, such as TF–IDF. This representation will represent each word and map to several occurrences in the corpus. There are many types as follows:
  • BoW: BoW is a feature extraction technique that ignores word order in a text document. Al-Radaideh and Al-Abrat proposed a model based on term weighting for ATC and reduced the number of terms used to generate the classification rules [118]. Alahmadi et al. proposed combining BoW with bag-of-concepts to handle semantic relationships between words. Still, the problem of sparse matrix and complex preprocessing finally did not work with a problem like OOV [98]. Al Sallab et al. proposed three DL models for sentiment classification in AT, each using a different representation method, such as BoW. Their experiments were carried out on the LDC ATB dataset [99]. Alnawas introduced Doc2Vec with ML for SA of AT, and they proposed a continuous vector representation model. They were computed using the PV–DM and PV–DBoW architectures. Furthermore, these vectors were used to train four popular ML methods: LR, SVM, KNN, and RF [126].

  • TF–IDF: TF–IDF assigns more weight to fewer common words in a document. Mahmood and Al-Rufaye applied and improved TM by decreasing dimensions utilizing k-means clustering algorithms [110]. Al-Taani et al. proposed an FCM approach to classifying AT by lowering the dimensionality of the representation. They employed SVD for DR, but it has significant disadvantages such as a high complexity time, a high-dimensional space, and a lack of consideration for the semantic level [125].

  • TCR–ICF: TCW–ICF is a new method of representation that has been used for ATC. It works like term frequency, which replaces representation based on class instead of a word. Guru et al. proposed TCW–ICF, a novel term weighting system for ATC. Their method improves results by applying DR [114]. Finally, all of their experiments were implemented in the dataset that they created.

5.2.2. Word Embedding

Word embedding converts words to vectors, which can be context-dependent or context-independent. We explore all existing work based on the following:
  • Context-Independent Word Embeddings: In this representation, the meaning of surrounding words is ignored; examples include Word2Vec, GloVe, and FastText.

  • Word2Vec: In 2013, Mikolov et al. from Google implemented the W2V model. This model has two hidden layers, a continuous BoW and the second one, Skip-Gram, which both work on a high-dimensional vector for each word. Some researchers have used the following methods for representation.

  • Altowayan represented text and created embedded words for SA tasks to represent AT. The embedding of features for binary classifiers was used to detect standard and dialectal AT, and they also presented word embedding as an alternative to extract features for Arabic sentiment classification. Their method depends on word embedding as AT as the primary source of characteristics. Two types of AT have been detected using this representation [7]. Dahou et al. detected Arabic review sentiment polarity and social media from AT. They used to study corpora from two domains: reviews and tweets [103]. Soliman et al. introduced a pretrained distributed representation called AraVec. They make this work open source to support the researcher community. Their model handles syntactic and semantic relations among words [107]. Al-Azani and El-Alfy designed a model for SA to solve three problems: microblogging data, handling imbalanced classes, and addressing dialectical Arabic. The oversampling technique solved the imbalanced dataset problem [109]. Sagheer and Sukkar resented classifying Arabic sentences using CNN models with a representation embedding layer. They have used AraVec as a pretrained system [111]. Alwehaibi et al. implemented SA for AT using the LSTM model on Arabic tweets. They assess the impact of pretrained vectors for numerical word representations that are already available. The experimental findings suggest that the LSTM–RNN model produces acceptable results [116]. Alayba et al. described how they have constructed Word2Vec models from a large Arabic corpus obtained from 10 newspapers in different Arab countries. Different ML algorithms and CNN with various FS methods were applied to the health sentiment dataset. They increase the accuracy of the form from 91% to 95% [117]. Fouad et al. showed that effective word embedding in ArWordVec was developed from Arabic tweets. They created a new approach for detecting word similarity. The experimental results suggested that the ArWordVec models outperform previously available models on Arabic Twitter data. Finally, they applied various models to obtain word embeddings, such as the CBoW, SG, and GloVe methods [119]. Abir Messaoudi et al. presented different word representations of different DL models (CNN and BiLSTM), without using any preprocessing step. They proved that CNN with M-BERT reached the best results compared to others [130]. Sharma et al. proposed a model to perfectly clean the data and generate word vectors from the pretrained Word2Vec model [133]. Elfaik and Nfaoui (2021) proposed a model for ATC. They represented text at the word level and investigated BiLSTM to improve the SA of AT. The F1 measure was 79.41 in LABR datasets. The complexity of preprocessing and time was greater. They did not use character level, which may solve some problems for the Arabic language [134].

  • GloVe: It is an unsupervised learning algorithm for obtaining vector representations for words, a strong representation to represent text [90]. The approach is similar to the Word2Vec method. M. A. Z. et al. investigated the effective representation of N-grams as features for ATC. Their experiment used the SPA dataset [101]. Gridach et al. implemented various word representation models, such as CBoW, Skip-Gram, and GloVe utilizing two datasets called ASTD, and SemEval [108]. Suleiman and Awajan studied various word embeddings to represent AT. These techniques are GloVe and Word2Vec. Finally, they conclude that Word2Vec outperforms others [115].

  • FastText: Facebook’s AI Research Lab released a novel technique to solve the representation issue by introducing a new word embedding method called FastText. Each word is represented as a bag of character N-gram. For example, given the word “محمد” and n = 4, FastText will produce the following representation composed of character trigrams: < مح, محم, حمد, مد >. Ibrahim Kaibi introduced NuSVC classifiers to classify AT using word embeddings representations known as AraVec and FastText. They combined both representation models based on the concatenation of their vectors. Evaluate the model using accuracy metrics [132].

  • Context-Dependent: It is one type of representation in which the meaning of the context is included. This representation depends on the context of the sentence, which means there will be more simulation of humans.

  • AraBERT: Antoun et al. implemented new transfer learning to classify AT. This model called AraBERT achieves the same BERT in English text. They compare multilingual BERT with AraBERT [127]. Chowdhury et al. studied the effects of the BERT model on a mixture of formal and informal texts. They applied new Arabic transfer learning for short-text datasets. They prove that greater generalization was made by the former when compared to others [129]. F. Zahra El-Alami et al. presented embedding representations that handle semantic context to improve ATC. This type of representation solves many complex problems. They implemented and compared their work with AraBERT [135].

  • MarBERT: Abdul-Mageed et al. presented two powerful Transformer-based models, especially for Arabic. They train their models on large-to-massive datasets that cover different domains [139].

5.3. Document-Level Methods

Mahdaouy et al. introduced a classification system to classify text and documents in vector space, and their representations for the document in an unsupervised method are to carry implicit relationships and semantics between words [106].

5.4. Sentence-Level Methods

A sentence representation is usually used in many tasks in natural language. Sentence representation aims to encode the semantic information of the whole sentence into a real-valued representation vector, which could improve the understanding of the context of the text. Farra et al. examined sentiment text for Arabic at two-level document and word. They conclude that the work, which has been done in Arabic, is still limited. They studied a novel grammatical method and the semantic orientation of words with their corresponding [95].

5.5. Representation Based on Hybrid Methods

Hybrid methods try to merge more than one method for text representation, by utilizing some advantage in one method and another advantage from another. Al-Anzi et al. proposed TC for AT and compared some of them. They employed SVD to decrease the dimension and reduce the number of features [102]. El-Alami et al. presented a method that works with two phases of document embedding and sense disambiguation to improve accuracy. They implemented several experiments on the Open-Source Arabic Corpora dataset. However, there are some limitations, such as using TF–IDF representation, which takes a sparse matrix representation, a complex preprocessing, especially using the Khoja stemmer, and using a lexicon will cover only some vocabulary, so these cannot be appropriate for the Arabic language since it has a rich vocabulary and rare words [122]. Alharbi et al. designed a model to classify microblogs on social media using word and character representation. At the same time, they presented a new technique that joins different levels of word embedding [128]. El-Affendi et al. developed a novel DL multilevel model that uses a simple positional binary embedding scheme to compute contextualized embedding at the character, word, and sentence levels simultaneously. The suggested model is also shown to generate new state-of-the-art accuracies for two multidomain problems [136].

5.6. Representation Based on Graph Methods

The representation of text as a graph is one of the essential preprocessing steps in data and TM in many domains, such as TC. The graph representation approach is used to represent text documents in a graph to handle text features such as semantics [124, 140]. El Bazzi et al. implemented a system to classify documents using a graph model for representation. They studied the impact of the semantic relation between the text tokens on the papers [112]. Ismail et al. presented a system to summarize and classify AT using a rich semantic graph (RSG). It is a suitable method that supports the development of the Arabic language [97]. Hadni and Gouiouez proposed a new graph approach for representing text and classifying AT. This is accomplished through using BabelNet knowledge [105]. Etaiwi and Awaja introduced a graph representation to classify AT. Their model was evaluated using different metrics such as precision, accuracy, recall, and F1-score [124].

6. DR

Representation of text in vector space models (VSMs) such as BOW has several limitations, for example, sparse matrices. These methods are pretty expensive in terms of time complexity and memory utilization. Many researchers utilized DR to limit the size of the feature space to address this limitation. Existing DR methods used in AT categorization are discussed in this section and shown in Figure 5 and Table 4.

Details are in the caption following the image
Dimensionality reduction techniques for Arabic text classification.
Table 4. Comparative analysis of dimensionality reduction technique.
Ref. Year Objective Method Dataset Evaluation matrices
[141] 2007 Aims to implement an SVM with chi-square Chi-square Arabic data Precision, recall, and F1-score
[142] 2007 Aims to explore the effectiveness of different feature selection methods Chi-square Arabic data Precision, recall, and F1-score
[51] 2008 Aim to introduce three feature reduction techniques and compare them Cluster with stemming 15,000 documents Precision and recall
[143] 2009 Aims to study the impact of the NB algorithm with the chi-square Chi-square SPA Recall, precision, and F1-score
[144] 2011 Aim to study a feature reduction algorithm Feature selection synonyms merge House Arabic documents F1-score
[96] 2012 Propose a conceptual representation for AT representation Chi-square Corpus of Arabic texts built by Mesleh Precision, recall, and F1-score
[145] 2012 Aim to introduce LDA (latent Dirichlet allocation) algorithm b LDA (latent Dirichlet allocation) House corpus of ATs F1-score
[146] 2013 This thesis introduces a new algorithm for feature selection called binary particle swarm optimization The feature selection process, the filter wrapper approach Akhbar-Alkhaleej, Arabic Alwatan, Al-Jazeera-News Arabic Recall, precision, and F1-score
[147] 2014 Aim to improve the AT categorization system by reducing the dimension Radial basis function House Arabic documents Precision, recall
[148] 2014 Proposes a new method for ATC in which a document is compared with predefined documents, using the chi-square measure TF–IDF and chi-square House containing 1090 documents
[101] 2015 Aim to improve accuracy by representing a word and decreasing the cosine error Word embeddings CBOW, SKIP-G, GloVe Collect home data
[106] 2016 Aim to prove that representation is better than text preprocessing method Word vectors and Doc2Vec BBC, CNN OSAC corpora2, Arabic Newswire LDC Precision, recall, and F1-score
[110] 2017 Aim to propose a study that minimizes the features TF–IDF 200 sports news corpus Precision, recall, and F1-score
[80] 2018 Aim to improve the chi-square Improve chi Open-Source Arabic Corpus (OSAC) Precision, recall, and F-measure
[149] 2018 Aim to investigate one of the most successful classification algorithms which are C4.5. Chi-square and symmetric uncertainty Arabic dataset Precision, recall
[150] 2018 Aim to propose a new feature selection method Feature selection Open-Source Arabic Corpus (OSAC) Precision, recall, and F1-score
[151] 2019 The proposed feature selection approach improves the accuracy Feature selection Precision, recall, and F1-score
[152] 2019 Propose a solution for the main problem, a large number of involved features Feature selection
[153] 2019 Aim to compare three-dimensional reduction methods
  • PCA
  • SVD
  • NMF
2 linguistic corpora for English and Arabic
[154] 2019 Aim to design a method for feature selection Feature selection NN, BBC, and OSAC
[155] 2019 Aim to introduce hybridization feature set methods Hybridized feature set Dark Web Forum Portal F1-score and accuracy
[156] 2020 Aim to improve the feature selection method by merging the chi-square and artificial bee colony Hybrid BBC F1-score
[157] 2020 Aim to improve and enhance the wrapper FS called the binary grey wolf optimizer Grey wolf optimizer Alwatan, Akhbar-Alkhaleej, and Al-Jazeera-News Precision, recall, and F1-score
[158] 2012 Aim to strengthen AT categorization system utilizing feature selection Synonyms merge technique House Arabic documents F1-score

6.1. Feature Selection

In general, FS has three categories known as embedded, wrapper, and filtering techniques, but in TC, the processes are more likely to use filters due to many features. Mesleh applied a TC system using SVM with CHI; simultaneously, they suggested other FS algorithms for future work [141]. Mesleh et al. collected a house dataset and used six FS techniques for ATC purposes. Based on their different experiences, they noted that FS is beneficial in increasing the accuracy of ATC [142]. Duwairi discussed three feature reduction approaches to improve accuracy in AT. At the same time, they made comparisons for stemming, light stemming, and word clustering [51]. Bahassine et al. developed a new method for the classification system of AT and applied CHI to improve classification accuracy and decrease size [80]. Larabi Marie-Sainte and Alalyani implemented SVM and FS methods used in numerous scenarios to study the classification of AT. Due to the complexity of Arabic, it was not done intensively. Their experimentation was evaluated using various matrices like precision, recall, and F1-score [150]. Rashid et al. implemented FS to increase the accuracy of ATCS, while precision, recall, and F1-score were used [151]. Belazzoug et al. proved that FS is important in enhancing the ATC system. They used BoW for representation; the main problem was having many features [152].

6.2. Feature Extraction

Mohamed applied a new algorithm for extracting features and decreasing the dimension. Principal component analysis (PCA), non-negative matrix factorization (NMF), and SVD have been used for clustering approaches. Finally, he evaluated three well-known techniques to demonstrate the advantages and disadvantages of each [153].

6.3. Optimization

Only a few works have been explored for ATC when compared to existing techniques. Chantar et al. designed a new method for FS to improve TC called the grey wolf optimizer (GWO). This method is a wrapper-based FS [157].

6.4. Hybrid

Sabbah and Selamat presented a hybrid FS method to improve the TC system. They represent text-using TF–IDF to represent text. At the same time, they used other techniques, such as PCA, to decrease dimension [155]. Hijazi et al. created a novel FS technique that combines artificial bee colony (ABC) and CHI. CHI has three advantages: quick and easy to use. The second phase used the ABC [156]. Chantar et al. proposed an ATC system using KNN and SVM to classify text and binary particle swarm optimization (BPSO), hybridized to select features [157]. Thabtah et al. introduced TC using the NB algorithm based on the CHI feature selection method. They have used many metrics for evaluation, such as F macro, recall, and precision [143].

Chantar et al. proposed an ATC system using KNN and SVM with BPSO as FS [157]. Hussein and Awadalla presented a TC system using different classification algorithms. By combining synonyms, dimensionality has been utilized as a semantic feature selection method [144]. Karima et al. proposed a conceptual representation of ATR. We used AWN to map the terms to the concept [96]. Zrigui et al. presented a conceptual representation for working with ATC. At the same time, AWN maps terms to the concept [145]. Saad et al. developed a new strategy for reducing the number of features by merging semantic synonyms and enhancing ATC [158]. Zaki et al. proposed an Arabic document system based on traditional models. Simultaneously, N-grams with TF–IDF representation techniques were applied [147]. Abu-Errub implemented TF–IDF representation techniques to classify documents into the right class. At the same time, they used the CHI method for FS [148].

7. Classification Models

Once the representation and choosing the optimal feature have been done for a given text through an optimal representation technique, the selection of such a classifier is a crucial task in ATC [15]. Many classification algorithms have been implemented in the literature on ATC shown in Figure 6 and Table 5. One of the significant challenges in applying ML to low-resource languages such as Arabic is the limited availability of high-quality labeled datasets. Unlike widely studied languages such as English, Arabic suffers from data scarcity, particularly in specialized domains. Furthermore, the complexity of Arabic morphology, including rich inflection, derivation, and agglutination, poses additional difficulties in feature extraction and representation. Dialectal variations across different regions further complicate text classification, as models trained on modern standard Arabic (MSA) may struggle to generalize across various dialects. Additionally, the lack of standardized preprocessing techniques and annotated corpora makes it challenging to fine-tune models effectively. Addressing these issues requires the development of transfer learning approaches, data augmentation techniques, and hybrid models that can leverage both supervised and unsupervised learning methods to enhance performance in low-resource NLP tasks.

Details are in the caption following the image
Classification technique for Arabic text.
Table 5. Comparative analysis of classification techniques.
Ref. Year Objective Method Dataset Evaluation matrices
[159] 2004 Aim to apply Arabic web documents classification using NB Naïve Bayes Collected 300 web documents per category Accuracy
[160] 2006 Aim to present a system for ATC Rocchio classifier Collected data corpus
[94] 2006 Aim to identify foreign words using three-classification method Lexicons for AT Collected dataset
[13] 2007 Aim to apply three algorithms for AT text classification techniques KNN, Rocchio, and Naïve Bayes 1445 document Accuracy, precision, recall, and F1-score
[161] 2008 An implementation classification using a recognized statistics technique SVM Different Accuracy
[162] 2008 This paper investigated different vector space models and use the KNN algorithm SVM Collected F1
[22] 2009 Aim to classify Arabic documents using artificial neural network SVD and neural networks Hadith! corpus Accuracy precision, recall, and F1-score
[163] 2011 Proposed to classify documents using lexicon and k-NN K-nearest NONE Precision, recall, and F1-score
[164] 2012 Aim to apply different rule-based classification algorithms Rule-based, DT (C4.5), rule induction (RIPPER), hybrid Published corpus Rule-based
[165] 2012 Aim to compare six well-known classifiers after applying feature selection. Naive Bayes without fs and maximum entropy with information gain Arabic datasets Precision, recall, and F1-score
[146] 2013 Aim to apply feature selection to improve accuracy The feature selection process, the filter wrapper approach Akhbar-Alkhaleej, Arabic Alwatan, Al-Jazeera-News Arabic dataset Precision, recall, and F1-score
[166] 2014 Aim to improve accuracy by using a different classification algorithm SVM, NB, and C4.5 Using Arabic Wikipedia Precision, recall, and F1-score
[167] 2014 Implemented the key nearest neighbor (KNN) algorithm KNN Dataset contains 621 documents Precision and recall
[66] 2015 An implementation of a Naive Bayesian classifier for classification Naive Bayesian classifier BBC Arabic corpus
[9] 2016 Aim to classify text using a graph-based approach KNN, Rocchio, and Naïve Bayes algorithms Corpus of 1084 documents F1-score
[168] 2016 Aim to classify AT utilizing a hybrid method Conditional random field and LSTM NONE Precision, recall, and F1-score
[169] 2017 Aim to classify AT documents using a different algorithm Rules, NB, LR, and AdaBoost with bagging CNN BBC OSAC Accuracy
[108] 2017 Aim to use DL for sentiment analysis CBOW, Skip-Gram, and GloVe ASTD and SemEval 2017 datasets. F1-score
[170] 2017 Aim to use neural networks and SVM and compare them RNN HOTEL DATA Accuracy and F1-score
[8] 2018 Aim to implement convolutional neural network (CNN) to classify AT from large datasets CNN Large dataset collection Accuracy
[171] 2018 Aim to use a combination of CNNs and LSTMs CNN–LSTM Arabic health services (AHS) dataset Accuracy
[172] 2018 Aim to design architectures to improve accuracy CNN–LSTM Task 1’s datasets Accuracy
[173] 2018 Aim to classify text using different classification techniques KNN, and Naïve Bayes algorithms .svm CNN dataset Precision, recall, and F1-score
[174] 2019 Aim to combine LSTM with CNN LSTM with CNN LABR, ASTD Accuracy
[175] 2019 Aim to classify documents using a convolutional GRU Many models Khaleej Arabia akbarona Accuracy
[176] 2019 Aim to classify Hadith document using different DT, RF, and NB DT, RF, and Naïve Bayes Hadith DATA Accuracy
[174] 2019 Aim to detect dialectal Arabic using deep learning LSTM, CNN LABR, ASTD Accuracy
[177] 2019 Aim to classify text using polynomial neural network Polynomial neural networks Arabic dataset Precision, recall, and F1-score
[178] 2019 Aim to classify text utilizing the narrow structure of CNN Narrow convolutional neural network Twitter datasets for dialect Accuracy, precision F1-score
[179] 2020 Aim to represent text as an image-based character to classify a document CNN1D They have created AWT and APD F1-score
[180] 2020 Aim to classify text based on deep auto encoder representations and bag-of-concepts A deep Autoencoder classifier OSAC Precision, recall, and F1-score
[181] 2020 Aim to classify AT documents by a combination of CNN and RNN CNN and RNN OSAC Precision, recall, and F1-score
[182] 2020 Aim to use CNN, LSTM, and their combination for classification CNN and LSTM OSAC F1-score
[183] 2020 Proposed methods to achieve very high accuracy using CNN CNN 15 different Accuracy
[184] 2020 Aim to use the CNN architecture with LSTM to classify AT CNN LABR ASTD ArTwitter Precision, recall, F1-score, and accuracy
[185] 2021 Aim to compare four machine learning algorithms in the task of ATC Artificial neural network, DT, and LR AJGT, ASTD, Twitter Precision, recall, F1-score, and accuracy
[186] 2021 Aim to classify AT utilizing two models, GRU and IAN-BGRU SVM, KNN, J48, and DT based on gated recurrent units and an interactive attention network based on bidirectional GRU Arabic hotel reviews dataset Precision, recall, F1-score, and ROC (%)

7.1. Rule-Based (Lexicon or Dictionary)

Rule-based classifiers are one type of classifier that makes class decisions based on various “if…else” rules. Because these rules are simple to understand, these classifiers are commonly used to generate descriptive models. The condition used with “if” is referred to as the antecedent, and the predicted class for each rule is referred to as the consequent. Rule-based SA refers to the study conducted by language experts. The outcome of this study is a set of rules (lexicon or sentiment lexicon) according to which the words classified are either positive or negative. A dictionary-based (lexicon-based) SA uses lists of words called lexicons. In these lists, the words have been prescored for sentiment.

Different methods have been used under rule-based approaches, such as lexicons and dictionaries. ATCs systems use these rules with string comparisons of text for some tasks. A few researchers have used this method. Nwesri et al. introduced various algorithms to specify foreign words utilizing lexicons, patterns, and N-grams, and they have proven that the lexicon approach was the best [94]. Thabtah et al. conducted in-depth research on the problem of ATC and evaluated the efficacy of different rule-based classification algorithms [164].

7.2. Classification Using ML Algorithm

ML and DL approaches achieve state-of-the-art results on ATC. In this section, we explore the related work regarding ATC.

7.2.1. Probability

El Kourdi et al. studied a statistical ML algorithm based on NB to classify nonvocalized AT. The NB categorizer is evaluated using cross-validation trials [159]. Yousif et al. applied NB to classify texts utilizing WordNet for representation and different stemmers to compare them [66]. Syiam et al. presented a Rocchio classifier algorithm for TC, which outperformed KNN. At the same time, they are addressed by combining DR techniques such as stemming and FS to reduce the cost classification process [160].

7.2.2. Nonprobability

  • Traditional ML: Al-Harbi et al. implemented AT documents on seven corpora generated for AT using a recognized statistical technique. Their method improved performance by utilizing FS and SVM with C5.0, which has been used. Finally, they conclude that C5.0 provides superior accuracy [161]. Mohammad et al. used a polynomial neural network in TC to produce successful outcomes [177]. Harrag and El-Qawasmah built a neural network for ATC and singular value decomposition to improve accuracy and reduce error [22]. Thabtah et al. studied different representation methods, such as term weighting approaches with the KNN algorithm for classification. In their comparison, they used the F1 evaluation metric [162]. El-Halees studied and combined approaches to classifying Arabic documents. He used three methods in the sequence: first, lexicon, ME, and k-NN to classify AT in different steps [163].

  • DL: Gridach introduced a new architecture that represents text at the character level and word level to name the entity recognition. The problem of vanishing gradients arises in the context of long sequences, particularly in tasks like text classification, making it difficult for models to learn long-range dependencies. The OOV problem is still there because word-level embedding cannot predict new words that have not been seen before [168]. Abu Kwaik et al. investigated the DL technique to detect dialectal AT. Their architecture was word-level representations. The experimental results had an accuracy of 81% in the LABR dataset and 85.58% in the ASTD dataset [174]. Abuhaiba and Dawoud proposed combining rules, followed by two classification stages for ATC [169]. Gridach et al. proposed a DL system for SA using DL and CBoW, Skip-Gram, and GloVe for representation [108]. Alayba et al. combined CNN and LSTM networks for Arabic sentiment categorization. Because of the complexity of Arabic morphology and orthography, it also investigated the usefulness of applying various levels of SA. Abdullah et al. described a system to detect and classify Arabic tweets utilizing word and document embeddings. They used a combination of CNN–LSTM for the classification task [172]. Elnagar et al. used Word2Vec embeddings trained on the Wikipedia corpus for text classification. They report the accuracy of 91.18% achieved by convolutional GRU on the SANAD corpus. However, applying normalization by replacing the letters (أإآ) with a letter (ا) in some cases will change the meaning; for example, فأر (means “mouse”) will transform to “فار” (means “escaped”) [175]. Finally, their works are based on filtering all alphabets and deciding whether they belong to Arabic. They eliminated non-Arabic alphabets, which added confusion when we had the text from other languages like Urdu. Abu Kwaik et al. proposed a new model for TC by a combination of LSTM–CNN to detect the dialectal of AT [174]. Daif et al. presented the DL structure for AT document classification using image-based characters. Each Arabic character or alphabet was represented as a 2D image. They trained their model from start to finish with the weighted class loss function to avoid the imbalance issue. They produced AWT and APD datasets to evaluate their model [179]. El-Alami et al. proposed an AT categorization method based on bag-of-concepts and deep Autoencoder representations to eliminate problems like explicit knowledge in semantic vocabularies using Arabic WordNet. Their method combines implicit and explicit semantics and reduces feature space dimensionality. They achieved the best results by 94% and 93% for precision and F-measure, respectively. However, their methods still suffer from the complexity of preprocessing and they cannot properly handle the level of vocabulary. Finally, they do not handle the Arabic language ambiguity issue and enhance their system’s performance by utilizing sense embedding techniques [123, 180]. Ameur et al. proposed a combination of CNN and RNN for AT document categorization using static, dynamic, fine-tuned, and word embedding. The DL CNN model automatically learns the most meaningful representations from Arabic word embedding space. They evaluated their proposed DL model using the OSAc dataset. By comparing the performance with the individual models of CNN and RNN, their proposed hybridization model helped improve ATC’s overall performance. There are some limitations in, such as normalization by changing some alphabet to another form, but in some cases, the meaning will change; for example, “كرة” (means football) will transform to “كره” (means hate) [181]. El-Alami et al. studied a hybrid of DL (CNNs and LSTM) that shows promise for huge datasets. They resolved issues such as the polysemous term. Simultaneously, a method for context meaning employing embedding and word sense disambiguation was proposed [182]. Alhawarat and Aseeri suggested the CNN model for ATC, but it takes a long time to train compared to ML approaches. They produced good results utilizing 15 freely available datasets [183]. Ombabi et al. suggested a DL model for Arabic SA, with this model fully combining a one-layer CNN architecture with two LSTM layers. As the input layer, this approach is handled by word embedding and FastText [184]. Al-Smadi et al. proposed an SVM approach that outperforms the other RNN approach on Arabic hotels’ reviews [170]. Alali et al. suggested that CNN utilizes representations to classify tweets. A sensitivity study was carried out to assess the influence of different combinations of structural features [178].

7.3. Hybrid

This approach combines more than one method for text classification; for example, it combines the rule-based and ML algorithms to achieve the maximum possible effectiveness. Kanaan et al. demonstrated many classification algorithms for classifying AT. They used NB, KNN, and Rocchio. The NB was the most effective [13]. Alahmadi et al. proposed a categorization system for AT utilizing a hybrid technique. They employed BoW and BoC for representation to tackle the semantic problem. However, the issue of sparse matrix and complexity with preprocessing finally did not work with the problem of OOV [166]. Bazzi et al. proposed a classification system using graph-based representation. First, a graph is used to represent each document in the collection. Term weighting is done after the construction of the document graph to estimate the significance of a term to the document [9]. Alhaj et al. presented a model for ATC using a three-classification algorithm to classify AT, which is affected by two types of representation, BoW and TF–IDF, on CNNDS Arabic corpus. At the same time, they used CHI to remove unnecessary features [173]. Abdelaal et al. proposed a system for categorizing hadith into different classes based on content. The best three classifiers assessed primarily are DT with 0.965%, RF with 0.956%, and NB with 0.951% [176]. Daher et al. aimed to introduce a simple approach for handling SA by extracting opinions from Arabic tweets using ML [185]. Abdelgwad et al. suggested DL based on GRU and an interactive BiLSTM network for classification [186].

All literature review comprehensively covers an extensive range of studies on ATC and ATR. By categorizing these studies into key themes such as feature extraction methods, classification techniques, and application areas, the review provides a structured understanding of the field. Furthermore, it includes an analysis of recent trends, such as the shift from traditional ML models to DL [187] architectures, and explores underrepresented challenges like dialectical Arabic processing.

8. Datasets

ATC models have used and utilized various datasets, only some of these datasets are available for public, and most of these datasets are not available. In addition, one of the problems for ATC is the lack of a benchmark dataset with a large size. In this section, we list datasets published for ATC by analyzing them based on the number of documents, number of class, number of words, and references of all datasets to help researchers as illustrated in Table 6.

Table 6. Summary of dataset and corpus available.
Ref. Year Dataset name Class Word Document Remark Utilization Website
[188] 2010 CNN 6 2,241,348 5070 OSAC 14 https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
[188] 2010 BBC 7 1,860,786 4763 OSAC 8 https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
[188] 2010 OSAc 10 18,183,511 22,429 OSAC 15 https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
[189] 2014 LABR 2&5 63,000 8,520,886 Sentiment analysis/classification 6 https://github.com/mohamedadaly/LABR
[190] 2019 Alkhaleej 7 45,500 SANAD 2 https://data.mendeley.com/datasets/57zpx667y9/2
[190] 2019 NADIA1 24 678,563 NADIA (multi label) 2 https://data.mendeley.com/datasets/hhrb7phdyx/2
[190] 2019 NADIA2 28 678,563 NADIA (multi label) 1 https://data.mendeley.com/datasets/hhrb7phdyx/2
[190] 2019 AKHBARONA 7 78,050 SANAD 2 https://data.mendeley.com/datasets/57zpx667y9/2
[190] 2019 ALARABIYA 6 71,247 SANAD 2 https://data.mendeley.com/datasets/57zpx667y9/2
[191] 2016 Abu El-Khair corpus 1,525,722,252 5,222,973 Corpus NA https://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
[192] 2017 Tashkeela 75,629,921 Corpus 1 https://tashkeela.sourceforge.net
[8] 2018 M BINIZY 5 319,254,124 111,728 Document 1 https://data.mendeley.com/datasets/v524p5dhpj/2
[193] 2018 AL-HAJ 6 1000 Document 1 https://github.com/yalhag1/Alj-News-Arabic-text-classification-dataset
[194] MANY BY TAMER https://qufaculty.qu.edu.qa/telsayed/datasets/
[195] 2020 BRAD-Arabic 2 & 3 39,886,898 510,598 Sentiment analysis/ classification 2 https://github.com/elnagara/BRAD-Arabic-Dataset
[196] 2020 HARD-Arabic 2&3 8,520,886 373,750 Sentiment analysis/ classification 2 https://github.com/elnagara/HARD-Arabic-Dataset
[197] 2015 TALAA 8 14,068,407 57,827 Document https://github.com/saidziani/Arabic-News-Article-Classification
[198] 2022 Masader Document https://arbml.github.io/masader/
1 2018 Arabic corpus 1.9 B words Corpus https://archive.org/details/arabic_corpus
2 2020 arTenTen 10 B words Corpus https://www.sketchengine.eu/artenten-arabic-corpus/
3 GDELT project 9.5 B Corpus https://www.gdeltproject.org/

The field of ATC relies on a variety of datasets, each with unique features and limitations. For instance, the AraSenTi dataset is widely used for SA, containing tweets labeled for polarity. However, it is limited in linguistic diversity, focusing primarily on MSA. Similarly, OSACT datasets emphasize dialectical Arabic but often over represent Egyptian and Levantine dialects, introducing bias in model training. A critical evaluation of these datasets reveals common challenges, including unbalanced class distributions and the prevalence of informal text, such as social media posts with spelling errors and code-switching. These issues highlight the need for more comprehensive and diverse datasets to advance the field of ATC.

9. Tools and Open-Source Library

There are different tools and open sources available for ATC models. In addition, one of the problems for ATC is the lack of open sources. In this section, we list some of these resources for ATC by analyzing them based on the name, and references of all resources to support researchers, as we illustrate in Table 7.

Table 7. Summary of dataset and corpus available.
Description Website
The “Rand” library has been launched to generate random ATs https://tahadz.wordpress.com/2020/08/10/arrand/
A specific Arabic language library for Python provides basic functions to manipulate Arabic letters and text https://pypi.org/project/PyArabic/
Fine-tuning BERT models for Arabic dialect detection https://github.com/issam9/finetuning-bert-models-for-arabic-dialect-detection
At QCRI, we are dedicated to promoting the Arabic language in the information age by conducting world-class research in Arabic language technologies https://alt.qcri.org/
Building open-source NLP libraries and tools for the Arabic language https://omdena.com/projects/nlp-arabic/
Arabic language support for Text Blob https://github.com/adhaamehab/textblob-ar
It can used as library “see section Arabic stop words library” https://pypi.org/project/Arabic-Stopwords/
Search Gumar for millions of words from Gulf Arabic https://camel.abudhabi.nyu.edu/gumar/
IWAN strives to publish research that serves the society and contributes in building a knowledge economy, through establishing a motivating environment, perfect placement of technology, and effective local and international partnership https://iwan.ksu.edu.sa/ar
Arabic NLP Survey Papers Repository (ASPR)—مستودع الأوراق المسحية في معالجة اللغة العربية (أسبر) https://github.com/iwan-rg/ArabicSurvey
The goal of this project is to create an Arabic benchmark for multitask learning, similar to the GLUE benchmark https://www.alue.org/home
ARABIC NLP TOOLS CATALOGUE is a catalog has 64 tools added by 8 contributors https://arbml.github.io/adawat/

10. Quantitative Analysis

The tables and figures included in this review significantly enhance comprehension by summarizing complex information concisely, so in this section, we explore quantitative analyses of ATC and ATR. To kick off the survey process, we first formulated key research questions focused on the effectiveness of various ATC methods and the challenges specific to AT. The foundation of the survey was built on peer-reviewed studies from the last 5 years, ensuring that the most recent advancements in the field were considered. We used a systematic review methodology, selecting studies based on their relevance to AT processing and evaluating them through a comparative analysis of the methodologies and datasets used. This procedure provided a structured approach to understanding the current state of ATC. It has been seen that the number of publications is 179 articles and we do our study based on the main subcategory for each stage and publication year. Moreover, these analyses will answer the following questions:
  • How many research articles were published in each subcategory (methods in each stage)?

  • How many research articles were published in the timeline for 2002–2021?

  • Which stage of the ATC models are studied most and the least?

  • What does a distribution of the papers look like for each subcategory based on using methods?

  • What does a distribution of the papers look like for each subcategory based on timeline?

  • What are the available datasets for ATC?

  • What are the advantages and disadvantages of ATC and ATR?

  • What challenges and restrictions do ATC and ATR still have for the future?

There are 179 papers, papers in our taxonomy which are divided into four stages the percentage for each one is illustrated in Figure 7. The total number of surveyed articles, including survey papers in this taxonomy, is 179 articles, out of which 30.32% is related to representation, 29.68% is related to preprocessing, and 23.87% is related to classification, whereas the rest 16.13% is related to the DR, and we explore this clearly in Figure 7.

Details are in the caption following the image
Distribution of ATC surveys and papers publications.

The examined research papers in this taxonomy were quantitatively assessed based on their main category stages to address the aforementioned research questions. Then, depending on their subcategories, each primary category was quantitatively examined. Finally, available datasets were qualitatively studied based on the number of documents, classes, words, and references. We observe that most categories that have been studied are representation and the less is DR.

10.1. Preprocessing

In this subsection, we quantitatively analyze the reviewed preprocessing techniques based on their categories and timeline as follows.

10.1.1. Preprocessing Techniques Based on Categories

The total number of reviewed research papers related to preprocessing is 46, which is 29.68% out of the total reviewed articles. However, Figure 8 shows the distribution of published papers among the preprocessing categories, tokenization, stop word, stemming, lemmatization, and hybrid, 50.55%, 2.2%, 39.56%, 2.2, and 5.94%, respectively. It can be observed that the tokenization category has obtained the highest percentage 50.55% because any processing for text has to tokenize text to a character or word, whereas lemmatization and stop words are 2.2% which is the lowest number of publications. Hybrid categories obtained have 5.49. Finally, stemming has a second value it can be concluded that the stemming category has been given more attention from all research after ignoring tokenization.

Details are in the caption following the image
Using preprocessing techniques based on categories.

10.1.2. Preprocessing Techniques Based on Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 9 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 28.26% of papers were published out of 45; it is distributed over 10 years, which is half of the considered period, which is why it has obtained the highest percentage. On the other side, 2015 and 2016 obtained 10.87, which is the highest percentage. Also, the years 2014, 2017, and 2019 have the same score of 8.7%, but it is lower, whereas the years 2012, 2013, 2014, 2018, and 2020 have obtained equal percentages of 4.35%, which is the lowest value. It can be concluded that more attention was given to representation in the years 2015 and 2016.

Details are in the caption following the image
Using preprocessing techniques based on timeline.

10.2. Representation

In this subsection, we quantitatively analyze the reviewed representation methods based on their categories and timeline.

10.2.1. Representation Techniques Based on Categories

The total number of reviewed research papers related to representation is 47, which is 30.32% out of the total reviewed articles. However, Figure 10 shows the distribution of published papers among the representation categories, char level, word, sentence, documents, and hybrid, 21.28%, 63.21%, 4.26%, 2.13%, and 8.51%, respectively. It can be observed that the word category has obtained the highest percentage, 63,83%, whereas the document level got 2.13%, which is the lowest number of publications. Sentence categories obtained the present age 4.26. To this end, it can be concluded that the word category has been given more attention in research than the other categories.

Details are in the caption following the image
Using representation techniques based on categories.

10.2.2. Representation Techniques Based on a Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 11 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 4.26% of papers were published out of 47, which is distributed over 10 years, which is half of the considered period, and the highest percentage in 2020. On the other side, from 2011 to 2019, the numbers increased one after another, except that 2016 and 2017 were equal. Finally, it can be concluded that more attention was given to representation in the year 2020.

Details are in the caption following the image
Using representation techniques based on timeline.

10.3. DR

In this subsection, we quantitatively analyze the reviewed DR methods based on their categories and timeline.

10.3.1. DR Techniques Based on Categories

The total number of reviewed research papers related to representation is 25, which is 16.13% of the total reviewed articles. However, Figure 12 shows the distribution of published papers among the representation categories, feature selection, feature extraction, and hybrid, 36%, 8%, 12%, and 44%, respectively. It can be observed that the hybrid category has obtained the highest percentage, 44%, whereas the feature extraction level got 8%, which is the lowest number of publications. Optimization categories obtained 12%. To this end, it can be concluded that the hybrid category has been given more attention in research than the other categories.

Details are in the caption following the image
Using dimensionality reduction techniques based on categories.

10.3.2. DR Techniques Based on the Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 13 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 16% of papers were published out of 25, which is distributed between 10 years, which is half of the considered period, that is, why it has obtained the next highest percentage after 2019. On the other side, year 2019 has obtained 20%, which is the highest percentage. Also, the years 2011, 2013, 2015, 2016, 2017, and 2021 have got the same score of 4%, but it is lesser; at the same time, 2020 and 2012 are the same percentage, and 2014 and 2018 are also the same percentages. It can be concluded that more attention was given to representation in the years 2019 followed by 2018 by neglecting the first value which considers 10 years from 2001 to 2010.

Details are in the caption following the image
Using dimensionality reduction techniques based on timeline.

10.4. Classification

In this subsection, we quantitatively analyze the reviewed classification methods based on their categories and timeline.

10.4.1. Classification Techniques Based on Categories

The total number of reviewed research papers related to representation is 37 which is 32.87% out of the total reviewed articles. However, Figure 14 shows the distribution of published papers among the classification categories rule-based lexicon, ML, and hybrid 5.41%, 72.97%, and 21.62%, respectively. It can be observed that the ML category has obtained the highest percentage 72.97%, whereas the rule-based level got 5.41%, which is the lowest number of publications. Hybrid categories obtained 21.62. To this end, it can be concluded that the ML category has been given more attention from research that the other categories, especially with DL at this time.

Details are in the caption following the image
Classification techniques based on categories.

10.4.2. Classification Techniques Based on the Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 15 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 18.92% of papers were published out of 37, and it is distributed between 10 years, which is half of the considered period, which is why it has obtained the highest percentage. On the other side, 2019 and 2020 obtained 16.22, which is the highest percentage. Also, the years 2011, 2011, and 2019 have got the same score of 2.7%, but it is lesser, whereas the years 2012, 2013, 2014, 2018, and 2021 have obtained different percentage values. It can be concluded that more attention was given to representation in the years 2019 and 2020.

Details are in the caption following the image
Classification techniques based on timeline.

11. Experimental Analysis

They have conducted analysis of various ATC and ATR methods with different ML algorithms, which were experimentally implemented, and the performance has been evaluated in terms of accuracy, precision, recall, and F-measure.

11.1. Metrics Evaluation

There are weighted objective metrics to evaluate the ATC system. We are going to mention them here recall, precision, accuracy, F1-score, Matthew’s correlation coefficient (MCC), and negative predictive value (NPV) were used [16, 199]. The metrics are defined as follows:
()
()
()
()
()
()
()

12. Discussion

This section describes the findings from the qualitative and quantitative analyses. Some results about existing ATC and ATR models were emphasized through qualitative investigation. Subsequently, general observations explore the merits and demerits of available models. Recent advancements in transformer-based models, such as ARAGPT, have shown great promise for Arabic text preprocessing. ARAGPT, designed specifically for Arabic, uses attention mechanisms to capture the complexities of the language, including its rich morphology and dialectal diversity. Compared to traditional methods, ARAGPT demonstrates superior performance in tasks like tokenization, normalization, and segmentation, all of which are crucial for ATC. Incorporating ARAGPT into the preprocessing pipeline can significantly enhance the accuracy and robustness of ATC systems. This survey explores these advanced techniques and compares them with other existing approaches in the field as we see similar work for other languages such as [200203]. Finally, we discuss different observations as the following.

12.1. Qualitative and Qualitative Analysis

It is clear from Table 3 that there are many researchers have used ATR models. In addition, Table 4 investigates the existing ATC models and the actions to prepare for them. Table 3 explores the existing work for DR tasks which is less compared to representation and classification. However, prior research solved numerous issues, which we shall discuss in Section 10. On the other hand, it can be shown that DR has received less attention than preprocessing and classification. Quantitative analysis highlights some observations regarding the publications related to ATC based on the timeline, and stages of ATCs, where 2020 was the most productive year.

12.2. General Observations

Text classification involves many steps as we mentioned above and, in each stage, there are many algorithms have been used. In our study, we focused more on two steps, which affect the task of classification. These steps are representation and classification. We will mention some observations for these as follows in Table 8 and Table 9.

Table 8. Observation of representation technique.
Strength/weakness BOW TF–IDF W2V GLOVE GLOVET FAST CONTEXT
Easy to compute
Compute the similarity
Syntactic
Semantics
Capture polysemy
Capture out-of-vocabulary
Memory consumption
Work on only sentence level
Need on huge corpus to train
Context handling
Table 9. Observation of classification technique.
Strength/weakness RA BBA LRA NBA KNN SVM DT CRF RF DL
Easy to implement
Robust
Flexible with feature design
Expensive to train
Finding an efficient architecture is difficult
Is it a fast algorithm
Is it a black-box
Handle online learning
Parallel processing capability
Requires a large amount of data

The field of ATC relies on a variety of datasets, each with unique features and limitations. For instance, the AraSenTi dataset is widely used for SA, containing tweets labeled for polarity. However, it is limited in linguistic diversity, focusing primarily on MSA. Similarly, OSACT datasets emphasize dialectical Arabic but often overrepresent Egyptian and Levantine dialects, introducing bias in model training. A critical evaluation of these datasets reveals common challenges, including unbalanced class distributions and the prevalence of informal text, such as social media posts with spelling errors and code-switching. These issues highlight the need for more comprehensive and diverse datasets to advance the field of ATC.

12.3. Open Issues and Challenges

Although the problem of automatic text classification enjoys quite a rich amount of literature, there are many challenges still open to research, including a focus on the lack of lexicons, lack of benchmark corpora, right-to-left reading, and compound phrases and idioms. There is a need for more efforts to implement modernized DL methods for ATC systems, while we have explored four AT steps (preprocessing, representation, DR, and classification) separately in Sections 4, 5, 6, and 7. Although there is some work has been done at ATC, the complexity of the Arabic language and lack of tools with an increasing number of documents make text processing and analysis a big data problem all of these prove that this topic still a hot area for a researcher. Furthermore, problems that are released to text in general such as representation feature extraction and selection still another option for research. We highlight in the following section a research gap that can facilitate a deeper understanding domain of ATC and improve these techniques. We list some of them in subsections as follows.

12.4. Challenges Related to Dataset, Lexicons, and Dictionaries

  • The lack of publicly available free Arabic corpora.

  • Lack of lexicons availability.

  • Lack of dictionaries availability.

  • Lack of data augmentation techniques for AT.

12.5. Challenges Related to Preprocessing

  • Normalization process for the letter in some cases will change the meaning and affect accuracy for example “alif” (e.g., ا أ إ) is normalized to (ا) and will change the meaning; for example, فأر (means “mouse”) will transform to فار (means “escaped”)

  • It is difficult to find roots of some words such as Arabized words, which are translated from other languages, for example, programs (برامج).

  • In the Arabic language, one word may have more than one lexical category (noun, verb, adjective, etc.), for example, “eyes” (“الانسان عين”) and wellspring (عين الماء), which makes it difficult to understand the meaning of AT.

  • In the Arabic language, the problem of synonyms and broken plural forms is widespread which makes it difficult to recognize and understand the meaning of such words.

  • Arabic letter Hamzah or Hamza (ء) can be written in four different forms (أ, ؤ, ئ, ء), so it is subjective to mistake and misuse with many words.

  • Arabic nouns do not start with a capital letter as in English so another challenge for automatic AT processing which makes it difficult to recognize nouns Arabic language.

  • Stemming problem of AT.

12.6. Challenges Related to Representation and Feature Engineering

  • Curse of dimensionality and sparse vectors.

  • Finding techniques that handle the context meaning of ATC.

  • Time-consuming and memory space with new representation techniques such as BERT.

12.7. Challenges Related to Difficulties Nature of Arabic Language

  • Orthographic, ambiguity, dialectal variation.

  • Neglect many of Arabic language delicate such as Arabic, Khuzestan Arabic, Khurasan Arabic, Uzbekistan Arabic, the sub-Saharan Arabic of Nigeria and Chad, Djibouti Arabic, Cypriot Arabic, and Maltese.

12.8. Challenges Related to Related Topics

  • Mixed language problem.

  • Multimodal problem and multilanguage (mixed language) problem.

  • For instance, Persian and Urdu both utilize the extended Arabic script, incorporating additional letters such as “پ,” “چ,” “ژ,” and “گ” not found in standard Arabic. This often complicates applying Arabic-trained models to these languages without fine-tuning or transfer learning techniques.

  • Kurdish, especially in its Sorani dialect, and Pashto also use modified Arabic scripts, and like Arabic, they suffer from issues like:

    • i.

      Lack of diacritics in standard writing

    • ii.

      Highly inflected morphology

    • iii.

      Ambiguous word boundaries

    • iv.

      Scarcity of annotated corpora [204206].

13. Conclusion

This study presented a comprehensive taxonomy review for ATC, which focuses on two main sections. In the beginning, a detailed analysis of the current ATC surveys based on their objective, functions, and methods has been done and compared with this study. Then, a survey for each topic was individually conducted such as preprocessing, representation, and classification. In addition, quantitative analysis has been done for each stage. Finally, the study briefly describes and lists the current open research challenges and future direction of the ATC system. There are many open challenges for ATC at every stage. In addition, future research directions are promising in this field such as multimodels, multilanguage models, and difficulties regarding the nature of the Arabic language such as dialectal, morphology, and stemming. So, based on our understanding, this study is helpful for the research community in finding gaps and challenges for the ATC system in the real scenario. It encourages researchers to develop an effective and efficient model for ATC in different domains such as healthcare, economics, business, and education. Ultimately, this study serves as a valuable resource for researchers by identifying key gaps and challenges in ATC and encouraging the development of more effective and efficient models. By addressing these challenges and exploring innovative approaches, future research can significantly enhance the capabilities of ATC systems, making them more robust and adaptable to real-world applications. In addition to the technical challenges and advancements in ATC, it is crucial to consider the ethical implications of applying ML to AT. One of the major difficulties in working with ATC is that there has been limited research addressing bias and ethical considerations in this field. Given these challenges, future research should focus on mitigating bias, ensuring dataset diversity, and developing explainable AI models to enhance fairness and accountability in ATC. Addressing these ethical considerations will be essential for building more trustworthy and responsible AI systems in this domain.

Nomenclature

  • ANN
  • Artificial neural networks
  • AT
  • Arabic text
  • ASA
  • Arabic sentiment analysis
  • ATC
  • Arabic text classification
  • ATR
  • Arabic text representation
  • CNN
  • Convolutional neural networks
  • DA
  • Dialect Arabic
  • DR
  • Dimensionality reduction
  • DT
  • Decision Tree
  • GRU
  • Gated recurrent units
  • IR
  • Information retrieval
  • K-NN
  • K-nearest neighbor
  • LDA
  • Latent Dirichlet allocation
  • LR
  • Logistic regression
  • LSTM
  • Long short-term memory
  • LSVC
  • Linear support vector classifier
  • ML
  • Machine learning
  • NB
  • Naive Bayes
  • NMF
  • Non-negative matrix factorization
  • OOV
  • Out-of-vocabulary
  • OSAC
  • Open-Source Arabic Corpus
  • PCA
  • Principal component analysis
  • RSG
  • Rich semantic graph
  • SA
  • Sentiment analysis
  • SVC
  • Support vector classifier
  • SVM
  • Support vector machines
  • TCW–ICF
  • Term class weight–inverse class frequency
  • TF
  • Term frequency
  • TF–IDF
  • Term frequency–inverse document frequency
  • UN
  • United Nations
  • VSM
  • Vector space model
  • BoW
  • Bag-of-words
  • TCR
  • Term class relevance
  • ISRI
  • Information Science Research Institute
  • SVD
  • Singular value decomposition
  • GWO
  • Grey wolf optimizer
  • ABC
  • Artificial bee colony
  • BPSO
  • Binary particle swarm optimization
  • Conflicts of Interest

    The authors declare no conflicts of interest.

    Author Contributions

    Abdullah Y. Muaad: data curation, formal analysis, visualization, validation, software, and writing – original draft; Md Belal Bin Heyat and Faijan Akhtar: conceptualization, formal analysis, investigation, project administration, and writing – original draft; Usman Naseem and Wadeea R. Naji: data curation, validation, software, and writing – review and editing; Suresha Mallappa and Hanumanthappa J.: conceptualization, funding acquisition, supervision, investigation, and writing – review and editing. All authors read and agreed to the publication.

    Funding

    This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

    Acknowledgments

    The authors would like to thank Prof. Sawan, Prof. Naseem, Prof. Lai, Prof. Singh, and Prof. Wu for their valuable help and support throughout this work.

      Data Availability Statement

      The data that support the findings of this study are available within the article.

        The full text of this article hosted at iucr.org is unavailable due to technical difficulties.