Chapter 7

Detection of Malicious Emails and URLs Using Text Mining

Heetakshi Fating

Heetakshi Fating

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author
Aditya Narawade

Aditya Narawade

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author
Sandeep Kumar Satapathy

Sandeep Kumar Satapathy

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author
Shruti Mishra

Shruti Mishra

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author
First published: 29 May 2023

Summary

This work aims to create a combined model of two models, to first process whether an email is malicious or not, after which, a non-malicious email is further analyzed to check whether it contains a malicious URL. Features are created for one of the models after which the information gain feature selection technique is used, while the method of tokenization is used for the email model. For the combined model, a new dataset containing only non-malicious emails which contained a mix of good and bad URLs was created and features were created in a similar manner to the URL dataset's model to determine whether the flagged non-malicious emails were entirely non-malicious or whether they did contain a malicious URL of any sort. For the malicious URL detection alone, the best accuracy of 80.7% was achieved by the Random Forest algorithm while an accuracy of 98.9% was achieved for the email dataset using the Random Forest algorithm as well. For the final combined model, the Support Vector Machine and Logistic Regression algorithms gave the better accuracies among others of 81.88% and 81.49% respectively.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.