Detection of Malicious Emails and URLs Using Text Mining

Heetakshi Fating,

Heetakshi Fating

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Aditya Narawade,

Aditya Narawade

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Sandeep Kumar Satapathy,

Sandeep Kumar Satapathy

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Shruti Mishra,

Shruti Mishra

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Heetakshi Fating,

Heetakshi Fating

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Aditya Narawade,

Aditya Narawade

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Sandeep Kumar Satapathy,

Sandeep Kumar Satapathy

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Shruti Mishra,

Shruti Mishra

School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India

Search for more papers by this author

Book Editor(s):Sachi Nandan Mohanty,

Sachi Nandan Mohanty

School of Computer Science & Engineering, VIT AP University, Amaravati, Andhra Pradesh, India

Search for more papers by this author

Rajanikanth Aluvalu,

Rajanikanth Aluvalu

Department of IT, Chaitanya Bharathi Institute of Technology, Hyderabad, India

Search for more papers by this author

Sarita Mohanty,

Sarita Mohanty

Department of Computer Science, Odisha University of Agriculture & Technology, Bhubaneswar, India

Search for more papers by this author

First published: 29 May 2023

https://doi.org/10.1002/9781119905172.ch7

Summary

This work aims to create a combined model of two models, to first process whether an email is malicious or not, after which, a non-malicious email is further analyzed to check whether it contains a malicious URL. Features are created for one of the models after which the information gain feature selection technique is used, while the method of tokenization is used for the email model. For the combined model, a new dataset containing only non-malicious emails which contained a mix of good and bad URLs was created and features were created in a similar manner to the URL dataset's model to determine whether the flagged non-malicious emails were entirely non-malicious or whether they did contain a malicious URL of any sort. For the malicious URL detection alone, the best accuracy of 80.7% was achieved by the Random Forest algorithm while an accuracy of 98.9% was achieved for the email dataset using the Random Forest algorithm as well. For the final combined model, the Support Vector Machine and Logistic Regression algorithms gave the better accuracies among others of 81.88% and 81.49% respectively.

References

Joshi , A. , Lloyd , L. , Westin , P. , Seethapathy , S. , Using Lexical Features For Malicious URL Detection–A Machine Learning Approach , 2019 , arXiv preprint arXiv:1910.06277.
Google Scholar
Vanhoenshoven , F. , Nápoles , G. , Falcon , R. , Vanhoof , K. , Köppen , M. , Detecting malicious URLs using machine learning techniques , in: 2016 IEEE Symposium Series on Computational Intelligence (SSCI) , IEEE , pp. 1 – 8 , December 2016 .
10.1109/SSCI.2016.7850079
Google Scholar
Khan , F. , Ahamed , J. , Kadry , S. , Ramasamy , L.K. , Detecting malicious URLs using binary classification through AdaBoost algorithm . Int. J. Electr. Comput. Eng. , 10 , 1 , 2088 – 8708 , 2020 .
Google Scholar
Ma , J. , Saul , L.K. , Savage , S. , Voelker , G.M. , Learning to detect malicious URLs . ACM Trans. Intell. Syst. Technol. , 2 , 3 , 1 – 24 , 2011 .
10.1145/1961189.1961201
Web of Science® Google Scholar
Ranganayakulu , D. and Chellappan , C. , Detecting malicious URLs in e-mail– An implementation . AASRI Proc. , 4 , 125 – 131 , 2013 .
10.1016/j.aasri.2013.10.020
Google Scholar
Shih , D.-H. , Chiang , H.-S. , Yen , C.D. , Classification methods in the detection of new malicious emails . Inf. Sci. , 172 , 1–2 , 241 – 261 , 2005 .
10.1016/j.ins.2004.06.003
Web of Science® Google Scholar
Rudd , E.M. , Harang , R. , Saxe , J. , MEADE: Towards a malicious email attachment detection engine . 2018 IEEE International Symposium on Technologies for Homeland Security (HST) , pp. 1 – 7 , 2018 .
Google Scholar
Sah , U.K. and Parmar , N. , An approach for malicious spam detection in email with comparison of different classifiers . Int. Res. J. Eng. Technol. (IRJET) , 4 , 8 , 2238 – 2242 , Aug- 2017 .
Google Scholar
Xuan , C.D. and Nguyen , H.D. , Malicious URL detection based on machine learning , Int J Adv Comput Sci Appl. , 11 , 1 , 2020 .
Google Scholar
Wejinya , G. and Bhatia , S. , Machine learning for malicious URL detection , in: ICT Systems and Sustainability, Advances in Intelligent Systems and Computing , vol. 1270 , M. Tuba , S. Akashe , A. Joshi (Eds.), Springer , Singapore , 2021 .
10.1007/978-981-15-8289-9_45
Google Scholar
Khonji , M. , Iraqi , Y. , Jones , A. , Lexical URL analysis for discriminating phishing and legitimate e-mail messages . 2011 International Conference for Internet Technology and Secured Transactions , pp. 422 – 427 , 2011 .
Google Scholar
Rupa , C. , Srivastava , G. , Bhattacharya , S. , Reddy , P. , Gadekallu , T.R. , A machine learning driven threat intelligence system for malicious URL detection , in: The 16th International Conference on Availability, Reliability and Security (ARES 2021) , Association for Computing Machinery , New York, NY, USA , Article 154, pp. 1 – 7 , 2021 .
10.1145/3465481.3470029
Google Scholar
Chiramdasu , R. , Srivastava , G. , Bhattacharya , S. , Reddy , P.K. , Gadekallu , T.R. , Malicious URL detection using logistic regression . 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS) , pp. 1 – 6 , 2021 .
Google Scholar
Khan , H.M.J. , Niyaz , Q. , Devabhaktuni , V.K. , Guo , S. , Shaikh , U. , Identifying generic features for malicious URL detection system . 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) , pp. 0347 – 0352 , 2019 .
Google Scholar
https://www.tessian.com/blog/phishing-statistics-2020/#:~:text=In%202021%20Tessian%20research%20found,receiving%20an%20average%20of%2049 .
Google Scholar
Satapathy , S.K. , Mishra , S. , Mishra , D. , Search technique using wildcards or truncation: A tolerance rough set clustering approach . Int. J. Adv. Comput. Sci. Appl. , 1 , 4 , 73 – 77 , October 2010 .
Google Scholar
Satapathy , S.K. , Dehuri , S. , Jagadev , A.K. , Mishra , S. , EEG Brain Signal Classification for Epileptic Seizure Disorder Detection ., 1st Eds , Elsevier Publication . Feb, 2019 .
10.1016/B978-0-12-817426-5.00001-6
Google Scholar
Satapathy , S.K. , Dehuri , S. , Jagadev , A.K. , Weighted majority voting based ensemble of classifiers using different machine learning techniques for classification of EEG signal to detect epileptic seizure . Informatica , 41 , 99 – 110 , 2017 .
Google Scholar
Satapathy , S.K. , Jagadev , A.K. , Dehuri , S. , An empirical analysis of training algorithms of neural networks: A case study of EEG signal classification using java framework , in: Advances in Intelligent Systems and Computing , vol. 309 , L.C. Jain (Eds.), pp. 151 – 160 , Springer , 2015 .
Google Scholar
Mishra , S. , Mishra , D. , Satapathy , S.K. , Fuzzy frequent pattern mining from gene expression data using dynamic multi-swarm particle swarm optimization . 2nd International Conference on Computer, Communication, Control and Information Technology (C3IT 2012), Proc. Technol. , 4 , 797 – 801 , Feb. 2012 .
Google Scholar

Evolution and Applications of Quantum Computing