Detection of Malicious Emails and URLs Using Text Mining
Heetakshi Fating
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorAditya Narawade
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorSandeep Kumar Satapathy
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorShruti Mishra
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorHeetakshi Fating
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorAditya Narawade
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorSandeep Kumar Satapathy
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorShruti Mishra
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu, India
Search for more papers by this authorSachi Nandan Mohanty
School of Computer Science & Engineering, VIT AP University, Amaravati, Andhra Pradesh, India
Search for more papers by this authorRajanikanth Aluvalu
Department of IT, Chaitanya Bharathi Institute of Technology, Hyderabad, India
Search for more papers by this authorSarita Mohanty
Department of Computer Science, Odisha University of Agriculture & Technology, Bhubaneswar, India
Search for more papers by this authorSummary
This work aims to create a combined model of two models, to first process whether an email is malicious or not, after which, a non-malicious email is further analyzed to check whether it contains a malicious URL. Features are created for one of the models after which the information gain feature selection technique is used, while the method of tokenization is used for the email model. For the combined model, a new dataset containing only non-malicious emails which contained a mix of good and bad URLs was created and features were created in a similar manner to the URL dataset's model to determine whether the flagged non-malicious emails were entirely non-malicious or whether they did contain a malicious URL of any sort. For the malicious URL detection alone, the best accuracy of 80.7% was achieved by the Random Forest algorithm while an accuracy of 98.9% was achieved for the email dataset using the Random Forest algorithm as well. For the final combined model, the Support Vector Machine and Logistic Regression algorithms gave the better accuracies among others of 81.88% and 81.49% respectively.
References
- Joshi , A. , Lloyd , L. , Westin , P. , Seethapathy , S. , Using Lexical Features For Malicious URL Detection–A Machine Learning Approach , 2019 , arXiv preprint arXiv:1910.06277.
-
Vanhoenshoven , F.
,
Nápoles , G.
,
Falcon , R.
,
Vanhoof , K.
,
Köppen , M.
,
Detecting malicious URLs using machine learning techniques
, in:
2016 IEEE Symposium Series on Computational Intelligence (SSCI)
,
IEEE
, pp.
1
–
8
, December
2016
.
10.1109/SSCI.2016.7850079 Google Scholar
- Khan , F. , Ahamed , J. , Kadry , S. , Ramasamy , L.K. , Detecting malicious URLs using binary classification through AdaBoost algorithm . Int. J. Electr. Comput. Eng. , 10 , 1 , 2088 – 8708 , 2020 .
- Ma , J. , Saul , L.K. , Savage , S. , Voelker , G.M. , Learning to detect malicious URLs . ACM Trans. Intell. Syst. Technol. , 2 , 3 , 1 – 24 , 2011 .
-
Ranganayakulu , D.
and
Chellappan , C.
,
Detecting malicious URLs in e-mail– An implementation
.
AASRI Proc.
,
4
,
125
–
131
,
2013
.
10.1016/j.aasri.2013.10.020 Google Scholar
- Shih , D.-H. , Chiang , H.-S. , Yen , C.D. , Classification methods in the detection of new malicious emails . Inf. Sci. , 172 , 1–2 , 241 – 261 , 2005 .
- Rudd , E.M. , Harang , R. , Saxe , J. , MEADE: Towards a malicious email attachment detection engine . 2018 IEEE International Symposium on Technologies for Homeland Security (HST) , pp. 1 – 7 , 2018 .
- Sah , U.K. and Parmar , N. , An approach for malicious spam detection in email with comparison of different classifiers . Int. Res. J. Eng. Technol. (IRJET) , 4 , 8 , 2238 – 2242 , Aug- 2017 .
- Xuan , C.D. and Nguyen , H.D. , Malicious URL detection based on machine learning , Int J Adv Comput Sci Appl. , 11 , 1 , 2020 .
-
Wejinya , G.
and
Bhatia , S.
,
Machine learning for malicious URL detection
, in:
ICT Systems and Sustainability, Advances in Intelligent Systems and Computing
, vol.
1270
,
M. Tuba
,
S. Akashe
,
A. Joshi
(Eds.),
Springer
,
Singapore
,
2021
.
10.1007/978-981-15-8289-9_45 Google Scholar
- Khonji , M. , Iraqi , Y. , Jones , A. , Lexical URL analysis for discriminating phishing and legitimate e-mail messages . 2011 International Conference for Internet Technology and Secured Transactions , pp. 422 – 427 , 2011 .
-
Rupa , C.
,
Srivastava , G.
,
Bhattacharya , S.
,
Reddy , P.
,
Gadekallu , T.R.
,
A machine learning driven threat intelligence system for malicious URL detection
, in:
The 16th International Conference on Availability, Reliability and Security (ARES 2021)
,
Association for Computing Machinery
,
New York, NY, USA
, Article 154, pp.
1
–
7
,
2021
.
10.1145/3465481.3470029 Google Scholar
- Chiramdasu , R. , Srivastava , G. , Bhattacharya , S. , Reddy , P.K. , Gadekallu , T.R. , Malicious URL detection using logistic regression . 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS) , pp. 1 – 6 , 2021 .
- Khan , H.M.J. , Niyaz , Q. , Devabhaktuni , V.K. , Guo , S. , Shaikh , U. , Identifying generic features for malicious URL detection system . 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) , pp. 0347 – 0352 , 2019 .
- https://www.tessian.com/blog/phishing-statistics-2020/#:~:text=In%202021%20Tessian%20research%20found,receiving%20an%20average%20of%2049 .
- Satapathy , S.K. , Mishra , S. , Mishra , D. , Search technique using wildcards or truncation: A tolerance rough set clustering approach . Int. J. Adv. Comput. Sci. Appl. , 1 , 4 , 73 – 77 , October 2010 .
-
Satapathy , S.K.
,
Dehuri , S.
,
Jagadev , A.K.
,
Mishra , S.
,
EEG Brain Signal Classification for Epileptic Seizure Disorder Detection
.,
1st Eds
,
Elsevier Publication
. Feb,
2019
.
10.1016/B978-0-12-817426-5.00001-6 Google Scholar
- Satapathy , S.K. , Dehuri , S. , Jagadev , A.K. , Weighted majority voting based ensemble of classifiers using different machine learning techniques for classification of EEG signal to detect epileptic seizure . Informatica , 41 , 99 – 110 , 2017 .
- Satapathy , S.K. , Jagadev , A.K. , Dehuri , S. , An empirical analysis of training algorithms of neural networks: A case study of EEG signal classification using java framework , in: Advances in Intelligent Systems and Computing , vol. 309 , L.C. Jain (Eds.), pp. 151 – 160 , Springer , 2015 .
- Mishra , S. , Mishra , D. , Satapathy , S.K. , Fuzzy frequent pattern mining from gene expression data using dynamic multi-swarm particle swarm optimization . 2nd International Conference on Computer, Communication, Control and Information Technology (C3IT 2012), Proc. Technol. , 4 , 797 – 801 , Feb. 2012 .