Malicious URL Detection An Evaluation of Feature Extraction and Machine Learning Algorithm

Yichen Wang

doi:10.54097/hset.v23i.3209

Authors

Yichen Wang

DOI:

https://doi.org/10.54097/hset.v23i.3209

Keywords:

malicious URL detection; machine learning; feature extraction; learning algorithms.

Abstract

Cyber attacks are increasing rapidly today, and have a great influence on network security. Many of cyber attacks take place via malicious Uniform Resource Locators (URLs). As a result, various approaches have been developed to detect malicious URLs. One of the most competitive techniques is machine learning and deep learning. However, the detailed techniques concerning feature extraction for URLs and machine learning algorithm are still in the process of development. This paper aims to provide some references for screening out the methods of feature extraction and machine learning algorithm. In the designed experiment, the selected URLs are processed by two different methods of feature extraction, tokenization and vectorization, and lexical feature selection. The resultant constructs two different datasets (data1 and data2) for machine learning. Two traditional learning algorithms (Logistic Regression and SVM) and three ensemble learning algorithms (Random Forest, Gradient Boosting, and Bagging) are adopted as detection model for both datasets. The experimental results demonstrate that the method of tokenization and vectorization for feature extraction, together with ensemble learning algorithms can result in good predictive performance of malicious URL detection.

Downloads

Download data is not yet available.

References

Internet Security Threat Report (ISTR) 2019–Symantec. https://www. symantec.com/content/dam/symantec/docs/reports/istr-24-2019-en.pdf [Last accessed 10/2019].

M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of drive-by download attacks and malicious javascript code,” in Proceedings of the 19th international conference on World wide web, ACM, 2010, pp. 281–290.

M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: a literature survey,” IEEE Communications Surveys and Tutorials, 2013, vol. 15, no. 4, pp. 2091–2121.

R. Heartfield, and G. Loukas, “A taxonomy of attacks and a survey of defense mechanisms for semantic social engineering attacks,” ACM Computing Surveys (CSUR), 2015, vol. 48, no. 3, p. 37.

D. Sahoo, C. Liu, and S.C.H. Hoi, “Malicious URL detection using machine learning: a survey,” 1, 1 (August 2019), 37 pages, https://doi.org/10.1145/nnnnnnn.nnnnnnn, 2019.

S. Sinha, M. Bailey, and F. Jahanian, “Shades of grey: On the effectiveness of reputation-based “blacklists”,” in Malicious and Unwanted Software, MALWARE 2008. 3rd International Conference on. IEEE, 2008, pp. 57–64.

J. Ma, L.K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists: learning to detect malicious web sites from suspicious URLs,” In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.

M. Felegyhazi, C. Kreibich, and V. Paxson, “On the Potential of Proactive Domain Blacklisting,” LEET, 2010.

P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta, “Phishnet: predictive blacklisting to detect phishing attacks,” In INFOCOM, 2010 Proceedings IEEE, 2010.

B. Sun, A. Mitsuaki, T. YAGI, and H. Mitsuhiro, “Automating URL blacklist generation with similarity search approach,” IEICE TRANSACTIONS on Information and Systems, 2016.

S. Sheng, B. Wardman, G. Warner, L. F. Cranor, J. Hong, and C. Zhang, “An empirical analysis of phishing blacklists,” in Proceedings of Sixth Conference on Email and Anti-Spam (CEAS), 2009.

B. Cui, S. He, X. Yao, and P. Shi, “Malicious URL detection with feature extraction based on machine learning,” Int. J. High Performance Computing and Networking, 2018, Vol. 12, No. 2, pp.166–178.

C.D. Xuan, H. D. Nguyen, and T. V. Nikolaevich, “Malicious URL detection based on machine learning,” International Journal of Advanced Computer Science and Applications, 2020, Vol. 11, No. 1, pp.148–153.

H. Choi, B.B. Zhu, and H. Lee, “Detecting malicious web links and identifying their attack types,” Proceedings of the 2nd USENIX Conference on Web Application Development, 2011.

S. Garera, N. Provos, M. Chew, and A.D. Rubin, “A framework for detection and measurement of phishing attacks,” Proceedings of 5th ACM Workshop on Recurring Malcode, 2007.

FAIZANN24, “Using machine learning to detect malicious URLs,” https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs, 2017.

K. Wang, C. Thrasher, and B.-J. P. Hsu, “Web scale nlp: a case study on url word breaking,” in Proceedings of the 20th international conference on World wide web, ACM, 2011, pp. 357–366.

S. Srinivasan, S. Bhattacharya, and R. Chakraborty, “Segmenting webdomains and hashtags using length specific models,” in Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012, pp. 1113–1122.

W. Wang, and K. E. Shirley, “Breaking bad: detecting malicious domains using word segmentation,” arXiv preprint arXiv:1506.04111, 2015.

V. Ramanathan, and H. Wechsler, “Phishing website detection using Latent Dirichlet Allocation and AdaBoost,” IEEE International Conference on Intelligence and Security Informatics. IEEE, 2012,102–107.

B. Eshete, A. Villafiorita, and K. Weldemariam, “BINSPECT: Holistic analysis and detection of malicious web pages,” Lecture Notes of the Institute for Computer Sciences Social Informatics amd Telecommunications Engineering, 2013, 149–166.

G. M. Rao, and D. Ramesh, “Ranger Random Forest-Based Efficient Ensemble Learning Approach for Detecting Malicious URLs,” In: Gunjan, V.K., Zurada, J.M. (eds) Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Advances in Intelligent Systems and Computing, 2021, vol 1245, Springer, Singapore, https://doi.org/10.1007/978-981-15-7234-0_56.

E. Buber, O. Demir, and O. K. Sahingoz, "Feature selections for the machine learning based detection of phishing websites," 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) IEEE, 2017.

H. Zuhair, M. Salleh, and A. Selamat, "Feature selection for phishing detection: a review of research," International Journal of Intelligent Systems Technologies and Applications 2016, 15.2, 147–162.