Intrusion Detection System with Supervised Learning Models

Jihan Zheng

doi:10.54097/hset.v23i.3215

Authors

Jihan Zheng

DOI:

https://doi.org/10.54097/hset.v23i.3215

Keywords:

Intrusion Detection Systems (IDSs); network anomaly detection; machine learning algorithms; accuracy score; confusion matrix.

Abstract

Intrusion Detection Systems (IDSs) can analyze and detect abnormal network activity, which addresses potential attacks based on studying and analyzing past attacks. This paper uses four supervised machine learning methods, which are logistic regression, decision tree, support vector machine, and random forest, to detect these abnormal attacks. The dataset used in this paper is from KDDCUP’99, a publicly available dataset for network-based anomaly detection systems. Certain features and outcomes are first extracted from the dataset. The values in nominal features are converted into dummy variables, and the values in outcome are changed to either normal or attack. Then the training processes are performed with the four algorithms, and the models are tested to get the accuracy scores. According to the results, the logistic regression model has the highest accuracy score of 0.9415, and the other three models all have accuracy scores above 0.90. The accuracy scores of the decision tree, support vector machine, and random forest are 0.9317, 0.9374, and 0.9202, respectively. Our models turn out to be efficient in identifying the network anomaly with provided data.

Downloads

Download data is not yet available.

References

Erdbrink, T. (2012). Iranian Oil Sites Go Offline Amid Cyberattack. The New York Times.

What is an Intrusion Detection System (IDS)?. Check Point. https://www.checkpoint.com/cyber-hub/network-security/what-is-an-intrusion-detection-system-ids/

Hoffman, J. Different Types of Intrusion Detection System (IDS). WisdomPlexus. https://wisdomplexus.com/blogs/different-types-of-intrusion-detection-systems-ids/

Anush. Network Anomaly Detection. Kaggle. https://www.kaggle.com/datasets/anushonkar/network-anamoly-detection?resource=download

Allison, P. D. (1999) Logistic regression using the sas system: theory and application. SAS Publishing.

Tolles, Juliana; Meurer, William J (2016). "Logistic Regression Relating Patient Characteristics to Outcomes". JAMA. 316 (5): 533–4. doi:10.1001/jama.2016.7653. ISSN 0098-7484. OCLC 6823603312. PMID 27483067.

Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, V. Vapnik. (1995) Comparison of learning algorithms for handwritten digit recognition. In: International Conference on Artificial Neural Networks. Paris.

Bujokas, E. (2022) Feature Importance in Decision Trees. Towards Data Science. https://towardsdatascience.com/feature-importance-in-decision-trees-e9450120b445

Gandhi, R (2018). Support Vector Machine — Introduction to Machine Learning Algorithms. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

IBM Cloud Education. Random Forest. IBM. https://www.ibm.com/cloud/learn/random-forest#:~:text=Random%20forest%20is%20a%20commonly,both%20classification%20and%20regression%20problems

Scikit-learn. Accuracy Score. Sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

Scikit-learn. Confusion Matrix. Sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Scikit-learn. Decision Tree Classifier. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html