Research on Optimising Data Detection Timing and Anomaly Detection Using K-Means Clustering and Random Forests

Authors

  • Changyang Tang
  • Shaowei Huang
  • Yihan An

DOI:

https://doi.org/10.54097/67f00n76

Keywords:

Pearson Analysis, Regression Models, K-Means Clustering, Random Forests.

Abstract

 This study integrates Pearson correlation analysis, multiple linear regression, linear mixed models (LMM), K-means clustering, and random forest algorithms to construct a data-driven framework for detection optimisation and anomaly identification. It aims to enhance sequencing data processing efficiency and improve the accuracy of anomaly detection. During data preprocessing, invalid and outlier values were first removed, followed by feature standardisation. Pearson correlation analysis was employed to uncover associations among core features. After constructing a multiple linear regression model using the least squares method, random effects were introduced to optimise the model into an LMM to address its insufficient fit. This approach not only significantly reduced the AIC and BIC values but also decreased the prediction error (RMSE) by 22%, substantially improving model fit and predictive accuracy. To further optimise detection timing, K-means clustering divided key influencing factors into five reasonable intervals. The median method was employed to determine optimal detection points within each interval, enabling earlier detection while maintaining accuracy. For anomaly detection requirements, the dataset was refined through data partitioning, mean imputation of missing values, and feature engineering. The resulting random forest model achieved 98% accuracy on the test set, demonstrating balanced precision and recall while effectively identifying core influential features. This integrated framework achieves seamless coordination between sequencing data processing, detection timing optimisation, and anomaly detection through multi-algorithm collaboration and iterative model refinement, providing a robust technical solution for related detection tasks.

References

[1]Xiang Chong, Chen Can. Optimisation of Genotype Imputation and Performance Analysis of Regression Models for Low-Depth Sequencing Data [J]. Hubei Agricultural Sciences, 2025, 64(07): 203-206. DOI: 10.14088/j.cnki.issn0439-8114.2025.07.035.

[2]Heng Hongjun, Dai Dongwei. A Multi-time Series Anomaly Detection Method Integrating Sparse Graph Attention [J]. Computer Engineering and Design, 2025, 46(03): 841-849. DOI:10.16208/j. issn1000-7024.2025.03.027.

[3]Zhang Yuhang. Research on Time-Delay Pearson Correlation Analysis and Key Variable Prediction Methods for Air Separation Equipment [D]. Hangzhou Dianzi University, 2023. DOI: 10.27075/d.cnki.ghzdc.2023.000752.

[4]Gai Yujie, Xie Yujiao, Wang Xiaodi. Parameter Estimation for Linear Mixed-Effects Models Based on Online Updating [J]. Applied Probability and Statistics, 2024, 40(03): 420-432.

[5]Xue Lei, Wang Tianfang. K-means Algorithm Based on Adaptive Dynamic Feature Weighting [J]. Journal of Jilin University (Science Edition), 2025, 63(05): 1404-1410. DOI: 10.13413/j.cnki. jdxblxb.2025001.

[6]Song Shijun, Fan Min. Design of a Big Data Anomaly Detection Model Based on the Random Forest Algorithm [J]. Journal of Jilin University (Engineering Science), 2023, 53(09): 2659-2665. DOI: 10.13229/j.cnki.jdxbgxb.20220598.

Downloads

Published

31-12-2025

Issue

Section

Articles

How to Cite

Tang, C., Huang, S., & An, Y. (2025). Research on Optimising Data Detection Timing and Anomaly Detection Using K-Means Clustering and Random Forests. Mathematical Modeling and Algorithm Application, 7(3), 89-94. https://doi.org/10.54097/67f00n76