PM 2.5 Concentration Prediction Method Based on Temporal Attention Mechanism and CNN ‐ LSTM

: Accurately predicting PM 2.5 concentration can effectively avoid the harm caused by heavy pollution weather to human health. In view of the non-linearity, time series characteristics, and the problem of large multi-step prediction errors in PM 2.5 concentration data, a method combining Long Short-term Memory Network and Convolutional Neural Network with Time Pattern Attention mechanism (TPA-CNN-LSTM) is proposed. The method uses historical PM 2.5 concentration data, historical meteorological data, and surrounding station data to predict the future 6-hour PM 2.5 concentration of air quality monitoring stations. Firstly, CNN is used to obtain the spatial characteristics between multiple stations, secondly, LSTM is added after CNN to extract the temporal changes of non-linear data, and finally, to capture the key features of temporal information, T emporal Pattern Attention mechanism (TPA) is added. TPA can automatically adjust weights based on the input of each time step, and select the most relevant time step for prediction, thereby improving the accuracy of the model. An example analysis is conducted on the measured data of Beijing's air quality stations in 2018, and compared with other mainstream algorithms. The results show that the proposed model has higher prediction accuracy and performance.


Introduction
The introduction should provide background information (including relevant references) and should indicate the purpose of the manuscript. Cite relevant work by others, including research outside your company. Place your work in perspective by referring to other research papers. Inclusion of statements at the end of the introduction regarding the organization of the manuscript can be helpful to the reader. PM 2.5 refers to particulate matter with a diameter of equal to or less than 2.5 micrometers in ambient air. Due to their small size, these particles can penetrate deep into the respiratory system and pose a significant threat to human health. PM 2.5 particles often adsorb carcinogenic substances, toxic metals, persistent organic pollutants, and other harmful materials, which can directly affect the lungs when they enter the human body, leading to heavy metal poisoning, increased cancer risk, reproductive harm, and other problems [1].
The fluctuations in PM 2.5 concentration have varying degrees of impact on the respiratory system, crop growth, and tourism. Predicting future PM 2.5 concentrations in advance can provide valuable health information for travelers and serve as a warning for relevant departments to take measures to improve air quality [2]. This research topic has positive significance for the long-term development of public health, the economy, and the ecological environment. At the same time, it presents significant challenges to the accuracy and stability of the models, making it a profound and meaningful research topic.
Against the background of rapid development in machine learning, researchers have conducted many studies on air pollutant concentration prediction based on different algorithms and models. For example, Zhang et al. [3] used the PCA method for data feature extraction, which improved the prediction accuracy and reduced the model's time complexity. They input the extracted data into a BP neural network for prediction. Samal et al. [4] proposed the Multi-Time Convolutional Artificial Neural Network (MTCAN) model, which can perform feature learning and sequence modeling simultaneously, and use a large amount of past observation data for long-term prediction to minimize memory requirements and operating time. Liu et al. [5] optimized the BP neural network using a genetic algorithm and constructed a feature-based PM 2.5 concentration prediction model. Li [6] proposed an AC-LSTM model composed of a onedimensional convolutional neural network, long short-term memory network, and attention mechanism. This model not only uses air pollutant concentration but also adds PM 2.5 concentration from neighboring air quality monitoring stations as prediction data. They used CNN and LSTM to extract the spatio-temporal correlation and interdependence of multi-variable time series data, and used attention mechanism to capture the importance of different feature states at different time steps in affecting future PM 2.5 concentrations. Liu et al. [7] used historical air pollutant and meteorological data for a region to construct an LSTM prediction model and accurately predict the PM 2.5 concentration for 1, 4, 8, and 12 hours in the future.
The above studies have made certain improvements in predicting PM 2.5 concentration, but these studies often only consider the relevant features of a single station itself or ignore the correlation between different time steps and the influence of meteorological factors on PM 2.5 concentration. In fact, PM 2.5 concentration is not only affected by local historical conditions but also by the transport of pollutants from surrounding areas and meteorological factors. Therefore, this paper proposes a new type of TPA-CNN-LSTM network, which uses historical pollutant concentrations, surrounding PM 2.5 concentrations, and meteorological factors as prediction features. LSTM overcomes the problems of gradient explosion and vanishing in RNN and can extract the time features of long-term prediction. CNN can extract the spatiotemporal features in the input data and the relevant features of PM 2.5 concentration between different monitoring stations. The time pattern attention mechanism can capture periodic features in time series data and handle multiple time series inputs, improving the accuracy, adaptability, and practicality of the model.

Data Description
The temporal and spatial variations of atmospheric particulate matter (PM) are influenced by multiple factors, including pollution sources, meteorological conditions, and others [14,15]. The variation of PM 2.5 concentration is not only affected by the previous atmospheric state and PM 2.5 concentration but also related to the PM 2.5 concentration in adjacent areas [16,17]. Therefore, historical air pollution data and meteorological data need to be considered in the prediction model. In this study, we selected 35 environmental monitoring stations in Beijing and collected hourly historical air quality concentration and meteorological data from January 1, 2018, to December 31, 2018, with a total of 8674 sets of data. The statistical information of the specific data is shown in Table 1. The locations of the air quality monitoring stations are shown in Figure 1. The experimental data includes concentrations of PM 10 , PM 2.5 , SO 2 , NO 2 , O 3 , and CO. The detailed characteristics of the air quality monitoring stations are also listed in Table 1. The meteorological data includes indicators such as atmospheric pressure, temperature, wind speed, humidity, and weather. Data sources include the Beijing Environmental Protection Monitoring Center and the European Center for Medium-Range Weather Forecasting reanalysis data. For individual missing data in the used data, linear interpolation, pre-filling, or post-filling methods were used for filling, and then normalization was carried out.

TPA mechanism
The proposed model utilizes one-dimensional CNN to learn the temporal pattern information of time series data, referred to as TPA (temporal pattern attention), as a local feature learning method in the network. The TPA mechanism structure is shown in Figure 2. The leftmost arrow in Figure 2 represents the processing of variables, and each row represents a variable's time series data. The time pattern matrix , C i j H of the variable within the convolutional kernel range is obtained by convolution calculation. The scoring function calculates the score of the time pattern matrix and normalizes the score using the sigmoid function to obtain the attention weight  . The context vector v t is obtained by combining the time pattern matrix and attention weight. The context vector v t from the encoder and the hidden state h are concatenated and connected with the hidden state s from the decoder. The output prediction value is calculated using the output layer and softmax function. The following describes in detail the role of the TPA mechanism in the proposed model. Firstly, one-dimensional CNN is used for convolution calculation. k filters are set, and the kernel size is 1k. T represents the range covered by attention, usually set to T=w. The convolution is calculated along the row vector of the hidden state matrix H using the above kernel to extract the time pattern matrix , C i j H of the variable within the convolutional kernel range, as shown in Equation (1).  performs convolution calculation on the data within this range. The learned temporal patterns are scored using the following formula: Weighted sum each row of HCC to obtain the context  (5) In the formula,

Convolutional neural network
Convolutional Neural Network (CNN) have achieved excellent performance in the fields of computer vision and natural language processing, thanks to their powerful feature extraction and recognition capabilities. Typically, onedimensional convolution is used for processing time data, two-dimensional convolution is used for spatial convolution in images, and three-dimensional convolution is used for spatial convolution in three-dimensional space. In this paper, we believe that the convolution kernel in CNN is a onedimensional structure. By using CNN, multi-variable time series data such as meteorological data, air pollution data, and data from adjacent stations needed for PM 2.5 prediction can be input through different channels to maximize information retention. CNN mainly consists of three modules: it is a feedforward neural network. Its basic structure consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer (FC), and an output layer, as shown in Figure  3. CNN has the ability to automatically learn data features and has characteristics such as local connectivity, weight sharing, pooling operations, and multi-layer structure. These features significantly reduce the complexity of the PM 2.5 concentration prediction model and reduce overfitting through gradient descent optimization, thus improving generalization ability. The calculation formulas for the convolution layer, pooling layer, and fully connected layer are:

LSTM model
The LSTM controls the transmission state through a gate structure, divided into forgetting gates, selective memory gates, and output gates, as shown in Figure 4. The forgetting gate determines the extent to which the previous unit state is forgotten and can be expressed as: In the formula: t G is the state matrix of the forget gate at time t; We choose a memory gate and an activation function to control the range of new information being added. By using the joint effect of the forget gate and the memory gate output, we update the unit state of this unit, which can be expressed as:

Model Parameter Design
The parameters of the prediction model were set as follows: the number of convolutional layers was 1, the number of filters was 16, the convolutional kernel size was 5 5, the learning rate was 0.001, the epoch was 100, the batch size was 32, and the optimizer was Adam. the parameter selection method was the grid search method.

Selection of Evaluation Indicators
The root mean square error (RMSE) (15) and mean absolute error (MAE) (16) were chosen as metrics to evaluate the prediction performance of the TPA-CNN-LSTM model at each site and to compare it with other models.

TPA-CNN-LSTM prediction model construction
In order to overcome the limitation of CNN in capturing only local information, this paper proposes a model that combines TPA with CNN. The model first performs feature extraction on the input time series sample and then uses it as the input to the CNN convolutional layer. This combination can reduce the computational complexity of CNN and speed up the feature extraction process. After the CNN performs the second feature extraction, the output time series is processed for dimensionality reduction. Considering the complex model structure of LSTM, when inputting long time series, the model training time is long. However, using the time series extracted and dimensionality-reduced by TPA-CNN as the input to LSTM can improve the accuracy of LSTM prediction and reduce its operating time. The prediction process of the entire TPA-CNN-LSTM model is shown in Figure 5. The main steps of the PM 2.5 concentration prediction model based on TPA-CNN-LSTM are as follows: Step 1: Correlation analysis. Analyze the factors affecting the PM 2.5 concentration changes and determine the input sequence.
Step 2: Data preprocessing. Handle missing values and outliers, and normalize the input time series.
Step 3: Feature extraction. Use TPA-CNN to extract important features from the normalized time series.
Step 4: PM 2.5 concentration prediction. Input the TPA-CNN processed time series into the LSTM network for model training.
Step 5: Output. Reverse normalize the predicted data and output the results.
Through the above steps, the goal of improving prediction accuracy and computational efficiency is achieved.

Correlation Analysis
This paper uses the Pearson correlation coefficient analysis method to analyze the correlation between various variables and PM 2.5 concentration. The Pearson correlation coefficient is used to describe the correlation between two variables and is often expressed as an R 2 score.
, X , and X  are respectively the standardized scores, sample mean, and sample standard deviation of sample i X . The value of R ranges from [-1,1], and the closer its value is to 1 or -1, the stronger the correlation. The closer its value is to 0, the weaker the correlation. The PM 2.5 concentration value of a region can be analyzed from two aspects: historical concentration values and current meteorological conditions. Historical concentration values will have a certain impact on the concentration value of the next moment, while other pollutants and meteorological factors will have a greater impact on the concentration value of the next moment.

Correlation Analysis of PM2.5 Concentrations with
Other Pollutants Figure 6 analyzes the correlation between PM2.5 and other pollutants. It can be seen that PM 10 , NO 2 , CO, and SO 2 are positively correlated with PM 2.5 , while O 3 is negatively correlated with PM 2.5 . This indicates that the concentration of PM 2.5 increases with the increase of PM 10 , NO 2 , CO, and SO 2 concentrations, and decreases with the increase of O 3 concentration.  Figure 7 analyzes the correlation between PM 2.5 and meteorological factors. It can be seen from the graph that there is only a certain correlation between PM 2.5 and meteorological factors. However, the correlation between weather, humidity, wind speed, and PM 2.5 is relatively strong, while the correlation between other meteorological factors and PM 2.5 is weak. Among them, weather, humidity, and PM 2.5 are positively correlated, while wind speed and PM 2.5 are negatively correlated. Although the correlation between other factors and PM 2.5 is small, there is still a certain correlation. Therefore, this paper will select these data for training.

Model Comparison Experiment Results
To validate the stronger learning capability and more accurate prediction results of the TPA-CNN-LSTM model proposed in this paper, the LSTM model and CNN-LSTM model were used to predict PM2.5 concentration on the same dataset, and the results of the three predictions were compared and analyzed, as shown in Figure 8. According to Figure 8, the three prediction models have consistent trends with the actual values, with the TPA-CNN-LSTM model having predicted values that are closer to the real values, while the LSTM model has a larger deviation in its predicted values. CNN can perform network learning from the original input sequence, avoiding the accumulation of errors caused by manually extracting features. By adding the TPA network, important information can be prioritized and the input data of the LSTM network can be optimized. Therefore, the TPA-CNN-LSTM network can more effectively extract features of the time series of PM2.5 concentration and related influencing factors and improve the prediction accuracy. Table 2 and Table 3 summarize the evaluation indicators for the three prediction models. Although the prediction accuracy of these three models decreases over time, it is noteworthy that the TPA-CNN-LSTM model performs better in the prediction results for each hour, and its prediction accuracy is higher than that of the LSTM and CNN-LSTM models. This result is also confirmed at the average prediction level of 6 hours. From the table, it can be seen that in the multi-hour prediction task, the TPA-CNN-LSTM model has the lowest MAE and RMSE. The predicted values of the TPA-CNN-LSTM model in the multi-scale prediction task are closer to the true values. In addition, in the one-hour PM 2.5 prediction task in Figure 8, the R 2 of the TPA-TCN-LSTM model is the highest. After adding the time pattern attention mechanism, the performance of TPA-CNN-LSTM in the multi-scale prediction task is better than that of LSTM and CNN-LSTM.
The results indicate that the proposed TPA-CNN-LSTM model can effectively learn the spatiotemporal correlations of air pollutants and is suitable for related tasks of predicting PM 2.5 concentration.

Conclusions and Future Work
This paper proposes a CNN-LSTM model based on temporal pattern attention for predicting the concentration of PM 2.5 over multiple hours. The model uses air quality data, meteorological data, and PM 2.5 concentrations from neighboring monitoring stations as inputs to capture the spatiotemporal correlation and long-term dependency of PM 2.5 . The temporal pattern attention mechanism helps to capture the importance of different feature states and improve the prediction accuracy of the model. Experimental results show that the TPA-CNN-LSTM model performs well in multi-scale prediction tasks. The main conclusions of this paper are as follows: Analysis of air pollution data shows that PM 2.5 concentrations have a strong spatiotemporal correlation. Due to the propagation of air flow, PM 2.5 concentrations in the prediction range are easily affected by PM 2.5 concentrations from neighboring monitoring stations. In addition, because PM 2.5 stays in the air for a long time, past feature states also affect future PM 2.5 concentrations. Furthermore, compared with a single model, an ensemble model combines the feature extraction capabilities of multiple models, resulting in more stable training results and stronger generalization ability. The TPA-CNN-LSTM prediction model can better capture the nonlinear relationship between each input variable and PM 2.5 concentration than other models, and has higher prediction accuracy.
Although the model achieved good prediction performance, there are still some limitations. The data collected in this study did not consider emissions from factories and vehicle exhaust, which are also important factors in the formation of air pollutants. Additionally, in the future, we will explore the use of this model for large-scale prediction of other air pollutants and incorporate satellite meteorological data into the input of the prediction model.