Research and Application of System ‐ based Clustering and Principal Component Analysis Algorithms

: The Silk Road was a channel of cultural exchange between China and the West in ancient times, in which glass was a valuable physical evidence of early trade, and the early glass in China was made by absorbing some foreign technology, which also led to a different chemical composition. Nowadays, most of the glass artifacts are roughly divided into lead-barium glass and high-potassium glass, and each of the different parts of the artifacts will be observed and sampled for analysis in the study of the artifacts, and the identification of their composition types has been hampered by natural weathering over thousands of years. Therefore, in view of such problems and the large number and complexity of chemical components, we propose to sub-classify the different types of glass artifacts through systematic clustering and principal component analysis algorithm model, and the basis of classification is the chemical composition, classification of the content of the different artifact sampling points, that is, the artifact number, and finally through sensitivity analysis to evaluate and test the classification results. The classification results can greatly reduce the workload of analyzing and identifying the types of artifacts, and provide a reference basis and methodological guidance for the problem of identifying and classifying artifacts.


Introduction
Principal component analysis is a technique used to explore the structure of high-dimensional data, while systematic cluster analysis is a technique to find the intrinsic structure between data, and both are mostly applied to data analysis type of topics, which also leads to many domestic and international scholars today to conduct research based on this model algorithm as well. Based on this, Qin Liyue et al. conducted an in-depth analysis and study on the comprehensive quality of roasted macadamia nuts kernels, which improves the theoretical basis and scientific basis for the processing and development of such foods. Chao Zhonghao et al. gave their model for the evaluation of volatile flavor substances of butter hot pot base, which enabled them to effectively evaluate the flavor of hot pot base. Therefore such as nowadays such models are widely used and researched, we intend to apply systematic clustering and principal component analysis algorithms to the problem of identification analysis of cultural relics, so that its work of cultural relic identification can be optimized to a certain extent, and refine various analysis algorithms and identification results testing so that it can effectively ensure the solution of such model algorithms for this problem.

Problem Description
The main raw material of glass is quartz sand, the main chemical composition of which is SiO2, and due to the high melting point of pure quartz sand, fluxes are added during refining in order to lower the melting temperature. The fluxes commonly used in ancient times are grass ash, natural alkali, saltpeter and lead ore, etc., and add limestone as a stabilizer, limestone calcination after conversion to CaO. added fluxes are different, its main chemical composition is also different. For example, lead-barium glass, which has a high content of PbO and BaO when lead ore is added as a flux in the firing process, is usually regarded as a glass variety invented by China itself, and the glass of Chu culture is dominated by lead-barium glass. In this context, we intend to analyze the relevant data of an existing batch of ancient glass artifacts, in which the chemical composition content varies among different artifacts and also the sampling points are different, and on this basis for each category choose the appropriate chemical composition to classify them into subcategories, give the specific classification method and analyze the rationality and sensitivity of the classification results. (1) Principal component analysis is a dimensionality reduction algorithm that transforms multiple indicators into a few principal components, i.e., replacing old data with fewer new data, and this these principal components all satisfy the linear relationship of the initial variables and do not correlate with each other, but reflect the characteristic results described by the full set of variables to the greatest extent. This problem can be started with a principal component analysis, which is performed as follows:

Model Building and Solving
Step1. First of all, it needs to be standardized. Annex 2 of this question involves 69 heritage sampling points, that is, there are 69 sample points, which involves 14 indicators, so that the value of the indicator of the sample is , followed by the following standardization of each indicator is : Where is the mean of the indicator and is the standard deviation of the indicator. The purpose of standardization here is mainly to resolve errors and mistakes in the data results due to different magnitudes.
Step2. Calculate the correlation coefficient matrix of the sample matrix x.
By the following equation: ( The correlation coefficient matrix can be obtained from , where is the correlation coefficient between the indicator of and the indicator of , and according to the correlation coefficient matrix, , , and where : Step3. Calculate the eigenvalues and eigenvectors by MATLAB.
First calculate the eigenvalues of the correlation coefficient Step4.Calculate the principal component contribution rate and cumulative contribution rate.
Contribution rate: (4) Cumulative contribution rate: Step5.The principal component analysis was performed by matlab code, and then the first, second and third .... corresponding to the eigenvalues whose cumulative contribution of these two categories exceeded 85% were taken respectively The first principal components, and then after determining the principal components, the descriptive and statistical analyses of key variables and data were then performed.
2) After performing principal component analysis, the two classes are then subclassified by using the principal component features as the basis for systematic clustering. Systematic clustering firstly divides the samples belonging to one class, and then always calculates the distance between subclasses, and soon then step by step in the conclusion, its will be merged into one big class.
3) Determine the value of :elbow rule The elbow rule[5] is to roughly estimate the optimal number of clusters by the graph. One of the locations where the effect of improvement by the degree of distortion of the graph has the greatest decrease in effect is the elbow, and the degree of distortion is generally used to determine the optimal value.
The systematic clustering can be started by dividing the new data obtained by principal component analysis for high potassium glass and lead-barium glass into and class clusters, respectively, where , where ( is the atomic level), and where is the inflection point value where the slope in the elbow function shows a significant decrease.

Solution of the model
1) According to the above established model and steps, this question first through matlab respectively for these two types of principal component analysis, can get the two types of glass after the principal component analysis of key variables including eigenvalues and cumulative contribution rate, and then through excel to process the data, where the high potassium class of glass part of the principal component information is as follows: The cumulative contribution has reached about 87% at principal component 5, and the first five principal components were taken as the combined indicators after screening according to the requirements. It is also easy to see that principal component 1 and principal component 2 contribute the most to the cumulative contribution among them, i.e., it can be seen that silica and sodium oxide may be important influencing factors and indicators for the classification of high potassium glass subclasses.
From the information in Table 2 below, it can be seen that the cumulative contribution of principal component 8 has reached about 88%, i.e., the composite index corresponding to the seven principal components mentioned above is taken according to the requirements. The contribution values of the first three components are higher and the difference is not very large, and it can be seen that silica, sodium oxide and potassium oxide may be more important indicators for the later subcategory of lead-barium glass classification. Based on the above analysis, the final results of the combined index matrix of the screened lead-barium glass and high potassium glass were set to and respectively. (2) Then the combined index matrix data obtained by principal component analysis of the two were imported into SPSS for systematic aggregation, and a spectrum chart could be obtained separately, and the number of categories was determined after their division.
(3) The aggregation coefficients obtained from the above SPSS were processed in excel and the aggregation coefficients were plotted in descending order:  According to the aggregation coefficient line graph, when 6, the decreasing trend of the line tends to slow down, according to the elbow rule so the subcategory of high potassium glass can be divided into the category number . According to the aggregation coefficient folding graph, it can be seen that when 7, the decreasing trend of the folding line tends to slow down, according to the elbow rule so the subcategory of lead barium glass can be classified as the category number . Then the number of categories corresponding to the division of the two categories was divided on the spectrogram generated by SPSS, and the subcategories of the two were obtained by observing the divided spectrogram.

Sensitivity analysis of the model
From the analysis of the above data results, it can be concluded that at the confidence level of 95%, the component content of silica and potassium oxide has a significant influence on the presence or absence of weathering on the surface, so it is reasonable to judge that it may cause overlap in classification and unreasonable classification categories on the final classification results, so we can use reducing the component content of both of them to fit the number of categories by elbow rule and folding diagram of clustering coefficient again to see Whether the two have a greater influence on the subcategory classification results of high potassium glass and lead barium glass, and if the difference in influence is large, we will classify them as a sensitivity factor set. We followed the above steps to plot the aggregation coefficient line graphs for both raw metadata without principal component analysis by excel and SPSS respectively for the results after reducing the content of treated components, and the results after removing potassium oxide are as follows:  And the results after removing silica are as follows: Figure 5. Comparison of polymerization coefficients of high potassium glass before and after removal Figure 6. Comparison of polymerization coefficients of lead-barium glass before and after removal Figure 6 shows that the removal of silica has a significant effect on the classification of the subclasses of high potassium glass, with a large difference in the degree of trend; Figure 7 shows that although the folds do not overlap or are close before and after the treatment, the degree of trend of the two folds is similar, and then the elbow rule shows that the classification results of the two are not significantly different. Therefore, it can be seen that the reduction data do not have a large impact on the final classification results of lead-barium glass, while for high potassium glass, the composition content of silica can be classified as a sensitive set of factors for its results.

Conclusion
This paper focuses on the application of cluster analysis and principal component analysis algorithm models. In the study we positioned the experimental object to the identification of the category of glass artifacts, and in the identification work we mainly through the above model algorithm for its different artifacts sub-classification, the glass artifacts into six categories, lead barium glass into seven categories, and the category content is based on the analysis of chemical composition of the artifacts sampling points, and then through sensitivity analysis of its classification results for a certain evaluation and test, the results obtained the classification results are less affected by some chemical composition. The results obtained the classification results of the method is less affected by some of the chemical composition, the type of cultural relics for the identification of the problem brings a valuable reference to improve the accuracy and efficiency of identification.