Research on Key Substances in the Rating of Strong Aroma Crude Spirits Based on Correlation Algorithms

: This experiment obtained the substance content of crude spirits at different distillation times using Gas Chromatography-Mass Spectrometry (GC-MS). The relationship between substance content and crude spirits rating was revealed using Spearman's rank correlation coefficient, the Maximal Information Coefficient (MIC), and Principal Component Analysis (PCA). There were 11 substances with a Spearman's coefficient greater than 0.70, 9 substances with a MIC greater than 0.6, and 6 substances in the PCA with an information extraction rate greater than 1.5×105. In combination with these three algorithms, a total of 17 substances were found to be related to the crude spirits grading. These substances are: 1,1-diethoxy-3-methylbutane, ethyl valerate, ethyl hexanoate, 2-methyl-1-butanol, ethyl caproate, ethyl lactate, ethyl nonanoate, butyl lactate, 2-hydroxy-4-methylpentanoic acid ethyl ester, isoamyl lactate, ethyl decanoate, butyric acid, (2,2-diethoxyethyl)-benzene, ethyl laurate, ethyl heptadecanoate, ethyl stearate, ethyl linolenate.


Introduction
Baijiu is a traditional Chinese spirit made primarily through solid-state fermentation, a unique process that has evolved over more than 2000 years [1]. This spirit typically uses a mixture of grains, most commonly sorghum, along with rice, glutinous rice, wheat, and corn [2]. Compared to other global distilled spirits, baijiu has a more diverse base of ingredients. In addition, the fermentation process of baijiu is distinct from other international distilled spirits. The conversion of starch into sugar and then into alcohol happens simultaneously during the fermentation process, a result of the diverse microbiological composition of the Qu (a type of fermentation starter), which includes yeast, molds, bacteria, and lactic acid bacteria [3]. These various microbes not only break down the ingredients into alcohol, but they also produce a range of volatile compounds such as esters, acids, aldehydes, and alcohols, with a total concentration of about 2-5g/L [4]. The interaction of these micro-constituents contributes to the diverse flavor profiles observed in baijiu.
Recent advances have seen the application of Gas Chromatography-Mass Spectrometry (GC-MS), Gas Chromatography, Liquid Chromatography, Spectroscopy, Electronic Nose, and Nuclear Magnetic Resonance in analyzing the chemical composition of baijiu [5][6][7][8][9][10][11][12][13]. Notable studies have shown the efficacy of GC-MS in differentiating various types of baijiu based on their volatile compound profiles [14,15], determining the grading of strong-flavor baijiu [16], and identifying and classifying different brands of baijiu [17]. These findings suggest that GC-MS, combined with correlation algorithms, can effectively detect the trace compounds in baijiu and distinguish between different types of spirits based on these constituents.
However, the majority of these studies have focused on the finished product of baijiu, which often has pronounced differences between types. There has been less research on the correlation between the composition and grading of different fractions of crude spirits. Therefore, this study aims to investigate the trace compounds in different distillation periods of baijiu using GC-MS and establish their relationship with the grading. The content of volatile compounds in the crude spirits will be determined using GC-MS, followed by an analysis of how these compounds change during the distillation process. Lastly, Spearman correlation, MIC, and PAC algorithms will be employed to determine the influence of different compounds on the grading of crude spirits, aiming to identify the key compounds that affect the grading.

Materials
Crude Spirits (17 batches of 202 bottles of crude spirits produced by a well-known liquor factory in Sichuan in May 2022); 2-Ethylbutyric acid (Chromatographically pure, purchased from Macklin Biochemical Co., Ltd. in Shanghai).

Sample Collection Method
A total of 17 batches of 202 bottles of crude spirits were collected, each batch from a different fermentation pit. Each batch of crude spirits was divided into head, middle, and tail fractions. Each batch varied based on actual field conditions, with samples collected according to alcohol content, distillation time, and field tasting conditions. The head and tail fractions had significant and unstable changes in quality, so the head and tail sections were primarily collected for analysis and classification of liquor grades. After collection, five national-level evaluators graded the liquor based on color, aroma, taste, and style, and divided the liquors into final grades (Grade 1: 40 bottles, Grade 2: 90 bottles, Grade 3: 72 bottles). Each sample was labeled with the format "Grade-Position within grade", where the batch number refers to batches 1-17, grade refers to grades 1-3, and the position within the grade refers to the order of collection within that grade, with higher numbers indicating later collection times.

Physical and Chemical Data Detection Method
GC-MS detection conditions: Automatic GC injection, chromatographic column selected was Agilent DB-WAX (30 × 320 × 0.25 μm), with a FID detector, and the liner chosen was Agilent 5062-3587 (900 μL). The temperature program was: held at 60 ℃ for five minutes, then increased at 10 ℃ per minute to 250 ℃, and held for 2 minutes. High purity helium was used as the carrier gas, with a flow rate of: 2.25 mL/min, non-split, total flow of: 34.5 mL/min, vaporizer temperature of 250 ℃; 1μl injection volume; MS interface temperature of 280 ℃; EI (electron ionization, EI) ion source of 70 eV ionization; ion source temperature of 230 ℃; quadrupole temperature of 150 ℃; full scan mode; scan range of 30~540 m.
Quantitative method: 2-Ethylbutyric acid was used as an internal standard, referring to the national standard GB/T 10345-2007 "White Wine Analysis Method", using the internal standard method, the content of each flavor component was calculated based on peak area.
Material analysis software: All data was analyzed using Excel for basic data analysis, and then the relationship between material content and grade was determined using Python library references and custom MATLAB programming.

Substance Distribution in Samples
A total of 89 substances were detected in the samples via Gas Chromatography-Mass Spectrometry (GC-MS). The data varies slightly for each fermentation pit and each segment of the crude spirits, but the overall trends are generally similar. Here, we selected the top 35 substances with the highest detection rates in all samples for analysis. The names and concentration information of these substances are as as shown in Table 1. During the distillation process, substances with low boiling points and high volatility distill out first, while substances with high boiling points and low volatility distill out later. Figure 2 shows a diagram of the changes in the total content of substances during the distillation process, drawn from 10 randomly selected samples out of 27 groups of liquor samples. The horizontal axis represents samples collected at different times (named by the "grade -grade internal position" rule). It can be seen from the figure that as the crude spirits grade gradually increases, the total substance content decreases gradually in the early stage and tends to stabilize in the later stage. Further combining the information in Table 1 and the content of each substance, it is found that the content of most esters is negatively correlated with the distillation stage. As the distillation progresses, their content in the sample decreases continuously. The content of the ester substances numbered 2, 3, 5, 6, 7, 8, 11, 12, 21, 27, 29, 30, 31, 32, 33, 34, and 35 in the table gradually decreases as the distillation process progresses. A few esters do not follow this pattern. The content of the esters numbered 9, 15, 17, 18, 19, 20, 23, 24, 25, and 28 in the table gradually increases as the distillation process progresses. The content of acids that are distilled out is almost all increasing, and the acids numbered 14, 22, 24, and 26 in the table gradually increase their content as the distillation process progresses. It can be found that the outflow of substances is closely related to the grading of the crude spirits, so further analysis is needed on the correlation between the content of substances and the grade of the crude spirits.

Methods
The content of a substance at a particular moment in the crude spirits distillation process is closely related to the progression of the distillation, implying that both the amount and proportion of substances can impact the grade of the crude spirit. Thus, it is vital to accurately quantify the correlation between substances and the grade of the crude spirit. In this study, we employed the Spearman rank correlation coefficient, MIC, PCA algorithms to investigate the degree of influence of substances on the grade of the crude spirit. Finally, the key volatile substances impacting the grade of the crude spirit were obtained by forming the union of substances selected by the three methods.

Principle of Spearman Coefficient Algorithm
During the collection of the crude spirits, the outflow time of the crude spirits of different grades is not the same, and the content of substances in the crude spirits of different grades varies significantly. In this situation, the order of substance content size can better reflect the variation of substances with distillation time. The Spearman rank correlation coefficient replaces the numbers themselves with the order of data size, which can determine the correlation of non-normally distributed data and discrete data. The treatment of "grade" in this algorithm coincides with the grade of the crude spirit and the size of the substance content. The calculation formula is as follows: ( 1) Where: is the Pearson product-moment correlation coefficient, is the covariance of the rank variables, and is the standard deviation of the rank variables.
According to the formula (1), the larger the absolute value of the Spearman rank coefficient, the more correlated the substance is with the grade of the crude spirit. A positive Spearman rank correlation coefficient indicates a positive impact, while a negative Spearman rank correlation coefficient indicates a reverse impact.

Principle of Maximum Mutual Information Coefficient Algorithm
The maximum mutual information coefficient algorithm is proposed by Reshef and others to measure the correlation strength between two variables. It is an improvement of the mutual information score and is often used to measure the degree of association between two variables X and Y. MIC ranges from 0 to 1, calculates the joint probability density through the scatter plot divided by the grid, and finally obtains the mutual information value, which can capture linear and nonlinear associations. Let and represent the grade of the crude spirit and the substance content, respectively, then the joint distribution of the grade and content is , and the marginal distributions are , respectively. The mutual information is the relative entropy of the joint distribution and the marginal distribution, calculated as follows: (2) Where: m, n represent the desired divisions of the x, y direction grid, f(xi, yi) is the joint probability density function, is the upper limit of the number of m×n grids, is a function related to the sample scale n, B=n0.6.

Principle of PCA Algorithm
PCA is a statistical analysis method that transforms a set of potentially correlated variables into a set of linearly independent variables through orthogonal transformation. These new linearly independent variables are called principal components. This method can reveal the inherent structure of the data and reduce high-dimensional data to lowdimensional data while retaining the original features [36]. In PCA, the principal components are arranged in the direction from large to small of the original data variance. The first principal component retains the most original information, the second principal component retains slightly less original information, and so on. The load factor is the contribution rate of each data to the principal component. The larger the load factor, the greater the contribution to the principal component. The information extraction rate is a measure of the extraction rate of the original variable by the principal component. The higher the information extraction rate, the stronger the correlation between the original variable and the dependent variable. (3) Where: (4) Where: represents the extraction rate of the original variable xi by the first m principal components, is the eigenvalue corresponding to the j-th principal component, is the load factor of the i-th original variable and the j-th principal component, is the variance of the i-th original variable. is the correlation coefficient of the i-th original variable with the j-th principal component?

Correlation Analysis of Substance
Content and Crude Spirits Grade Figure 1 shows the heat distribution map of the 11 substances with Spearman rank correlation coefficient greater than 0.70 in Table 1. These 11 substances are numbered 1, 2, 6,9,17,18,19,23,31,32,35, corresponding to the names of the substances: 1,1-diethoxy-3-methylbutane, pentanoic acid ethyl ester, hexanoic acid ethyl ester, lactic acid ethyl ester, butyric acid lactate, 2-hydroxy-4-methyl-pentanoic acid ethyl ester, lactic acid isoamyl ester, (2,2-diethoxyethyl)-benzene, heptadecanoic acid ethyl ester, octadecanoic acid ethyl ester, linoleic acid ethyl ester. Among them, the contents of substances 1, 2, 6, 31, 32, 35 decreased with the increase of distillation time, while the contents of substances 9, 17, 18, 19, 23 increased with the increase of distillation time. Besides, the content of these substances is not only closely related to the grade of the original liquor, but also highly correlated with each other, indicating that although the content of substances in the original liquor differs greatly, there is a relatively stable proportion between the contents of each substance.   Figure 2 shows the coefficient distribution map of 9 substances with MIC coefficient greater than 0.60 in Table 1. These 9 substances are numbered 1, 2, 3, 4, 9, 17, 18, 19, 22, corresponding to the names of the substances: 1,1-diethoxy-3-methylbutane, pentanoic acid ethyl ester, hexanoic acid ethyl ester, 2-methylbutanol, lactic acid ethyl ester, butyric acid lactate, 2-hydroxy-4-methyl-pentanoic acid ethyl ester, lactic acid isoamyl ester, butyric acid. According to the previous research, the contents of substances 1, 2, 3, 4 increased in the first, second, and third grade liquor samples, while the contents of substances 9, 17, 18, 19, 22 decreased in the first, second, and third grade liquor samples.  Figure 3(a) shows the contribution and cumulative contribution of the top ten PCs after dimension reduction of the crude spirits data. In order to avoid the information extraction rate of each PC being submerged in the total of all PC' information extraction rates, the information extraction rate formula is used to measure the impact of substance content on the crude spirits grade. The information extraction rates of the top five PCs were calculated separately, and shows in Figure 3(b), taking 1.5×105 as the threshold, and extracting highly correlated substances.
Using the PCA algorithm, six substances related to the grading were selected from Table 1, with serial numbers: 6, 16, 21, 23, 25, 31. Their names are: Hexanoic acid ethyl ester, Nonanoic acid ethyl ester, Decanoic acid ethyl ester, (2,2-Diethoxyethyl)-benzene, Dodecanoic acid ethyl ester, Heptadecanoic acid ethyl ester. In combination with previous research, it was found that substances 6, 16, 31 appear in the early stage of distillation, and a spirit sample that does not contain these three substances is likely to be a third segment spirit.
During the entire distillation process, these 17 substances show a steady upward or downward trend. The content and proportion of substances at different distillation times vary, and only when the proportions of various substances are moderate in the middle of the distillation will the crude spirits present a full-bodied taste.