Sichuan Cuisine Recognition Method based on Residual Neural Network

: To address issues such as the high number of parameters, significant variations among images of similar dishes, weak geometric invariance, and low recognition rates in Sichuan cuisine recognition methods, a lightweight Sichuan cuisine recognition model, RGBNet, based on residual neural network, is proposed. The model employs dilated convolutions to increase the receptive field of convolutional kernels while maintaining a consistent parameter count, thus obtaining more shallow-level features. An RGB module is constructed using asymmetric convolutions to enhance the model's geometric invariance, feature non-linear expression, and feature extraction capabilities. Finally, the DFC long-range attention mechanism is introduced to effectively capture long-range information, thereby improving adaptive learning capabilities. To validate the model's performance, the classic ChineseFoodNet benchmark dataset is utilized. A MiniChineseFood dataset is created by extracting 30 classes totaling 20,000 images for experimentation. The recognition accuracy is measured using the top1 method of image recognition performance, achieving a final image recognition accuracy of 96.62%. Compared to models such as EfficientNet, ShuffNet, FasterNet, and MobileNetV2, RGBNet demonstrates respective accuracy improvements of 16.57%, 18.52%, 17.12%, and 16.35%. This presents a novel approach for industrial food recognition.


Introduction
All Sichuan cuisine, also known as Sichuan-style cuisine, is a regional culinary style represented by Sichuan Province.Sichuan cuisine has profound effects on human health, nutrition, and various aspects of life.With the increase in per capita consumption levels, more researchers are turning their attention to food science, aiming to achieve health regulation by analyzing the nutritional components and ingredient combinations of dishes [1].
Methods for recognizing Sichuan cuisine have progressed from early traditional wireless RF signal methods, traditional machine learning methods, to deep learning-based recognition methods.Traditional RF methods [2] involve implanting wireless RF chips in utensils containing dishes to identify and analyze the dishes.Although RF-based methods have high accuracy, they require customizing utensils in advance, and the process is cumbersome, with limited functionality and poor maintainability.Traditional machine learning methods [3] rely on manually selecting features, conducting statistical analysis, and inputting the results into classifiers.The accuracy is not ideal.Recognition methods based on deep learning [4] are characterized by lossless, realtime, and pollution-free features.They can identify dish categories by capturing images through cameras.Compared to traditional machine learning methods, which require manual feature extraction, convolutional neural networks can automatically learn and extract features.By stacking multiple convolutional and pooling layers, abstract features are extracted layer by layer, achieving more accurate and efficient classification.
Research on food recognition is interdisciplinary, spanning fields such as computer vision [5], new media [6], industrial informatics [7], agriculture, medicine, and nutrition science [8].The widespread use of portable devices (such as smartphones and cameras) and the development of artificial intelligence have led to extensive applications in Sichuan cuisine image recognition.Therefore, the development of real-time and accurate methods and technologies for Sichuan cuisine recognition has significant practical value.Haiyan Wang [9] and others improved local skeleton information learning by integrating asymmetric convolutions to enhance dish feature extraction.Deng Zhiliang [10] proposed a dish recognition network model that integrates multiscale features to extract semantic information from deep-level images, and calculates inter-class similarity using triplet loss.Wu Zhengdong [11] introduced a multiscale sampling module to address the limitations of fully connected layers on input sizes.Additionally, an attention-based bilinear network was proposed to construct an attention network from both channel and spatial directions to enhance feature extraction capabilities.Liao Enhong [12] addressed the issue of accuracy errors caused by the large inter-class similarity in Sichuan cuisine dish images using the maximum inter-class loss function.
Although the aforementioned methods can effectively identify dish categories, they often come with a huge number of parameters and seldom consider the issue of lightweight design.Therefore, a lightweight Sichuan cuisine recognition method is proposed by improving the residual neural network model.This method enhances the convolutional neural network backbone based on the characteristics of the Sichuan cuisine image dataset and incorporates attention mechanisms to capture pixel-level long-range relationship information.To validate the model's performance, this study conducted comparative experiments, comparing the proposed method with lightweight network models such as EfficientNet [13], ShuffNet [14], FasterNet [15], and MobileNetV2 [16].

The Fundamental Principles of CNN
Convolutional Neural Network (CNN) [17] is a deep learning model widely utilized in computer vision and image processing tasks.The core of CNN is the Convolutional Layer, which employs convolution operations to extract features from input data.The convolutional layer effectively captures spatial local features in the image, such as edges and textures.Simultaneously, the convolutional layer possesses characteristics of parameter sharing and sparse connections, significantly reducing the number of network parameters and enhancing computational efficiency.
Although increasing the number of network layers improves the model's generalization ability to some extent, the high time and space complexity constrain the application of deep convolutional neural networks in resourceconstrained environments such as mobile phones and embedded devices [18].To address the issue of low computational efficiency in large convolutional network models, a network structure is constructed using the residual neural network-based approach.

Optimizing the Design of CNN 2.2.1. Introducing Dilated Convolution
In dish recognition networks, the first layer's convolution operation is typically employed to extract low-level features from the input image, such as edges and color information.This layer performs convolution operations on the pixel values of the input image with convolutional kernels, resulting in a new set of feature maps that better represent the texture information of the input image.The advantage of using dilated convolution [19] for feature extraction in the first layer lies in its ability to increase the receptive field of the convolutional kernel [20] while maintaining the output resolution.This enhancement contributes to an improved perceptual capability of the network.In comparison to standard convolution, dilated convolution also effectively reduces the number of parameters, mitigates the risk of overfitting, and accelerates the speed of convolutional computations, thereby enhancing the operational efficiency of the model.Figure 1 illustrates examples of standard convolution (Figure 1a) and dilated convolution (Figure 1b).

Fusion of Asymmetric Convolution in RGB Bottleneck
Traditional lightweight networks primarily rely on depthwise separable convolution for feature extraction.Although this method enhances computational efficiency, splitting the convolution operation into two parts during the separable operation leads to the loss of partial multi-scale feature information.
To address these limitations, a method to improve convolutional neural networks is proposed by introducing asymmetric convolution blocks [21].The aim is to enhance the modeling of geometric deformations by strengthening the information in the convolutional kernel skeleton, thereby improving the network's ability to model geometric deformations and enhance generalization performance.
The RGB Bottleneck block of asymmetric convolution consists of three parallel layers, each using convolutional kernels of sizes n×n, 1×n, and n×1 to slide and extract features.After convolution, Batch Normalization is applied to the outputs of the three branches, and then the outputs of each branch are summed to obtain a rich feature space.Non-square convolution layers, such as 1×d and d×1, are utilized.The additivity property of convolution is leveraged, as shown in the following formula (Equation 1).
Where A represents the equivalent output, C is the input, K is the 2D convolutional kernel, and P is the number of convolutional kernels, if there exist P size-compatible 2D kernels (Kp) that, when applied with the same stride on the same input C, generate outputs with the same resolution, and if the sum of these outputs is denoted as A, then the corresponding kernels at each position can be summed to form an equivalent kernel K.This equivalent kernel K produces the same output A when applied to the same input.
The RGB bottleneck structure is illustrated in Figure 2, where Figure 2

Attention Module to Enhance Geometric
Deformation Performance By introducing a decoupled fully connected (DFC) attention mechanism branch, implemented with asymmetric convolutions employing unequal horizontal and vertical convolutions [22], into the RGB bottleneck structure, the model captures richer image features using distinct convolution kernels in different directions.This significantly enhances the model's capability to capture long-range spatial information and representation power in images.
For a given input   , which can be regarded as a tensor of size H*W, the mathematical expression for implementing the attention map using an ordinary fully connected layer is shown in Equation ( 2): In the equation, ⊙ denotes element-wise multiplication, represents the learnable weights in the fully connected layer, and is the obtained attention map, and Zis the original feature.By decoupling Equation (1) along both the horizontal and vertical directions, long-range correlations in both directions can be captured, resulting in attention weights.The feature aggregation processes in the horizontal and vertical directions are illustrated in Equations ( 3) and ( 4).
(3) (4) In the equation, In the formula, represents horizontal weights, represents vertical weights.The DFC (Decoupled Fully Connected) attention mechanism is illustrated in Figure 3.

RGBNet Model Architecture
Considering the small inter-class differences and large intra-class differences in the Chinese cuisine dataset, a lightweight recognition approach for Sichuan dishes is proposed, aiming to address the task of recognizing Sichuan cuisine images on edge devices such as mobile phones.The model consists of three main parts.Firstly, in the first layer of the network, dilated convolution is employed for feature extraction, significantly reducing the model's parameter count while maintaining accuracy, thereby enhancing computational efficiency.The second part is primarily composed of the ResGhost Bottleneck structure proposed in this paper, which employs cost-effective operations [23] to break down larger convolutional layers into subnetworks with shared weights.Residual convolution is utilized to improve the model's generalization ability and efficiency.The third part introduces the Decoupled Fully Connected attention mechanism (DFC), designed to facilitate long-range communication.It incorporates dynamic normalization parameters to adjust feature values in different regions, ultimately achieving information exchange and feature calibration between different regions.The structure of the established RGBNet network is shown in Table 1.

MiniChineseFood Dataset
ChineseFoodNet [24] is a large-scale dataset of food images, comprising 208 classes with a total of 180,000 images featuring various culinary styles from different regions in China.Each dish is represented by images capturing significant variations in angles, lighting conditions, and plating.However, due to the dataset's inclusion of a substantial number of visually distinct dishes, which tend to achieve high scores during network training, it lacks representativeness.
To address this, 30 classes were extracted from the ChineseFoodNet dataset, resulting in a practical set of 20,000 images.Through data augmentation techniques, including random augmentation [25] and random erasing [26], the original image count was expanded to 100,000 images.This extended dataset is named MiniChineseFood, and it was divided into a training set and a test set in a 4:1 ratio for model training.Some images from the MiniChineseFood dataset are shown in Figure 4.The selected images from the ChineseFoodNet dataset adhere to the following criteria: ensuring low intra-class similarity among extracted dish images, such as different shapes of ingredients (a and b), varied plating (c and d), and distinct types of ingredients despite having the same name (e, f); maintaining high inter-class similarity among extracted dish images, for instance, different types of ingredients (g and h), different cooking methods (i and j) for the same ingredient, and dishes with different ingredients and cooking methods that appear similar but belong to different categories (k and l).

Model Training and Result Analysis
To evaluate the performance of the RGBNet network in recognizing Sichuan cuisine images, this study selected representative lightweight CNN networks for performance comparison, including MobileNetV2, ShuffleNet, EfficientNet, and FasterNet.
In the RGBNet, the optimizer is SGD, with a learning rate (lr) of 0.045, momentum set to 0.9, and weight decay (weight_decay) of 4e-05.Additionally, gradient clipping was not applied.The learning rate was configured using a "step" decay strategy, where the learning rate is multiplied by a parameter gamma (set to 0.98) after the first epoch and gradually reduced in subsequent epochs to fine-tune the model parameters.The changes in model accuracy and loss function values are illustrated in Figure 5.
From the graph, it can be observed that selecting top-1 accuracy as the evaluation metric, at the beginning of training, RGBNet, like other models, exhibits relatively low accuracy.However, starting from the 10th epoch, RGBNet's accuracy growth rate gradually surpasses other networks and reaches its peak at the 196th epoch.In comparison to other models, which achieve convergence around the 85th epoch, RGBNet converges more slowly.This is attributed to the ResGhost module requiring more data for learning to achieve better model fitting capability.After thorough learning, all performance metrics surpass those of the comparative models, strongly indicating that RGBNet possesses robust feature extraction and fitting capabilities.

Ablation Experiment
To verify the impact of dilated convolution and DFC attention mechanism on model recognition accuracy, RGBNet (conv2d), RGBNet (Dilated Conv), and RGBNet (Dilated Conv+DFC) were selected for ablation experiments.In RGBNet (conv2d), the first layer uses regular 3×3 convolution; in RGBNet (Dilated Conv), the first layer uses dilated convolution; and in RGBNet (Dilated Conv+DFC), DFC attention mechanism is incorporated alongside dilated convolution.The experimental results are shown in Table 4.This fully demonstrates that integrating DFC effectively captures long-range information, enhancing accuracy, and combining dilated convolution with the lightweight RGB module can achieve better recognition results.

Conclusion
For the Sichuan cuisine recognition task, a lightweight RGBNet network model based on a residual neural network is proposed.In the backbone network, standard convolution is channel-split to merge multiple asymmetric convolutions, acquiring richer features from multiple directions.Subsequently, DFC long-range attention is introduced to capture long-distance dependencies in sequences.Compared to existing convolutional neural networks, the proposed model performs better on the MiniChineseFood dataset.
However, there are still areas for further improvement in the deep learning approach used by the model.Firstly, due to the diverse nature of Sichuan cuisine dishes and variations in ingredient types and cooking styles, the data samples extracted in the experiments are relatively singular.In future research, more advanced object detection algorithms will be integrated to handle multi-object cuisine datasets.Secondly, although RGBNet achieves good performance in lightweight models, its feature representation capability is weaker compared to some complex models.This can be addressed by incorporating more network layers and using more complex features to enhance the algorithm for better practical applications in real life.

Fig 1 .
Fig 1.Standard Convolution and Dilated Convolution (a) depicts the RGB bottleneck structure with a stride of 1, and Figure 2(b) represents the RGB bottleneck structure with a stride of 2.

Table 2
presents the performance of the model on the test set after training.Compared to other models, the RGBNet demonstrates excellent performance, achieving an accuracy of 96.72%, recall of 0.9646%, precision of 0.9739%, and an F1 score of 0.9625%.

Table 2 .
Data Results