Prediction modeling of cigarette ventilation rate based on genetic algorithm backpropagation (GABP) neural network

open

ventilation rate of the cigarette.This study requires first dividing the structure of the produced cigarettes and then conducting modeling analysis, which has a certain lag in cigarette production.
Qu et al. [3] conducted a correlation analysis on the physical indicators of thin cigarettes and studied the key indicators that affect the open resistance of thin cigarettes, such as the ventilation rate of cigarette, cigarette weight, and cigarette hardness.Wang et al. [4] used correlation analysis and multiple regression to statistically analyze the relationship between cigarette open resistance of conventional brands and various physical indicators, such as process parameters, and cut tobacco structure.They found the mathematical model between cigarette open resistance and other indicators in order to explore the relationship between the physical indicators of cigarette and open resistance.Zhang [5] explored the impact of various physical indexes on the total ventilation rate of cigarettes and obtained key physical indicators that affect the total ventilation rate of cigarettes, such as weight, circumference, hardness, and open resistance.These studies indicate that physical indicators such as weight, circumference, open resistance, length, hardness, and filter air permeability have a significant impact on predicting cigarette ventilation rate.
Wang et al. [6] established a prior model-based method between the open resistance and ventilation of cigarettes.This model guides the production of various cigarettes, with good universality and applicability.However, the model is established under the assumption of uniform distribution of various parts of the cigarette, which is difficult to control in actual production and has certain limitations.Learning-based methods can automatically learn the implicit relationships of complex physical problems, establish models for simulation and prediction, but require sufficient data for training [7,8].This paper used machine learning and linear regression to predict the ventilation rate of cigarettes, namely multiple linear regression (MLR), backpropagation neural network (BPNN), and genetic algorithm-optimized backpropagation (GABP).Multiple linear regression analysis (MLR) [9] is a statistical model that uses the correlation of variables to predict the value of the dependent variables.It is the basic method for the prediction of ventilation rate.Backpropagation neural network (BPNN) systematically solves the learning problem of the connection weights of hidden unit layer in multilayer neuron networks using error backpropagation.It has intelligent information processing functions such as adaptability, self-organization, high parallelism, robustness, and fault tolerance and is especially suitable for nonlinear system modeling [10][11][12].In this paper, the ventilation rate of cigarette and the physical indicators of cigarettes were used to model, and the weights and thresholds of the models were constantly adjusted to predict the ventilation rate of cigarette.Genetic algorithm (GA) [13,14] is a random search and optimization method based on biological natural selection and genetic mechanism.Since BPNN tends to converge to a local minimum, GA is often used to find the best initial weight and threshold for BPNN optimization.GA is suitable for dealing with complex and nonlinear problems that are difficult to be solved by traditional search methods and for screening variables.It can reduce memory use and improve the predictive performance of BPNN.
This study focused on optimization and training MLR, BPNN, and GABP, and prediction of ventilation rate according to the physical indicators of cigarettes collected from Xuchang cigarette factory.The ventilation rate and parameter data sets of cigarette were divided into two groups for model training and verification.Model accuracy was evaluated by comparison of the predicted and measured values.The results indicated that the prediction model of ventilation rate suitable for multifactor influence was found, improved the processing technology and quality level of cigarettes, provided a rich theoretical basis for cigarette enterprises, and achieved the stability analysis of ventilation rate.

Data sources
The cigarette samples were taken from cigarettes produced by a unit on the cigarette production line of Xuchang cigarette factory in Henan Province.The experimental data were as follows: weight, circumference, length, hardness, filter air permeability, open resistance, and total ventilation rate.A total of 900 samples were selected, of which 800 groups were training samples, and the remaining data were testing samples (Additional file 1).

Data preprocessing
The normalization of the data means that the value of each feature is scaled to between 0 and 1 [15].

Correlation analysis between parameters
Correlation analysis refers to the analysis of two or more correlated variable elements, in order to measure the degree of correlation between two variable factors.Statistical software SPSS was used to Pearson correlation analysis, Pearson' correlation coefficient formula was as follows: [16,17].

Principal component analysis
PCA is a data reduction technique that creates principal components (PC), which are linear combinations of the original variables, and create new, uncorrelated variables [18].In order to improve predictive performance and efficiency, PCA dimensionality reduced principal components were chosen for the experiment.

Multiple linear regression model
Multiple linear regression was mainly used to reflect the mathematical relationship between multiple independent variables and an independent variable, and can be used to predict the change of a variable.The multivariate linear model between the total ventilation rate of cigarettes and the circumference, weight, length, filter air permeability, open resistance, and hardness was established as follows: (1) In formula (2), β 0 , β 1 , β 2 , β 3 , β 4 , β 5 , and β 6 are the overall regression coefficients, Y is the total ventilation rate of the cigarette, x 1 is the circumference, x 2 is the weight, and x 3 is the length, x 4 is the filter air permeability, x 5 is the open resistance, x 6 is the hardness.

BP neural network model design
BPNN is widely used in signal processing, pattern recognition, machine control, and many other fields [19].In this study, BPNN was used as the classifier model.This study adopted a single hidden layer BPNN model, and its structure is shown in Fig. 1.
The main steps of the proposed BPNN classification method were as follows: (1) Enter the training sample, normalize the feature value of the input sample to the range [0,1]; (2) Network initialization.
The six nodes of the input layer correspond to six eigenvectors (weight, circumference, length, hardness, filter air permeability, and open resistance).One output layer node is total ventilation rate of cigarette.The number of hidden layer nodes is related to the number of neurons in the input and output layers.The quantity shall be selected according to the design experience and experiment.Here, we used the following empirical formula for calculation [20]: (2) where n1 is the number of neurons in the hidden layer, n is the number of neurons in the output layer, m is the number of neurons in the input layer, and a is a constant that ranges between 1 and 10.Nine neurons were selected in the hidden layer.The neural network model was transferred from the input layer to the hidden layer by tansig function and from the hidden layer to the output layer by purelin function [21].This study used the combination of the above transfer functions, and the expressions of the two transfer functions were shown in equations ( 4) and (5).Then, the maximum number of iterations for model training was set to 2000, the inertia coefficient was set to 0.8, the maximum allowable error was set to 1e −6 , and the learning efficiency was set to 0.01.The initial parameters of the backpropagation neural networks are shown in Table 1.
(3) During the training process of BPNN, the weights and thresholds were constantly adjusted until the final result was obtained.Then, the trained classifier was used to recognize the test samples.

Based on GABP neural network model
The classic BP neural network algorithm is based on the gradient descent method to achieve the purpose of optimization, which is prone to converge to the local minimum [22].After the standard BP neural network was applied to the training set, it was difficult to distinguish the local extreme points from the global extreme points, and the error was large when correcting the weight and threshold.Therefore, the genetic algorithm was selected to optimize BP neural network.The GABP flowchart is shown in Fig. 2.
(  set the population algebra as 500, coded the generated population, and performed population selection, genetic crossover, and mutation operations on it.The selection factor was set to 0.09, the crossover factor was set to 0.4, and the variation factor was set to 0.001.The initial parameters of the GABP are shown in Table 2.  (3) In the process of GABP training, the algorithm was continuously optimized to obtain the optimal parameters, and the model was trained.

Assessment of model performance
Two statistical parameters were used to evaluate the training and prediction performance of the neural network model, which were root-mean-square error (RMSE) and determination coefficient (R 2 ) [23].Their definitions are shown in Formula ( 6), (7), respectively.
where n was the number of samples, y i was the experimental measured value, f(x i ) was the prediction value calculated by the model, and y was the average of the experimental measurements.The closer RMSE was to zero and the closer R 2 was to 1, the better the model fitting effect was, and the more accurate the prediction was.

Correlation analysis
It can be seen from Table 3 that there were different degrees of correlation among the factors affecting the ventilation rate of cigarettes.The weight had an extremely significant positive correlation (p < 0.01) with the circumference, filter air permeability, hardness, and open (6 resistance.The circumference had an extremely significant positive correlation with the hardness, filter air permeability, and open resistance and a significant negative correlation with the length.The length had a significant positive correlation (p < 0.05) with the hardness and open resistance, and the hardness had a very significant positive correlation with the filter air permeability, and open resistance; the total ventilation rate had an extremely significant positive correlation with weight, circumference, hardness, filter air permeability and open resistance, but the correlation with length was low.Therefore, in the process of cigarette production, using the correlation between these indicators can provide theoretical guidance and technical support for the stability study of the ventilation rate of the unit.

Multiple linear regression
Using SPSS for comprehensive calculation, the basic formula of the obtained multiple linear regression model is: Table 4 shows that the coefficient of determination R 2 of the model was 0.841, which indicated that 84% of the total ventilation rate of the dependent variable can be explained in the selected variables; the Durbin Watson value was 1.899, close to 2, explained that the autocorrelation between the variables of the equation can be accepted; the results of variance analysis (ANOVA), the p value (significance) of the model was 0.000, representing p < 0.001, which proved that the model contained at least one independent variable that had a significant impact on the dependent variable y.There was a linear relationship between the dependent variable and the independent variable, so the model significance relationship of the model was tenable.

Backpropagation neural networks
The BP neural network with the parameters in Table 1 is used to train, and the following results are obtained.Figure 3 plots the variation of MSE with the training period, and the best neural network performance is achieved when the period is 3. Figure 4 shows the regression diagram of training sets, verification sets, test sets, and overall data.The regression correlation coefficient, R, of all sets is above 0.88, and most of the predicted results are close to the training set.

Genetic algorithm-optimized backpropagation
The GABP model with the parameters in Table 2 is used to train, and the following results are obtained.Figure 5 shows the changes in MSE with the training epochs.When the epoch is 4, the GABP performance reaches its peak.Figure 6 shows the regression diagram of training sets, verification sets, test sets, and overall data.The regression correlation coefficient, R, of all sets is above 0.91, indicating that the GABP has good predictive performance.

Predicted and actual values under different models
Figure 7 shows the dispersion of predicted and actual values for each model.Compared with MLR (Fig. 7a), BP (Fig. 7b), and GABP (Fig. 7c), the results are similar, but GABP is more stable than MLR and BP, indicating that all data points between predicted and actual values are widely distributed without overestimation or underestimation.As shown in Fig. 8, with the increase in the number of principal components, the cumulative variance interpretation rate will gradually increase, R 2 will first increase, then tend to plateau, and finally slowly rise.In this paper, the number of principal components selected through experiments is 2, and R 2 reaches 0.792, which is basically the same as when the principal components are 3 and 4. When the principal component is 5, R 2 will increase to 0.863, which is similar to the results without PCA.

Model comparison
In this paper, multiple linear regression, BP neural network, and GABP neural network were used to simulate and predict the ventilation rate of cigarettes.Based on the results in Table 5, it can be concluded that: (1) Compared with the multiple linear regression model, the RMSE of the BPNN decreased by 1.7%, and the R 2 increased by 0.7%, which showed that the BPNN model had little improvement in the prediction of cigarette ventilation rate compared with the multiple linear regression model, probably due to the tendency of BPNN to fall into local minima.(2) After the genetic optimization of the BPNN model, compared with the multiple linear regression model, RMSE decreased by 6.9%, and R 2 increased by 3.8%, which showed that the GABP model had a little improvement in the prediction of cigarette ventilation rate compared with the multiple linear regression model.Compared to BPNN, there is also a certain improvement, this implies that GA has been successfully used to find the optimal weights and thresholds to generate better performing GABP models.(3) After PCA dimensionality reduction, when the principal component was 2, the RMSE of GABP increased by 35.8% and the R 2 decreased by 10.1%, which is not as good as GABP without principal component analysis in terms of predictive performance.

Conclusion
(1) In this paper, the data of cigarettes from Xuchang Cigarette Factory were used to establish a model and test the ventilation rate of cigarettes.The test results showed that there was a correlation between the indicators that affect the ventilation rate of cigarettes and the total ventilation rate: the total ventilation rate had a very significant positive correlation with the weight, circumference, hardness, filter air permeability, and open resistance, but had a low degree of correlation with the length.(2) The RMSE and R 2 of multiple linear regression, BP neural network model, and GABP neural network model were compared.The results showed that there were differences among the three models in predicting cigarette ventilation rate: the effect of genetic optimization BP neural network model was a little better than the BP neural network and multiple linear regression, these improvements can make a certain contribution to the stability of cigarette ventilation rate, which is beneficial for improving the quality of cigarettes.(3) After using PCA dimensionality reduction technology, when the number of principal components was 2, the running time was reduced and the efficiency was indeed improved, but the predictive performance decreased.The number of features may be crucial, and using PCA dimensionality reduction will lose some important information, which is not in line with the expectation of improving predictive performance.Therefore, GABP without PCA dimensionality reduction was selected.

Fig. 1
Fig. 1 Structure of BPNN ) Enter the training sample and normalize the feature value of the input sample to the range [0,1];(2) Network initialization, based on BPNN, set the optimization parameters, adjusted the parameters continuously through practice, selected the population size as 200,

Fig. 7 Fig. 8
Fig. 7 Comparison between predicted values and experimental values of different algorithms.a MLR, b BPNN, c GABP

Table 1
Parameters of the BP model

Table 2
Parameters of the GABP model

Table 3
Correlation analysis* and **, respectively, represent a significant correlation at the p < 0.05 level and an extremely significant correlation at the p < 0.01 level

Table 4
Model summary

Table 5
Model evaluation index In the future, the genetically optimized BP neural network can be optimized, or other relevant factors affecting the cigarette ventilation rate can be added to improve the prediction accuracy of the cigarette ventilation rate model.