Optimal guidance whale optimization algorithm and hybrid deep learning networks for land use land cover classification

Satellite Image classification provides information about land use land cover (LULC) and this is required in many applications such as Urban planning and environmental monitoring. Recently, deep learning techniques were applied for satellite image classification and achieved higher efficiency. The existing techniques in satellite image classification have limitations of overfitting problems due to the convolutional neural network (CNN) model generating more features. This research proposes the optimal guidance-whale optimization algorithm (OG-WOA) technique to select the relevant features and reduce the overfitting problem. The optimal guidance technique increases the exploitation of the search technique by changing the position of the search agent related to the best fitness value. This increase in exploitation helps to select the relevant features and avoid overfitting problems. The input images are normalized and applied to AlexNet–ResNet50 model for feature extraction. The OG-WOA technique is applied in extracted features to select relevant features. Finally, the selected features are processed for classification using Bi-directional long short-term memory (Bi-LSTM). The proposed OG-WOA–Bi-LSTM technique has an accuracy of 97.12% on AID, 99.34% on UCM, and 96.73% on NWPU, SceneNet model has accuracy of 89.58% on AID, and 95.21 on the NWPU dataset.


Literature survey
Recently, the CNN-based models were widely applied for scene classification in satellite images due to their efficiency. Recent CNN techniques in satellite image classification were reviewed to understand its performance.
Xie et al. [11] applied label augmentation to process the data and a joint label was assigned to each generated image to consider data augmentation and label at the same time. Training set intra-class diversity was increased by augmented samples and applied for the classification process. The Kullback-Leibler divergence (KL) was applied to constrain two samples' output distribution of the same category to generate consistent output distribution. The KL divergent provides considerable performance in satellite image classification and lower efficiency in the overlapping of data. Bazi et al. [12] applied vision transformers for satellite image classification and this was considered state-ofthe-art in Natural Language Processing. Multi-head attention technique was used to develop building blocks to provide long-range contextual relations between pixels in the images. Embedding position was applied in these patches to track the positions and the resulting sequence was applied to the multi-head attention technique. The softmax classification layer was applied with the token sequence for the classification process. The multi-head attention technique suffers from a vanishing gradient problem. Xu et al. [13] applied Graph Convolutional Network (GCN) with deep feature aggregation for satellite image classification. Pre-trained CNN on ImageNet is applied for multi-layer features extraction and the GCN model was applied to reveal convolutional feature maps related to patch-to-patch correlations and more refined features are extracted. Multiple features are integrated using a weighted concatenation method based on three weighting coefficients and semantic classes of query images were carried out using a linear classifier. The GCN model performance was affected by the overfitting problem in the CNN model. Alhichri et al. [14] applied the deep attention CNN model for satellite image classification and an attention mechanism is applied to process a new feature map related to a weighted average of original feature maps. The EfficientNet-B3-Attn-2 was an attention technique with a pre-trained CNN model for satellite image classification. A dedicated branch was applied to measure the required weights in CNN and the end-to-end backpropagation technique was used to learn the weights for CNN weights. The model has lower efficiency compared to state-of-the-art techniques. Ma et al. [15] applied multiobjective neural evolution (SceneNet) for satellite image classification. The evolutionary algorithm was applied for network architecture searching and coding and applied flexible hierarchical extraction of satellite image classification. Searched network performance error and computational complexity are balanced using a multi-objective optimization method and Pareto solution set was obtained using competitive neural architecture. The SceneNet model has an overfitting problem in the classification due to the generation of more features in CNN. Naushad et al. [16] applied the transfer learning method in CNN training for fine-tuning VGG16 and Wide Residual Networks (WRNs), additional layers are applied to replace final layers for LULC classification in EuroSAT dataset. The developed method was applied with data augmentation, adaptive learning rates, gradient clipping, and early stopping. The VGG16-WRNs network has considerable performance and suffers from the limitation of overfitting problems in the network.
Tang et al. [17] applied new CNN based Attention Consistent Network (ACNet) was applied using Siamese network. The ACNet dual-branch structure was applied with input data of image pairs using spatial rotation. The different attention techniques are applied to mine the object's information from satellite images using similarities and spatial rotation. ACNet was applied to unify salient regions and affect satellite images of semantic categories. The learned features were used for the classification task in the network. Li et al. [18] applied a Gated Recurrent Multi-Attention Neural Network (GRMA-Net) for satellite image classification. Informative features occur at multiple stages of a network and multi-level attention modules are applied to focus on informative regions to extract features. Deep Gated Recurrent Unit (GRU) was used for spatial sequences to capture the contextual relationship and long-range independency. Li et al. [19] applied locality preservation deep cross-modal embedding network and fully assimilate the pairwise intra-modal and intermodal in an end-to-end manner to inconsistency between two hybrid spaces. The large-scale satellite images were used to evaluate the model performance in classification. Wang et al. [20] applied an enhanced feature pyramid network with Deep Semantic Embedding (DSE) for satellite image classification. The DSE module was applied to generate discriminative features based on multi-level and multi-scale features. Twobranch Deep Feature Fusion (TDFF) of the feature fusion module was applied at various levels effectively. Zhang et al. [21] applied a meta-learning technique for few-shot classification and feature extraction was trained to represent inputs. The classifier was optimized in metric space in the meta-training stage using cosine distance with a learnable scale parameter. The developed model shows considerable performance in two datasets and has a limitation of overfitting problem. Zhang et al. [22] developed a suitable CNN model of Remote Sensing-DARTS to find optimal network architecture for satellite image classification. Some new techniques were applied in the search phase for a smoother process, optimal distinguishing and operator. The optimal cell was stacked to develop the final network for classification.
First, a global information and the local features are crucial to distinguish in the Remote Sensing (RS) images. The existing networks are good at capturing the global features of the CNNs' hierarchical structure and nonlinear fitting capacity. However, the local features are not always emphasized. Second, to obtain satisfactory classification results, the distances of RS images from the same/different classes should be minimized/ maximized. Nevertheless, these key points in pattern classification do not get the attention they deserve.

Proposed method
The AlexNet and ResNet50 models are applied to extract the features from the input images. The OG-WOA technique is applied to select the relevant features from extracted features of AlexNet-ResNet50. The overall process in satellite image classification is shown in Fig. 1.   [22][23][24][25][26]. CNN's model shows improved performance against many classifier techniques such as Naïve Bayesian Classifier, Decision Tree, and Support Vector Machine (SVM). The CNN learns the feature representation during training and significantly reduces the time required for feature design i.e., selecting the most distinguishing features. CNN model's most important process is convolution and the convolution layer is an important layer that uses kernel filters on input during forwarding propagation. Each convolution layer is assigned with random kernel weights and updated at each iteration from the loss function of network training. Some types of patterns in final learned kernels is present in input images. Figure 2 provides three steps: (i) Convolution, (ii) stack, and (iii) NonLinear Activation Function (NLAF). Considering an input matrix X and Convolutional Layer output O , there exists a set of kernels F j , ∀ j ∈ [1, · · · , J ] , then C(j) convolution output, as given in Eq. (1).
An activation map is stacked to form C(j) activation maps, as given in Eq. (2).
where total filter is given as J , and channel direction pile operation is given as S.
The activation map D is applied to Nonlinear activation function and the final activation map provides output, as in Eq. (3).

AlexNet
Many hidden layers are present in deep architecture and hidden layers extract features in useful ways [27][28][29][30]. Deep network image classification performance gains a high classification rate than other techniques. AlexNet is a popular model that consists of several hidden layers to extract the features. Numerous enhancements are trained in these parameters and the overall architecture is shown in Fig. 3.
The first improvement is made in the activation function and classical neural network of nonlinearity, the activation function is limited to the logistic function, tanh, arctan, etc. Activation functions use gradient values to significantly improves the input for small range 0 and activation function types fall into the gradient vanishing problem. Rectified Linear Unit (ReLU) is a new activation function and is applied to overcome this problem. The RELU gradient is less than 1 if the input is not less than 0. The training process is accelerated using RELU, as in Eq. (5).
Several sub-networks are present in the network and overfitting falls into each subnetwork, the same loss function is shared in sub-network and this is useful to drop out some of the layers. Some of the layers are dropped to avoid overfitting in the network. Fully connected layers are applied with drop-out to improve the performance. Each iteration trains part of the neurons during dropout and joint adaptation is applied to reduce between neurons due to neurons forced to co-operate using dropout that enhance and improve generalization. Entire network output is sub-networks average and dropout is used to improve and increase the robustness.
Convolutional layers automatically extract the features and reduced them by the pooling layer. An image I consists of width w and height h and convolutional kernel m has width c and height b , convolution is defined in Eq. (6). Convolution learns image features in the model and parameters are shared to reduce model complexity. Extracted features are reduced using pooling layers and neighboring pixels consider the pooling layer from the feature map and represent generating values. AlexNet uses Max pooling to reduce the feature map and a 4 × 4 block of max pooling is used in the feature map to generate a 2 × 2 block that contains the maximum values.
Cross-channel normalization is improved using feature generalization that belongs to a local normalization technique. These maps are normalized and apply feature maps to the next layers. Cross-channel normalization generates the same position of the sum of several adjacent maps of the same positions. Fully connected layers are used for the classification. Fully connected layers are used with the activation function of softmax, as in Eq. (7).
Softmax output is presented in a range of 0 to 1 which is the main advantage to provide neurons activation; the activation function is used for this process. AlexNet is trained using different techniques. The AlexNet model in feature extraction is shown in Fig. 3.
Delta is predicted by ResNet50 and this requires final prediction from one layer to the next. ResNet provides an alternate shortcut path for the gradient to flow through that solves the vanishing gradient problem. ResNet50 uses identity mapping that allows the model to bypass a weight layer of CNN if the current layer is not necessary. In the training set, this solves overfitting problem and ResNet50 model consists of 50 layers for feature extraction. The ResNet50 model in feature extraction is shown in Fig. 4.
One of the common image classification techniques is AlexNet. However, it offers the following benefits when it is also used for feature extraction. The initial values that the AlexNet features can achieve are ideal. because it has two parallel CNN lines that have been trained on two GPUs and connected crosswise to easily fit the image. In a similar manner, ResNet50 is far deeper than AlexNet and its architecture size is significantly less due to fully-connected layers. ResNet50 makes it simple to train networks with many layers without raising the training error percentage. Furthermore, AlexNet is not as deep as ResNet50, which leads to more architectural errors. The subspace value is ideal when the ResNet50 is taken into account, but there is a possibility of overlap in the feature sub-space. The subspace error value of particular classes changes as a result while using those features throughout the training and testing stage. Additionally, ResNet50 often requires more training time, making it nearly impossible to use in real-world applications.
From this section, 4096 and 64 features are retrieved from the AlexNet and ResNet50, respectively. Optimal values from the AlexNet and ResNet50 models are gathered in order to acquire more useful features. After that, for a better representation of the object, the results from AlexNet and Res-Net50 are combined and used in feature extraction. The feature selection procedure receives these extracted features as input.

Feature selection
Once the feature extraction is done, the process of feature selection is deliberated using OG-WOA algorithm. The feature selection process is considered as a problem of global combinatorial optimization, which seeks to reduce the noisy and redundant data while producing a uniform level of classification accuracy. The current feature selection approaches tend to choose a large number of irrelevant features for classification rather than selecting the features adaptively. The Whale Optimization Method (WOA) selects the pertinent features, because which is an algorithm that learns the features adaptively. Similar to WOA, the discrete search space for WOA consists of all feasible combinations of the attributes that can be chosen from the dataset.

Whale optimization algorithm
Humpback whale's hunting behavior is used to inspire the Whale Optimization Algorithm (WOA) [36][37][38]. Bubble-net hunting technique is used by humpback whales to encircle and catch their prey in small fish groups. Prey position X * are the best whale position in WOA and other whales update their position based on X * . Three behaviors of whales are searching for prey (exploration), bubble-net attacking (exploitation), and encircling prey, as modeled in the definitions.
Encircling Prey: Prey is surrounded in whales' hunting process, whales can detect prey position and surround them. The current best whale X * is considered as prey or close to prey. The X * is used to update the position of all other whales, as in Eqs. (8) and (9). where whale is X(t) , the distance calculation D between the prey X * (t) , and t is iteration counter. The coefficient vectors A and C are used to calculate Eqs. (10) and (11). where a value is linearly reduced from 2 to 0 over the iterations and random number r in the range of [0, 1]. Bubble-net attacking: Spiral updating position or shrinking encircling technique is used to whales spin around the prey, this behavior is given in Eq. (12).
where spiral updating position (if p > 0.5 ) or shrinking encircling technique (if p < 0.5 ) is used for updating whales' probability and p is a random number in [0, 1]. A is a random value is in the range of [−a, a] , where a linearly decreases from 2 to 0 throughout the iteration. The D′ denotes the distance between the prey X * and current whale X in spiral updating position, constant b is used to define the shape of spiral movement and l is a random number in [− 1, 1].

Searching for prey
Whales performs a global search in search space to find new prey. If vector A absolute value is greater or equal to 1, this is completed. This will perform exploration or exploitation. The position of the whales is updated in the exploration phase related to random whale X rand instead of best whale X * , as calculated using Eqs. (13) and (14).
where randomly selected whale X rand is from the current population.

Optimal guidance
In Optimal Guidance process, the weight coefficient (w) with reference to the PSO algorithm is applied to improve WOA performance by adaptively changing the weight factor of the algorithm and randomly follows the object change in whale into the best individual in the swarm. Due to this reason, the whale gathers more data to better understand its own behavior, ensuring that the group is diverse and encouraging a balance between the stages of exploration and exploitation to increase the algorithm's search efficiency. Consequently, it is possible to list every possible subset of characteristics given the limited number of features. The state i of the modified position update equation is given in Eq. (15).
where the current iteration of the optimal solution is given as gbest and adaptively changes w(t) according to Eq. (16). where w maximum initial value is denoted as w max . The maximum weight value of OG-WOA is mainly depends on the population size. An adaptive weight coefficient is included in the position update to enhance the exploitation, because the conventional WOA gradually loses their exploitation ability along with the iteration. Here, the value of weight coefficient is depends on the w max , iteration and maximum iteration value. The incorporation of adaptive weight coefficient affects position updated of WOA for enhancing search efficiency while choosing the features. The w value is larger in the early stage of benefits exploration and it became smaller in a later stage for benefits of exploitation. Totally, 4160 images are extracted, out of which 2560 features are selected and processed for classification. After merging small, medium, and large-scale spatial and visual histograms, the Bi-LSTM network is finally used to classify the remote sensing scene images.

Classification
Once the feature selection is done, the features are processed for classification using Bi-LSTM. It is one of the best deep learning models for image classification which has been shown to accomplish the highest level of accuracy.

Bi-LSTM
A deep neural network is trained using a bi-LSTM network to classify sequence data. In Bi-LSTM, the OG-WAO selected features are used as inputs while the number of classes is the output. Since Bi-LSTMs are adept at remembering specific patterns, they perform noticeably better. The Earth observation satellites typically take a series of images from the same location. As a result, the interval between subsequent images enhances the temporal resolution (i.e., the time once it was learned). The major reason for utilizing BiLSTM is the temporal pattern of the scenes are exploited across the image time series. The input sequence is decided using BiLSTM which is referred as i = i 1 , i 2 . . . .i n from opposite order to a forward

t is referred as encoded vector which is calculated through the accumulation of decisive forward and backward outputs
Logistic sigmoid function is defined as δ ; First hidden layer output sequence is stated  AID dataset consists of 10,000 images and 30 classes. The spatial resolution is in range of 8 m to 0.5 m, the 220 × 420 pixel size, and each classes consists of 220 to 420 images.
The NWPU dataset is collected from Google Earth, consists of 31,500 images and in size of 256 × 256 pixels. The dataset has 45 classes, 700 images in each class and spatial resolution is in size of 30 m to 0.2 m.
Metrics Accuracy, sensitivity, and specificity were measured from the output of the OG-WOA technique and its formula are given in Eqs. (20)(21)(22).

Results
The proposed OG-WOA-Bi-LSTM technique is evaluated in three datasets and compared with existing techniques.

AID dataset
The OG-WOA technique is compared with various feature selection and deep learning techniques on AID dataset. Table 1 represents the different Optimization Algorithm used for Feature selection to analyze accuracy, sensitivity and specificity in Bi-LSTM on AID dataset.   Feature selection techniques such as Particle Swarm Optimization (PSO), Firefly Optimization Algorithm (FOA), Gray Wolf Optimization (GWO), and Whale Optimization Algorithm (WOA) are compared with OG-WOA technique, as in Table 1 and Fig. 8. The existing optimization techniques of PSO, FOA and GWO have limitation of local optima trap and WOA has limitation of lower exploitation in the feature selection. The Optimal Guiding technique modify the position of search agent based on best fitness values that benefits exploitation in the search. From Table 1, it clearly shows that the OG-WOA technique has achieved higher performance in terms of 97.12% accuracy, 97.43% sensitivity, and 97.21% specificity.
Deep learning techniques such as LSTM, CNN, Recurrent Neural Network (RNN), Generative Adversarial Network (GAN) models are compared with the proposed Bi-LSTM technique on the AID dataset, as shown in Table 2 and Fig. 9. The deep learning techniques of RNN, GAN and CNN models have a limitation of overfitting problems due to the generation of features in the network. The LSTM model has efficient performance in handling the sequence of data and has a limitation of vanishing gradient problems. The proposed Bi-LSTM has 97.12% accuracy, 97.43% sensitivity, and 97.21% specificity, GAN model has 91.27% accuracy, 93.84% sensitivity, and 93.71% specificity in satellite image classification.

UCM dataset
Deep learning techniques and feature selection techniques were applied to the UCM dataset and compared with OG-WOA technique. Table 3 represents the different  Optimization Algorithm used for Feature selection to analyze accuracy, sensitivity and specificity in Bi-LSTM on UCM dataset. Feature selection techniques are applied to the UCM dataset and compared with the OG-WOA technique, as in Table 3 and Fig. 10. The PSO, FOA, and GWO have limitations of local optima trap and lower convergence in the feature selection. The WOA model has a limitation of lower exploitation and tends to lose potential solutions in the classification. The Optimal Guiding technique increases the exploitation by changing the position of the search agent based on a higher fitness value. The OG-WOA technique has 99.34% accuracy, 99.44% sensitivity, and 99.31% specificity, GWO method has 93.22% accuracy, 93.8% sensitivity, and 91.7% specificity.
The deep learning techniques of CNN, GAN, RNN and LSTM models are compared with the OG-WOA technique on the UCM dataset, as shown in Table 4 and Fig. 11. The CNN model has overfitting problem due to the generation of features in the models.

NWPU dataset
Deep learning techniques and feature selection techniques were applied to the NWPU dataset and compared with the OG-WOA model. Table 5 represents the different Optimization Algorithm used for Feature selection to analyze accuracy, sensitivity and specificity in Bi-LSTM on NWPU dataset. The feature selection techniques such as PSO, FOA, GWO, and WOA are compared with OG-WOA technique, as shown in Table 5 and Fig. 12 Table 6 and Fig. 13. As overfitting occurs in CNN which is failed to find the difference between the categories of more similarity. LSTM model has considerable performance in the classification and has the limitation of vanishing gradient problem. The proposed Bi-LSTM has 96.73% accuracy, 97.21% sensitivity, and 97.24% specificity, ResNet50 model has 90.99% accuracy, 93.74% sensitivity, and 92.75% specificity.

Comparative analysis
The existing satellite image classification models were compared with proposed OG-WOA-Bi-LSTM technique for comparison.   The OG-WOA-Bi-LSTM technique is compared with recent techniques in satellite image classification, as shown in Table 7 and Fig. 14. The existing techniques have considerable performance in satellite image classification. The KL divergent [11] technique has lower efficiency in handling the overlapping of data and lower efficiency in differentiating the images of more correlation. The vision transforms [12] technique is suffering from a vanishing gradient problem and degrades the performance of the model. The GCN [13] and EfficientNet-B3-Attn-2 [14] models have overfitting problems that degrade the efficiency of the classification. The SceneNet [15] technique has an overfitting problem due to the generation of many features in the network. The proposed OG-WOA-Bi-LSTM technique has the advantage of increasing the exploitation in the search for feature selection. The incorporation of adaptive weight coefficient helps to enhance the balance among the exploration and exploitation which leads to enhance the search efficiency during feature selection. Therefore, it has achieved the accuracy of 97.12% on AID, 99.34% on UCM, and 96.73% on NWPU, SceneNet model has an accuracy of 89.58% on AID, and 95.21 on the NWPU dataset.

Conclusion
Satellite image classification is required for applications in urban planning and agriculture etc. The existing techniques have limitations of overfitting problems due to the generation of many features in the CNN model. This research proposes OG-WOA technique to increase the exploitation in feature selection. The AlexNet-ResNet50 model is applied to extract the features from the input images. The OG-WOA technique is applied to select the features from the extracted features. The OG-WOA technique changes the position of the search agent related to the best fitness value to increase the exploitation. This helps to escape from the local optima trap and increases the convergence. Finally, the selected features are processed for classification using Bi-LSTM. The existing techniques have limitations of local optima trap and lower convergence in feature selection. In classification, proposed OG-WOA-Bi-LSTM has attained the higher accuracy of 97.12% on AID, 99.34% on UCM and 96.73% on NWPU, SceneNet model has accuracy of 89.58% on AID, and 95.21 on the NWPU dataset. The future scope of this method involves applying the attention technique to improve classification performance.