1.1 Related works
At present, the research on the application of artificial intelligence techniques in the domain of software defect classification and prediction is divided into two main types as follows:
1.1.1 (1) Intelligent algorithm-based software testing usage [2,3,4,5,6]
At present, the existing software testing knowledge reuse technologies based on the intelligent algorithm are mostly focused on the research of test case reuse, in another word, based on the historical test cases, with the help of an intelligent algorithm, and the reusable test cases are recommended for the current projects to improve software testing efficiency and reduce work costs. This includes the reuse of software test cases based on document similarity, the classification and intelligent retrieval of reusable test cases, the automatic generation of software failure modes, etc. Most of the existing techniques are based on the attribute metric of the software product, the distribution of defects, the number of defects, and other information for prediction. However, the problem to be solved in this paper is the reuse prediction of radar software defect data, namely how to predict similar defect problems from the existing historical defect data sets based on the current radar software requirements. The existing software testing knowledge reuse techniques based on intelligent algorithms are mostly focused on test case reuse, which does not effectively consider the characteristics and failure mechanism of radar software. Thus, it is difficult to effectively achieve the intelligent prediction capability of radar software defects.
1.1.2 (2) Data text splitting technology
Text disambiguation is a prerequisite for accurate classification and prediction of software defect data. At present, the more common text separation techniques include lexicon-based matching [7], i.e., based on the constructed dictionary (e.g., the modern Chinese dictionary, etc.), the “data text” can be sliced according to the certain rules by the segmentation algorithms such as inverse maximum matching and forward maximum matching. The statistical model-based word separation method [8], which transforms the text separation problem into a sequence annotation problem, is implemented with the help of statistical models for text separation such as the hidden Markov model, conditional random field model (CRF), and maximum entropy model. Deep learning-based word separation methods [9], namely deep learning algorithms such as convolutional neural networks (CNN), recurrent neural networks (RNN), are used to learn from the annotated training set in order to achieve text separation in terms of word frequency, contextual relationships, etc. The radar software test data usually contains a lot of technical terms, such as silent zone, sector, and identification zone. It is still difficult to accurately classify the terminology in the radar domain, for example, silent zone may be divided into two meaningless words, silent and zone.
1.1.3 (3) Topic model-based text classification techniques
The main idea of the topic model is that a text is considered to be composed by two types of structures: document-topic and topic-word, i.e., a document is a probability distribution of several topics. At the same time, each topic is a probability distribution of words. “Keyword Extraction based on Topic Models” is a technique of training and learning from historical defective data (i.e., training set) by the Dirichlet allocation and plain Bayesian model, and decomposing the defective data text into multiple topic models, each of which consists of one or more words that characterize the keywords of a topic of the text. The most common topic models are the PLSA model, LSA model, LDA model, etc. Based on the acquired topic models, the classification of software defect data can be achieved.
Among these topic models, the LDA topic model does not require additional annotation and processing of the training set, is unsupervised learning, has less technical difficulty and workload, and has been more widely studied and applied in text classification [10,11,12]. However, the current LDA topic model has less consideration for the demand features and a priori statistical features of the defect data, and there are problems such as the forced assignment of implicit topics, poor integration of classification results with demand, and lack of easy interpretation, which affect the accuracy and recall rate of radar defect data classification.
1.2 Background
In this paper, existing techniques such as the LDA topic model and inverse maximum matching algorithm are required. In this section, the basic principles of these existing techniques are explained as follows:
1.2.1 (1) Principle of LDA topic model
The LDA topic model treats the text as consisting of a multi-layer architecture of document-topic and topic-word. Each document can be viewed as a probability distribution of several topics; at the same time, each topic can be viewed as a probability distribution of several words. The core is the Bayesian estimation process of computing the posterior topic distribution of documents based on the Dirichlet prior hypothesis of document topic distribution and topic word distribution, combined with the corpus. After model inference and parameter estimation, the text corpus is decomposed into multiple topic vectors, and each topic vector consists of one or more words, which are characterized by the certain topic of the text. The specific principle of the LDA topic model is shown in the following Fig. 1 [11]:
In Fig. 1, assuming that the document corpus (for example, radar software defect data) has D documents, there are N words in the corpus, Wd,n represents the nth word in the dth document, and each document consists of k topics Composition, the topic-word probability distribution under each topic ϕk obeys the Dirichlet distribution with β as the parameter, θd is the document-topic distribution, each document corresponds to a different topic distribution, θd obeys the Dirichlet distribution withα as the parameter, Zd,n,n represents the specified distribution between topics and words within the defect data d, and Zd,n obeys the polynomial distribution with θz as the parameter.
1.2.2 (2) The reverse maximum matching algorithm
The maximal matching algorithm is the main algorithm applied to text separation, which includes forward maximal matching algorithm, reverse maximal matching algorithm, and two-way matching algorithm. The main principle is to cut out a single word string, and then compare it with the lexicon, if it is a word, record it, otherwise continue the comparison by adding or subtracting a single word, and terminate if there is still a single word left, or treat it as unregistered if the single word string cannot be cut.
The reverse maximum matching method is usually abbreviated as the RMM method, which starts from the end of the processed document to match and scan, and each time takes the last 2i characters (i character string) as the matching field. If the matching fails, remove the top of the matching field. Correspondingly, the text segmentation dictionary is a reverse order dictionary, in which each entry will be stored in reverse order. In actual processing, the document is first processed in reverse order to generate reverse order documents. Then, according to the reverse order dictionary, the forward maximum matching method can be used to process the reverse order document.
1.3 Framework for intelligent classification techniques for radar software defects
This paper proposes an intelligent classification method for radar software defects based on an improved LDA topic model (referred to as RadarDCP), whose technical framework is shown in Fig. 2. Based on the contents of Fig. 2, the overall technical scheme of this paper is described as follows:
Input:
-
Historical software defect data set (test problem reports, the FMEA lists, etc.)
-
The new project software requirements (such as the function name, interface name, etc.)
Process:
Based on the mechanism of software defect generation and propagation, a standardized software defect data record structure is developed, including related functions, defect cause, defect description, defect impact, impact level, control measures, and the other fields. At the same time, for each field content description, grammatical structure, consistency, etc. to ensure that the data format and content without duplication, to form an iterative reusable radar software defect data set.
Analyze the radar domain standards, demand documents, historical test data, and other corpus sets, and build the radar domain dictionary from multiple perspectives such as professional terms, synonyms, demand information, stop words, and abnormal types. Then, with the help of a reverse maximum matching segmentation algorithm, we can achieve the accurate text segmentation of radar software defect data text.
The traditional LDA topic model is an unsupervised learning process, which suffers from the problem of forced assignment of implied topics, i.e., it is impossible to control the category and direction of topic acquisition for defective data, which may lead to uninterpretable classification results of defective data. In this paper, we propose an improved LDA topic model incorporating radar requirement features by referring to the Labeled-LDA method [13], in which the required elements such as the name of radar software function, interface data, interface type, etc., are used as extended features and incorporated into the LDA model learning process to adjust the parameter estimation of the distribution function. Correspondingly, the LDA model learning results are guided to obtain a topic model for radar requirements features.
Based on the improved LDA topic model considering the radar requirement features, the radar software historical defect data are trained and learned to form multiple defect topic models. Based on the correlation between each defect data and the topic models, the defect data are classified according to the topic models, and the set of keywords for each topic model is obtained.
Output: New project software requirements (function name, interface name, etc.) predicted possible defect data