Skip to main content

Analysis of influencing factors on excellent teachers' professional growth based on DB-Kmeans method

Abstract

The Kmeans clustering algorithm is widely used for the advantages of simplicity and efficient operation. However, the lack of clustering centers in the algorithm usually causes incorrect category of some discrete points. Therefore, in order to obtain more accurate clustering results when studying the factors affecting the professional growth of outstanding teachers, this paper proposes an improved algorithm of Kmeans combined with DBSCAN. Observing the clustering results of the influencing factors and calculating the evaluation standard values of the clustering results, it is found that the optimized DB-Kmeans algorithm has obvious improvements in the accuracy of the clustering results, and the clustering effect of the algorithm on edge points is more advantageous than the original algorithms according to the scatter diagram.

1 Introduction

Rejuvenating the country through science and education is an important policy of our country. The professional growth of teachers affects the development and future of national education. For an ordinary teacher to grow into an excellent teacher, in addition to his/her own efforts, he/she also needs to learn useful experience from other excellent teachers, which can effectively help his/her own growth. The interview records of excellent teachers are an effective summary of teachers' professional growth experience. Extracting key information from these interview texts and clustering the influencing factors of excellent teachers' professional growth can systematically provide valuable guidance for the professional development of teachers. It is of great and far-reaching significance to improve the professional quality of teachers and promote the development of education in our country.

How to mine and analyze the interview texts of these excellent teachers is of great significance and research value. Under the modern background, the information retrieval of texts puts forward higher requirements for the clustering algorithms. In order to obtain more accurate and effective information more efficiently, it is necessary to optimize the traditional clustering algorithm to deal with all kinds of texts to achieve more efficient in-depth analysis. In this paper, when researching the professional growth factors from the interview texts in the growth process of outstanding teachers, the Kmeans and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithms are improved, and DB-Kmeans (DBSCAN—Kmeans) is proposed. The initial value of the clustering is optimized, and the selection method of the cluster centroid is improved, which effectively improves the accuracy of the clustering results.

2 Related works

Kmeans clustering is an unsupervised method and a common partition-based clustering algorithm, which is widely used in text analysis and cluster analysis. However, due to the lack of cluster centers in the Kmeans algorithm, it can still be seen that many discrete points are not classified into the correct category. DBSCAN clustering algorithm is a classical density-based spatial clustering algorithm. The algorithm starts with a randomly selected core point and recursively classifies points that satisfy the density requirement. Finally, the maximized area containing the core points and boundary points is obtained. The DBSCAN algorithm does not need to specify the number of clusters in advance, but only needs two parameters: Eps (Epsilon) and MinPts (Minimum Points). However, this algorithm is computationally inefficient, and the computation speed is slow for relatively large datasets. Because the effect of the DBSCAN clustering algorithm depends on the parameters Eps and MinPts, and improper selection of parameters will directly lead to the decline of the clustering quality, it is necessary to conduct multiple experiments to obtain a set of values with better effects. With the continuous development of technology, many scholars have made great improvements to the K-center clustering algorithm and DBSCAN, bringing great benefits to the mining of big data and information acquisition.

Wu Ying proposed a Canopy-Kmeans clustering algorithm. First, the Canopy algorithm was used to cluster the samples, and then, the initial clustering was obtained. The clustering result was used as the initial center and number of clusters of the Kmeans algorithm to get the result [1]. Gao Xin proposed a DT-Kmeans clustering algorithm, which first randomly selected a cluster center point, and determined the remaining clusters according to the data object density information and the distance information between the data object and the existing cluster center point [2]. Yan Minghui et al. proposed the introduction of Gaussian kernel density estimation to obtain the maximum probability to improve the way of Kmeans cluster center acquisition and finally improved the research effect of the traditional method [3]. Hima Bindu et al. proposed an improved algorithm of Firefly Algorithm (FA) mixed with Kmeans to find the optimal cluster center [4]. Valarmathy proposed a clustering algorithm combining DBSCAN density clustering and K-Distance tree algorithm [5]. Manogaran et al. proposed a modeling method combining Hidden Markov Model (HMM) and DBSCAN with GMM [6, 7]. Zhong Jun et al. proposed a hybrid algorithm of convolutional auto-encoding and Gaussian mixture, which was applied to the feature extraction of ECG signals, and saved a lot of time and effort of manual labeling [8]. Shi Yongge et al. proposed a hybrid algorithm of Kmeans and Extreme Gradient Boosting (XGBoost) to mine designated telecom customers with special behaviors from the vast voice communication records of telecom companies [9].

In view of the shortcomings of the Kmeans and DBSCAN algorithms and the hybrid algorithm idea proposed by the above scholars, this paper proposes an improved algorithm of DBSCAN combined with the Kmeans algorithm. The accuracy of the results in this paper is improved, and it provides data support for our research on the influencing factors of the growth process of outstanding teachers.

3 Kmeans and DBSCAN algorithms

The Kmeans algorithm first randomly selects K objects as cluster centers, then assigns the sample points to the class with the closest centroid according to the Euclidean distance, finally calculates the mean of the sample points in the class and updates the centroids until the results converge.

The specific steps of the algorithm are as follows:

  • Randomly generate K centroids;

  • Calculate the distance between all points in the sample and a random centroid, and classify each data into the cluster corresponding to the centroid that is closest to it. The distance between the object and the random centroid is the Euclidean distance. The formula is as (1);

    $$d\left( {x_{t} ,z_{k} } \right) = \sqrt {\mathop \sum \limits_{i = 1}^{N} \left( {x_{t} ,z_{k} } \right)^{2} }$$
    (1)

    Among them, N represents the sample set {x1, x2, …, xt}, {z1, z2, …, zk} represents K centroids, and the sample points in N are divided into the class closest to the centroid [10].

  • Recompute the K centroids based on the average distance between all points of the class result;

  • Repeat steps 2 and 3 until the sum of the distances of all sample points and their corresponding centroids of the class is minimized. The results tend to converge through multiple iterations.

The algorithm flowchart of Kmeans is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of Kmeans clustering algorithm

DBSCAN can assume that the clustering results can be determined by the tightness of the sample distribution. The data that are clustered into the same category are closely connected, that is, there must be data belonging to the same category around a certain data in the sample. The final clustering result is obtained by dividing all the closely connected words in the sample set into categories and displaying the results in the form of scatter plots. Different categories are represented by different colors, which are presented with a more intuitive visual experience.

According to [11], for the sample set D = (x1, x2, …, xm), the DBSCAN algorithm includes 5 core definitions in the implementation process: Eps neighborhood, core object and boundary object, density direct, density reachable and density connected. There can be one or more core points in the cluster. If there is only one core point, other non-core point samples in the cluster are in the Eps neighborhood of this core point. If there are multiple core points, there must be one other core point in the Eps neighborhood of any core point in the cluster, otherwise the two core points cannot be density reachable. The collection of all samples in the Eps neighborhood of these core points forms a DBSCAN cluster.

The quality of the DBSCAN clustering algorithm depends on the parameters Eps and MinPts, so it is necessary to conduct multiple experiments to obtain a set of values with better quality. After many experiments, Eps = 1 and MinPts = 1 are selected. Kmeans and DBSCAN have their own advantages and disadvantages in the implementation process. The comparison results are as follows in Table 1.

Table 1 Analysis of advantages and disadvantages of clustering algorithms

It can be seen from the table that the Kmeans algorithm has simpler parameters than DBSCAN, is easy to implement and does not take too much time. On the contrary, there are many parameters in the DBSCAN algorithm, which have a great impact on the clustering results, but it does not need to specify the number of K, and the cluster centers all exist in the data samples, while the Kmeans algorithm is a randomly assigned centroid or a value calculated from the mean, not necessarily real in the sample.

4 The optimized DB-Kmeans algorithm

In view of the shortcomings of DBSCAN and Kmeans algorithms, this paper proposes a hybrid method of Kmeans and DBSCAN algorithms, referred to as DB-Kmeans. This algorithm can maximize the advantages of Kmeans and DBSCAN algorithms, and avoid the shortcomings to some extent to the clustering results.

Firstly, DBSCAN algorithm is used to perform rough clustering to obtain the number of cluster categories and to cluster center points, and then, Kmeans algorithm is performed for further clustering. This processing can benefit from the no need of K value to obtain global optimal solution, and the results also can avoid the shortcoming of being sensitive to noise points and abnormal points of Kmeans algorithm. The clustering steps of DB-Kmeans are as follows:

  • Through the initial clustering of the DBSCAN algorithm, all the data are divided according to the density, and the cluster center point is obtained;

  • Use the cluster center and K value in the above results as the initial centroid and the number of categories, respectively;

  • Get the final clustering result and scatter plot based on Kmeans.

This algorithm solves the problem of slow clustering speed of DBSCAN algorithm and can greatly speed up the algorithm. The more accurate initial value is provided for Kmeans clustering algorithm, which is of great significance to the final segmentation result. The algorithm flowchart is shown in Fig. 2. The pseudocode is shown in Table 2 below.

Fig. 2
figure 2

Flowchart of DB-Kmeans algorithm

Table 2 Pseudocode of DB-Kmeans algorithm

5 Results of experiments

The experiment is performed on the platform with processing CPU Intel(R) CoreTMi5-1035G1, the Samsung 16G DDR4-3200 memory and Windows10 system. The development environment is Anaconda3 with the programming language python 3.6. The main data include the results of keyword extraction from 100 long texts, which are from the translation of the interview manuscripts of excellent teachers by laboratory personnel and the interview content of teachers from the open resources online. There are 550 keywords as a sample of subsequent keyword clusters.

The manually extracted keywords are vectorized, and the keywords are clustered on the two-dimensional vector. The scatter plot is shown in Fig. 3 below. In these scattergrams, different colors represent different clusters. Through many experiments in this research, the clustering result is the best when K = 7, so 7 clusters of different colors can be seen in the figure.

Fig. 3
figure 3

Scatter plot

Looking at the scattergrams, a lot of data in Figs. 1 and 2 that are not clustered together correctly or are at the category boundary are divided into appropriate categories in Fig. 3. In order to analyze the results more accurately and objectively, the results of the 6 clustering algorithms used in this paper are labeled according to the category of the cluster to compare with the results of manual clustering. The confusion moment certificate is exported to show the corresponding results of the real and predicted labels of the classification model. Finally, the standard values corresponding to the 7 clusters are calculated according to the confusion matrix.

6 Evaluations and discussion

Clustering is an unsupervised learning process, but for the evaluation of clustering effect, we mark the effect of clustering manually and then use the evaluation indicators commonly used in machine learning classification models: Accuracy, Recall and F1 value (H-mean value) to evaluate the quality of the clustering effect.

  • Accuracy Proportion of all predicted correct values to the total. The formula is (2).

    $${\text{Accuracy}} = \frac{{{\text{TP}} + {\mathrm{FN}}}}{{{\text{TP}} + {\mathrm{TN}} + {\text{FP}} + {\mathrm{FN}}}}$$
    (2)
  • Recall rate Recall rate, that is, the proportion of correct predictions that are positive to all actual positives. The formula is (3).

    $${\text{Recall}} = \frac{{{\text{TP}}}}{{{\mathrm{TP}} + {\text{FN}}}}$$
    (3)
  • F1 value the arithmetic mean divided by the geometric mean, the larger the better. The formula is (4).

    $$F1 = \frac{{2{\text{TP}}}}{{2{\mathrm{TP}} + {\text{FP}} + {\mathrm{FN}}}}$$
    (4)

Among them, TP (True Positive) represents the prediction of the true positive class as a positive class; FP (False Positive) represents the prediction of the true negative class as a positive class; TN (True Negative) represents the true negative class is predicted as a negative class; FN (False Negative) represents the prediction of the true positive class as a negative class.

In order to highlight the superiority of the DB-Kmeans algorithm more clearly, this paper introduces several other commonly used algorithms based on the two comparison algorithms to analyze the same data samples. According to the calculation formulas of the three standards, the evaluation result tables are obtained as Tables 3, 4 and 5, respectively.

Table 3 Accuracy of each cluster
Table 4 Recall of each cluster
Table 5 F1 of each cluster

By analyzing the table, we can find the Accuracy, Recall and the F1 values of DB-Kmeans are mostly the highest, which means that the proportion of words with correct clusters, the proportion of words predicted to be positive true values and the value of F1 are mostly the largest. The experiment result means that DB-Kmeans algorithm is the best in the traditional methods. In order to clarify the evaluation results, line charts are shown in Fig. 4.

Fig. 4
figure 4

Line chart of results

From the analysis of the line charts, the three evaluation standard values of the DB-Kmeans algorithm are mostly the highest among these algorithms; besides, the accuracy and recall rate can basically reach more than 70%. Therefore, combined with the analysis of the advantages and disadvantages of the algorithms and the comparison of experimental data, this paper finally selects the DB-Kmeans algorithm with the best clustering results to study the influencing factors of the growth process of excellent teachers. Part of the results of the DB-Kmeans algorithm are shown in Table 6.

Table 6 Clustering results

From the table above, we find that the words are clustered into 7 categories:

The first category is inclined to the spiritual attitude of excellent teachers, such as lifelong learning, teaching, educating people, passion, knowledge ideal and teaching students in accordance with aptitude. The second category can be explained to be the environmental factors, including the school factors and family factors, such as classroom, external environment, leadership, parents and children. The third, fourth and fifth categories can be roughly distinguished as figures and events impact, professional ethics and work contents. For examples, excellent teacher, professor, special education can be regarded as the figure and events impact. Similarly, morality and sense of responsibility are in the scope of professional ethics. The words in the sixth category is mainly about self-awareness and introspection, such as reflection, ideas, thinking and sense of achievement. The seventh category can be identified as the professional knowledge and ability, such as the profession, classroom teaching, language ability and organizational management ability.

Basically, compared with the manual clustering results, the seven categories shown in Table 6 include the main influence factors of the excellent teachers’ professional growth, such as the factors of professional spirit, knowledge and ability, self-awareness and introspection. These factors can be summarized as the inner factors of a person. In contrast, the environment factors, figures and events impact can be summarized as the outer factors of a person. These clustering and analysis provide a significant reference for the research of influencing factors on excellent teachers' professional growth.

7 Conclusions

In view of the shortcomings of Kmeans and DBSCAN algorithms, this paper proposes an improved DB-Kmeans algorithm and evaluates the clustering results through three evaluation criteria. Experiments show that the optimized algorithm improves the accuracy of keyword clustering results in the analysis of influencing factors of excellent teachers' professional growth through interview records. However, due to the specified scope of the research field in education, the amount of data prepared is relatively limited. With the increase in the amount of data, it is necessary to further test the DB-Kmeans algorithm whether can still maintain high-speed and effective calculation. Therefore, the next step of research is to apply this improved algorithm to a wider research field, to calculate a larger amount of data and to further verify the superiority.

Availability of data and materials

Please contact the authors for data requests.

Abbreviations

DBSCAN:

Density-based spatial clustering of applications with noise

Eps:

Epsilon

MinPts:

Minimum points

DT-Kmeans:

Decision Tree-Kmeans

FA:

Firefly algorithm

HMM:

Hidden Markov model

GMM:

Gaussian mixture mode

ECG:

Electrocardiogram

XGBoost:

Extreme gradient boosting

TF-IDF:

Term frequency-inverse document frequency

TP:

True positive

FP:

False positive

TN:

True negative

FN:

False negative

References

  1. Y. Wu, Research on Passenger Car Passenger Order Scheduling Based on Canopy-Kmeans Algorithm (Shanxi University, 2020)

    Google Scholar 

  2. X. Gao, An Improved K-means Clustering Algorithm and a New Clustering Effectiveness Index Research (Anhui University, 2020)

    Google Scholar 

  3. M. Yan, X. Xie, W. Li, D. Wu, X. Cui, S. Pan, Morphological clustering algorithm of typical load curve based on Gaussian kernel density estimation. Electr. Meas. Instrum. 1–8 (2022). https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CAPJ%26dbname=CAPJLAST%26filename=DCYQ20210316003%26uniplatform=NZKPT%26v=fzYhayA03xb9KgYHcfE22jsZ7B3eNVIqtoqr0ToI60YAoWnfgwuDsWQj7-MOLMZ

  4. G. HimaBindu, Ch. Raghu Kumar, C. H. Hemanand, N. Rama Krishna, Hybrid clustering algorithm to process big data using firefly optimization mechanism. Mater. Today Proc. (2020). https://doi.org/10.1016/j.matpr.2020.10.273

  5. N. Valarmathy, S. Krishnaveni, A novel method to enhance the performance evaluation of DBSCAN clustering algorithm using different distinguished metrics. Mater. Today Proc. (2020). https://doi.org/10.1016/j.matpr.2020.09.623

  6. G. Manogaran, V. Vijayakumar, R. Varatharajan, P.M. Kumar, R. Sundarasekar, C.-H. Hsu, Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering. Wirel. Pers. Commun. 102(3), 2099–2116 (2018)

    Article  Google Scholar 

  7. W. Jia, Y. Tan, L. Liu, J. Li, H. Zhang, K. Zhao, Hierarchical prediction based on two-level Gaussian mixture model clustering for bike-sharing system. Knowl.-Based Syst. 178, 84–97 (2019)

    Article  Google Scholar 

  8. J. Zhong, D. Hai, J. Cheng, C. Jiao, S. Gou, Y. Liu, H. Zhou, W. Zhu, Convolutional autoencoding and Gaussian mixture clustering for unsupervised beat-to-beat heart rate estimation of electrocardiograms from wearable sensors. Sensors 21(21), 7163 (2021)

    Article  Google Scholar 

  9. Y. Shi, S. Yan, M. He, X. Li, Hybrid data mining method of telecom customer based on improved Kmeans and XGBoost. J. Phys. Conf. Ser. 2010(1), 120 (2021)

    Article  Google Scholar 

  10. T. Li, Research on patent text clustering based on improved k-means algorithm. Hebei University of Engineering (2020)

  11. J. Xiaoyun, Ru. Zheng, C. Jingxia, An EEG emotion recognition method based on multi-feature extraction. J. Shaanxi Univ. Sci. Technol. 36(05), 152–158 (2018)

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Scientific Research Project of Tianjin Educational Committee (2021KJ182).

Author information

Authors and Affiliations

Authors

Contributions

XG and XD proposed the framework of the whole ideal, structure of the model and the algorithm; TH helped to perform the simulations and conduct the analysis of the results. YK provided the interview records, participated in the conception and helped to revise the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaoming Ding.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, X., Ding, X., Han, T. et al. Analysis of influencing factors on excellent teachers' professional growth based on DB-Kmeans method. EURASIP J. Adv. Signal Process. 2022, 117 (2022). https://doi.org/10.1186/s13634-022-00948-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13634-022-00948-2

Keywords