LI Jia-nan,GAO Xing-quan,LI Zhuo,et al.Comparative Study of Four Machine Learning Algorithms for Soybean Protein Localization Predicting[J].Soybean Science,2022,41(03):337-344.[doi:10.11861/j.issn.1000-9841.2022.03.0337]
四种机器学习算法预测大豆蛋白质定位对比研究
- Title:
- Comparative Study of Four Machine Learning Algorithms for Soybean Protein Localization Predicting
- Keywords:
- Support Vector Machines algorithm; Naive Bayesian algorithm; Decision Tree algorithm; Random Forest algorithm; soybean protein; completely random missing; sequence position prediction
- 文献标志码:
- A
- 摘要:
- 为探索不同缺失程度大豆蛋白质亚细胞定位预测的有效方法,提升大豆蛋白质亚细胞定位预测能力,本研究以1万条已知亚细胞定位位置的大豆蛋白质序列数据为研究对象,进行5%、10%、15%、20%和30%不同缺失比例完全随机缺失,分别运用SVM算法、朴素贝叶斯算法和随机森林算法和决策树4种机器学习算法预测缺失序列的亚细胞位置,对原始位置和预测后的位置进行相关性分析,对比分析不同算法的准确性和性能。结果显示:随机森林算法预测的准确率最高;朴素贝叶斯算法的运行速度最快;朴素贝叶斯算法的运行内存最小。在不考虑运行时间和运行内存因素,且对预测的准确率要求较高的情况下,随机森林算法的预测效果要优于另外3种算法;同种情况下,若对运行内存要求较高时,可优先考虑朴素贝叶斯算法。结果说明不同机器学习方法在不同缺失程度的预测需求下的适用性,可应用于大豆蛋白质数据的定位预测。
- Abstract:
- In order to explore an effective method for predicting the subcellular localization of soybean protein with different degrees of deletion, and improve the prediction ability of soybean protein subcellular localization, this study took 10 000 soybean protein sequence data with known subcellular localization positions as the research object, and carried out 5%, 10%, 15%, 20% and 30% sequences missing at random. Four machine learning methods, namely SVM algorithm, Naive Bayes algorithm, Random Forest algorithm and Decision Tree algorithm, were used to predict the subcellular position of the missing sequence. Correlation analysis was performed between the original position and the predicted position, and the accuracy and performance of different algorithms were compared and analyzed. The results showed that the prediction accuracy of Random Forest algorithm was the highest, the running speed of Naive Bayes algorithm was the fastest, and the running memory of Naive Bayes algorithm was the smallest. When the running time and running memory factors were not considered, and the prediction accuracy was high, the prediction effect of the random forest algorithm was better than the other three algorithms. In the same situation, if the running memory requirements are high, the Naive Bayes algorithm may be preferred. The results show the applicability of different machine learning methods under the prediction requirements of different degrees of missingness, and can be applied to the localization prediction of soybean protein data.
参考文献/References:
[1]EISENHABER F, BORK P. Wanted: Subcellular localization of proteins based on sequence[J]. Trends in Cell Biology, 1998, 8(4): 169-170. [2]CHOU K C. Some remarks on predicting multi-label attributes in molecular Biosystems[J]. Molecular BioSystems,2013,9(6): 1092-1100.[3]LUNN E J. Compartmentation in plant metabolism[J].Journal of Experimental Botany,2007,58(1):35-47.[4]ENRICO〖KG(0.6mm〗 M, MAESHIMA M, EKKEHARD N H. Vacuolar transporters and their essential role in plant metabolism[J]. Journal of Experimental Botany, 2007, 58(1): 83-102.[5]白辉, 王宪云, 曹英豪, 等. 水稻叶绿体蛋白质在生长发育过程中的表达研究[J].生物化学与生物物理进展,2010,37(9): 988-995. (BAI H, WANG X Y, CAO Y H, et al. Expression of chloroplast proteins in rice during growth and development[J].Progress in Biochemistry and Biophysics,2010,37(9): 988-995.)[6]赵丽, 周巧霞, 王拴, 等. 线粒体分裂和融合相关蛋白质的研究进展[J]. 生理学报, 2018, 70(4): 424-432. (ZHAO L, ZHOU Q X, WANG S, et al. Research progress of mitochondrial fission and fusion-related proteins[J]. Acta Physiologica Sinica, 2018,70(4): 424-432.)[7]CHOU K C, CAI Y D. Using function domain composition and support vector machines for prediction of protein subcellular location[J]. Journal of Biological Chemistry, 2002, 277(48): 45765-45769. [8]GALAR M, FERNáNDEZ A, BARRENECHEA E, et al. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes[J]. Pattern Recognition, 2011, 44(8): 1761-1776.[9]MURAKAMI Y, MIZUGUCHI K. Applying the Naive Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites[J]. Bioinformatics, 2010,26(15): 1841-1848.[10]MABUNI D. A new direct node data splitting technique in decision tree induction[J]. International Journal of Innovative Technology and Exploring Engineering, 2020, 9(7).[11]MENGTING N, YANJUAN L, CHUNYU W, et al. RFAmyloid: A web server for predicting amyloid proteins[J]. International Journal of Molecular Sciences, 2018, 19(7): 2071.[12]唐友, 郑萍, 王嘉博, 等. 对比Bayesian B等多种方法的大豆全基因组选择应用研究[J]. 大豆科学, 2018, 37(3): 30-35. (TANG Y, ZHENG P, WANG J B, et al. Application of soybean genome-wide selection by comparing Bayesian B and other methods[J]. Soybean Science, 2018, 37(3): 30-35.)[13]LIU B, 〖KG(0.27mm〗WU H, CHOU K C. Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences[J]. Natural Science, 2017, 9(4): 67-91.[14]李盛. 基于数据挖掘的煤矿微震危害预测实证分析[D].昆明: 云南师范大学, 2015. (LI S. Empirical analysis of microseismic hazard prediction in coal mine based on data mining[D]. Kunming: Yunnan Normal University, 2015.)[15]未丽, 刘建利. 植物蛋白质亚细胞定位相关研究概述[J].植物科学学报, 2021, 39(1): 93-101. (WEI L, LIU J L. An overview of studies related to the subcellular localization of plant proteins[J]. Chinese Journal of Plant Science, 2021, 39(1): 93-101.)[16]陈凯.面向不平衡数据集的朴素贝叶斯文本分类算法改进研究[D].哈尔滨: 东北林业大学,2018. (CHEN K. Improvement of Naive Bayesian text classification algorithm for imbalanced data sets[D]. Harbin: Northeast Forestry University, 2018.)[17]于合龙, 刘雨帆, 张继成, 等. 基于多种机器学习方法填补大豆基因组缺失的比较研究[J]. 大豆科学, 2021, 40(1): 122-129. (YU H L, LIU Y F, ZHANG J C, et al. A comparative study of filling in the soybean genome deletion based on multiple machine learning methods[J]. Soybean Science, 2021, 40(1): 122-129.)
备注/Memo
收稿日期:2021-10-22