PDF下载 +分享

微信公众号：大豆科学

[1]李佳楠,高兴泉,李卓,等.四种机器学习算法预测大豆蛋白质定位对比研究[J].大豆科学,2022,41(03):337-344.[doi:10.11861/j.issn.1000-9841.2022.03.0337]
　LI Jia-nan,GAO Xing-quan,LI Zhuo,et al.Comparative Study of Four Machine Learning Algorithms for Soybean Protein Localization Predicting[J].Soybean Science,2022,41(03):337-344.[doi:10.11861/j.issn.1000-9841.2022.03.0337]

点击复制

四种机器学习算法预测大豆蛋白质定位对比研究

《大豆科学》[ISSN:1000-9841/CN:23-1227/S] 卷: 第41卷期数: 2022年03期页码: 337-344 栏目: 出版日期: 2022-05-20

Title:: Comparative Study of Four Machine Learning Algorithms for Soybean Protein Localization Predicting

作者:: 李佳楠1; 2; 高兴泉2; 李卓1; 滕小华1; 黄斌1; 张继成3; 唐友1; 2; （1.吉林农业科技学院电气与信息工程学院，吉林吉林 132101; 2.吉林化工学院信息与控制工程学院，吉林吉林 132000; 3.东北农业大学电气与信息工程学院, 黑龙江哈尔滨 150030）

Author(s):: LI Jia-nan1; 2; GAO Xing-quan2; LI Zhuo1; TENG Xiao-hua1; HUANG Bin1; ZHANG Ji-cheng3; TANG You1; 2; (1.Electrical and Information Engineering College, Jilin Agricultural Science and Technology University, Jilin 132101, China; 2.School of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 132000, China; 3.College of Electronic and Information, Northeast Agricultural University, Harbin 150030, China)

关键词:: 支持向量机算法; 朴素贝叶斯算法; 决策树算法; 随机森林算法; 大豆蛋白质; 完全随机缺失; 序列位置预测

Keywords:: Support Vector Machines algorithm; Naive Bayesian algorithm; Decision Tree algorithm; Random Forest algorithm; soybean protein; completely random missing; sequence position prediction

DOI:: 10.11861/j.issn.1000-9841.2022.03.0337

文献标志码:: A

摘要:: 为探索不同缺失程度大豆蛋白质亚细胞定位预测的有效方法，提升大豆蛋白质亚细胞定位预测能力，本研究以1万条已知亚细胞定位位置的大豆蛋白质序列数据为研究对象，进行5%、10%、15%、20%和30%不同缺失比例完全随机缺失，分别运用SVM算法、朴素贝叶斯算法和随机森林算法和决策树4种机器学习算法预测缺失序列的亚细胞位置，对原始位置和预测后的位置进行相关性分析，对比分析不同算法的准确性和性能。结果显示：随机森林算法预测的准确率最高；朴素贝叶斯算法的运行速度最快；朴素贝叶斯算法的运行内存最小。在不考虑运行时间和运行内存因素，且对预测的准确率要求较高的情况下，随机森林算法的预测效果要优于另外3种算法；同种情况下，若对运行内存要求较高时，可优先考虑朴素贝叶斯算法。结果说明不同机器学习方法在不同缺失程度的预测需求下的适用性，可应用于大豆蛋白质数据的定位预测。

Abstract:: In order to explore an effective method for predicting the subcellular localization of soybean protein with different degrees of deletion, and improve the prediction ability of soybean protein subcellular localization, this study took 10 000 soybean protein sequence data with known subcellular localization positions as the research object, and carried out 5%, 10%, 15%, 20% and 30% sequences missing at random. Four machine learning methods, namely SVM algorithm, Naive Bayes algorithm, Random Forest algorithm and Decision Tree algorithm, were used to predict the subcellular position of the missing sequence. Correlation analysis was performed between the original position and the predicted position, and the accuracy and performance of different algorithms were compared and analyzed. The results showed that the prediction accuracy of Random Forest algorithm was the highest, the running speed of Naive Bayes algorithm was the fastest, and the running memory of Naive Bayes algorithm was the smallest. When the running time and running memory factors were not considered, and the prediction accuracy was high, the prediction effect of the random forest algorithm was better than the other three algorithms. In the same situation, if the running memory requirements are high, the Naive Bayes algorithm may be preferred. The results show the applicability of different machine learning methods under the prediction requirements of different degrees of missingness, and can be applied to the localization prediction of soybean protein data.

参考文献/References:

［1］EISENHABER F, BORK P. Wanted: Subcellular localization of proteins based on sequence[J]. Trends in Cell Biology, 1998, 8(4): 169-170. ［2］CHOU K C. Some remarks on predicting multi-label attributes in molecular Biosystems[J]. Molecular BioSystems,2013,9(6): 1092-1100.［3］LUNN E J. Compartmentation in plant metabolism[J].Journal of Experimental Botany,2007,58(1):35-47.［4］ENRICO〖KG(0.6mm〗 M, MAESHIMA M, EKKEHARD N H. Vacuolar transporters and their essential role in plant metabolism[J]. Journal of Experimental Botany, 2007, 58(1): 83-102.［5］白辉, 王宪云, 曹英豪, 等. 水稻叶绿体蛋白质在生长发育过程中的表达研究[J].生物化学与生物物理进展,2010,37(9): 988-995. (BAI H, WANG X Y, CAO Y H, et al. Expression of chloroplast proteins in rice during growth and development[J].Progress in Biochemistry and Biophysics,2010,37(9): 988-995.)［6］赵丽, 周巧霞, 王拴, 等. 线粒体分裂和融合相关蛋白质的研究进展[J]. 生理学报, 2018, 70(4): 424-432. (ZHAO L, ZHOU Q X, WANG S, et al. Research progress of mitochondrial fission and fusion-related proteins[J]. Acta Physiologica Sinica, 2018,70(4): 424-432.)［7］CHOU K C, CAI Y D. Using function domain composition and support vector machines for prediction of protein subcellular location[J]. Journal of Biological Chemistry, 2002, 277(48): 45765-45769. ［8］GALAR M, FERNáNDEZ A, BARRENECHEA E, et al. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes[J]. Pattern Recognition, 2011, 44(8): 1761-1776.［9］MURAKAMI Y, MIZUGUCHI K. Applying the Naive Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites[J]. Bioinformatics, 2010,26(15): 1841-1848.［10］MABUNI D. A new direct node data splitting technique in decision tree induction[J]. International Journal of Innovative Technology and Exploring Engineering, 2020, 9(7).［11］MENGTING N, YANJUAN L, CHUNYU W, et al. RFAmyloid: A web server for predicting amyloid proteins[J]. International Journal of Molecular Sciences, 2018, 19(7): 2071.［12］唐友, 郑萍, 王嘉博, 等. 对比Bayesian B等多种方法的大豆全基因组选择应用研究[J]. 大豆科学, 2018, 37(3): 30-35. (TANG Y, ZHENG P, WANG J B, et al. Application of soybean genome-wide selection by comparing Bayesian B and other methods[J]. Soybean Science, 2018, 37(3): 30-35.)［13］LIU B, 〖KG(0.27mm〗WU H, CHOU K C. Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences[J]. Natural Science, 2017, 9(4): 67-91.［14］李盛. 基于数据挖掘的煤矿微震危害预测实证分析[D].昆明: 云南师范大学, 2015. (LI S. Empirical analysis of microseismic hazard prediction in coal mine based on data mining[D]. Kunming: Yunnan Normal University, 2015.)［15］未丽, 刘建利. 植物蛋白质亚细胞定位相关研究概述[J].植物科学学报, 2021, 39(1): 93-101. (WEI L, LIU J L. An overview of studies related to the subcellular localization of plant proteins[J]. Chinese Journal of Plant Science, 2021, 39(1): 93-101.)［16］陈凯.面向不平衡数据集的朴素贝叶斯文本分类算法改进研究[D].哈尔滨: 东北林业大学,2018. (CHEN K. Improvement of Naive Bayesian text classification algorithm for imbalanced data sets[D]. Harbin: Northeast Forestry University, 2018.)［17］于合龙, 刘雨帆, 张继成, 等. 基于多种机器学习方法填补大豆基因组缺失的比较研究[J]. 大豆科学, 2021, 40(1): 122-129. (YU H L, LIU Y F, ZHANG J C, et al. A comparative study of filling in the soybean genome deletion based on multiple machine learning methods[J]. Soybean Science, 2021, 40(1): 122-129.)

备注/Memo

收稿日期：2021-10-22

基金项目：吉林省特色高水平学科新兴交叉学科“数字农业”(2018)；吉林省智慧农业工程研究中心项目(2016)；国家自然科学基金(31801441)。

第一作者：李佳楠（1995—），男，硕士研究生，主要从事生物信息学研究。E-mail:rate_ljn@163.com。

通讯作者：唐友(1979—)，男，博士，教授，高级工程师，主要从事生物信息学及农业信息化研究。E-mail:tangyou@neau.edu.cn。

更新日期/Last Update: 2022-06-30