|Table of Contents|

Comparative Study of Four Machine Learning Algorithms for Soybean Protein Localization Predicting(PDF)

《大豆科学》[ISSN:1000-9841/CN:23-1227/S]

Issue:
2022年03期
Page:
337-344
Research Field:
Publishing date:

Info

Title:
Comparative Study of Four Machine Learning Algorithms for Soybean Protein Localization Predicting
Author(s):
LI Jia-nan12 GAO Xing-quan2 LI Zhuo1 TENG Xiao-hua1 HUANG Bin1 ZHANG Ji-cheng3 TANG You12
(1.Electrical and Information Engineering College, Jilin Agricultural Science and Technology University, Jilin 132101, China; 2.School of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 132000, China; 3.College of Electronic and Information, Northeast Agricultural University, Harbin 150030, China)
Keywords:
Support Vector Machines algorithm Naive Bayesian algorithm Decision Tree algorithm Random Forest algorithm soybean protein completely random missing sequence position prediction
PACS:
-
DOI:
10.11861/j.issn.1000-9841.2022.03.0337
Abstract:
In order to explore an effective method for predicting the subcellular localization of soybean protein with different degrees of deletion, and improve the prediction ability of soybean protein subcellular localization, this study took 10 000 soybean protein sequence data with known subcellular localization positions as the research object, and carried out 5%, 10%, 15%, 20% and 30% sequences missing at random. Four machine learning methods, namely SVM algorithm, Naive Bayes algorithm, Random Forest algorithm and Decision Tree algorithm, were used to predict the subcellular position of the missing sequence. Correlation analysis was performed between the original position and the predicted position, and the accuracy and performance of different algorithms were compared and analyzed. The results showed that the prediction accuracy of Random Forest algorithm was the highest, the running speed of Naive Bayes algorithm was the fastest, and the running memory of Naive Bayes algorithm was the smallest. When the running time and running memory factors were not considered, and the prediction accuracy was high, the prediction effect of the random forest algorithm was better than the other three algorithms. In the same situation, if the running memory requirements are high, the Naive Bayes algorithm may be preferred. The results show the applicability of different machine learning methods under the prediction requirements of different degrees of missingness, and can be applied to the localization prediction of soybean protein data.

References:

[1]EISENHABER F, BORK P. Wanted: Subcellular localization of proteins based on sequence[J]. Trends in Cell Biology, 1998, 8(4): 169-170. [2]CHOU K C. Some remarks on predicting multi-label attributes in molecular Biosystems[J]. Molecular BioSystems,2013,9(6): 1092-1100.[3]LUNN E J. Compartmentation in plant metabolism[J].Journal of Experimental Botany,2007,58(1):35-47.[4]ENRICO〖KG(0.6mm〗 M, MAESHIMA M, EKKEHARD N H. Vacuolar transporters and their essential role in plant metabolism[J]. Journal of Experimental Botany, 2007, 58(1): 83-102.[5]白辉, 王宪云, 曹英豪, 等. 水稻叶绿体蛋白质在生长发育过程中的表达研究[J].生物化学与生物物理进展,2010,37(9): 988-995. (BAI H, WANG X Y, CAO Y H, et al. Expression of chloroplast proteins in rice during growth and development[J].Progress in Biochemistry and Biophysics,2010,37(9): 988-995.)[6]赵丽, 周巧霞, 王拴, 等. 线粒体分裂和融合相关蛋白质的研究进展[J]. 生理学报, 2018, 70(4): 424-432. (ZHAO L, ZHOU Q X, WANG S, et al. Research progress of mitochondrial fission and fusion-related proteins[J]. Acta Physiologica Sinica, 2018,70(4): 424-432.)[7]CHOU K C, CAI Y D. Using function domain composition and support vector machines for prediction of protein subcellular location[J]. Journal of Biological Chemistry, 2002, 277(48): 45765-45769. [8]GALAR M, FERNáNDEZ A, BARRENECHEA E, et al. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes[J]. Pattern Recognition, 2011, 44(8): 1761-1776.[9]MURAKAMI Y, MIZUGUCHI K. Applying the Naive Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites[J]. Bioinformatics, 2010,26(15): 1841-1848.[10]MABUNI D. A new direct node data splitting technique in decision tree induction[J]. International Journal of Innovative Technology and Exploring Engineering, 2020, 9(7).[11]MENGTING N, YANJUAN L, CHUNYU W, et al. RFAmyloid: A web server for predicting amyloid proteins[J]. International Journal of Molecular Sciences, 2018, 19(7): 2071.[12]唐友, 郑萍, 王嘉博, 等. 对比Bayesian B等多种方法的大豆全基因组选择应用研究[J]. 大豆科学, 2018, 37(3): 30-35. (TANG Y, ZHENG P, WANG J B, et al. Application of soybean genome-wide selection by comparing Bayesian B and other methods[J]. Soybean Science, 2018, 37(3): 30-35.)[13]LIU B, 〖KG(0.27mm〗WU H, CHOU K C. Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences[J]. Natural Science, 2017, 9(4): 67-91.[14]李盛. 基于数据挖掘的煤矿微震危害预测实证分析[D].昆明: 云南师范大学, 2015. (LI S. Empirical analysis of microseismic hazard prediction in coal mine based on data mining[D]. Kunming: Yunnan Normal University, 2015.)[15]未丽, 刘建利. 植物蛋白质亚细胞定位相关研究概述[J].植物科学学报, 2021, 39(1): 93-101. (WEI L, LIU J L. An overview of studies related to the subcellular localization of plant proteins[J]. Chinese Journal of Plant Science, 2021, 39(1): 93-101.)[16]陈凯.面向不平衡数据集的朴素贝叶斯文本分类算法改进研究[D].哈尔滨: 东北林业大学,2018. (CHEN K. Improvement of Naive Bayesian text classification algorithm for imbalanced data sets[D]. Harbin: Northeast Forestry University, 2018.)[17]于合龙, 刘雨帆, 张继成, 等. 基于多种机器学习方法填补大豆基因组缺失的比较研究[J]. 大豆科学, 2021, 40(1): 122-129. (YU H L, LIU Y F, ZHANG J C, et al. A comparative study of filling in the soybean genome deletion based on multiple machine learning methods[J]. Soybean Science, 2021, 40(1): 122-129.)

Memo

Memo:
-
Last Update: 2022-06-30