YU He-long,LIU Yu-fan,ZHANG Ji-cheng,et al.Comparative Research for Imputation of Soybean Genome Missing Values Based on Various Machine Learning Methods[J].Soybean Science,2021,40(01):122-129.[doi:DOI:10.11861/j.issn.1000-9841.2021.01.0122]
基于多种机器学习方法填补大豆基因组缺失的比较研究
- Title:
- Comparative Research for Imputation of Soybean Genome Missing Values Based on Various Machine Learning Methods
- 文章编号:
- 2021,40(1):122-129
- 关键词:
- 大豆基因组缺失; K近邻算法; SoftImpute算法; 随机森林算法; 全基因组关联分析
- Keywords:
- Soybean genome missing; K-nearest neighbours algorithm; SoftImpute algorithm; Random forest algorithm; Genome-wide association analysis
- 文献标志码:
- A
- 摘要:
- 为探索大豆基因组测序不同程度缺失数据的有效填补措施,提升数据分析综合能力,本研究以大豆株高与叶面积两组性状的基因组基因型数据为研究对象,进行5%、10%和20%不同缺失比例的人为数据缺失处理,分别运用K近邻算法、SoftImpute算法和随机森林算法3种机器学习方法对缺失数据进行填补,分析填补数据的准确性和性能。对原始数据和填补后的数据进行全基因组关联分析,分别对比填补后的数据和原始数据的分析效果。从准确率来看,随机森林算法填补的准确率最高; 从运行时间上来看,SoftImpute算法的运行速度最快; 运行内存方面,SoftImpute算法的运行内存最小,而当数据量达到10 000×1 000时,K近邻填补算法的运行内存最小。在不考虑运行时间和运行内存的因素,且对填补的准确率要求较高的情况下,随机森林算法的填补效果要优于K近邻填补算法和SoftImpute算法,若对运行时间要求较高且数据量较大时,则应选择SoftImpute算法,同种情况下若对运行内存要求较高时,可优先考虑K近邻填补算法。结果说明不同机器学习方法在不同缺失程度的填补需求下的适用性,可应用于大豆基因组数据缺失处理。
- Abstract:
- In order to explore effective measures to impute different degrees of missing data in soybean genome sequencing, and to improve the comprehensive ability of data analysis, this study took the genome genotype data of soybean plant height and leaf area as the research objects, and carried out 5%, 10%, and 20% different proportions of artificial data missing processing, three machine learning methods, namely the K-nearest neighbours algorithm, the SoftImpute algorithm and the random forest algorithm were used to impute the missing data and analyze the accuracy and performance of the imputed data. Genome-wide association analyses were performed for both original data and the imputed data, and compared the analysis effects of the imputed data and the original data respectively. From the perspective of accuracy, the random forest algorithm had the highest imputing accuracy. From the perspective of running time, the SoftImpute algorithm ran the fastest. In terms of running memory, the SoftImpute algorithm had the smallest running memory, and when the amount of data reached10 000×1 000, the K-nearest neighbours impute algorithm had the smallest running memory. In summary, without considering the factors of running time and running memory, and requiring high imputing accuracy, the imputing effect of the random forest algorithm is better than that of the K-nearest neighbours impute algorithm and the SoftImpute algorithm. When the running time requirements are high and the amount of data is large, the SoftImpute algorithm should be selected. In the same situation, if the running memory requirements are high, the K-nearest neighbours impute algorithm can be given priority.It is indicated that the applicability of different machine learning methods under different imputation needs could be used in soybean genome data missing.
参考文献/References:
备注/Memo
国家自然科学基金(U19A2061); 吉林省科技发展计划(20190301024NY,20200301047RQ); 吉林省发展和改革委员会项目(2020C005)。