基因家族扩张与收缩分析及物种进化树构建（上）基因家族扩张与收缩分析及物

一、数据准备首先，选取不同物种的Protein数据集：Arabidopsis_thaliana.fa；Citrus_grandis.fa；Dimocarpus_longan.fa；Durio_zibethinus.fa；Prunus_persica.fa； Vitis_vinifera.fa；Citrus_clementina.fa；Citrus_sinensis.fa；Diospyros_oleifera.fa；Malus_domestica.fa；Oryza_sativa.fa；Pyrus_communis.fa
然后进行数据处理，去冗余，只保留最长转录本，去除可变剪切：
python3 removeRedundantProteins.py -i input.fa -o output.fa
removeRedundantProteins.py

import sys,getopt def usage(): print('usage:python3 removeRedundantProteins.py -i -o <-h>') return def removeRedundant(in_file,out_file): gene_dic = {} flag = '' with open (in_file) as in_fasta: for line in in_fasta: if '>' in line: line1 = line.strip('>\n') line2 = line1.split('.') li = line2[0] flag = li try: gene_dic[li] except KeyError: gene_dic[li] = [line] else: gene_dic[li].append(line) else: gene_dic[flag][-1] += line with open (out_file,'w') as out_fasta: for k,v in gene_dic.items(): if len(v) == 1: out_fasta.write(gene_dic[k][0]) else: trans_max = '' for trans in gene_dic[k]: a = len(list(trans)) b = len(list(trans_max)) if a > b: trans_max = trans out_fasta.write(trans_max) def main(argv): try: opts, args = getopt.getopt(argv,'hi:o:') except getopt.GetoptError: usage() sys.exit() for opt, arg in opts: if opt == '-h': usage() sys.exit() elif opt == '-i': in_fasta_name = arg elif opt == '-o': outfile_name = arg try: removeRedundant(in_fasta_name,outfile_name) except UnboundLocalError: usage() return if __name__ == '__main__': main(sys.argv[1:])

将处理好的数据置于一个文件夹中“Dataset”
二、使用OrthoFinder查找直系同源基因 OrthoFinder这个软件，之前有一篇文章已经介绍过了，这里就不在赘述，这个软件安装十分友好，直接conda安装即可；
nohup orthofinder -f Dataset -M msa -S diamond -T iqtree -t 24 -a 24 2> orthofinder.log &
orthofinder参数详情：
-t 并行序列搜索线程数（默认= 16）
-a 并行分析线程数（默认值= 1）
-M 基因树推断方法。可选：dendroblast和msa（默认= dendroblast）
-S 序列搜索程序（默认= blast）选项：blast，mmseqs,，blast_gz，diamond（推荐使用diamond，比对速度很给力）
-A 多序列联配方式，需要添加参数-M msa时才有效；（默认= mafft）可选择：muscle，mafft
-T 建树方法，需要添加参数-M msa时才有效，（默认 = fasttree）可选：iqtree，raxml-ng，fasttree，raxml
-s <文件> 可指定特定的根物种树
-I 设定MCL的通胀参数（默认 = 1.5）
-x Info用于以othoXML格式输出结果
-p 将临时pickle文件写入到
-l 只执行单向序列搜索
-n 名称以附加到结果目录
-h 打印帮助文本
如果只需要查找直系同源基因，只需接“-f” 参数即可；此步也可建树，采用默认的建树方法fasttree，为无根树。
nohup orthofinder -f Dataset &
如果添加-M msa -T iqtree设定制定参数，可按照设定的参数使用最大似然法构建有根的物种进化树，构建的树为STAG树。
nohup orthofinder -f Dataset -M msa -S diamond -T iqtree -t 24 -a 24 2> orthofinder.log &
关于构建系统进化树，有很多种做法，常见的有利用物种全部的蛋白序列，构建STAG物种树；也有使用单拷贝直系同源基因构建的物种进化树，关于这一点，OrthoFinder查找同源基因，可以输出直系单拷贝同源基因的序列结果，后续也可使用其他构树软件及算法进行进化树构建。关于建树方法，则有距离矩阵法、最大简约法、最大似然法以及贝叶斯；当然目前主流采用的基本为最大似然法和贝叶斯，其中贝叶斯算法计算量巨大，耗时最久，其构建的树也认为最为“逼真”，但文章中使用较多的还是最大似然法，其耗时也需蛮久。
三、OrthoFinder输出结果解读 OrthoFinder输出的结果会在OrthoFinder文件夹下面的以日期命名的文件夹中，如：~/OrthoFinder/Results_May08

基因家族扩张与收缩分析及物种进化树构建（上）

文章图片
orthofinder输出结果
其中Orthogroups文件夹下面为查找的同源基因分组结果：
Orthogroups.GeneCount.tsv：每一行为直系同源基因组对应的基因数目；
Orthogroups_SingleCopyOrthologues.txt
Orthogroups.tsv：每一行为直系同源基因组对应的基因；
Orthogroups.txt：类似Orthogroups.tsv，输出格式为OrthoMCL;
Orthogroups_UnassignedGenes.tsv：物种特异性基因
Single_Copy_Orthologue_Sequences里为单拷贝直系同源基因
Comparative_Genomics_Statistics的相关结果文件：
Orthogroups_SpeciesOverlaps.csv：不同物种间的同源基因的交集；
SingleCopyOrthogroups.txt：单拷贝基因组的编号；
Statistics_Overall.csv：总体统计；
Statistics_PerSpecies.csv：分物种统计；
Gene_Trees: 每个直系同源基因基因组里的基因树；
SpeciesTree_rooted.txt: 从所有包含STAG支持的直系同源组推断的STAG物种树；
SpeciesTree_rooted_node_labels.txt: 同上，只不过多了一个标签信息，用于解释基因重复数据；
其中，我们可以用Orthogroups.GeneCount.tsv来作为CAFE的输入文件，分析基因家族的扩张与收缩；使用SpeciesTree_rooted.txt作为推断的物种树，并使用r8s，从中提取超度量树（ultrametric tree）即时间树；
四、用推断的物种树提取超度量树（ultrametric tree） 1、软件准备（r8s）

###Install: wget http://ceiba.biosci.arizona.edu/r8s/r8s1.81.tar.gz tar -xzvfr8s1.81.tar.gz cd r8s1.81/src make -f Makefile.linux

2、准备r8s输入文件r8s_ctl_file.txt

python cafetutorial_prep_r8s.py -i SpeciesTree_rooted.txt -o r8s_ctl_file.txt -s 6650255 -p 'Oryza_sativa,Arabidopsis_thaliana' -c '152'

参数：
-i path_tree_file: path to .txt file containing tree in NEWICK format
-s n_sites: number of sites in alignment that was used to infer species tree
-p list_of_spp_tuples: list of tuples (each tuple being two species IDs whose mrca's age we are constraining; e.g., [('ENSG00','ENSPTR'),('ENSFCA','ENSECA')]
-c list_of_spp_cal_points: list of flats, one for each tuple in list_of_spp_tuples (e.g., [6.4,80])
-s 即用于推断物种树的比对序列碱基数目；
-p 已知物种树中的一对物种；
-c 已知一对物种的分化年限：
可在timetree网站查询：为152 mya

基因家族扩张与收缩分析及物种进化树构建（上）

文章图片
Pairwise Divergence Time For Oryza sativa & Arabidopsis thaliana
cafetutorial_prep_r8s.py

__author__ = "Fabio H. K. Mendes" import os import argparse def r8s_prep(path_tree_file, sites_n, list_of_spp_tuples, list_of_spp_cal_points, path_output_file): tree_str = str() with open(path_tree_file, "r") as tree_file: tree_str = tree_file.readline().rstrip() fixage_lines = list() with open(path_output_file, "w") as output_file: output_file.write( "#NEXUS\nbegin trees; \ntree nj_tree = [&R] " + tree_str + "\nEnd; \nbegin rates; \nblformat nsites=" + \ sites_n + " lengths=persite ultrametric=no; \ncollapse; \n") for idx, pair in enumerate(list_of_spp_tuples): node_name = pair[0][-3:] + pair[1][-3:] output_file.write( "mrca " + node_name + " " + pair[0] + " " + pair[1] + "; \n") fixage_lines.append("fixage taxon=" + node_name + " age=" + list_of_cal_points[idx] + "; \n") for line in fixage_lines: output_file.write(line) output_file.write( "divtime method=pl algorithm=tn cvStart=0 cvInc=0.5 cvNum=8 crossv=yes; \ndescribe plot=chronogram; \ndescribe plot=tree_description; \nend; " ) if __name__ == "__main__": parser = argparse.ArgumentParser(description=__doc__, prog="mcl2rawcafe.py") parser.add_argument("-i", "--input-file", action="store", dest="input_file", required=True, type=str, help="full path to .txt file containing tree in NEWICK format") parser.add_argument("-o", "--output-file", action="store", dest="output_file", required=True, type=str, help="full path to file to be written (r8s input file)") parser.add_argument("-s", "--sites-n", action="store", dest="sites_n", required=True, type=str, help="number of sites in alignment used to estimate species tree") parser.add_argument("-p", "--pairs-species", action="store", dest="spp_pairs", required=True, type=str, help="") parser.add_argument("-c", "--calibration-points", action="store", dest="cal_points", required=True, type=str, help="") args = parser.parse_args() list_of_spp_tuples = list() for pair in args.spp_pairs.split(" "): list_of_spp_tuples.append(tuple(pair.split(","))) list_of_cal_points = args.cal_points.split(",") print "\nRunning cafetutorial_clade_and_size_filter.py as a standalone...\n" r8s_prep(args.input_file, args.sites_n, list_of_spp_tuples, list_of_cal_points, args.output_file)

3、运行r8s并提取超度量树

/home/Tools/r8s1.81/src/r8s -b -f r8s_ctl_file.txt > r8s_tmp.txt #提取超度量树 $tail -n 1 r8s_tmp.txt | cut -c 16- > twelve_spp_r8s_ultrametric.txt

五、基因家族扩张与收缩分析 1、软件安装（CAFE） conda install cafe
2、cafe输入数据处理

awk -F"\t" '{print "Desc""\t"$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13}' Orthogroups.GeneCount.tsv > cafe-input.data sed -e '1 s/(null)/Desc/' -e '1 s/Orthogroup/Family/' -i cafe-input.data python cafetutorial_clade_and_size_filter.py -i cafe-input.data -o filtered_cafe_input.txt -s 2> filtered.log

cafetutorial_clade_and_size_filter.py

__author__ = "Fabio H. K. Mendes" import os import argparse def clade_filter(mcl_dump, clade_str): lines_to_keep_list = list() spp_idx_dict = dict() clades_list = list() if clade_str: # if clade filter was specified clades_list = clade_str.split(" ") with open(mcl_dump, "r") as input_file: for line_n, line in enumerate(input_file): line = line.rstrip() tokens = line.split("\t") spp_info = tokens[2:] if line.startswith("Desc"): spp_idx_dict = dict((sp, idx) for idx,sp in enumerate(spp_info)) continue if clades_list: clades_ok_list = list() for clade in clades_list: spp_list = clade.split(",") clade_count = sum(1 for sp in spp_list if int(spp_info[spp_idx_dict[sp]]) >= 1) if clade_count >= 2: clades_ok_list.append(1) if sum(clades_ok_list) == len(clades_list): lines_to_keep_list.append(line_n) # just keeping lines where >=2 species (among all of them) have gene copies clade_count = sum(1 for sp_count in spp_info if int(sp_count) >= 1) if clade_count >= 2: lines_to_keep_list.append(line_n) return set(lines_to_keep_list) def size_filter(mcl_dump, lines_to_keep_set): lines_to_remove_set = set() size_cutoff = 100 fam_size = int() with open(mcl_dump, "r") as input_file: for line_n, line in enumerate(input_file): line = line.rstrip() if line.startswith("Desc"): continue elif line_n not in lines_to_keep_set and len(lines_to_keep_set) > 0: continue tokens = line.split("\t") spp_info = tokens[2:] for gene_count in spp_info: if int(gene_count) >= size_cutoff: lines_to_separate_set.add(line_n) lines_to_keep_set -= lines_to_separate_set return lines_to_keep_set, lines_to_separate_set def filter_print(mcl_dump, lines_to_keep_set, lines_to_separate_set, output_file_name): if len(lines_to_keep_set) == 0 and len(lines_to_separate_set) == 0: exit("No filtering was done! Exiting...\n") with open(output_file_name, "w") as output_file: with open("large_"+output_file_name, "w") as output_file2: with open(mcl_dump, "r") as input_file: for line_n, line in enumerate(input_file): line = line.rstrip() + "\n" if line_n == 0: output_file.write(line) output_file2.write(line) elif line_n in lines_to_keep_set and len(lines_to_keep_set) >= 1: output_file.write(line) elif line_n not in lines_to_separate_set and len(lines_to_keep_set) == 0: output_file.write(line) # has to be if, not elif if line_n in lines_to_separate_set: output_file2.write(line) # cleaning up in case size filtering was not done if len(lines_to_separate_set) == 0: os.unlink("large_"+output_file_name) if __name__ == "__main__": parser = argparse.ArgumentParser(description=__doc__, prog="cafetutorial_clade_and_size_filter.py") parser.add_argument("-i", "--input-file", action="store", dest="input_file", required=True, type=str, help="full path to mcl's output dump file") parser.add_argument("-o", "--output-file", action="store", dest="output_file", required=True, type=str, help="full path to file to be written") parser.add_argument("-cl", "--clade-filter", action="store", dest="clade_str", default=None, required=False, type=str, help="list of clades (separated by white spaces) comprised of species identifiers (separated by comma) that must have at least two species with gene copies for a given gene family") parser.add_argument("-s", "--size-filter", action="store_true", dest="size_filter", required=False, help="option to perform size filtering") args = parser.parse_args() lines_to_keep_set, lines_to_separate_set = set(), set() # applying size filter (if no groups are specified, just the lines where just 1 species has gene counts are removed lines_to_keep_set = clade_filter(args.input_file, args.clade_str) if args.size_filter: lines_to_keep_set, lines_to_separate = size_filter(args.input_file, lines_to_keep_set) filter_print(args.input_file, lines_to_keep_set, lines_to_separate_set, args.output_file) # .add(0) to get header back

3、配置cafe文件 vim cafetutorial_run.sh

基因家族扩张与收缩分析及物种进化树构建（上）

文章图片
cafetutorial_run.sh
tree即为r8s提取的超度量树；
4、运行cafe

mkdir -p reports cafe cafetutorial_run.sh

5、结果统计 python cafetutorial_report_analysis.py -i reports/report_run.cafe -o reports/summary_run
summary_run_node.txt：统计每个节点中扩张，收缩的基因家族数目；
summary_run_fams.txt：具体发生变化的基因家族
六、CAFE_fig可视化 python3 /home/Tools/CAFE_fig/CAFE_fig.py resultfile.cafe -pb 0.05 -pf 0.05 --dump test/ -g svg --count_all_expansions
输出svg格式的文件，可导入AI编辑美化；
CAFE_fig运行报错：（module 'ete3' has no attribute 'TreeStyle'）
报错解决：
vim /home/Tools/CAFE_fig/CAFE_fig.py

基因家族扩张与收缩分析及物种进化树构建（上）

文章图片
CAFE_fig.py
修改脚本，添加from PyQt5 import QtGui
程序还在运行，后续贴出结果图。
参考 OrthoFinder
timetree
http://www.chenlianfu.com/?tag=genomecomparison
https://www.jianshu.com/p/146093c91e2b
r8s
【OrthoFinder】
Emms, D.M., Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol 16, 157 (2015) (https://doi.org/10.1186/s13059-015-0721-2)
Emms, D.M., Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20, 238 (2019) https://doi.org/10.1186/s13059-019-1832-y)
【CAFE v.4.2.1】
Han, M. V., Thomas, G. W. C., Lugo-Martinez, J., and Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Molecular Biology and Evolution 30, 8 (2013)
【iqtree v. 1.6.12】
Lam-Tung Nguyen, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies. Mol Biol Evol, 32:268-274. https://doi.org/10.1093/molbev/msu300
【modelFinder】
Subha Kalyaanamoorthy, Bui Quang Minh, Thomas KF Wong, Arndt von Haeseler, and Lars S Jermiin (2017) ModelFinder: Fast model selection for accurate phylogenetic estimates. Nature Methods, 14:587–589. https://doi.org/10.1038/nmeth.4285
【R8s v. 1.81】
Sanderson M J. R8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics, 2003, 19(2): 301-302.
【STAG tree】
Emms D.M. & Kelly S. STAG: Species Tree Inference from All Genes (2018), bioRxiv https://doi.org/10.1101/267914
补充材料 【基因家族扩张与收缩分析及物种进化树构建（上）】直系同源低拷贝核基因（orthologous low-copy nuclear genes, LCN）：在进化过程中，新基因通常来自事先存在的基因，新基因的功能从先前基因的功能进化而来。新基因的原材料来自基因组区域的重复，这种重复可包括一个或多个基因。作为物种形成的伴随事件而被重复，并继续保持相同功能的基因，称为直系同源基因（orthologous）。新的基因功能可由在单个物种的基因组中发生的重复引起的。在一个基因组内部的重复导致旁系同源基因（paralogous gene）。
最大似然法（maximum likelihood method）：使用概率模型，寻找能够以较高概率产生观察数据的系统发生树。
外群的选择：大多数的种系发生重建方法会产生无根树，但是观察树的拓扑结构无法识别树根应在哪一分支上。实际上，对于要证实哪一个分类单元的分支先于其他的分类单元，树根必须确定。在无根树中设定一个根，最简单的方法是在数据集中增加一个外群（outgroup）。外群是一种分类操作单元，且有外部信息表明外群在所有分类群之前就已分化。研究演化历史，一般选择比目标序列具有较早进化历史的序列作为外类群。
Bootstrap support: bootstrap是统计学上一种非参数统计方法，通过有放回的随机抽样，构建分类回归树。Jackknife与bootstrap类似，只是每次抽样时会去除几个样本，像小刀一样切去一部分。所谓bootstrap法就是从整个序列的碱基（氨基酸）中任意选取一半，剩下的一半序列随机补齐组成一个新的序列。这样，一个序列就可以变成许多序列，一个序列组也就可以变成许多个序列组。根据某种算法（距离矩阵法、最大简约法、最大似然法），每个多序列组都可以生成一个进化树。将生成的许多进化树进行比较，按照多数规则（majority-rule）就会得到一个最“逼真”的进化树。