6|6 GATK4完整流程

0定义变量

source activate wes #GATK=~/biosoft/gatk/gatk-4.1.2.0/gatk ref=/mnt/f/kelly/bioTree/server/wesproject/hg38/Homo_sapiens_assembly38.fasta snp=/mnt/f/kelly/bioTree/server/wesproject/hg38/dbsnp_146.hg38.vcf.gz indel=/mnt/f/kelly/bioTree/server/wesproject/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

1 标记PCR重复reads
sample=SRR7696207 echo $sample gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates \ -I $sample.bam -O ${sample}_marked.bam \ -M $sample.metrics \ 1>log.mark 2>&1

运行结束后的文件如下
├── [ 17K]log.mark ├── [3.8G]SRR7696207.bam ├── [5.0G]SRR7696207_marked.bam ├── [3.3K]SRR7696207.metrics

2 FixMateInformation
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" FixMateInformation \ -I ${sample}_marked.bam \ -O ${sample}_marked_fixed.bam \ -SO coordinate \ 1>${sample}_log.fix 2>&1

这样就得到marked_fixed.bam文件。
接着进行index
samtools index ${sample}_marked_fixed.bam

3 BaseRecalibrator 碱基矫正
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"BaseRecalibrator \ -R $ref\ -I ${sample}_marked_fixed.bam\ --known-sites $snp \ --known-sites $indel \ -O ${sample}_recal.table \ 1>${sample}_log.recal 2>&1

gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"ApplyBQSR \ -R $ref\ -I ${sample}_marked_fixed.bam\ -bqsr ${sample}_recal.table \ -O ${sample}_bqsr.bam \ 1>${sample}_log.ApplyBQSR2>&1

此时文件结构如下
├── [7.2M]SRR7696207_bqsr.bai ├── [8.1G]SRR7696207_bqsr.bam ├── [ 13K]SRR7696207_log.ApplyBQSR ├── [24]SRR7696207_log.fix ├── [ 30K]SRR7696207_log.HC ├── [ 39K]SRR7696207_log.recal ├── [5.0G]SRR7696207_marked.bam ├── [5.0G]SRR7696207_marked_fixed.bam ├── [3.3M]SRR7696207_marked_fixed.bam.bai ├── [3.3K]SRR7696207.metrics ├── [246K]SRR7696207_recal.table

这里同样包含了两个步骤:
第一步,BaseRecalibrator,这里计算出了所有需要进行重校正的read和特征值,然后把这些信息输出为一份校准表文件(sample_name.recal_data.table)
第二步,PrintReads,这一步利用第一步得到的校准表文件(sample_name.recal_data.table)重新调整原来BAM文件中的碱基质量值,并使用这个新的质量值重新输出一份新的BAM文件。
注意,因为BQSR实际上是为了(尽可能)校正测序过程中的系统性错误,因此,在执行的时候是按照不同的测序lane或者测序文库来进行的,这个时候@RG信息(BWA比对时所设置的)就显得很重要了,算法就是通过@RG中的ID来识别各个独立的测序过程,这也是我开始强调其重要性的原因。
第4节构建WGS主流程
4 HaplotypeCaller命令
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \ -R $ref\ -I ${sample}_bqsr.bam \ --dbsnp $snp \ -O ${sample}_raw.vcf \ 1>${sample}_log.HC 2>&1

...... 13:37:48.966 INFOProgressMeter - Starting traversal 13:37:48.966 INFOProgressMeter -Current LocusElapsed MinutesRegions ProcessedRegions/Minute 13:37:58.966 INFOProgressMeter -chr1:12211250.2511030660.0 13:38:09.200 INFOProgressMeter -chr1:17052540.3767022743.9 13:38:19.204 INFOProgressMeter -chr1:28999740.51330026390.6 13:38:29.224 INFOProgressMeter -chr1:39509420.71860027721.2 13:38:39.236 INFOProgressMeter -chr1:52891410.82538030292.4 13:38:49.236 INFOProgressMeter -chr1:67078381.03208031936.3 13:38:59.241 INFOProgressMeter -chr1:82925451.23978033963.7 13:39:09.243 INFOProgressMeter -chr1:99394441.34769035644.1 13:39:19.261 INFOProgressMeter -chr1:115171561.55525036713.0 13:39:29.284 INFOProgressMeter -chr1:128452001.76147036765.1 13:39:39.418 INFOProgressMeter -chr1:133712711.86363034565.2 13:39:49.440 INFOProgressMeter -chr1:151962742.07231036013.0 13:39:59.443 INFOProgressMeter -chr1:161675582.27706035436.1 ......

这样就得到vcf文件 以上可以批量进行
#设置环境和变量 source activate wes #如果把gatk加到了环境变量 就直接按下面走,否则 #gatk=~/biosoft/gatk/gatk-4.1.2.0/gatk #下面gatk改为$gatk #下面都设置为你自己的路径 ref=/mnt/f/kelly/bioTree/server/wesproject/hg38/Homo_sapiens_assembly38.fasta snp=/mnt/f/kelly/bioTree/server/wesproject/hg38/dbsnp_146.hg38.vcf.gz indel=/mnt/f/kelly/bioTree/server/wesproject/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz```for sample in {file1.sam,file2.sam,file3.sam...} do echo $sample #mark dupulicates gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates \ -I $sample.bam -O ${sample}_marked.bam \ -M $sample.metrics \ 1>log.mark 2>&1 #fixmateinformation gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" FixMateInformation \ -I ${sample}_marked.bam \ -O ${sample}_marked_fixed.bam \ -SO coordinate \ 1>log.fix 2>&1 #index samtools index ${sample}_marked_fixed.bam #baserecalibrator gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"BaseRecalibrator \ -R $ref\ -I ${sample}_marked_fixed.bam\ --known-sites $snp \ --known-sites $indel \ -O ${sample}_recal.table \ 1>${sample}_log.recal 2>&1 gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"ApplyBQSR \ -R $ref\ -I ${sample}_marked_fixed.bam\ -bqsr ${sample}_recal.table \ -O ${sample}_bqsr.bam \ 1>${sample}_log.ApplyBQSR2>&1 ## 使用GATK的HaplotypeCaller命令 gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \ -R $ref\ -I ${sample}_bqsr.bam \ --dbsnp $snp \ -O ${sample}_raw.vcf \ 1>${sample}_log.HC 2>&1 done

可以把上面内容写入脚本,比如
cat gatk4.sh

【6|6 GATK4完整流程】然后运行就可以了

    推荐阅读