多批次WES数据该肿么办
批次很多时候无法避免,比如文章 Biomed Res Int. 2014 . doi: 10.1155/2014/319534 就提到:
In large WES studies, some samples are occasionally sequenced twice or even more times due to a variety of reasons, for example, insufficient coverage in the first experiment, sample duplication, and the rest. It is challenging how to best utilize these duplicated exomes for SNP discovery and genotype calling, especially with batch effects taken into consideration.正好作者有这样的数据,来源于 Shanghai Breast Cancer Study (SBCS) 数据集的 92 subjects (51 cases and 41 controls) 的外显子数据,建库策略是 QIAmp DNA kit + Illumina TruSeq 得到fastq数据后走标准的 GATK 流程得到 184个BAM文件
可以分3个策略来进行比较
- M strategy (merging duplicates into one)
- group H consisting of the higher sequencing depth for each subject
- group L consisting of the lower depth for each subject
- (1) ≥ 3 SNPs detected within 10?bp distance;
- (2) > 10% alignments mapped ambiguously;
- (3) SNPs having a quality score < 50;
- (4) variant confidence/quality by depth < 1.5;
- (5) strand bias score calculated by GATK > ?1
对找到SNP做的比较有点简单:
- heterozygous-homozygous ratio (Hete/Homo)
- transition-transversion ratio (Ti/Tv)
- overlapping rate with the 1000 Genomes Project consistently
包括
- an average of 64.0 and 57.2 million reads per exome
- with 43.4 and 36.0 mean depths across the target regions
- 98.23% and 98.65% of the reads were aligned to the human reference genome
- 49.70% and 49.11% were mapped to the target regions
- Approximately 86.16% and 86.14% of the reads in the H and L groups had mapping quality ≥ 20
Table S1: Data production by 92 duplicated WES subjects.
Table S2: Number of variants observed across the on-target and off-target regions.
Click here to view.(41K, xlsx)
首先是测序详情
MQ <10 (%) | MQ >= 20 (%) | Mapped bases (e9) | Mapping rate (%) | Target mapping (%) | Total reads (e6) | Group |
---|---|---|---|---|---|---|
10.71 | 85.75 | 1.83 | 98.26 | 49.14 | 45.18 | L |
11.08 | 86.41 | 2.24 | 99.1 | 49.27 | 55.01 | H |
10.62 | 86.74 | 2.63 | 98.93 | 53.49 | 59.8 | L |
10.57 | 86.68 | 3.53 | 98.86 | 53.87 | 79.93 | H |
9.88 | 87.53 | 2.5 | 98.95 | 54.97 | 55.71 | L |
9.86 | 87.49 | 3.61 | 98.91 | 55.37 | 79.82 | H |
10.59 | 85.86 | 1.82 | 98.26 | 49.33 | 44.76 | L |
10.92 | 86.59 | 2.25 | 99.11 | 49.48 | 54.89 | H |
10.13 | 87.37 | 2.11 | 99.03 | 52.87 | 48.31 | L |
10 | 87.35 | 3.01 | 98.91 | 53.38 | 68.63 | H |
10.95 | 86.49 | 2.15 | 99.02 | 53.39 | 48.78 | L |
10.82 | 86.51 | 3.04 | 98.95 | 53.87 | 68.62 | H |
然后是找到的SNP详情
Sample index | # SNPs on-target | # SNPs off-target | # unique SNPs | Calling strategies * |
---|---|---|---|---|
1 | 46645 | 102680 | 24554 | Merge |
1 | 44377 | 75767 | 1310 | High |
1 | 42880 | 64470 | 1139 | Low |
2 | 47409 | 105395 | 18742 | Merge |
2 | 46259 | 85611 | 1445 | High |
2 | 44916 | 74509 | 1076 | Low |
3 | 47100 | 103724 | 20247 | Merge |
3 | 46087 | 82940 | 1424 | High |
3 | 44681 | 69129 | 1051 | Low |
推荐阅读
- 放屁有这三个特征的,请注意啦!这说明你的身体毒素太多
- 爱就是希望你好好活着
- 昨夜小楼听风
- 知识
- 死结。
- 我从来不做坏事
- 烦恼和幸福
- 关于QueryWrapper|关于QueryWrapper,实现MybatisPlus多表关联查询方式
- Linux下面如何查看tomcat已经使用多少线程
- 说得清,说不清