利用GATK对RNA-seq数据做call SNP 或 INDEL分析
GATK是一个很全面的工具,不仅可以做DNA-seq的数据分析,也可以用来做RNA-seq的数据分析,此为用GATK做RNA-seq的数据分析(snps 和indels)
conda activate GATK
cd GATK/rna-seq
1.数据下载
2、质控过滤(optional)
fastqc *.gz -o ./qc
- 由于fastqc报告结果总体合格,未进行过滤。
- 利用trimmomatic过滤低质量fastq片段代码参考如下
trimmomatic PE -phred33 -trimlog logfile sample_1.fastq.gz sample_2.fastq.gz out.read_1.fq.gz \
out.trim.read_1.fq.gz out.read_2.fq.gz out.trim.read_2.fq.gz \
ILLUMINACLIP:/home/shensuo/biosoft/Trimmomatic/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 \
SLsampleINGWINDOW:5:20 LEADING:5 TRAILING:5 MINLEN:50
3、star比对
此处比对与重测序数据的比对存在差别,重测序数据比对软件利用的是Bwa mem
mkdir ../bam/star_log ; cd ../bam
star_index=/home/data/gma49/GATK/ref/hg38/star-index/GRCh38_gencode_v22_CTAT_lib_Apr032020.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa.star.idxcat > SRR.list
SRR11618713
SRR11618726
SRR11618727
cat SRR.list | while read sample
do
cd star_log
fq1=../../raw/${sample}_1.fastq.gz
fq2=../../raw/${sample}_2.fastq.gz
STAR --runThreadN 4 --genomeDir $star_index \
--twopassMode Basic --outReadsUnmapped None --chimSegmentMin 12 \
--alignIntronMax 100000 --chimSegmentReadGapMax parameter 3 \
--alignSJstitchMismatchNmax 5 -1 5 5 \
--readFilesCommand zcat --readFilesIn $fq1 $fq2 --outFileNamePrefix ${sample}_
cd ../ ;mv ./star_log/${sample}_Aligned.out.sam ./${sample}.sam
samtools sort -o ${sample}.bam ${sample}.sam
samtools index $sample.bam
samtools flagstat $sample.bam > $sample.flagstat
rm ${sample}.sam
done
4.去除重复
这一步也没有用GATK的工具,用的是比较常用的picard,这个和dna-seq分析一样,都用的是picard的markduplicates命令。命令如下:
$java -jar $picard AddOrReplaceReadGroups I=$inputbam O=${outputbam}_rg.tmp RGID=id RGLB=library RGPL=ILLUMINA RGPU=machine RGSM=sample
$java -jar $picard MarkDuplicates I=${outputbam}_rg.tmp O=${outputbam}_markdup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=${outputbam}.metrics
5、SplitNCigarReads
-
这一步是RNA-seq特异性的一步。因为mRNA转录本是主要由DNA的外显子exon可变剪切组合而成
- 把转录信息比对到基因组上时,比对到intron的区域均为N,直接根据此结果会出现假阳性variant结果。
- 所以需要将跨越了n个intron的read片段,切割为n+1个子read片段。
cat SRR.list | while read sample
do
gatk SplitNCigarReads -R $GENOME \
-I ./${sample}_rmd.bam \
-O ${sample}_rmd_split.bam
gatk AddOrReplaceReadGroups -I ${sample}_rmd_split.bam \
-O ${sample}_rmd_split_add.bam -LB $sample -PL ILLUMINA -PU $sample -SM $sample
done
AddOrReplaceReadGroups
与之前bwa比对时的-R
参数,添加必要的group信息
6. 碱基质量分数重校准
碱基质量分数重校准,就是利用机器学习的方式调整原始碱基的质量分数。从而可以检测碱基质量分数中的系统错误。所谓的变异位点,就是与参考基因组不同的部分,假设原始数据中就存在着一些由于测序仪器产生的系统性误差,那么变异位点识别过程中找到的variant,就会存在大量的假阳性。因为碱基个数较多,即便机器说他识别的5亿个碱基有99%的概率是对,那么也就说有5千万可能是错的,数量级也是比较大的。
cat SRR.list | while read sample
do
gatk BaseRecalibrator \
-I ${sample}_rmd_split_add.bam \
-R $GENOME \
--known-sites /home/data/gma49/GATK/ref/hg38/var_ref/dbsnp_146.hg38.vcf.gz \
--known-sites /home/data/gma49/GATK/ref/hg38/var_ref/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites /home/data/gma49/GATK/ref/hg38/var_ref/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-O ${sample}_recal.table
gatk ApplyBQSR \
--bqsr-recal-file ${sample}_recal.table \
-R /home/data/gma49/GATK/ref/hg38/star-index/GRCh38_gencode_v22_CTAT_lib_Apr032020.plug-n-play/ctat_genome_lib_build_dir/ref_genome.fa \
-I ${sample}_rmd_split_add.bam \
-O ${sample}_recal.bam
done
6、Variant Calling
- 为每个样本生成 g.vcf
mkdir ../vcf
cat SRR.list | while read sample
do
gatk HaplotypeCaller \
--native-pair-hmm-threads 10 \
-R $GENOME \
-D /home/data/gma49/GATK/ref/hg38/var_ref/dbsnp_146.hg38.vcf.gz \
-I ${sample}_recal.bam \
--minimum-mapping-quality 30 \
-ERC GVCF \
-O ../vcf/${sample}.g.vcf
done
- 合并 g.vcf
cd ../vcf
gatk CombineGVCFs \
-R $GENOME \
--variant SRR11618713.g.vcf \
--variant SRR11618726.g.vcf \
--variant SRR11618727.g.vcf \
-O SRR3.all.g.vcfgatk GenotypeGVCFs \
-R $GENOME \
-V SRR3.all.g.vcf \
-O SRR3.vcf
过滤 filter variation
#提取SNP
gatk \
--java-options "-Xmx4g -Djava.io.tmpdir=./tmp" \
SelectVariants \
-R ../01.ref/genome.fasta \
-V all.merge_raw.vcf \
--select-type SNP \
-O all.raw.snp.vcf### 过滤SNP(Filter列加标记)
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" \
VariantFiltration \
-R ../01.ref/genome.fasta \
-V all.raw.snp.vcf \
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 \
|| SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name 'SNP_filter' \
-O all.filter.snp.vcf
### 提取过滤好的SNP(将含有Filter过滤标签的SNP删除)
$ gatk \
--java-options "-Xmx4g -Djava.io.tmpdir=./tmp" \
SelectVariants \
-R ../01.ref/genome.fasta \
-V all.filter.snp.vcf \
--exclude-filtered \
-O all.filtered.snp.vcf
7.对SNP进行注释
###snp和indel分别用conver2annovar.pl把vcf文件转化成table_annovar.pl需要的格式
$convert2annovar --format vcf4 ${result}_fil_snp.vcf > ${result}_fil_snp.avinput
$convert2annovar --format vcf4 ${result}_fil_indel.vcf > ${result}_fil_indel.avinput
###把基因名字等注释出来
$table_annovar ${result}_fil_indel.avinput $mmdb -otherinfo --build mm10 -out ${result}_fil_indel.anno -protocol refGene,cytoBand -operation g,r -remove -nastring '.'
$table_annovar ${result}_fil_snp.avinput $mmdb -otherinfo --build mm10 -out ${result}_fil_snp.anno -protocol refGene,cytoBand -operation g,r -remove -nastring '.'
参考来源:
SNPs calling流程(GATK4) - 简书 (jianshu.com)
GATK4流程学习之RNA-Seq variant calling(SNP+INDEL) - 简书 (jianshu.com)
用GATK对RNA-seq做call SNP and INDEL分析 - 知乎 (zhihu.com)