转录组上游分析

Song Xudong2024-11-162024-11-16

创建环境

conda create -n rna_p3 python=3    sra-tools               
conda env list                                         #查看环境
conda activate rna_p3                                  #进入conda 环境，每次开始分析都要进入环境！！
conda deactivate                                    #退出当前conda环境

上游分析软件下载

conda install -y 软件名=版本号

例如sra-tools

conda install bioconda::sra-tools

具体下载可在conda官网找到 https://anaconda.org/

下载

质控清洗：fastqc multiqc trim-galore
比对计数： hisat2 subread samtools=1.6 salmon

建议一个个安装，顺利完成

示例文件下载

参考文章

https://www.nature.com/articles/s41422-021-00477-x#data-availability

找到文章中

All data generated in the current study are available in the Gene Expression Omnibus with accession number GSE154290.

在NCBI网站中：https://www.ncbi.nlm.nih.gov/

选择GEO DataSets

查找GSE154290

PAIRED 表示为双端测序

SRA.txt

SRR12207279
SRR12207280
SRR12207283
SRR12207284

prefetch下载

单个下载

prefetch SRR1482462

批量下载

prefetch -f no -p --option-file SRA.txt

后台下载

nohup prefetch SRR12207279 &

##使用超算无法通过提交sbatch的方式下载，怀疑是sbatch任务的网络问题

后台批量下载

nohup prefetch -f no --option-file SRA.txt &

$ nohup: ignoring input and appending output to ‘nohup.out’ 并不是报错，按回车继续

查看后台任务

jobs

#或者
ps -f

会显示

UID PID PPID C STIME TTY TIME CMD
dk_szy 195230 145297 4 15:11 pts/2 00:00:00 prefetch -f no –option-file SRA.txt

删除后台任务

kill -9 1

下载完成

会在nohub.out文件中显示下载的进展，可查看是否完整下载

SRA文件转为FASTQ格式

单个转格式

慢的转换，太慢了，不建议使用

#将当前
fastq-dump --split-3 --gzip ./SRR12207279

使用这个转换，多线程转换格式，输出的为fq的文件

fasterq-dump --split-3 ./SRR12207283

批量转格式

小命令：删除当前目录SRR文件夹里的所有分文件夹，只保留其文件

find ./SRR -mindepth 1 -type d -exec sh -c 'mv {}/* ./SRR; rmdir {}' \;

fasterq-dump进行批量转换，将所有 .sra 文件都放在SRR文件夹里

# 设置输入目录
sra_dir="./SRR"

# 设置输出目录
output_dir="./fastq-result"

# 遍历目录中的所有.sra文件
for sra_file in $sra_dir/*.sra
do
    # 获取不带路径的文件名
    filename=$(basename "$sra_file" .sra)
    
    # 使用fasterq-dump处理每个文件
    fasterq-dump --outdir "$output_dir" --split-3 "$sra_file"
done

生成在当前路径 ./fastq-result 下的 fastq 文件

md5检查文件完整性

这样检验似乎并不是很准确，需要用数据来源处所给的 md5 值进行比对

生成md5值

md5sum *gz >md5.txt
cat md5.txt

检查，要在当前文件夹下

md5sum -c md5.txt

原始数据质量查看

fastqc

ls ./fastq-result/* | xargs fastqc -t 12 -o   ./fastq-result/

multiqc

multiqc ./fastq-result/ -o   ./fastq-result/

质控

trim_galore

参考： https://mp.weixin.qq.com/s?search_click_id=2253394900344533860-1633341597788-628761&sub=&__biz=MzAxMDkxODM1Ng==&mid=2247503527&idx=4&sn=261be2c7ff2a7f80b2e6ab139028d780&chksm=9b4b8e1cac3c070a963f59fcc55095fe0ae37d1eecde3e67f6682779c91d0e907a8fd7eb8f84&scene=3&subscene=10000&clicktime=1633341597&enterid=1633341597&ascene=0&devicetype=android-30&version=28000f35&nettype=cmnet&abtest_cookie=AAACAA%3D%3D&lang=zh_CN&exportkey=AdXQxZkOzbrJK%2BUwprsoMEk%3D&pass_ticket=pvvLsQPcUT6hjrE3KSHbHTaZKL%2FVtNBVq%2BLbY9r0hucz%2FDdbU3NgO9ofB9mtC3fS&wx_header=1

单个质控

-q 25 质量

–phred33 测序类型

–length 35 最短长度

–stringency 3 设定可以忍受的前后adapter重叠的碱基数，默认为1（非常苛刻）。可以适度放宽，因为后一个adapter几乎不可能被测序仪读到。

–paired 双端测序

-o ./clean_data/ 设置输出目录

./fastq-result/SRR12207279_1.fastq ./fastq-result/SRR12207279_2.fastq 为输入文件

trim_galore -q 25 --phred33 --length 35 --stringency 3 --paired -o ./clean_data/ ./fastq-result/SRR12207279_1.fastq ./fastq-result/SRR12207279_2.fastq

批量质控

运行目录下包含 ./fastq-result/ 这个文件夹，存放测序数据 .fastq或 .fastq.gz

方法一：这个是标准的单线程循环方式，较慢。4个双端数据，大概半个小时，慢但是不容易出错

# 设置工作目录为fastq文件所在的目录
cd ./fastq-result/

# 遍历所有以_1.fastq结尾的文件
for file1 in *_1.fastq
do
    # 从文件名中提取没有_1的部分
    base=$(basename "$file1" _1.fastq)
    
    # 构建对应的_2.fastq文件名
    file2="${base}_2.fastq"
    
    # 检查对应的_2.fastq文件是否存在
    if [ -e "$file2" ]; then
        # 如果存在，执行trim_galore命令
        trim_galore -q 25 --phred33 --length 35 --stringency 3 --paired -o ../clean_data/ "$file1" "$file2"
    else
        # 如果不存在，打印错误信息
        echo "Error: No matching file found for $file1"
    fi
done

fq.gz格式文件是处理后得到的数据，txt格式文件是样品处理的结果报告，也包括软件运行的参数信息

方法二：多任务同时进行，4个双端数据，10分钟

# 设置工作目录为fastq文件所在的目录
cd ./fastq-result/

# 遍历所有以_1.fastq结尾的文件
for file1 in *_1.fastq; do
    # 从文件名中提取没有_1的部分
    base=$(basename "$file1" _1.fastq)
    
    # 构建对应的_2.fastq文件名
    file2="${base}_2.fastq"
    
    # 在后台执行检查和trim_galore命令
    (
        # 检查对应的_2.fastq文件是否存在
        if [ -e "$file2" ]; then
            # 如果存在，执行trim_galore命令
            trim_galore -q 25 --phred33 --length 35 --stringency 3 --paired -o ../clean_data/ "$file1" "$file2"
        else
            # 如果不存在，打印错误信息
            echo "Error: No matching file found for $file1"
        fi
    ) &
done

# 等待所有后台进程完成
wait

结果

再次查看样品质量

XXXXXX_val_1.fq即为清洗后的序列

fastp质控

此处为扩展学习

conda下载fastp

# note: the fastp version in bioconda may be not the latest
conda install -c bioconda fastp

单个

输入 -i -I 双端测序文件，输出 -o -O 质控处理后文件，和 json文件，fastp.html结果

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

批量

# 创建清理后的文件夹
mkdir  clean-fastp

# 设置工作目录为fastq文件所在的目录
cd ./fastq-result/

# 遍历所有以_1.fastq结尾的文件
for file1 in *_1.fastq; do
    # 从文件名中提取没有_1的部分
    base=$(basename "$file1" _1.fastq)
    
    # 构建对应的_2.fastq文件名
    file2="${base}_2.fastq"
fileoo1="${base}_1.fq"
fileoo2="${base}_2.fq"
jsono="${base}.json"
htmlo="${base}.html"

    
    # 在后台执行检查和trim_galore命令
    (
        # 检查对应的_2.fastq文件是否存在
        if [ -e "$file2" ]; then
            # 如果存在，执行trim_galore命令
		fastp -i "$file1"  -o ../clean-fastp/"$fileoo1" -I "$file2" -O ../clean-fastp/"$fileoo2"  --json  ../clean-fastp/"$jsono"  --html  ../clean-fastp/"$htmlo"
                    else
            # 如果不存在，打印错误信息
            echo "Error: No matching file found for $file1"
        fi
    ) &
done

# 等待所有后台进程完成
wait

下载参考基因组

参考：http://www.360doc.com/content/21/0708/21/44561002_985728537.shtml

Hisat2官网的 UCSC参考基因下载地址 https://daehwankimlab.github.io/hisat2/download/#h-sapiens

官网提供了人和小鼠的索引文件下载，压缩包有make_grch38_tran.sh文件，详细记录了创建索引的过程。

一般要先构建索引然后进行比对，但是人和鼠的索引在Hisat2官网可直接下载使用

构建索引参考：https://www.bilibili.com/video/BV1mt411J7v8/?spm_id_from=333.337.search-card.all.click&vd_source=b938c9620af06f4224f5fd4db315cbd4

https://blog.csdn.net/qq_74093550/article/details/131915068

下载并解压所需的 mm10 或 grcm38 的index文件

下载UCSC mm10，小鼠参考基因，创建文件夹并下载到 ./reference 测试中使用这个

wget https://genome-idx.s3.amazonaws.com/hisat/mm10_genome.tar.gz
tar -zxvf *tar.gz

另一个参考基因组GRCm38，GRCm38下载后为同一个文件？小鼠还是直接用UCSC mm10

wget https://cloud.biohpc.swmed.edu/index.php/s/grcm38/download -O GRCm38.tar.gz

参考序列及注释下载汇总

直接可用的索引和注释

UCSC下载，推荐 https://hgdownload.soe.ucsc.edu/downloads.html

选择物种，下载chromFa.tar.gz

人

##基因组  UCSC hg38
	https://genome-idx.s3.amazonaws.com/hisat/hg38_genome.tar.gz
##参考注释
	https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz

小鼠

##基因组  UCSC mm10
	https://genome-idx.s3.amazonaws.com/hisat/mm10_genome.tar.gz
##参考注释
	https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.refGene.gtf.gz

自己构建参考序列索引（可选）

下载参考：https://blog.csdn.net/u011262253/article/details/117486244

如果不使用UCSC则需要这一步

Ensembl数据库

参考：https://blog.csdn.net/flashan_shensanceng/article/details/115705200

在Ensembl数据库：https://asia.ensembl.org/index.html

找到我们数据的物种

参考基因组和注释文件都有很多的版本，需要我们根据实际情况进行选择

Ensemble提供两种组装形式和3种重复序列处理方式的参考基因组，分别是primary、toplevel 、unmasked(dna) 、soft-masked(dna_sm) 和masked(dna_rm) 。

一般选择 .dna.primary或.dna_sm.primary！！！！！！

没有的话选择 .dna.toplevel.fa.gz 也可

分别包含三种类型的.gtf（general tranfer format）和.gff（general feature format）注释文件，根据自己需求选择合适注释信息

gtf：全部的注释信息，选择这个就好
chr：染色体注释信息
abinitio：预测基因集注释信息

如选择：

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Homo_sapiens.GRCh38.112.gtf.gz

NCBI数据库

https://www.ncbi.nlm.nih.gov/datasets/genome/

我们下载

Genome sequences (FASTA)

Annotation features (GTF)

GENCODE数据库

如果只涉及人类和小鼠，极力推荐 GENCOE，这里有着相较其他数据库，最新最全的基因组和其注释信息。

一般选择

Genome sequence (GRCh38.p14)	ALL
Comprehensive gene annotation	CHR

UCSC数据库

因为只有人和鼠的是已经有构建好索引的文件可以直接用，但是其他物种还是需要自己进行构建

在官网中找到自己的物种 https://hgdownload.soe.ucsc.edu/downloads.html

自己选择合适的文件

下载结果

构建方法

参考 https://blog.csdn.net/weixin_40640700/article/details/116891230

例如Ensembl数据库中下载小鼠的参考序列，用 hisat2 构建，

注意，使用什么软件构建索引，就使用什么软件进行比对，并使用序列对应的注释

下载参考序列和注释 https://asia.ensembl.org/Mus_musculus/Info/Index

选择下载

参考基因 Mus_musculus.GRCm39.dna.primary_assembly.fa.gz

注释文件 Mus_musculus.GRCm39.112.gtf.gz

使用命令下载

#基因文件
wget -c https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
#注释文件
wget  -c https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz

构建命令，需要等待的时间较长，可能1-2h

#解压文件
gzip -d Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
#hisat2构建索引
hisat2-build -p 20  Mus_musculus.GRCm39.dna.primary_assembly.fa mousegenome

得到8个索引文件，前面的 MusGRCm39 为自己设置的索引名字，结尾为 ht2

使用自己构建的索引和官方提供的索引进行比较

自己索引比对率和官方比对率比较，相差无几

暂未进行较多尝试

samtools sort -@ 12 -o ./688.bam ./6668888.sam
samtools sort -@ 12 -o ./677.bam ./666777.sam

gtf='/public/home/dk_szy/songxudong/rna-test/reference/mouse-Ensembl-bowtie2/gft/Mus_musculus.GRCm39.112.gtf.gz'
gtf0='/public/home/dk_szy/songxudong/rna-test/reference/mouse-UCSC-mm10/gft/mm10.refGene.gtf.gz'


featureCounts -T  20  -p  -a  $gtf  -o  counts.txt  ./688.bam

featureCounts -T  20  -p  -a  $gtf0  -o  counts0.txt  ./677.bam

值得注意的是自己从Ensembl数据库中下载的对应注释，生成的count矩阵需要geneid为 ENSMUSG00000104478，之后需要进行ID转换

hisat2比对基因

hisat2单个比对

一个样，花了8分钟，多样品记得调整运行时间

index='/public/home/dk_szy/songxudong/rna-test/reference/mm10/genome'

    id=234234214

    file1="./clean_data/SRR12207279_1_val_1.fq"
	
file2="./clean_data/SRR12207279_2_val_2.fq"

echo $file1
echo $file2
echo "${id}.sam" 
        hisat2 -t -p 12 -x $index \
    -1 "$file1" \
    -2 "$file2" \
    -S "${id}.sam"

hisat2批量比对

hisat2

比对将双端的fq文件转为》sam文件》bam文件

sam太大，命令中转为bam就自动删除了

结果在 ./align/ 文件夹中

25行 hisat2 -t -p 20 -x $index \ 中和通过调整 -p 20 的线程数加速运行


mkdir -p ./align/flag
cd ./align/
pwd

##参考基因组的位置
index='/public/home/dk_szy/songxudong/rna-test/reference/mm10/genome'

# 假设你的fastq文件在fastq-result文件夹中
fastq_dir="../clean_data"

# 遍历fastq-result文件夹中的所有1.fastq文件
for file1 in $fastq_dir/*_1_val_1.fq; do
    # 从1.fastq文件名中提取ID
    id=$(basename "$file1" .fq | sed 's/_1_val_1//')
    
    # 查找对应的2.fastq文件
    file2="$fastq_dir/${id}_2_val_2.fq"
    
    # 检查2.fastq文件是否存在
    if [ -f "$file2" ]; then
        echo "333#  ${id}  ！！！！！ is on the hisat2 Working !!!"
        
        # 使用hisat2进行比对，并指定输出目录为当前目录（./align/）
        hisat2 -t -p 20  -x $index \
            -1 "$file1" \
            -2 "$file2"  -S  "${id}.sam" 
        
        # sam2bam and remove sam，指定输出目录为当前目录（./align/）
        echo -e " ${id} sam2bam and remove sam   "
        samtools sort -@ 12 -o "./${id}_sorted.bam" "./${id}.sam"
        rm "./${id}.sam"
    else
        echo "No matching 2.fastq file found for $file1"
    fi
done

查看运行结果

grep alignment slurm-555095.out 为运行的报告文件，提取关于alignment的结果

grep alignment slurm-555095.out

生成counts表达矩阵

使用featureCounts 计数得到

下载参考基因注释

UCSC mm10 为例

https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/

gtf=’/public/home/dk_szy/songxudong/rna-test/reference/gft/mm10.refGene.gtf.gz’ 为注释文件，要和比对中的参考基因对应！！！

date

gtf='/public/home/dk_szy/songxudong/rna-test/reference/gft/mm10.refGene.gtf.gz'

mkdir  -p  ./counts

cd ./counts

pwd

featureCounts -T  20  -p  -a  $gtf  -o  counts.txt  ../align/*.bam

multiqc ./

echo -e " \n \n \n ALL WORK DONE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!  \n "
date

结果在./counts中

count矩阵 counts.txt

注释结果解读 multiqc_report.html

类型转换

featureCounts的结果包含基因长度，可直接使用进行转换

整理为带基因长度的文件


#我们使原始的count数据进行转换
exprSet<-read.csv("count-改名.csv",row.names = 1)

#删除表达量=0的行
exprSet1 <- exprSet[rowSums(exprSet[,2:ncol(exprSet)]) > 0,]
dim(exprSet)
names(exprSet)

##提取Count值
count <- exprSet1[,2:ncol(exprSet)]
write.csv(count,"song_count.csv")
##提取基因长度，基因长度需要转化成kb
gene_length_kb <- exprSet1$Length / 1000
head(gene_length_kb)

#CPM
cpm = log2(edgeR::cpm(count)+1)
write.csv(cpm,"song_CPM.csv")


##TPM
### 每千碱基reads（per million scaling factor)长度标准化
data_rpk <- count /gene_length_kb
##每百万
TPM <- t(t(data_rpk) / colSums(data_rpk) * 1000000)
head(TPM)
## 求均值，看一看
avg_tmp <- data.frame(avg_tmp = rowMeans(TPM))
head(avg_tmp)
##保存数据
write.csv(TPM,"song_TPM.csv")

##FPKM
FPKM <- t(t(data_rpk) / colSums(count ) * 10^6)
write.csv(TPM,"song_FPKM.csv")

转录本相关

未深入了解使用

基于PASA软件

简单介绍的视频：https://www.bilibili.com/video/BV1KE421g7kT?t=3053.0&p=7

样品的重命名和分组

这部分可省略

counts.txt 中根据分组进行修改，样品少则手动在 Excel 中修改也一样

下载的数据从ncbi中下载 Metadata

在R语言中进行修改

需要

Metadata文件 SraRunTable.txt

count文件 counts.txt

###环境设置
rm(list=ls())
options(stringsAsFactors = F) 
library(tidyverse) # ggplot2 stringer dplyr tidyr readr purrr  tibble forcats
library(data.table) #多核读取文件
setwd("C:/Users/Lenovo/Desktop/test")

#### 对counts进行处理筛选得到表达矩阵 ####
a1 <- fread('./counts.txt',
            header = T,data.table = F)#载入counts，第一列设置为列名
colnames(a1)
counts <- a1[,7:ncol(a1)] #截取样本基因表达量的counts部分作为counts 
rownames(counts) <- a1$Geneid #将基因名作为行名
#更改样品名
colnames(counts)
colnames(counts) <- gsub('../align/','', #删除样品名前缀
                         gsub('_sorted.bam','',  colnames(counts))) #删除样品名后缀


#### 导入或构建样本信息,  进行列样品名的重命名和分组####
b <- read.csv('./SraRunTable.txt')
b
name_list <- b$source_name[match(colnames(counts),b$Run)]; name_list  #选择所需要的样品信息列
nlgl <- data.frame(row.names=colnames(counts),
                   name_list=name_list,
                   group_list=name_list)
fix(nlgl)  #手动编辑构建样品名和分组信息
name_list <- nlgl$name_list
colnames(counts) <- name_list #更改样品名
group_list <- nlgl$group_list
gl <- data.frame(row.names=colnames(counts), #构建样品名与分组对应的数据框
                 group_list=group_list)

write.csv(counts,file = "count-改名.csv") #保存
write.csv(nlgl,file = "分组.csv") #保存

用于下游分析的文件 count-改名.csv
分组信息文件分组.csv

文件目录

test ：测试文件

SRR：下载的源文件

fastq-result：解压后的双端文件

clean_data：质控后的序列文件

align：比对后未注释的文件

counts：对比后的count文件，需进行样品重命名

reference：参考序列和注释文件

.sh 按步骤的分析操作文件 sbatch XXX.sh