連續(xù)兩次求賢令:曾經(jīng)我給你帶來(lái)了十萬(wàn)用戶,但現(xiàn)在祝你倒閉,以及 生信技能樹(shù)知識(shí)整理實(shí)習(xí)生招募,讓我走大運(yùn)結(jié)識(shí)了幾位優(yōu)秀小伙伴!大家開(kāi)始根據(jù)我的ngs組學(xué)視頻進(jìn)行一系列公共數(shù)據(jù)集分析實(shí)戰(zhàn),其中幾個(gè)小伙伴讓我非常驚喜,不需要怎么溝通和指導(dǎo),就默默的完成了一個(gè)實(shí)戰(zhàn)!
他前面的分享是:
下面是他對(duì)我們b站轉(zhuǎn)錄組視頻課程的詳細(xì)筆記
承接上節(jié):RNA-seq入門實(shí)戰(zhàn)(四):差異分析前的準(zhǔn)備——數(shù)據(jù)檢查,以及 RNA-seq入門實(shí)戰(zhàn)(五):差異分析——DESeq2 edgeR limma的使用與比較
本節(jié)概覽:1.獲取DEG結(jié)果的上下調(diào)差異基因2.bitr()函數(shù)轉(zhuǎn)化基因名為entrez ID3.利用clusterProfiler進(jìn)行KEGG與GO富集4.用enrichplot進(jìn)行富集結(jié)果可視化:pathview goplot barplot dotplot cnetplot emapplot treeplot heatplot upsetplot
載入上節(jié)RNA-seq入門的簡(jiǎn)單實(shí)戰(zhàn)(五)中保存的三種差異分析結(jié)果數(shù)據(jù),這里示范選取DESeq2的結(jié)果數(shù)據(jù),進(jìn)行篩選條件設(shè)置后獲取上下調(diào)基因名
rm(list=ls())
options(stringsAsFactors = F)
library(clusterProfiler)
library(enrichplot)
library(tidyverse)
# library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(DOSE)
library(pathview) #BiocManager::install("pathview",ask = F,update = F)
##數(shù)據(jù)載入和目錄設(shè)置
setwd("C:/Users/Lenovo/Desktop/test")
source('FUNCTIONS.R')
load(file = '1.counts.Rdata')
load(list.files(path = "./3.DEG",pattern = 'DEG_results.Rdata',full.names = T))
dir.create("4.Enrichment_KEGG_GO")
setwd("4.Enrichment_KEGG_GO")
##### 篩選條件設(shè)置 #######
log2FC_cutoff = log2(10)
pvalue_cutoff = 0.001
padj_cutoff = 0.001
####獲取DEG結(jié)果的上下調(diào)基因 ####
#DEG_DESeq2[,c(2,5,6)] DEG_limma_voom[,c(1,4,5)] DEG_edgeR[,c(1,4,5)]
need_DEG <- DEG_DESeq2[,c(2,5,6)]
head(need_DEG)
colnames(need_DEG) <- c('log2FoldChange','pvalue','padj')
gene_up=rownames(need_DEG[with(need_DEG,log2FoldChange>log2FC_cutoff & pvalue<pvalue_cutoff & padj<padj_cutoff),])
gene_down=rownames(need_DEG[with(need_DEG,log2FoldChange < -log2FC_cutoff & pvalue<pvalue_cutoff & padj<padj_cutoff),])
在進(jìn)行差異基因富集分析前,需要先將基因名為entrez ID。
轉(zhuǎn)化ID前要載入org.Hs.eg.db\org.Mm.eg.db,其包含著各大主流數(shù)據(jù)庫(kù)的數(shù)據(jù),如entrez ID和ensembl等等,使用keytypes(org.Mm.eg.db) 可查看所有支持及可轉(zhuǎn)化類型。
一般利用clusterProfiler包的bitr()函數(shù)或者mapIds()函數(shù)進(jìn)行轉(zhuǎn)換,用法如下:
#### 轉(zhuǎn)化基因名為entrez ID #####org.Hs.eg.db\org.Mm.eg.db包含著各大主流數(shù)據(jù)庫(kù)的數(shù)據(jù),如entrez ID和ensembl,#keytypes(org.Hs.eg.db) #查看所有支持及可轉(zhuǎn)化類型 常用 "ENTREZID","ENSEMBL","SYMBOL"gene_up_entrez <- as.character(na.omit(bitr(gene_up, #數(shù)據(jù)集 fromType="SYMBOL", #輸入格式 toType="ENTREZID", # 轉(zhuǎn)為ENTERZID格式 OrgDb="org.Mm.eg.db")[,2])) #"org.Hs.eg.db" "org.Mm.eg.db"gene_down_entrez <- as.character(na.omit(bitr(gene_down, #數(shù)據(jù)集 fromType="SYMBOL", #輸入格式 toType="ENTREZID", # 轉(zhuǎn)為ENTERZID格式 OrgDb="org.Mm.eg.db")[,2])) #"org.Hs.eg.db" "org.Mm.eg.db"gene_diff_entrez <- unique(c(gene_up_entrez ,gene_down_entrez ))##或者使用mapIds# gene_up_ENTREZID <- as.character(na.omit(mapIds(x = org.Mm.eg.db,# keys = gene_up,# keytype = "SYMBOL",# column = "ENTREZID")))
有了上一步所得上下調(diào)差異基因的entrez ID,我們就可以利用clusterProfiler包的enrichKEGG()和enrichGO()函數(shù)進(jìn)行富集分析了。在這里僅示范對(duì)上調(diào)基因進(jìn)行富集,實(shí)際應(yīng)用中可以將上下調(diào)和合并的基因都分別進(jìn)行富集查看結(jié)果。需要注意以下事項(xiàng):
函數(shù)的參數(shù)pvalueCutoff 默認(rèn)為 0.05,qvalueCutoff 默認(rèn)為 0.2,可根據(jù)實(shí)際情況自行調(diào)整大一些
enrichGO()中要設(shè)置OrgDb = "org.Mm.eg.db";readable = TRUE表示將entrez ID轉(zhuǎn)化為gene Symbol;ont 參數(shù)可以選擇"BP", "MF" "CC" 或 "ALL",表示對(duì)細(xì)胞組分(cellular component, CC)、分子功能(molecular function, MF)、生物過(guò)程(biological process, BP)或全部的基因通路進(jìn)行富集,可以根據(jù)實(shí)際需求選擇
enrichKEGG()要設(shè)置organism = "mmu",由于其沒(méi)有readable參數(shù),因此之后還要用DOSE包的setReadable進(jìn)行轉(zhuǎn)化entrez ID為gene Symbol
#### KEGG、GO富集 ####
kegg_enrich_results <- enrichKEGG(gene = gene_up_entrez,
organism = "mmu", #物種人類 hsa 小鼠mmu
pvalueCutoff = 0.05,
qvalueCutoff = 0.2)
kegg_enrich_results <- DOSE::setReadable(kegg_enrich_results,
OrgDb="org.Mm.eg.db",
keyType='ENTREZID')#ENTREZID to gene Symbol
write.csv(kk_read@result,'KEGG_gene_up_enrichresults.csv')
save(kegg_enrich_results, file ='KEGG_gene_up_enrichresults.Rdata')
go_enrich_results <- enrichGO(gene = gene_up_entrez,
OrgDb = "org.Mm.eg.db",
ont = "ALL" , #One of "BP", "MF" "CC" "ALL"
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
readable = TRUE)
write.csv(go_enrich_results@result, 'GO_gene_up_BP_enrichresults.csv')
save(go_enrich_results, file ='GO_gene_up_enrichresults.Rdata')
在富集到通路后就要進(jìn)行可視化展示了,參見(jiàn)clusterprofiler說(shuō)明書(shū)?? Introduction | Biomedical Knowledge Mining using GOSemSim and clusterProfiler (yulab-smu.top),其中enrichplot包可以對(duì)富集結(jié)果進(jìn)行超級(jí)豐富的可視化。
下面就來(lái)總結(jié)介紹一下clusterprofiler說(shuō)明書(shū)介紹的各種富集結(jié)果可視化方法,其中4.1pathview 只針對(duì)KEGG, 4.2goplot只針對(duì)GO結(jié)果,其他可視化方法則對(duì)兩者都通用
將KEGG通路進(jìn)行可視化一般有以下三種方法:
*使用函數(shù) browseKEGG(enrich_results, select_pathway)進(jìn)行網(wǎng)頁(yè)查看相關(guān)通路,如
browseKEGG(kegg_enrich_results, 'mmu04310') #網(wǎng)頁(yè)查看通路
網(wǎng)頁(yè)版的pathview https://pathview.uncc.edu/guest-home,可以上傳數(shù)據(jù)進(jìn)行在線可視化
pathview包:pathview()函數(shù)中需要輸入含log2FC信息的gene.data、pathway.id 和species物種信息,會(huì)生成含有基因上下調(diào)信息的基因通路圖。
注意函數(shù)中有一個(gè)重要參數(shù)kegg.native :若TRUE則輸出完整gene pathway的png文件,若為FALSE則輸出只含輸入基因信息的pdf文件 。
參數(shù)limit可以對(duì)圖例color bar范圍(即log2FC范圍)進(jìn)行調(diào)整。
其他參數(shù)使用詳見(jiàn)官方說(shuō)明:pathview.pdf (bioconductor.org),再推薦一篇簡(jiǎn)要中文教程:可視化kegg通路-pathview包 | KeepNotes blog (bioinfo-scrounger.com)下面我們選取富集排名第3(富集結(jié)果是按pvalue由小到大排序的)的"Wnt signaling pathway" 進(jìn)行示范。
#構(gòu)建含log2FC信息的genelist
genelist <- as.numeric(DEG_DESeq2[,2])
names(genelist) <- row.names(DEG_DESeq2)
# genelist ID轉(zhuǎn)化
genelist_entrez <- genelist
names(genelist_entrez) <- bitr(names(genelist),
fromType="SYMBOL",toType="ENTREZID",
OrgDb=OrgDb)[,2]
genelist_entrez <- genelist_entrez[!is.na(names(genelist_entrez))]
##查看與選擇所需通路
kk_read@result$Description[1:10] #查看前10通路
i=3
select_pathway <- kk_read@result$ID[i] #選擇所需通路的ID號(hào)
pathview(gene.data = genelist_entrez,
pathway.id = select_pathway,
species = 'mmu' , # 人類hsa 小鼠mmu
kegg.native = T,# TRUE輸出完整pathway的png文件,F(xiàn)輸出基因列表的pdf文件
new.signature = F, #pdf是否顯示pathway標(biāo)注
limit = list(gene=2.5, cpd=1) #圖例color bar范圍調(diào)整
)
pathview(gene.data = genelist_entrez,
pathway.id = select_pathway,
species = 'mmu' , # 人類hsa 小鼠mmu
kegg.native = F,# TRUE輸出完整pathway的png文件,F(xiàn)輸出基因列表的pdf文件
new.signature = F, #pdf是否顯示pathway標(biāo)注
limit = list(gene=2.5, cpd=1) #圖例color bar范圍調(diào)整
參數(shù)kegg.native = T,所得如下:
參數(shù)kegg.native = F,所得如下:
注意當(dāng)enrichGO()的ont為'ALL'時(shí)不能運(yùn)行該圖,會(huì)報(bào)錯(cuò)。以下 goplot展示的是enrichGO()的ont='BP'時(shí)的go_enrich_results
### goplot : Gene Ontology is organized as a directed acyclic graph.有向無(wú)環(huán)圖gop <- goplot(go_enrich_results, showCategory = 10)ggsave(gop, filename = "goplot.pdf", width=10, height=10)
barplot與dotplot展示富集通路的p.adj,generatio,count信息。
如果enrichGO的ont為'ALL',barplot與dotplot還可以設(shè)置使不同ont項(xiàng)目通路分隔開(kāi)展示
### barplot
barp <- barplot(go_enrich_results, font.size=14, showCategory=10)+
theme(plot.margin=unit(c(1,1,1,1),'lines'))
#如果enrichGO的ont為'ALL'
if (F) {
barp <- barplot(go_enrich_results, split= "ONTOLOGY", font.size =14)+
facet_grid(ONTOLOGY~., scale="free")+
theme(plot.margin=unit(c(1,1,1,1),'lines')) #調(diào)整圖形四周留白大小
}
ggsave(barp,filename = paste0(fn,'.pdf'), width=10, height=10)
### dotplot
dotp <- enrichplot::dotplot(go_enrich_results,font.size =14)+
theme(legend.key.size = unit(10, "pt"),#調(diào)整圖例大小
plot.margin=unit(c(1,1,1,1),'lines'))#調(diào)整四周留白大小
#如果enrichGO的ont為'ALL'
if (F) {
dotp <- enrichplot::dotplot(go_enrich_results,font.size =8,split = 'ONTOLOGY')+
facet_grid(ONTOLOGY~., scale="free")+
theme(legend.key.size = unit(10, "pt"),#調(diào)整圖例大小
plot.margin=unit(c(1,1,1,1),'lines'))#調(diào)整四周留白大小
}
ggsave(dotp,filename = paste0(fn,'.pdf'),width =10,height =10)
cnetplot展示了各通路下的基因網(wǎng)絡(luò)。繪制cnetplot有兩種展現(xiàn)方式, 更改參數(shù)circular 為 F(默認(rèn))或T可以分別得到散布狀和圈狀分布的cnetplot;cnetplot還可以輸入含log2FC信息的genelist ,會(huì)將log2FC信息展現(xiàn)在圖中
### cnetplot: Gene-Concept Network
#構(gòu)建含log2FC信息的genelist
genelist <- as.numeric(DEG_DESeq2[,2])
names(genelist) <- row.names(DEG_DESeq2)
cnetp1 <- cnetplot(go_enrich_results, foldChange = genelist,
showCategory = 6,
colorEdge = T,
node_label = 'all',
color_category ='steelblue')
cnetp2 <- cnetplot(go_enrich_results, foldChange = genelist,
showCategory = 6,
node_label = 'gene',
circular = T,
colorEdge = T)
ggsave(cnetp1,filename ='cnetplot.pdf', width =12,height =10)
ggsave(cnetp2,filename = 'cnetplot_cir.pdf', width =15,height =10)
emapplot富集圖將被富集的術(shù)語(yǔ)組織成一個(gè)邊緣連接重疊基因集的網(wǎng)絡(luò),相互重疊的基因集往往會(huì)聚集在一起,因此有助于我們識(shí)別功能模塊。
注意使用emapplot前還需要用pairwise_termsim()處理富集結(jié)果
### emapplot :Enrichment Mappt <- pairwise_termsim(go_enrich_results)emapp <- emapplot(pt, showCategory = 30, node_label = 'category') # 'category', 'group', 'all' and 'none'.)ggsave(emapp,filename = 'emapplot.pdf',width =10,height =10)
與cnetplot展示內(nèi)容類似,但是將不同富集通路的相同基因聚在了一起,有助于我們更好識(shí)別基因模塊
## heatplot: Heatmap-like functional classification
#構(gòu)建含log2FC信息的genelist
genelist <- as.numeric(DEG_DESeq2[,2])
names(genelist) <- row.names(DEG_DESeq2)
heatp <- heatplot(go_enrich_results, foldChange = genelist,
showCategory = 5)
ggsave(heatp, filename ='heatplot.pdf', width=20, height=10)
upsetplot展示不同富集通路間的重疊基因數(shù)量。
## upsetplot # emphasizes the gene overlapping among different gene setsupsetp <- upsetplot(go_enrich_results, n = 10)+ theme(plot.margin=unit(c(1,1,1,1),'lines')) #調(diào)整圖形四周留白大小ggsave(upsetp, filename = 'upsetplot.pdf', width=15, height=10)
treeplot對(duì)富集結(jié)果術(shù)語(yǔ)進(jìn)行層次聚類, 并使用高頻詞標(biāo)記,有助于我們從繁多的富集結(jié)果中快速提取有用關(guān)鍵信息。
## Tree plot #pt <- pairwise_termsim(go_enrich_results)treep <- treeplot(pt, showCategory = 30)ggsave(treep, filename = 'treeplot.pdf', width=18, height=10)
GO、KEGG富集與可視化到此就結(jié)束啦
如果所得顯著差異基因很少,或無(wú)法富集到有生物學(xué)意義的通路時(shí)又該怎么辦呢,下節(jié)將介紹一種不依賴于差異基因篩選的富集分析方法——GSEA
參考資料
?? Introduction | Biomedical Knowledge Mining using GOSemSim and clusterProfiler (yulab-smu.top)pathview.pdf (bioconductor.org)
可視化kegg通路-pathview包 | KeepNotes blog (bioinfo-scrounger.com)
GitHub - jmzeng1314/GEO
【生信技能樹(shù)】轉(zhuǎn)錄組測(cè)序數(shù)據(jù)分析_嗶哩嗶哩_bilibili
【生信技能樹(shù)】GEO數(shù)據(jù)庫(kù)挖掘_嗶哩嗶哩_bilibili
聯(lián)系客服