clusterProfiler 是業(yè)界大神Y叔寫的一個(gè)R包,可以用來做各種富集分析,如GO、KEGG、DO(Disease Ontology analysis)、Reactome pathway analysis以及GSEA富集分析等。而除了富集分析,他還可以非常方便的對富集分析結(jié)果進(jìn)行可視化。
這里使用clusterProfiler進(jìn)行GO、KEGG以及GSEA富集分析。
安裝clusterProfiler:
> source("http://bioconductor.org/biocLite.R")> biocLite('clusterProfiler')
對于沒有轉(zhuǎn)換的gene ID,clusterProfiler也提供了bitr
方法進(jìn)行轉(zhuǎn)換ID:
Usage: bitr(geneID, fromType, toType, OrgDb, drop = TRUE)Arguments geneID input gene id fromType input id type toType output id type OrgDb annotation db drop drop NA or not# example:> x <- c("GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2", "ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1", "NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1", "COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC", "ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT", "MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK", "STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B")> eg <- bitr(x, fromType="SYMBOL", toType=c("ENTREZID","ENSEMBL"), OrgDb="org.Hs.eg.db"); head(eg)'select()' returned 1:many mapping between keys and columns SYMBOL ENTREZID ENSEMBL1 GPX3 2878 ENSG000002114452 GLRX 2745 ENSG000001732213 LBP 3929 ENSG000001299884 CRYAB 1410 ENSG000001098465 DEFB1 1672 ENSG000001648256 HCLS1 3059 ENSG00000180353
可以看到,這里轉(zhuǎn)換ID的對應(yīng)文件來源于"org.Hs.eg.db"這個(gè)包。
在開始富集分析之前先看看GO和KEGG富集分析的方法以及參數(shù):
enrichGO GO Enrichment Analysis of a gene set. Given a vector of genes, this function will return the enrichment GO categories after FDR control.Usage: enrichGO(gene, OrgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05, pAdjustMethod = "BH", universe, qvalueCutoff = 0.2, minGSSize = 10, maxGSSize = 500, readable = FALSE, pool = FALSE)Arguments: gene a vector of entrez gene id. OrgDb OrgDb keyType keytype of input gene ont One of "MF", "BP", and "CC" subontologies or 'ALL'. pvalueCutoff Cutoff value of pvalue. pAdjustMethod one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none" universe background genes qvalueCutoff qvalue cutoff minGSSize minimal size of genes annotated by Ontology term for testing. maxGSSize maximal size of genes annotated for testing readable whether mapping gene ID to gene Name pool If ont=’ALL’, whether pool 3 GO sub-ontologies
導(dǎo)入數(shù)據(jù),這是一個(gè)整合數(shù)據(jù),在這里我們要用到的只是entrez ID列和最后一列(logFC):
> library(clusterProfiler)> degenes <- read.csv('D:/TCGA/microarray_analysis/intersectgenes_logFC_broad.txt',header = T,stringsAsFactors = F,sep = '\t')> head(degenes) Symbol GSM450153 GSM450154 GSM450155 ID P.Value Q.Value adj.P.Val RefSeq.ID Entrez.ID Fold.Change1 AIM1 -0.7155 -0.8391 -2.3808 ENSG00000112297 4.518321e-02 0.287315306 0.287315306 NA 202 2.5553542 AK5 -1.9269 -0.6967 -0.5628 ENSG00000154027 7.907520e-05 0.005415775 0.005415775 NA 26289 1.3500513 ANXA3 4.2110 0.8687 -0.1016 ENSG00000138772 3.122002e-02 0.229588586 0.229588586 NA 306 -2.3287364 ARHGAP15 -0.0725 -1.5821 -2.0469 ENSG00000075884 6.321948e-05 0.004553170 0.004553170 NA 55843 5.0641835 ASGR2 1.5563 1.4054 1.2066 ENSG00000161944 1.474010e-02 0.146976591 0.146976591 NA 433 2.0615356 ATM -1.4344 -0.4961 -1.9324 ENSG00000149311 1.403235e-02 0.142613364 0.142613364 NA 472 2.313447> genelist <- degenes$Entrez.ID# 檢查是否有重復(fù)> genelist[duplicated(genelist)]integer(0)
由于clusterProfiler富集分析推薦的輸入文件是Entrez ID,因此這里提取的是Entrez ID,接下來就可以進(jìn)行富集分析了:
> go <- enrichGO(genelist, OrgDb = org.Hs.eg.db, ont='ALL',pAdjustMethod = 'BH',pvalueCutoff = 0.05, qvalueCutoff = 0.2,keyType = 'ENTREZID')> head(go) ONTOLOGY ID Description GeneRatio BgRatio pvalueGO:1903039 BP GO:1903039 positive regulation of leukocyte cell-cell adhesion 20/121 226/17381 9.074966e-17GO:0050870 BP GO:0050870 positive regulation of T cell activation 19/121 215/17381 5.824805e-16GO:0001819 BP GO:0001819 positive regulation of cytokine production 24/121 409/17381 7.118707e-16GO:0007159 BP GO:0007159 leukocyte cell-cell adhesion 23/121 374/17381 1.118047e-15GO:0022409 BP GO:0022409 positive regulation of cell-cell adhesion 20/121 259/17381 1.289708e-15GO:1903037 BP GO:1903037 regulation of leukocyte cell-cell adhesion 22/121 343/17381 2.095005e-15 p.adjust qvalueGO:1903039 2.474743e-13 1.898101e-13GO:0050870 6.470905e-13 4.963113e-13GO:0001819 6.470905e-13 4.963113e-13GO:0007159 7.034066e-13 5.395051e-13GO:0022409 7.034066e-13 5.395051e-13GO:1903037 9.521797e-13 7.303114e-13 geneIDGO:1903039 84433/6352/1236/915/916/959/972/9308/3113/3117/3119/3122/3458/3565/22914/3932/8013/6504/7494/6375GO:0050870 84433/6352/1236/915/916/959/972/9308/3113/3117/3119/3122/3458/3565/22914/3932/6504/7494/6375GO:0001819 330/84433/834/1236/914/916/959/972/9308/56253/2205/2597/3113/8692/64135/3458/3565/22914/27035/8013/6504/7100/7494/6375GO:0007159 84433/6352/1236/914/915/916/959/972/9308/2213/3113/3117/3119/3122/3458/3565/22914/3932/8013/6280/6504/7494/6375GO:0022409 84433/6352/1236/915/916/959/972/9308/3113/3117/3119/3122/3458/3565/22914/3932/8013/6504/7494/6375GO:1903037 84433/6352/1236/914/915/916/959/972/9308/2213/3113/3117/3119/3122/3458/3565/22914/3932/8013/6504/7494/6375 CountGO:1903039 20GO:0050870 19GO:0001819 24GO:0007159 23GO:0022409 20GO:1903037 22> dim(go)[1] 524 10> dim(go[go$ONTOLOGY=='BP',])[1] 482 10> dim(go[go$ONTOLOGY=='CC',])[1] 26 10> dim(go[go$ONTOLOGY=='MF',])[1] 16 10# 看來這些差異基因主要還是富集到BP中了# 進(jìn)行簡單的可視化> barplot(go,showCategory=20,drop=T)> dotplot(go,showCategory=50)# 還可以繪制GO的網(wǎng)絡(luò)關(guān)系圖,但是值得注意的是這里的數(shù)據(jù)只能是富集一個(gè)GO通路(BP、CC或MF)的數(shù)據(jù)> go.BP <- enrichGO(genelist, OrgDb = org.Hs.eg.db, ont='BP',pAdjustMethod = 'BH',pvalueCutoff = 0.05, qvalueCutoff = 0.2,keyType = 'ENTREZID')> plotGOgraph(go.BP)
barplot
dotplot
plotGOgraph
KEGG通路富集函數(shù)用法與GO富集分析方法類似:
enrichKEGG KEGG Enrichment Analysis of a gene set. Given a vector of genes, this function will return the enrichment KEGG categories with FDR control.Usage: enrichKEGG(gene, organism = "hsa", keyType = "kegg", pvalueCutoff = 0.05, pAdjustMethod = "BH", universe, minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2, use_internal_data = FALSE)Arguments: gene a vector of entrez gene id. organism supported organism listed in ’http://www.genome.jp/kegg/catalog/org_list.html’ keyType one of "kegg", ’ncbi-geneid’, ’ncib-proteinid’ and ’uniprot’ pvalueCutoff Cutoff value of pvalue. pAdjustMethod one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none" universe background genes minGSSize minimal size of genes annotated by Ontology term for testing. maxGSSize maximal size of genes annotated for testing qvalueCutoff qvalue cutoff use_internal_data logical, use KEGG.db or latest online KEGG data
我們繼續(xù)使用上面的數(shù)據(jù)進(jìn)行KEGG富集分析:
> kegg <- enrichKEGG(genelist, organism = 'hsa', keyType = 'kegg', pvalueCutoff = 0.05,pAdjustMethod = 'BH', minGSSize = 10,maxGSSize = 500,qvalueCutoff = 0.2,use_internal_data = FALSE)> head(kegg) ID Description GeneRatio BgRatio pvalue p.adjust qvaluehsa04658 hsa04658 Th1 and Th2 cell differentiation 11/76 92/7404 1.750537e-09 2.765849e-07 2.026938e-07hsa05310 hsa05310 Asthma 7/76 31/7404 1.956862e-08 1.433518e-06 1.050546e-06hsa04672 hsa04672 Intestinal immune network for IgA production 8/76 49/7404 2.721869e-08 1.433518e-06 1.050546e-06hsa05168 hsa05168 Herpes simplex infection 13/76 185/7404 3.744992e-08 1.479272e-06 1.084077e-06hsa05140 hsa05140 Leishmaniasis 9/76 74/7404 5.040165e-08 1.592692e-06 1.167196e-06hsa05330 hsa05330 Allograft rejection 7/76 38/7404 8.868232e-08 2.335301e-06 1.711413e-06 geneID Counthsa04658 915/916/2353/3113/3117/3119/3122/3458/3565/3932/6775 11hsa05310 959/2205/3113/3117/3119/3122/3565 7hsa04672 959/7852/3113/3117/3119/3122/3565/608 8hsa05168 6352/972/54205/1936/2353/3113/3117/3119/3122/64135/3434/3458/5187 13hsa05140 2209/2212/2353/3113/3117/3119/3122/3458/3565 9hsa05330 959/3113/3117/3119/3122/3458/3565 7> dim(kegg)[1] 30 9# 簡單可視化> dotplot(kegg, showCategory=30)
dotplot
clusterProfiler provides
enricher
function for hypergeometric test andGSEA
function for gene set enrichment analysis that are designed to accept user defined annotation.
這里使用clusterProfiler里面的GSEA
函數(shù)進(jìn)行GSEA富集分析,并與使用超幾何分布富集(enricher
函數(shù))的結(jié)果進(jìn)行簡單比較,enricher
函數(shù)與GSEA
函數(shù)用法基本相同,因此這里只給出GSEA
的用法及參數(shù)。
GSEA a universal gene set enrichment analysis toolsUsage: GSEA(geneList, exponent = 1, nPerm = 1000, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, pAdjustMethod = "BH", TERM2GENE, TERM2NAME = NA, verbose = TRUE, seed = FALSE, by = "fgsea")Arguments:geneList order ranked geneListexponent weight of each stepnPerm number of permutationsminGSSize minimal size of each geneSet for analyzingmaxGSSize maximal size of genes annotated for testingpvalueCutoff pvalue cutoffpAdjustMethod p value adjustment methodTERM2GENE user input annotation of TERM TO GENE mapping, a data.frame of 2 column with term and geneTERM2NAME user input of TERM TO NAME mapping, a data.frame of 2 column with term and nameverbose logicalseed logicalby one of ’fgsea’ or ’DOSE’
在進(jìn)行富集分析之前需要對數(shù)據(jù)做一個(gè)預(yù)處理——排序。
> library(dplyr)> geneList <- select(degenes, Entrez.ID, Fold.Change); head(geneList) Entrez.ID Fold.Change1 202 2.5553542 26289 1.3500513 306 -2.3287364 55843 5.0641835 433 2.0615356 472 2.313447> geneList.sort <- arrange(geneList, desc(Fold.Change)); head(geneList.sort) Entrez.ID Fold.Change1 3512 36.473322 3117 35.586853 3113 24.101514 916 14.897635 5996 14.674176 3119 11.12233> gene <- geneList.sort$Entrez.ID
這里使用的是broad GSEA提供的gene sets 來提供TERM2GENE:
> gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", package="clusterProfiler")> c5 <- read.gmt(gmtfile)> head(c5) ont gene1 NUCLEOPLASM 31902 NUCLEOPLASM 25473 NUCLEOPLASM 261734 NUCLEOPLASM 94395 NUCLEOPLASM 575086 NUCLEOPLASM 6837
萬事俱備,只欠東風(fēng)。現(xiàn)在可以開始分析了,先進(jìn)行超幾何分布的富集分析:
### 先使用基于超幾何分布的富集分析> enrich <- enricher(gene, TERM2GENE=c5); head(enrich) ID Description GeneRatio BgRatio pvalue p.adjustIMMUNOLOGICAL_SYNAPSE IMMUNOLOGICAL_SYNAPSE IMMUNOLOGICAL_SYNAPSE 2/55 11/5270 0.00553797 0.2547466LYSOSOME LYSOSOME LYSOSOME 3/55 61/5270 0.02528071 0.3336995LYTIC_VACUOLE LYTIC_VACUOLE LYTIC_VACUOLE 3/55 61/5270 0.02528071 0.3336995VACUOLE VACUOLE VACUOLE 3/55 69/5270 0.03472762 0.3336995LIPID_RAFT LIPID_RAFT LIPID_RAFT 2/55 29/5270 0.03627169 0.3336995PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX 1/55 10/5270 0.09967809 0.7338789 qvalue geneID CountIMMUNOLOGICAL_SYNAPSE 0.2390071 50852/84433 2LYSOSOME 0.3130819 3122/9098/8692 3LYTIC_VACUOLE 0.3130819 3122/9098/8692 3VACUOLE 0.3130819 3122/9098/8692 3LIPID_RAFT 0.3130819 3932/84433 2PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX 0.6885363 54205 1
再做GSEA富集分析,在此之前需要對輸入gene list做一下處理,包括三步:
## assume 1st column is ID## 2nd column is FC> head(geneList) Entrez.ID Fold.Change1 202 2.5553542 26289 1.3500513 306 -2.3287364 55843 5.0641835 433 2.0615356 472 2.313447## feature 1: numeric vector> glist <- geneList[,2];head(glist)[1] 2.555354 1.350051 -2.328736 5.064183 2.061535 2.313447## feature 2: named vector> names(glist) <- as.character(geneList[,1]);head(glist) 202 26289 306 55843 433 472 2.555354 1.350051 -2.328736 5.064183 2.061535 2.313447 ## feature 3: decreasing order> glist <- sort(glist,decreasing = T); head(glist) 3512 3117 3113 916 5996 3119 36.47332 35.58685 24.10151 14.89763 14.67417 11.12233
輸入文件準(zhǔn)備好了盡可以進(jìn)行GSEA富集分析了:
> gsea <- GSEA(glist, TERM2GENE=c5, verbose=FALSE, pvalueCutoff = 0.8); head(gsea) ID Description setSize enrichmentScore NES pvalue p.adjust qvalues rank leading_edgeMEMBRANE MEMBRANE MEMBRANE 30 0.4525068 1.2643778 0.1896024 0.6880521 0.6880521 29 tags=37%, list=23%, signal=37%PLASMA_MEMBRANE PLASMA_MEMBRANE PLASMA_MEMBRANE 28 0.4358283 1.2085736 0.2435766 0.6880521 0.6880521 29 tags=36%, list=23%, signal=35%CYTOPLASMIC_PART CYTOPLASMIC_PART CYTOPLASMIC_PART 12 -0.2809599 -1.1642120 0.2481203 0.6880521 0.6880521 34 tags=67%, list=27%, signal=54%MEMBRANE_PART MEMBRANE_PART MEMBRANE_PART 27 0.3603900 0.9960600 0.4928131 0.6880521 0.6880521 68 tags=70%, list=53%, signal=42%PLASMA_MEMBRANE_PART PLASMA_MEMBRANE_PART PLASMA_MEMBRANE_PART 24 0.3593591 0.9832720 0.5092593 0.6880521 0.6880521 68 tags=71%, list=53%, signal=41%INTEGRAL_TO_MEMBRANE INTEGRAL_TO_MEMBRANE INTEGRAL_TO_MEMBRANE 22 0.3618337 0.9738809 0.5191710 0.6880521 0.6880521 68 tags=73%, list=53%, signal=41% core_enrichmentMEMBRANE 916/5996/3122/7852/972/969/6402/2205/56253/10225/3738PLASMA_MEMBRANE 916/5996/3122/7852/969/6402/2205/56253/10225/3738CYTOPLASMIC_PART 8692/54205/1936/2879/9531/293/23787MEMBRANE_PART 916/7852/972/969/6402/2205/10225/3738/3932/30061/9308/608/51266/50852/51348/29121/1235/84433/22914PLASMA_MEMBRANE_PART 916/7852/969/6402/2205/10225/3738/3932/30061/9308/51266/50852/51348/29121/1235/84433/22914INTEGRAL_TO_MEMBRANE 7852/972/969/6402/2205/10225/3738/30061/9308/608/51266/50852/51348/29121/1235/22914
不要問我為什么要pvalueCutoff設(shè)置0.8,因?yàn)橐恢闭{(diào)大到0.8才富集到結(jié)果。。??梢姅?shù)據(jù)應(yīng)該是有問題的,但這里作為一個(gè)實(shí)踐就不管那么多了。
而且,clusterProfiler還支持GSEA的GO、KEGG富集。
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)(Subramanian et al. 2005) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
> gsea.go <- gseGO(glist,OrgDb = org.Hs.eg.db, pvalueCutoff = 0.5); head(gsea.go)preparing geneSet collections...GSEA analysis...leading edge analysis...done... ID Description setSize enrichmentScore NES pvalue p.adjust qvalues rank leading_edgeGO:0031294 GO:0031294 lymphocyte costimulation 11 0.7832645 1.872828 0.001186240 0.1049915 0.09363642 10 tags=45%, list=8%, signal=46%GO:0031295 GO:0031295 T cell costimulation 11 0.7832645 1.872828 0.001186240 0.1049915 0.09363642 10 tags=45%, list=8%, signal=46%GO:0002376 GO:0002376 immune system process 74 0.6032615 1.738943 0.002004008 0.1049915 0.09363642 60 tags=64%, list=47%, signal=80%GO:0042110 GO:0042110 T cell activation 23 0.6722496 1.825647 0.002143623 0.1049915 0.09363642 24 tags=48%, list=19%, signal=47%GO:0050851 GO:0050851 antigen receptor-mediated signaling pathway 11 0.7582504 1.813018 0.002372479 0.1049915 0.09363642 10 tags=45%, list=8%, signal=46%GO:0050852 GO:0050852 T cell receptor signaling pathway 10 0.7963700 1.872212 0.002380952 0.1049915 0.09363642 10 tags=50%, list=8%, signal=50% core_enrichmentGO:0031294 3117/3113/916/3119/3122GO:0031295 3117/3113/916/3119/3122GO:0002376 3512/3117/3113/916/5996/3119/914/10964/3122/7852/972/4068/6402/6352/84636/915/2205/51176/4069/2213/56253/10225/55619/4332/2212/6375/7100/3434/8638/3932/2209/51316/30061/9308/64135/8320/10578/3458/10875/608/83666/51266/50852/3437/51348/2353/9111GO:0042110 3117/3113/916/3119/914/3122/972/6352/915/51176/2213GO:0050851 3117/3113/916/3119/3122GO:0050852 3117/3113/916/3119/3122
順便再利用上面處理好的glist進(jìn)行一下KEGG富集到的某一條通路的可視化:
> library(pathview)> pathview(gene.data = glist, pathway.id = 'hsa04658',species="hsa", limit=list(gene=max(abs(glist)), cpd=1))
上面的命令會在當(dāng)前目錄生成3個(gè)文件:一個(gè)原始KEGG通路圖片,一個(gè)標(biāo)注了上下調(diào)基因的,最后一個(gè)文本文件則是一些KEGG通路信息。
image.png
hsa04658.pathview.png
聯(lián)系客服