大家應(yīng)該對通路富集分析都很熟悉,如DAVID、超幾何富集分析等。都是在大量文章中常見的通路富集方法,給大家介紹一個更加復(fù)雜的通路富集分析的前期數(shù)據(jù)處理包GSVA(gene set variation analysis)與gsea選擇。GSVA是一種非參數(shù)的無監(jiān)督分析方法,主要用來評估芯片核轉(zhuǎn)錄組的基因集富集結(jié)果。主要是通過將基因在不同樣品間的表達(dá)量矩陣轉(zhuǎn)化成基因集在樣品間的表達(dá)量矩陣,從而來評估不同的通路在不同樣品間是否富集。具體的一個分析流程如下:
這是相對簡單粗暴一些的基因富集分析方法,不需要輸入基因的表達(dá)值,只需要通過自己設(shè)定的閾值(|log2FC| >1 和Pvalue < 0.05)篩選得到的差異基因名。進(jìn)行統(tǒng)計檢驗后返回顯著富集的功能基因集。如DESeq2 (獲得差異基因信息),clusterProfiler(ID轉(zhuǎn)換+富集分析+作圖一站式神包?。?/p>
相比于第一種簡單粗暴的用硬閾值截取+往籃子里塞雞蛋的方法,GSEA同時考慮了基因在整個表達(dá)譜中所處的FoldChange rank以及同一基因集中的基因在表達(dá)譜rank中的距離。通俗來講,GSEA基于如下假設(shè):一個基因集中的基因如果在表達(dá)譜中所處的rank越極端(高/低FoldChange),而且基因之間的距離越短(rank相近),則這個基因集越可能顯著。所以GSEA需要輸入的是【所有納入差異分析的完整基因list和它們的FC值,并已經(jīng)預(yù)先排序(pre-ranked)】。
具體的原理參考,https://cloud.tencent.com/developer/article/1426130
ssGSEA顧名思義是一種特殊的GSEA,它主要針對單樣本無法做GSEA而提出的一種實現(xiàn)方法,原理上與GSEA是類似的,不同的是GSEA需要準(zhǔn)備表達(dá)譜文件即gct,根據(jù)表達(dá)譜文件計算每個基因的rank值,再進(jìn)行后續(xù)的統(tǒng)計分析。ssGSEA是為無重復(fù)的樣本進(jìn)行,geneset enrichment analysis準(zhǔn)備的,所以不同于上方以組別為單位(cancer vs normal)的GSEA分析,通過ssGSEA,每個樣本都可以得到相應(yīng)基因集的評分。
a. 首先假設(shè)有一個樣本的表達(dá)數(shù)據(jù),那么他應(yīng)該是這樣的,第一列為基因,第二列為表達(dá)值,這樣的兩列的數(shù)據(jù)矩陣。
首先對樣本的所有基因的表達(dá)水平進(jìn)行排序獲得其在所有基因中的秩次rank,這些基因的集合為BG。
b. 假設(shè)要對其進(jìn)行KEGG的分析,首先我們需要在GSEA官網(wǎng)找到KEGG對應(yīng)的gmt文件。
gmt文件主要格式是:每行表示一個通路,第一列為通路ID,第二列為通路對應(yīng)的描述,第三列開始到最后一列為該通路中的基因。
c. 那么對于任意的一個通路A,我們可以拿到這個通路的基因列表GL,從GL中尋找BG里存在的基因并計數(shù)為NC,并將這些基因的表達(dá)水平加和為SG。
開始計算ES:對于任意一個表達(dá)譜中的基因 G,如果G是集合GL中的基因則他的ES等于該基因的表達(dá)水平除以SG,否則記該基因的ES等于1除以(基因集合BG總個數(shù)減去NC)
依次計算每個BG中的基因的ES值,找到其中絕對值最大的ES作為通路A的A.ES
d. 到此通路A的ES計算完畢,需要一個統(tǒng)計學(xué)方法來評估該ES是否是顯著的,即非隨機(jī)的。
按照上述計算ES的方法,先隨機(jī)打亂表達(dá)譜中基因的表達(dá)順序,然后再依次計算ES值,如此重復(fù)一千次,得到一千個ES值,我們根據(jù)這一千個ES值的分布,來計算A.ES在這個分布中所處的位置及出現(xiàn)在該位置時的概率即得到了p值 。依次分別計算每個通路的ES及p值,然后使用多重檢驗矯正得到每個通路的FDR
此外,對于多個樣本或多組樣本,也可以利用T檢驗或方差檢驗對結(jié)果進(jìn)行p值的計算與相應(yīng)的FDR值。
以上即是整個ssGSEA算法的整體思路。
R語言,GSVA包(進(jìn)行GSVA/ssGSEA分析),limma包(差異pathway的篩選),pheatmap包(熱圖繪制)。
library(GSEABase)
library(GSVA)
#讀取基因集文件
geneSets <- getGmt("test.geneset")
#讀取表達(dá)量文件并去除重復(fù)
mydata <- read.table(file = "all.genes.fpkm.xls",header=T)
name=mydata[,1]
index <- duplicated(mydata[,1])
fildup=mydata[!index,]
exp=fildup[,-1]
row.names(exp)=name
#將數(shù)據(jù)框轉(zhuǎn)換成矩陣
mydata= as.matrix(exp)
#使用gsva方法進(jìn)行分析,默認(rèn)mx.diff=TRUE,min.sz=1,max.zs=Inf,這里是設(shè)置最小值和最大值
res_es <- gsva(mydata, geneSets, min.sz=10, max.sz=500, verbose=FALSE, parallel.sz=1)
pheatmap(res_es)
Usage
## S4 method for signature 'ExpressionSet,list'
gsva(expr, gset.idx.list, annotation,method=c("gsva", "ssgsea", "zscore", "plage"),
kcdf=c("Gaussian", "Poisson", "none"),abs.ranking=FALSE,min.sz=1,max.sz=Inf,
parallel.sz=0,parallel.type="SOCK",mx.diff=TRUE,tau=switch(method, gsva=1, ssgsea=0.25, NA),ssgsea.norm=TRUE,verbose=TRUE)
Arguments
expr
Gene expression data which can be given either as an ExpressionSet object or as a matrix of expression values where rows correspond to genes and columns correspond to samples.
gset.idx.list
Gene sets provided either as a list object or as a GeneSetCollection object.
annotation
In the case of calling gsva() with expression data in a matrix and gene sets as a GeneSetCollection object, the annotation argument can be used to supply the name of the Bioconductor package that contains annotations for the class of gene identifiers occurring in the row names of the expression data matrix. By default gsva() will try to match the identifiers in expr to the identifiers in gset.idx.list just as they are, unless the annotation argument is set.
method
Method to employ in the estimation of gene-set enrichment scores per sample. By default this is set to gsva (H?nzelmann et al, 2013) and other options are ssgsea (Barbie et al, 2009), zscore (Lee et al, 2008) or plage (Tomfohr et al, 2005). The latter two standardize first expression profiles into z-scores over the samples and, in the case of zscore, it combines them together as their sum divided by the square-root of the size of the gene set, while in the case of plage they are used to calculate the singular value decomposition (SVD) over the genes in the gene set and use the coefficients of the first right-singular vector as pathway activity profile.
kcdf
Character string denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples when method="gsva". By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson".
abs.ranking
Flag used only when mx.diff=TRUE. When abs.ranking=FALSE (default) a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When abs.ranking=TRUE the original Kuiper statistic that sums the largest positive and negative random walk deviations, is used. In this latter case, gene sets with genes enriched on either extreme (high or low) will be regarded as 'highly' activated.
min.sz
Minimum size of the resulting gene sets.
max.sz
Maximum size of the resulting gene sets.
parallel.sz
Number of processors to use when doing the calculations in parallel. This requires to previously load either the parallel or the snow library. If parallel is loaded and this argument is left with its default value (parallel.sz=0) then it will use all available core processors unless we set this argument with a smaller number. If snow is loaded then we must set this argument to a positive integer number that specifies the number of processors to employ in the parallel calculation.
parallel.type
Type of cluster architecture when using snow.
mx.diff
Offers two approaches to calculate the enrichment statistic (ES) from the KS random walk statistic. mx.diff=FALSE: ES is calculated as the maximum distance of the random walk from 0. mx.diff=TRUE (default): ES is calculated as the magnitude difference between the largest positive and negative random walk deviations.
tau
Exponent defining the weight of the tail in the random walk performed by both the gsva (H?nzelmann et al., 2013) and the ssgsea (Barbie et al., 2009) methods. By default, this tau=1 when method="gsva" and tau=0.25 when method="ssgsea" just as specified by Barbie et al. (2009) where this parameter is called alpha.
ssgsea.norm
Logical, set to TRUE (default) with method="ssgsea" runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. When ssgsea.norm=FALSE this last normalization step is skipped.
verbose
Gives information about each calculation step. Default: FALSE.
聯(lián)系客服