花式分析GEO数据集

【花式分析GEO数据集】首先,我参照GEO2R的在线工具,构建了一个函数,传递进GSE编号、平台号和分组信息,它就会返回给你差异分析的结果。
与GEO2R类似,但是有很多参数可以更改,也可以知道更多细节。代码如下:

# Version info: R 3.4.3, Biobase 2.30.0, GEOquery 2.40.0, limma 3.26.8 # R scripts generatedFri May 4 00:38:36 EDT 2018 source("https://bioconductor.org/biocLite.R") # biocLite()#########################1.函数GEOAnaly####################################### #Differential expression analysis with limma library(Biobase) library(GEOquery) library(limma)GEOAnaly <- function(geo, gpl, groups){ #这三个参数分别传递geo芯片编号、平台名、样本选择和分组信息 # load series and platform data from GEO geodir = paste0('GEOdataDownloads/', geo) dir.create(geodir, recursive = T) gset = getGEO(GEO = geo, GSEMatrix =TRUE, AnnotGPL=TRUE, destdir = geodir) if (length(gset) > 1) idx = grep(gpl, attr(gset, "names")) else idx = 1 gset = gset[[idx]]# make proper column names to match toptable fvarLabels(gset) = make.names(fvarLabels(gset))# group names for all samples gsms = groups sml = c() for (i in 1:nchar(gsms)) { sml[i] = substr(gsms,i,i) }# eliminate samples marked as "X" sel = which(sml != "X") sml = sml[sel] gset = gset[ ,sel]# log2 transform ex = exprs(gset) qx = as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T)) LogC = (qx[5] > 100) || (qx[6]-qx[1] > 50 && qx[2] > 0) || (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2) if (LogC) { ex[which(ex <= 0)] = NaN exprs(gset) = log2(ex) } # set up the data and proceed with analysis sml = paste("G", sml, sep="")# set group names fl = as.factor(sml) gset$description = fl design = model.matrix(~ description + 0, gset) colnames(design) = levels(fl) fit = lmFit(gset, design) cont.matrix = makeContrasts(G1-G0, levels=design) fit2 = contrasts.fit(fit, cont.matrix) fit2 = eBayes(fit2, 0.01) tT = topTable(fit2, adjust="fdr", sort.by="B", number=10000)genesymbol = colnames(tT)[grep('symbol', colnames(tT), ignore.case = T)]#想想为什么加这一句 tT = subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC",genesymbol)) write.table(tT, file=stdout(), row.names=F, sep="\t") return(tT) }

随便找了一个例子来进行分析。

花式分析GEO数据集
文章图片
image.png 这个数据集有96个样本,解剖部位有舌、口底、加补、牙槽骨等,我们只想分析舌鳞癌的肿瘤和正常样本,分别标记为“1”和“0”,其它标记为x,于是这些就形成第三个参数。
gse31056 <- GEOAnaly('GSE31056', 'GPL10526', paste0("XXXXXXXXXXXXXXXXXXXX01XX01XXXXXXXX", "01XXXXXXX01XXX01X1X0XXX0XXXXXXXXXXX10X10X01XXX01XX01X0XXX1XXXX")) head(gse31056)

结果是出来了。但是其实还是有问题的:
如果不是正常比病变的对照设计怎么办?
如果有lncRNA与mRNA需要区分怎么办?
拿到结果后可以进行下游分析了。
花式分析GEO数据集
文章图片
image.png
截图实在困难。就凑合着看吧!结果展示在坐下区域。

    推荐阅读