OSCA教程1|OSCA教程1 | SingleCellExperiment对象 OSCA教程1|SingleCellExperiment对象

教程来源：http://bioconductor.org/books/release/OSCA/data-infrastructure.html#background
用一张图展示SingleCellExperiment的结构：

文章图片
image-20210413210641826.png SingleCellExperiment对象中每一个数据代表一个分离的slot（来源于S4对象）。假如我们将SingleCellExperiment比作一艘货船，那么slot可以理解为单个的装载不同货物的boxes，比如有的专门存放数值类型的矩阵，另外一些则单独存放数据框。
在本次学习中，我们讨论可以获得哪些slot，他们的特定格式，我们怎样与他们进行交互。
厉害的人可能早就发现了SingleCellExperiment与SummarizedExperiment对象是一样。
1.存储主要的实验数据
1.1 assay slot
如果只创建一个基本的SingleCellExperiment对象，我们只需要赋值assay 数据槽就可以了（上图中的蓝色框框）。这个slot包含了主要的数据如：counts 矩阵。我们来随便生成一个具有三个细胞和10个基因的count矩阵进行测试。

counts_matrix <- data.frame(cell_1 = rpois(10, 10), cell_2 = rpois(10, 10), cell_3 = rpois(10, 30)) rownames(counts_matrix) <- paste0("gene_", 1:10) counts_matrix <- as.matrix(counts_matrix) # must be a matrix object!# 生成的矩阵，rpois为随机生成一个具有泊松分布特征的数据 counts_matrix cell_1 cell_2 cell_3 gene_191133 gene_210632 gene_39828 gene_412734 gene_581329 gene_6141224 gene_771226 gene_841227 gene_99926 gene_108729

现在，我们可以开始创建SingleCellExperiment对象了，并将数据命名：counts

sce <- SingleCellExperiment(assays = list(counts = counts_matrix))

我们可以直接在命令行输入sce来查看初步的主要信息。

sceclass: SingleCellExperiment dim: 10 3 metadata(0): assays(1): counts rownames(10): gene_1 gene_2 ... gene_9 gene_10 rowData names(0): colnames(3): cell_1 cell_2 cell_3 colData names(0): reducedDimNames(0): altExpNames(0):

有两种方法可以获取counts值：

assay(sce, "counts") ，这是最常用的方法，第二个参数使用assay的name，就是刚刚我们命名的这个名字：counts
counts(sce) ，这个是上面方法的简写，但是旨在assay具有特殊名字的数据才有效"counts"。

counts(sce) cell_1 cell_2 cell_3 gene_191133 gene_210632 gene_39828 gene_412734 gene_581329 gene_6141224 gene_771226 gene_841227 gene_99926 gene_108729

1.2 添加更多的assays
assays数据槽非常强大的原因是它可以存储主要数据的不同格式。这在这个时候非常有用：我想保存原始count矩阵，还想保存标准化后的normalized 版本。现在我们使用scater包来计算标准化并log转换后的数据。

sce <- scater::logNormCounts(sce)

在做单细胞数据分析的时候，你可能已经注意到了我们每次都是对同一个对象如sce进行赋值，那为什么原有数据没有被覆盖掉呢？

# sce对象的assays变成嘞counts和logcounts sce class: SingleCellExperiment dim: 10 3 metadata(0): assays(2): counts logcounts rownames(10): gene_1 gene_2 ... gene_9 gene_10 rowData names(0): colnames(3): cell_1 cell_2 cell_3 colData names(1): sizeFactor reducedDimNames(0): altExpNames(0):

sce中此时多了一个assays，原始的counts并没有被覆盖掉。这也是为什么SingleCellExperiment对象特殊的地方，每次返回结果包含了原来的结果，新的结果是增加在对象中而不是替换。
与counts相似，我们也可以使用同样的方法取标化后的值

logcounts(sce) assay(sce,'logcounts')cell_1cell_2cell_3 gene_14.126532 3.701210 3.987843 gene_24.456647 3.497187 4.135947 gene_33.855265 4.644657 4.312379 gene_43.855265 4.542172 3.568449 gene_53.697747 3.497187 4.270252 gene_64.641056 3.701210 4.135947 gene_74.126532 4.830075 4.270252 gene_83.319303 4.431846 3.987843 gene_94.725113 3.701210 4.270252 gene_10 3.855265 3.879924 4.182118

查看对象中包含的所有assay

assays(sce)List of length 2 names(2): counts logcounts

上面的功能告诉我们，我们可以自动添加assay到sce对象中，但是更多的时候是使用我们自己的计算方式，但是这个时候返回的并不是SingleCellExperiment对象，不能将结果自动添加到assay中。这个时候想将新计算的结果添加进去怎么办呢？
使用以下方法

counts_100 <- counts(sce) + 100 assay(sce, "counts_100") <- counts_100 # assign a new entry to assays slot assays(sce) # new assay has now been added.List of length 3 names(3): counts logcounts counts_100

2.处理metadata
2.1 On the columns
为了注释SingleCellExperiment对象，我们需要增加以下metadata来描述我们的主要数据的列，比如实验的样本或者细胞类型描述。这个数据就保存在colData数据槽中，通常是一个data.frame或者DataFrame，行为细胞，列为对应的元数据如治疗信息，批次信息。
现在，让我们往sce中添加一些细胞信息在colData slot中。

cell_metadata <- data.frame(batch = c(1, 1, 2)) rownames(cell_metadata) <- paste0("cell_", 1:3)cell_metadata batch cell_11 cell_21 cell_32

可以使用两种方式将细胞信息添加到sce对象中去

sce <- SingleCellExperiment(assays = list(counts = counts_matrix), colData = https://www.it610.com/article/cell_metadata)sce class: SingleCellExperiment dim: 10 3 metadata(0): assays(1): counts rownames(10): gene_1 gene_2 ... gene_9 gene_10 rowData names(0): colnames(3): cell_1 cell_2 cell_3 colData names(1): batch reducedDimNames(0): altExpNames(0):

提取colData信息

colData(sce)DataFrame with 3 rows and 1 column batch cell_11 cell_21 cell_32# 更简单的取值方式 sce$batch

scater 的addPerCellQC()可以自动计算一些细胞指标并添加到colData数据槽中

sce <- scater::addPerCellQC(sce) colData(sce)DataFrame with 3 rows and 8 columns batchsumdetected percent_top_50 percent_top_100 percent_top_200 cell_1111010100100100 cell_219610100100100 cell_3228810100100100 percent_top_500total cell_1100110 cell_210096 cell_3100288

手动添加更多colData信息

sce$more_stuff <- runif(ncol(sce)) colnames(colData(sce))[1] "batch""sum""detected""percent_top_50""percent_top_100" [6] "percent_top_200" "percent_top_500" "total""more_stuff"

使用colData取子集

sce[, sce$batch == 1]class: SingleCellExperiment dim: 10 2 metadata(0): assays(1): counts rownames(10): gene_1 gene_2 ... gene_9 gene_10 rowData names(0): colnames(2): cell_1 cell_2 colData names(9): batch sum ... total more_stuff reducedDimNames(0): altExpNames(0):

2.2 On the rows
存储feature水平的注释为rowData数据槽，rowData是一个DataFrame，行对应基因，保存的信息如：转录本长度，基因名。还有一个rowRanges数据槽保存GRanges或GRangesList对象的基因组坐标。rowRanges保存基因的染色体，起始位置，终止位置。
这两个数据槽可以使用rowRanges()和rowData()获取。
在此处，sce中的rowRanges数据槽没有保存信息，运行会返回一个空值。

rowRanges(sce) # empty

在rowData中添加信息

sce <- scater::addPerFeatureQC(sce) rowData(sce)DataFrame with 10 rows and 2 columns meandetected gene_114.6667100 gene_216.3333100 gene_318.6667100 gene_413.6667100 gene_515.3333100 gene_617.3333100 gene_719.6667100 gene_814.6667100 gene_918.6667100 gene_1015.6667100

与colData相似，rowData在创建SingleCellExperiment对象的时候就已经初始化保存在对象中。具体还要取决于物种，比对和定量使用的注释信息等。
如，使用Ensembl ID，我们可能会使用AnnotationHub 资源获得Ensembl注释对象并提取基因body信息保存在我们的SingleCellExperiment对象的rowRanges中。

library(AnnotationHub) edb <- AnnotationHub()[["AH73881"]] # Human, Ensembl v97. genes(edb)[,2]

如何在基因/feature水平提取子集？类似于行操作。

sce[c("gene_1", "gene_4"), ] sce[c(1, 4), ] # same as above in this case## class: SingleCellExperiment ## dim: 2 3 ## metadata(0): ## assays(1): counts ## rownames(2): gene_1 gene_4 ## rowData names(2): mean detected ## colnames(3): cell_1 cell_2 cell_3 ## colData names(5): batch sum detected total more_stuff ## reducedDimNames(0): ## altExpNames(0):

2.3 Other metadata
还有一些数据信息不适合存储在colData或者rowData里面，那么可以保存在metadata数据槽中。
它可以是任何你想放的信息。
比如，我们有一些高变基因像保存在sce的slot中，我们就可以加入到metadata中。

my_genes <- c("gene_1", "gene_5") metadata(sce) <- list(favorite_genes = my_genes) metadata(sce)## $favorite_genes ## [1] "gene_1" "gene_5"

我们还可以简单的通过$添加更多信息

your_genes <- c("gene_4", "gene_8") metadata(sce)$your_genes <- your_genes metadata(sce)## $favorite_genes ## [1] "gene_1" "gene_5" ## ## $your_genes ## [1] "gene_4" "gene_8"

3.单细胞特异的fields
总结前面的，我们了解了SingleCellExperiment中的assays，colData，rowData/rowRanges以及metadata数据槽。
这些slots实际上继承自它的parent：SummarizedExperiment。
那么SingleCellExperiment对象还有一些自己的特有的数据槽（slots）。
3.1 Dimensionality reduction results
reducedDims数据槽保存通过PCA或t-SNE降维后的数据，行对应primary data数据的列即cells，列代表维度。由于这个数据槽以list形式保存数据，对同一个数据集，我们可以保存多个PCA/t-SNE/etc。
下面，我们使用来在scater包的runPCA()计算PCA

sce <- scater::logNormCounts(sce) sce <- scater::runPCA(sce) reducedDim(sce, "PCA")##PC1PC2 ## cell_1 -0.6690868 -0.2484418 ## cell_20.7974507 -0.1451026 ## cell_3 -0.12836390.3935444 ## attr(,"varExplained") ## [1] 0.5500410 0.1188276 ## attr(,"percentVar") ## [1] 82.23453 17.76547 ## attr(,"rotation") ##PC1PC2 ## gene_30.590643220.12776255 ## gene_9-0.52370773 -0.15070567 ## gene_40.37398222 -0.44373578 ## gene_5-0.34759518 -0.23398435 ## gene_7-0.172232720.53514989 ## gene_100.198983910.44548482 ## gene_8-0.158687580.34292404 ## gene_20.08684966 -0.21813365 ## gene_1-0.11744576 -0.01881143 ## gene_6-0.02022031 -0.24277404

同样，使用runTSNE()计算t-SNE。

sce <- scater::runTSNE(sce, perplexity = 0.1) reducedDim(sce, "TSNE")##[,1][,2] ## cell_15694.636-88.68314 ## cell_2 -2769.3044975.60635 ## cell_3 -2925.333 -4886.92321

我们可以使用reducedDims(sce)查看sce的降维数据列表，注意与reducedDim()的区别。

reducedDims(sce)## List of length 2 ## names(2): PCA TSNE

同样，可以手动添加对象到reducedDims()数据槽中。
使用uwot包的umap()函数，生成UMAP坐标保存到reducedDims中去。

u <- uwot::umap(t(logcounts(sce)), n_neighbors = 2) reducedDim(sce, "UMAP_uwot") <- u reducedDims(sce) # Now stored in the object.## List of length 3 ## names(3): PCA TSNE UMAP_uwotreducedDim(sce, "UMAP_uwot") ##[,1][,2] ## cell_10.692158950.07642523 ## cell_2 -0.599221710.26388137 ## cell_3 -0.09293724 -0.34030660 ## attr(,"scaled:center") ## [1]-0.6138766 -11.2867896

3.2 可选的Experiments
这个地方可以保存如 spike-in等的信息。
如果我们有可选的feature信息，我们可以保存在 SingleCellExperiment中。

spike_counts <- cbind(cell_1 = rpois(5, 10), cell_2 = rpois(5, 10), cell_3 = rpois(5, 30)) rownames(spike_counts) <- paste0("spike_", 1:5) spike_se <- SummarizedExperiment(list(counts=spike_counts)) spike_se## class: SummarizedExperiment ## dim: 5 3 ## metadata(0): ## assays(1): counts ## rownames(5): spike_1 spike_2 spike_3 spike_4 spike_5 ## rowData names(0): ## colnames(3): cell_1 cell_2 cell_3 ## colData names(0):

然后使用altExp()保存在sce对象中

altExp(sce, "spike") <- spike_se altExps(sce)## List of length 1 ## names(1): spike

提取

altExp(sce, "spike") <- spike_se altExps(sce)## List of length 1 ## names(1): spike

取子集

sub <- sce[,1:2] # retain only two samples. altExp(sub, "spike")## class: SummarizedExperiment ## dim: 5 2 ## metadata(0): ## assays(1): counts ## rownames(5): spike_1 spike_2 spike_3 spike_4 spike_5 ## rowData names(0): ## colnames(2): cell_1 cell_2 ## colData names(0):

所有的SummarizedExperiment对象都可以保存在Experiments中，甚至是SingleCellExperiment。
3.3 Size factors
sizeFactors()返回每一个细胞的标化因子组成的数值型向量，用于后续的标准化。
一般是自动生成。
如，使用scran包生成。