实验记录5（.h5文件解析为.tsv和.mtx格式）实验记录5：.h5文件解析为.ts

这篇主要针对Seurat包装不上hdf5r包的bug而无法使用Read10X_h5这个函数的读者。
在使用Seurat包读取10Xgenomics测序数据时，我发现github上有很多人面临着与我同样的问题：hdf5r包无法安装，因此在调用Read10X_h5函数时会出现以下报错信息：

Error in Read10X_h5("/bone_marrow/ica_bone_marrow_h5.h5") : Please install the 'hdf5r' package

在用包的时候，若数据格式不是.h5则用不到这个函数。但是偏巧，当你需要用Seurat分析这个数据，而数据格式正好是.h5怎么办呢？
因此我们需要先将.h5解析为Read10X这个函数能够读取的格式。
目标：将.h5文件转化为10X outs中的barcodes.tsv, genes.tsv, matrix.mtx三个文件

文件	格式
barcodes.tsv	1列，为barcode名
genes.tsv	2列，第1列为ENS编号，第2列为基因名
matrix.mtx	3列，第1列为基因编号，第2列为细胞编号，第3列为对应的reads数

梗概：

这是大文件的文本处理，要注意内存和磁盘空间的控制
主要是用R的hdf5r包处理，Seurat的依赖包
tsv文件与csv文件类似，不同点在于分隔符是\t而不是逗号
mtx格式是一个矩阵

Step1：需要一台能装上hdf5r包的设备本人MacOS系统
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)
能够成功安装hdf5r包。
Step2：查看数据

setwd("/data/immune_census/bone_marrow/data") library(hdf5r) bm <- H5File$new("./ica_bone_marrow_h5.h5", mode = "r") bm Class: H5File Filename: /immune_census/bone_marrow/data/ica_bone_marrow_h5.h5 Access type: H5F_ACC_RDONLY Attributes: TITLE, CLASS, VERSION, FILTERS, PYTABLES_FORMAT_VERSION Listing: nameobj_type dataset.dims dataset.type_class GRCh38 H5I_GROUP

bm$ls() namelink.typeobj_type num_attrs group.nlinks group.mounted dataset.rank 1 GRCh38 H5L_TYPE_HARD H5I_GROUP470NA dataset.dims dataset.maxdims dataset.type_class dataset.space_class committed_type 1

将GRCh38导出来查看后发现共7组信息。

GRCh38 <- bm[["GRCh38"]] GRCh38$ls() namelink.typeobj_type num_attrs group.nlinks group.mounted dataset.rank dataset.dims 1barcodes H5L_TYPE_HARD H5I_DATASET3NANA1378000 2data H5L_TYPE_HARD H5I_DATASET3NANA1303795547 3 gene_names H5L_TYPE_HARD H5I_DATASET3NANA133694 4genes H5L_TYPE_HARD H5I_DATASET3NANA133694 5indices H5L_TYPE_HARD H5I_DATASET3NANA1303795547 6indptr H5L_TYPE_HARD H5I_DATASET3NANA1378001 7shape H5L_TYPE_HARD H5I_DATASET3NANA12 dataset.maxdims dataset.type_class dataset.space_class committed_type 1378000H5T_STRINGH5S_SIMPLE 2303795547H5T_INTEGERH5S_SIMPLE 333694H5T_STRINGH5S_SIMPLE 433694H5T_STRINGH5S_SIMPLE 5303795547H5T_INTEGERH5S_SIMPLE 6378001H5T_INTEGERH5S_SIMPLE 716384H5T_INTEGERH5S_SIMPLE

将所有的信息导出来

### 1细胞barcode： barcodes <- GRCh38[["barcodes"]] barcodes[1:5] [1] "MantonBM1_HiSeq_1-AAACCTGAGCAGGTCA-1" "MantonBM1_HiSeq_1-AAACCTGCACACTGCG-1" [3] "MantonBM1_HiSeq_1-AAACCTGCACCGGAAA-1" "MantonBM1_HiSeq_1-AAACCTGCATAGACTC-1" [5] "MantonBM1_HiSeq_1-AAACCTGCATCGATGT-1"### 2基因ENS编号 genes <- GRCh38[["genes"]] genes[1:10] [1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009" "ENSG00000239945" [6] "ENSG00000239906" "ENSG00000241599" "ENSG00000279928" "ENSG00000279457" "ENSG00000228463"### 3基因名 gene_names <- GRCh38[["gene_names"]] gene_names[1:10] [1] "RP11-34P13.3""FAM138A""OR4F5""RP11-34P13.7""RP11-34P13.8" [6] "RP11-34P13.14" "RP11-34P13.9""FO538757.3""FO538757.2""AP006222.2"### 4 基因索引号 indices <- GRCh38[["indices"]] indices[1:10] [1] 33664 33662 33661 33659 33658 33657 33655 33654 33653 33652### 5 reads数： data <- GRCh38[["data"]] data[1:10] [1]55 14 12 359 15 26 175### 6 indptr <- GRCh38[["indptr"]] indptr[1:10] [1]0785 1466 2010 2942 4605 5569 5678 6374 7096### 7 矩阵shape（33694个基因，378000个细胞） shape <- GRCh38[["shape"]] shape[1:2] [1]33694 378000

Step3：转化为R可操作的dataframe
现在仍然是h5的compounds数据类型，要把它转化为R可操作的dataframe

### 读取数据###数据个数（用length查看） barcodes <- bm$open("/GRCh38/barcodes")#378000 genes <- bm$open("/GRCh38/genes")#33694 gene_names <- bm$open("/GRCh38/gene_names")#33694 indices <- bm$open("/GRCh38/indices")#303795547 data <- bm$open("/GRCh38/data")#303795547 indptr <- bm$open("/GRCh38/indptr")#378001 shape <- bm$open("/GRCh38/shape")#[1]33694 378000####### 文件1 barcodes.tsv 的生成 ####### barcodes <- barcodes[] write.csv(barcodes, file = "barcodes.csv") #再手动重命名为.tsv格式####### 文件2 genes.tsv 的生成 ####### genes <- genes[] gene_names <- gene_names[] df <- data.frame(gene_names, row.names = c(genes)) write.csv(df, file = "genes.csv") #再手动重命名为.tsv格式

文件3 matrix.mtx 比较特殊，列名是基因的索引号，表示该基因在该细胞中的reads数。第二列没有现成的变量可以一次性填进去，需要进行一些转化和运算：

第一列是indices，作为row.names;
第三列是data，也就是reads数;
第二列是细胞的编号，如1，1，1，2，2，2，2，2，2，3，4，……，33694。这些数字表示即第1个细胞至第33694个细胞。如第一行为33688 1 2 则表示33688号基因在细胞1中的reads数为2。

indptr是一个递增数列，a2-a1 = 细胞cell1的数量。因此也就是说数字1要重复a2-a1次；数字2要重复a3-a2次，以此类推。

indices <- indices[] indptr <- indptr[] data <- data[]####### 文件3 matrix.csv 的生成 ####### # 创建两个相差一个元素的数列，相减得到细胞数量列表 indptr1 <- indptr[2:length(indptr)] indptr2 <- indptr[1:length(indptr)-1] cell_numb <- indptr1 - indptr2column2_of_file3 <- rep(c(1:length(indptr1)),c(cell_numb)) # rep函数万岁df <- data.frame(indices, column2_of_file3, data) save(df, file = "df_marrow_bone.RData") # 10分钟后运行结束，文件大小672Mb

这里仅仅只是把变量保存，由于磁盘限制没有写入csv中。
这里是利用另一个数据的处理过程，和之前的过程一样，不展开说明：

setwd("/Users/shinianyike/Desktop/zll/data/immune_census/cord_blood/data") library(hdf5r) bm <- H5File$new("./ica_cord_blood_h5.h5", mode = "r") GRCh38 <- bm[["GRCh38"]]### 读取数据###数据个数 barcodes <- bm$open("/GRCh38/barcodes")#384000 genes <- bm$open("/GRCh38/genes")#33694 gene_names <- bm$open("/GRCh38/gene_names")#33694 indices <- bm$open("/GRCh38/indices")#260473471 data <- bm$open("/GRCh38/data")#260473471 indptr <- bm$open("/GRCh38/indptr")#384001 shape <- bm$open("/GRCh38/shape")#[1]33694 384000#### 将.h5格式转化为R可操作的list #### 注意data <- GRCh38[["data"]] 这种读取方式仍然得到的是.h5的格式 barcodes <- barcodes[] genes <- genes[] gene_names <- gene_names[] indices <- indices[] indptr <- indptr[] data <- data[]####### 文件1 barcodes.tsv 的生成 ####### write.csv(barcodes, file = "barcodes.csv", row.names = FALSE, col.names = FALSE) #再手动将文件重命名为.tsv结尾####### 文件2 genes.tsv 的生成 ####### df <- data.frame(gene_names, row.names = c(genes)) write.csv(df, file = "genes.csv", row.names = FALSE, col.names = FALSE) #再手动将文件重命名为.tsv结尾####### 文件3 matrix.tsv 的生成 ####### # 创建两个相差一个元素的数列，相减得到细胞数量列表 indptr1 <- indptr[2:length(indptr)] indptr2 <- indptr[1:length(indptr)-1] cell_numb <- indptr1 - indptr2column2_of_file3 <- rep(c(1:length(indptr1)),c(cell_numb)) # rep函数万岁df <- data.frame(indices, column2_of_file3, data) save(df, file = "df_cord_blood.RData")

Step4：将数据转换为mtx格式
上述的文件3仅仅只是生成了一个csv文件，然而与输入的mtx格式是不同的。
mtx文件描述：引自：http://people.sc.fsu.edu/~jburkardt/data/mm/mm.html

A file in the Matrix Market format comprises four parts:

Header line: contains an identifier, and four text fields;

Comment lines: allow a user to store information and comments;

Size line: specifies the number of rows and columns, and the number of nonzero elements;

Data lines: specify the location of the matrix entries (implicitly or explicitly) and their values.

The header line has the form:
%%MatrixMarket object format field symmetry

The header line must be the first line of the file, and the header line must begin with the string %%MatrixMarket. The four fields that follow that string are

object is usually matrix, and that is the case we will consider here. Another legal value is vector, whose format is similar, but with some obvious simplifications.

format is either coordinate or array;

field is either real, double, complex, integer or pattern.

symmetry is either general (legal for real, complex, integer or pattern fields), symmetric (real, complex, integer or pattern), skew-symmetric (real, complex or integer), or hermitian (complex only).

例子：人外周血10Xgenomics数据比对方法所得RNA-seq数据的mtx文件（下图）以%%MatrixMarket字符串开头，object为matrix，format为coordinate，field为real，symmetry为general。分隔符为空格。
第一行为%%MatrixMarket matrix coordinate real general
第二行为%
第三行为基因总数，细胞总数，以及reads总数（第三列的和）
第三行的前两个数字已经有了，只需要计算一下reads总数。

文章图片
pbmc.mtx
脾脏数据的第一行则为
%%MatrixMarket matrix coordinate integer general
载入R变量

> setwd("/Users/shinianyike/Desktop/zll/data/immune_census/bone_marrow/") > load("./data/10X/df_marrow_bone.RData") > head(df) indices column2_of_file3 data 13366415 23366215 333661114 433659112 533658135 63365719# reads总数： sum(df$data) [1] 1179704196 # 所以mtx文件的第三行为33694 378000 1179704196

总的来说，要做的事情是在df数据框前添加三行：
第一行：%%MatrixMarket matrix coordinate real general
第二行：%
第三行：33694 378000 303795547
由于第一行和第二行是一行字符，并不像数据框那样行列元素个数能一一对应，因此要考虑其他的文本处理方法。
另外，R在第一行添加元素的操作比较繁杂，尤其是大文件。需要创建新空白文件，写入第一行，再把剩下的一行一行写入，缺点在于慢。若一次性写入又会产生内存不够的问题。
但是，利用linux的sed命令就可以轻松解决这个问题。
正式开始操作：
将R变量传上服务器，载入R变量，写入csv文件：

load("./df_marrow_bone.RData") head(df) indices column2_of_file3 data 13366415 23366215 333661114 433659112 533658135 63365719write.csv(df, file = "matrix.csv", row.names = FALSE, col.names = FALSE) # 用时15min左右Warning message: In write.csv(df, file = "matrix.csv", row.names = FALSE, col.names = FALSE) : quit()

### 查看文件 head matrix.csv "indices","column2_of_file3","data" 33664,1,5 33662,1,5 33661,1,14 33659,1,12 33658,1,35 33657,1,9 33655,1,15 33654,1,26 33653,1,17### 处理文件格式 sed -i 's/,/ /g' matrix.csv ###将分隔符逗号改为空格 sed -i '1d' matrix.csv###将第一行删除 sed -i '1i %%MatrixMarket matrix coordinate real general\ %\ 33694 378000 303795547' matrix.csv ###添加前面三行内容### 检查文件 head matrix.csv %%MatrixMarket matrix coordinate real general % 33694 378000 303795547 33664 1 5 33662 1 5 33661 1 14 33659 1 12 33658 1 35 33657 1 9 33655 1 15### 更改后缀名 mv ./matrix.csv ./matrix.mtxwc -l matrix.mtx##查看行数 303795550 matrix.mtx ##行数正确

Step6：genes.tsv和barcodes.tsv的处理

sed -i 's/ENSG/\nENSG/g' genes.tsv ##添加空行 sed -i '$a \' genes.tsv sed -i '1d' genes.tsv wc -l genes.tsv 33694 genes.tsv

barcodes同理
参考： hdf5r包的介绍：https://cran.r-project.org/web/packages/hdf5r/vignettes/hdf5r.html#conversion-of-datatypes
mtx文件描述：http://people.sc.fsu.edu/~jburkardt/data/mm/mm.html
https://github.com/cran/Matrix/blob/master/R/HBMM.R
后续读取数据出现报错

bm_data <- Read10X(data.dir ='./10X/')Error: readMM(): rowvalues 'i' are not in 1:nr #6：12

【实验记录5（.h5文件解析为.tsv和.mtx格式）】这里说明第一列的数据中不在1：nrow的范围内，检查后发现有3行0的存在（这是什么坑爹数据啊。。。。），只好手动修改。
2月28日

library(Matrix) head(df) indices column2_of_file3 data [1,]3366415 [2,]3366215 [3,]33661114 [4,]33659112 [5,]33658135 [6,]3365719which(df[,1]=="0") [1]2446541 121241013 212008491 # 这里就是报错来源。 df[2446540,] #indices column2_of_file3data #12830141 df[121241012,] #indices column2_of_file3data #121557051 df[212008490,] #indices column2_of_file3data #722592111 df[2446541,1]=45 df[121241013,1]=8 df[212008491,1]=20sparseM <- sparseMatrix(i=df[,1],j=df[,2],x=df[,3]) writeMM(obj = sparseM, file="matrix.mtx")