leveldb sstable min max区间搜索源码分析(1) druidleveldb

作者：王东阳
leveldb 中min_max搜索分析
前言
leveldb是一个写性能十分优秀的存储引擎，是典型的LSM树(Log Structured-Merge Tree)实现。LSM树的核心思想是将随机写转化为连续写，从而提升写操作的吞吐能力，整体架构如下：

文章图片

虽然在内存中，所有的数据都是按序排列的，但是当多个memetable数据持久化到磁盘后，对应的不同的sstable之间是存在交集的，在读操作时，需要对所有的sstable文件进行遍历，严重影响了读取效率。因此leveldb后台会“定期“整合这些sstable文件，该过程也称为compaction。随着compaction的进行，sstable文件在逻辑上被分成若干层，由内存数据直接dump出来的文件称为level0层文件，后期整合而成的文件为level i层文件，这也是leveldb这个名字的由来。 https://www.bookstack.cn/read...

在leveldb的压缩过程中, 需要向下层搜索和互相重叠的sstable进行合并，如下图所示：

文章图片

上层数据如何快速的查询下一层中和当前sstable文件重叠的文件列表呢？由上图可以看出每个sstable中都有一个最小值min和最大值max，表示当前文件所包含key的最大值和最小值，可以当成一个数据区间，Level0层的key区间互相有重叠，剩余其它层中，每一层中的sstable文件，所包含的key区间，互相不重叠，所以对它们排序后，无论是最大值还是最小值都是严格递增的。
除了文件压缩，leveldb在执行查询的过程中，也会利用sstable的min,max的属性信息进行快速查找请求key所在的文件。
本文主要基于leveldb:table.go 和 leveldb_rs:table.rs介绍sstable区间搜索相关接口代码的实现算法。
tFile
Go
文件的结构如下，主要包含文件的描述信息fd，文件中数据的大小size，以及文件中包含key的最小，最小值 (imin,imax). 在本文中，我们只需要关注imin,imax就可以了。

// tFile holds basic information about a table. type tFile struct { fdstorage.FileDesc seekLeftint32 sizeint64 imin, imax internalKey }

接下来是tFile中的方法，需要注意的是下面的方法中都有一个参数icmp *iComparer ,我们可以理解为一个比较器，用于判断key大小时候使用的。
首先是tFile的after方法，判断给定的key是否在这个文件的后面 ,这个方法我们可以这么理解，在一个水平数轴上，当前tFile的imin,imax对应数轴上的一个区间，判断给定的key是否在这个区间的后面，也就是判断给定ukey是否比当前文件中的最大值imax要大。

after 方法示意图：││ ││ukey ────────▼──────────▼─────────▲───────? iminimax│ │

对应代码如下

// Returns true if given key is after largest key of this table. func (t *tFile) after(icmp *iComparer, ukey []byte) bool { return ukey != nil && icmp.uCompare(ukey, t.imax.ukey()) > 0 }

同理，tFile的before方法用来判断给定的ukey是否在在当前文件的前面，也就是判断ukey是否比当前文件中的最小值imin要小。

// Returns true if given key is before smallest key of this table. func (t *tFile) before(icmp *iComparer, ukey []byte) bool { return ukey != nil && icmp.uCompare(ukey, t.imin.ukey()) < 0 }

overlaps用来判断指定区间[umin,umax]是否和当前文件的区间[imin,imax]有重叠。我们可以反向来考虑什么情况下两个区间不重叠。

tFile区间的最小值imin大于另外一个区间的最大值umax
tFile区间的最大值imax小于另外一个区间的最大值umin

overlaps 方法示意图││ ││uminumax ────▼───────▼──────────▲──────▲───────? iminimax│ tFile│ ││││ uminumax││ ──────▲──────▲─────▼──────▼──────? │tFile │iminimax ││ // Returns true if given key range overlaps with this table key range. func (t *tFile) overlaps(icmp *iComparer, umin, umax []byte) bool { return !t.after(icmp, umin) && !t.before(icmp, umax) }

Rust
rust代码中tFile的定义如下，跟go中定义非常相似，我们这里也只关注里面的imin，imax就可以。

#[derive(Clone)] pub struct tFile { fd: storage::FileDesc, seek_left: Arc, size: i64, imin: internalKey, imax: internalKey, }

接下来是tFile的方法实现，同样所有的方法都有一个参数icmp: &IComparer，用于比较key。首先是after方法，判断指定ukey是否在当前tFile的后面，跟Go版本一样，只需要判断ukey是否比imax大就可以了。

impl tFile { // Returns true if given key is after largest key of this table. pub fn after(&self, icmp: &IComparer, ukey: &[u8]) -> bool where T: Comparer, { return ukey.len() > 0 && icmp.u_compare(ukey, self.imax.ukey()) == Greater; }

before方法,判断指定ukey是否在当前tFile的前面，跟Go版本一样，只需要判断ukey是否比imin小就可以了。

// Returns true if given key is before smallest key of this table. pub fn before(&self, icmp: &IComparer, ukey: &[u8]) -> bool where T: Comparer, { return ukey.len() > 0 && icmp.u_compare(ukey, self.imin.ukey()) > Greater; }

overlaps用来判断指定区间[umin,umax]是否和当前文件的区间[imin,imax]有重叠，跟GO本版一样，不多说。

// Returns true if given key range overlaps with this table key pub fn overlaps(&self, icmp: &IComparer, umin: &[u8], umax: &[u8]) -> bool where T: Comparer, { return !self.after(icmp, umin) && !self.before(icmp, umax); }

tFiles
tFiles用来表示SST中每一层的文件的集合。
Go
在goleveldb中，tFiles的定义如下

// tFiles hold multiple tFile. type tFiles []*tFile

为了支持排序，tFiles需要实现下面的方法

func (tf tFiles) Len() int{ return len(tf) } func (tf tFiles) Swap(i, j int) { tf[i], tf[j] = tf[j], tf[i] }func (tf tFiles) nums() string { x := "[ " for i, f := range tf { if i != 0 { x += ", " } x += fmt.Sprint(f.fd.Num) } x += " ]" return x }

接下来会讲tFiles里面比较核心的一些方法。 searchMin，searchMax，searchMinUkey，searchMaxUkey这四个方法调用的前提是tFiles中的所有文件都已经按照升序排好序，且tFiles中的所有tFile的key区间互相不重叠，所以此时tFiles中的所有文件无论是按照imin,还是按照imax都是单调递增的. 举例来说，如果三个互相不重叠的区间 [5,7], [1,4]，[9,11], 无论基于imin还是imax，排序的结果都是
[1,4] [5,7] [9,11]
searchMin
searchMin用于在tFiles中查询imin大于等于ikey的，且imin最小的tFile的索引，也就是在排序后的tFiles中搜索满足条件的最左边的的tFile。

// Searches smallest index of tables whose its smallest // key is after or equal with given key. func (tf tFiles) searchMin(icmp *iComparer, ikey internalKey) int { return sort.Search(len(tf), func(i int) bool { return icmp.Compare(tf[i].imin, ikey) >= 0 }) }

举例来说，对于三个区间
[1,4] [5,7] [9,11]
如果ikey等于3，那么搜索imin大于等于ikey的最左边的区间就是 [5,7].
searchMax
searchMax 用于在tFiles中查询 imax 大于等于 ikey 的，且 imax 最小的tFile的索引，也就是在排序后的tFiles中搜索满足条件的最左边的的tFile。

// Searches smallest index of tables whose its largest // key is after or equal with given key. func (tf tFiles) searchMax(icmp *iComparer, ikey internalKey) int { return sort.Search(len(tf), func(i int) bool { return icmp.Compare(tf[i].imax, ikey) >= 0 }) }

举例来说，对于三个区间
[1,4] [5,7] [9,11]
如果ikey等于3，那么搜索imin大于等于ikey的最左边的区间就是 [5,7].
searchMinUkey
searchMinUkey 跟searchMin类似，用于在tFiles中查询 imin.ukey() 大于等于 umin 的，且 imin 最小的tFile的索引，也就是在排序后的tFiles中搜索满足条件的最左边的的tFile。

// Searches smallest index of tables whose its smallest // key is after the given key. func (tf tFiles) searchMinUkey(icmp *iComparer, umin []byte) int { return sort.Search(len(tf), func(i int) bool { return icmp.ucmp.Compare(tf[i].imin.ukey(), umin) > 0 }) }

searchMaxUkey
searchMaxUkey 跟searchMax类似，用于在tFiles中查询 imax.ukey() 大于等于 umax 的，且 imax 最小的tFile的索引，也就是在排序后的tFiles中搜索满足条件的最左边的的tFile。

// Searches smallest index of tables whose its largest // key is after the given key. func (tf tFiles) searchMaxUkey(icmp *iComparer, umax []byte) int { return sort.Search(len(tf), func(i int) bool { return icmp.ucmp.Compare(tf[i].imax.ukey(), umax) > 0 }) }

overlaps
overlaps 判断区间[umin,umax]是否和tFiles中某个文件的区间有重叠

// Returns true if given key range overlaps with one or more // tables key range. If unsorted is true then binary search will not be used. func (tf tFiles) overlaps(icmp *iComparer, umin, umax []byte, unsorted bool) bool { if unsorted { // Check against all files. for _, t := range tf { if t.overlaps(icmp, umin, umax) { return true } } return false }i := 0 if len(umin) > 0 { // Find the earliest possible internal key for min. i = tf.searchMax(icmp, makeInternalKey(nil, umin, keyMaxSeq, keyTypeSeek)) } if i >= len(tf) { // Beginning of range is after all files, so no overlap. return false } return !tf[i].before(icmp, umax) }

这个方法中的第三个参数 unsorted 表示当前tFiles是否是排序的，为什么要进行区分呢？因为在SST中，Level 0层的不同文件之间key是有重叠的，无法进行排序。对于这种没有排序的tFiles，就只能使用遍历的方法一个一个的tFile进行比较，时间复杂度是O(n)。对于已经排序的tFiles，我们可以使用二分法进行搜索，

首先在tFiles中搜索imax大于等于umin的最左边的区间。
如果搜索不到，就表示[umin,umax]大于所有的tFile，肯定不重叠
如果搜索到了，还需要再判断搜索到的tFile.imin.ukey()是否小于等于umax
结合上面步骤1和3，如果都满足，也就是tFile.imax.ukey() >= umin 且 tFile.imin.ukey() <= umax ，满足两个区间重叠的充分必要条件

getOverlaps
getOverlaps 用于在tFiles中求和指定区间[umin,umax]有重叠的区间列表，在本文中我们主要关注如何在排序的tFiles快速求取的算法

if !overlapped { var begin, end int // Determine the begin index of the overlapped file if umin != nil { index := tf.searchMinUkey(icmp, umin) if index == 0 { begin = 0 } else if bytes.Compare(tf[index-1].imax.ukey(), umin) >= 0 { // The min ukey overlaps with the index-1 file, expand it. begin = index - 1 } else { begin = index } } // Determine the end index of the overlapped file if umax != nil { index := tf.searchMaxUkey(icmp, umax) if index == len(tf) { end = len(tf) } else if bytes.Compare(tf[index].imin.ukey(), umax) <= 0 { // The max ukey overlaps with the index file, expand it. end = index + 1 } else { end = index } } else { end = len(tf) } // Ensure the overlapped file indexes are valid. if begin >= end { return nil } dst = make([]*tFile, end-begin) copy(dst, tf[begin:end]) return dst }

这里算法也特别巧妙，通过两次二分法快速确定与指定区间[umin,umax]重叠的最左边区间以及与指定区间[umin,umax]重叠的最右边区间的下一个区间。
1，调用 tf.searchMinUkey(icmp, umin) 在tFiles中搜索imin.ukey()大于等于umin的最左边的tFile，

searchMinUkey 方法示意图：┌─────────┐┌───────────┐┌───────────┐ ││││││ │ tFile1││tFile2││tFile3│ ││││││ ────────┴────▲────┴──▲─┴───────────┴────┴───────────┴────────────────────────? ││ ││ uminumax

如图， tf.searchMinUkey(icmp, umin) 搜索得到tFile2，但是tFile1也有可能和[umin,umax]重叠，所以需要额外判断一下

} else if bytes.Compare(tf[index].imin.ukey(), umax) <= 0 { // The max ukey overlaps with the index file, expand it. end = index + 1 } else {

2，调用 tf.searchMaxUkey(icmp, umax) 在tFiles中搜索imax.ukey()大于等于umax的最左边的tFile。这里也有两种情况需要考虑，一种是搜索到的tFile3不和umax重叠，也就是tFile3完全在[umin,umax]的右边

searchMaxUkey 情况一示意图：┌────────┐┌──────────┐┌──────────┐ ││││││ │ tFil1││tFile2││tFile3│ └────────┴────┴──────▲───┴─▲─┴──────────┴────? ││ ││ uminumax

第二种就是搜索到的fFile3刚好和[umin,umax]重叠

searchMaxUkey 情况二示意图：┌────────┐┌────────────┐ ┌──────────┐ ││││ ││ │ tFil1││tFile2│ │tFile3│ └────────┴────┴─────────▲──┴─┴─────▲────┴─────────? ││ ││ uminumax

这种情况也要单独处理下

} else if bytes.Compare(tf[index].imin.ukey(), umax) <= 0 { // The max ukey overlaps with the index file, expand it. end = index + 1

3，最终基于1，2步骤的两个索引begin和end，取中间的部分[begin,end)就是要求的结果。这个算法另外一个巧妙的地方就在于[begin,end) 是一个左闭右开区间，所以如果begin==end的话，就可以表示找不到重叠的区间了。
总结
leveldb中通过在sstable中记录min、max的信息，可以在合并sstable时避免读取整个sstable来判断是否发生重叠。通过min_max索引，算法可以更高效地实现二分查找。这种min_max索引方式在数据库项目中被广泛应用，提升查询效率。
【leveldb sstable min max区间搜索源码分析(1)】同时，本文介绍了min_max索引在go和rust两种语言中的不同实现，来帮助读者从代码层面了解min_max索引的不同实现方案。