几种常见的hash函数

概览 最近在看redis源码,发现redis采用了几种不同的算法来计算Hash Code; 因此打算借此整理下JDK中的实现,加深理解;
Redis Thomas Wang's 32 bit Mix Function 关于该算法的具体内容,可以参考这篇文章,算法源码如下:

public int hash32shift(int key) { key = ~key + (key << 15); // key = (key << 15) - key - 1; key = key ^ (key >>> 12); key = key + (key << 2); key = key ^ (key >>> 4); key = key * 2057; // key = (key + (key << 3)) + (key << 11); key = key ^ (key >>> 16); return key; }

可以看到它是计算int类型的hash code,返回结果也是int类型; 另外它还提供了64bit的算法,有兴趣的可以自行查阅:
根据Thomas Wang所描述,由于上述代码可以利用CPU的native指令,在HP 9000 workstations机器上只需要11个时钟周期,速度很快;
比较费解这边的常量值是怎么确定的,无奈算法基础比较差,如果有了解的同学,欢迎答疑;
Austin Appleby's MurmurHash2 MurmurHash是一种非加密型哈希函数,由Austin Appleby在2008年发明,并且有多个变种; 该算法的作者2011去了google,开发出来了新的算法CityHash;
unsigned int dictGenHashFunction(const void *key, int len) { /* 'm' and 'r' are mixing constants generated offline. They're not really 'magic', they just happen to work well.*/ uint32_t seed = dict_hash_function_seed; const uint32_t m = 0x5bd1e995; const int r = 24; /* Initialize the hash to a 'random' value */ uint32_t h = seed ^ len; /* Mix 4 bytes at a time into the hash */ const unsigned char *data = https://www.it610.com/article/(const unsigned char *)key; while(len>= 4) { uint32_t k = *(uint32_t*)data; k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; data += 4; len -= 4; }/* Handle the last few bytes of the input array*/ switch(len) { case 3: h ^= data[2] << 16; case 2: h ^= data[1] << 8; case 1: h ^= data[0]; h *= m; }; /* Do a few final mixes of the hash to ensure the last few * bytes are well-incorporated. */ h ^= h >> 13; h *= m; h ^= h >> 15; return (unsigned int)h; }

Murmur可以计算字符串的hash code,基本思想就是把key分成n组,每组4个字符,把这4个字符看成是一个uint_32,进行n次运算,得到一个h,然会在对h进行处理,得到一个相对离散的hash code;
DJB Hash
unsigned int DJBHash(char *str) { unsigned int hash = 5381; while (*str){ hash = ((hash << 5) + hash) + (*str++); /* times 33 */ } hash &= ~(1 << 31); /* strip the highest bit */ return hash; }

Redis对其进行了部分调整,不区分大小写:
unsigned int dictGenCaseHashFunction(const unsigned char *buf, int len) { unsigned int hash = (unsigned int)dict_hash_function_seed; while (len--) hash = ((hash << 5) + hash) + (tolower(*buf++)); /* hash * 33 + c */ return hash; }

JDK Austin Appleby's MurmurHash3 从JDK7开始,JDK引入了一种新的计算Hash code的算法,可以在如下的集合对象中使用:
  • HashMap
  • Hashtable
  • HashSet
  • LinkedHashMap
  • LinkedHashSet
  • WeakHashMap
  • ConcurrentHashMap
final int hash(Object k) { int h = hashSeed; if (0 != h && k instanceof String) { return sun.misc.Hashing.stringHash32((String) k); }h ^= k.hashCode(); // This function ensures that hashCodes that differ only by // constant multiples at each bit position have a bounded // number of collisions (approximately 8 at default load factor). h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }

可以看到只有key为字符类型,而且hashSeed不为0时才采用新的哈希算法;
sun.misc.Hashing.stringHash32实际上调用的是String.hash32方法:
int hash32() { int h = hash32; //h==0表示未计算过hash code //可以看到hash code只会计算一次 if (0 == h) { //HASHING_SEED是根据时间戳等计算出来的随机数 h = sun.misc.Hashing.murmur3_32(HASHING_SEED, value, 0, value.length); //确保结果非0,避免重新计算 h = (0 != h) ? h : 1; hash32 = h; }return h; }

可以看到最终调用的是MurmurHash3算法,那么什么情况下hashSeed不为0呢?
final boolean initHashSeedAsNeeded(int capacity) { boolean currentAltHashing = hashSeed != 0; boolean useAltHashing = sun.misc.VM.isBooted() && (capacity >= Holder.ALTERNATIVE_HASHING_THRESHOLD); boolean switching = currentAltHashing ^ useAltHashing; if (switching) { hashSeed = useAltHashing ? sun.misc.Hashing.randomHashSeed(this) : 0; } return switching; }

可以看到是否采用新算法和Holder.ALTERNATIVE_HASHING_THRESHOLD有关;实际上JDK定义了一个新的属性jdk.map.althashing.threshold(默认值-1),只有当系统容量大于该属性值时,才会采用新的算法;因此可以将通过-Djdk.map.althashing.threshold=0设置该属性为0,启用新的哈希函数;
String.hashCode
public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } return h; }

Object.hashCode 之前曾经在Java对象结构 中介绍了Java的MarkWord,其中记录了对象的hashCode; 如果对象计算过hashCode,则会存储到对象头中;如果是第一次计算则是调用
static inline intptr_t get_next_hash(Thread * Self, oop obj) { intptr_t value = https://www.it610.com/article/0 ; if (hashCode == 0) { // This form uses an unguarded global Park-Miller RNG, // so it's possible for two threads to race and generate the same RNG. // On MP system we'll have lots of RW access to a global, so the // mechanism induces lots of coherency traffic. value = https://www.it610.com/article/os::random() ; } else if (hashCode == 1) { // This variation has the property of being stable (idempotent) // between STW operations.This can be useful in some of the 1-0 // synchronization schemes. intptr_t addrBits = intptr_t(obj)>> 3 ; value = https://www.it610.com/article/addrBits ^ (addrBits>> 5) ^ GVars.stwRandom ; } else if (hashCode == 2) { value = https://www.it610.com/article/1 ; // for sensitivity testing } else if (hashCode == 3) { value = ++GVars.hcSequence ; } else if (hashCode == 4) { value = intptr_t(obj) ; } else { // Marsaglia's xor-shift scheme with thread-specific state // This is probably the best overall implementation -- we'll // likely make this the default in future releases. unsigned t = Self->_hashStateX ; t ^= (t << 11) ; Self->_hashStateX = Self->_hashStateY ; Self->_hashStateY = Self->_hashStateZ ; Self->_hashStateZ = Self->_hashStateW ; unsigned v = Self->_hashStateW ; v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ; Self->_hashStateW = v ; value = https://www.it610.com/article/v ; }value &= markOopDesc::hash_mask; if (value == 0) value = 0xBAD ; assert (value != markOopDesc::no_hash,"invariant") ; TEVENT (hashCode: GENERATE) ; return value; }

可以看到,这边提供了多种不同的实现,具体采用哪种,取决于hashCode的值,在Linux机器上,hashCode默认为0:
几种常见的hash函数
文章图片
hashCode.png 因此是通过os::random计算hashCode,可以参考Java对象结构 ;
其它 【几种常见的hash函数】Character、Integer、Long、Double的hashCode方法返回包装的基本类型;

    推荐阅读