Java|Java hashCode 深入研究

什么是 hash ? Hash,一般翻译做散列,或音译为哈希,是把任意长度的输入(又叫做预映射 pre-image )通过散列算法变换成固定长度的输出,该输出就是散列值。简单的说就是一种将任意长度的消息压缩到某一固定长度的消息摘要的函数。
特性:

  • 如果两个散列值是不相同,那么这两个散列值的原始输入也不相同
  • 如果两个散列值相同,两个输入值很可能是相同的,但不绝对肯定二者一定相等(可能出现哈希碰撞)
  • 输入一些数据计算出散列值,然后部分改变输入值,一个具有强混淆特性的散列函数会产生一个完全不同的散列值。
主要使用场景:
  • 文件校验
  • 数字签名
  • 鉴权协议


什么是 hashCode ? 这个问题可以直接看 Objec.java 中 hashCode 方法的文档:
Returns a hash code value for the object. This method is supported for the benefit of hash tables such as those provided by java.util.HashMap.
The general contract of hashCode is:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
... ...
简单来说,hashCode 是 Object.java 的一个方法,该方法返回该对象的 hash code 值。该方法是为了更好的支持哈希表,如 HashMap 等。
特性:
  • 一个对象在其生命期中 hash code 保持不变
  • 如果两个对象 equals 为 true,它们的 hash code 一定相等
  • 如果两个对象 equals 为 false,它们的 hash code 不一定不等


hashCode 如何生成的 ? 先看 Object.java:
public native int hashCode(); public String toString() { return getClass().getName() + "@" + Integer.toHexString(hashCode()); }public native int hashCode();

可以看到 hashCode() 是一个 native 方法,具体可查看 Object.c :
static JNINativeMethod methods[] = { {"hashCode","()I",(void *)&JVM_IHashCode}, {"wait","(J)V",(void *)&JVM_MonitorWait}, {"notify","()V",(void *)&JVM_MonitorNotify}, {"notifyAll","()V",(void *)&JVM_MonitorNotifyAll}, {"clone","()Ljava/lang/Object; ",(void *)&JVM_Clone}, }; JNIEXPORT void JNICALL Java_java_lang_Object_registerNatives(JNIEnv *env, jclass cls) { (*env)->RegisterNatives(env, cls, methods, sizeof(methods)/sizeof(methods[0])); }JNIEXPORT jclass JNICALL Java_java_lang_Object_getClass(JNIEnv *env, jobject this) { if (this == NULL) { JNU_ThrowNullPointerException(env, NULL); return 0; } else { return (*env)->GetObjectClass(env, this); } }

JVM_IHashCode 函数指针在 jvm.cpp 中的定义为:
JVM_ENTRY(jint, JVM_IHashCode(JNIEnv* env, jobject handle)) JVMWrapper("JVM_IHashCode"); // as implemented in the classic virtual machine; return 0 if object is NULL return handle == NULL ? 0 : ObjectSynchronizer::FastHashCode (THREAD, JNIHandles::resolve_non_null(handle)) ; JVM_END

ObjectSynchronizer::FastHashCode 函数的实现在 synchronizer.cpp 中:
intptr_t ObjectSynchronizer::FastHashCode (Thread * Self, oop obj) { if (UseBiasedLocking) { // NOTE: many places throughout the JVM do not expect a safepoint // to be taken here, in particular most operations on perm gen // objects. However, we only ever bias Java instances and all of // the call sites of identity_hash that might revoke biases have // been checked to make sure they can handle a safepoint. The // added check of the bias pattern is to avoid useless calls to // thread-local storage. if (obj->mark()->has_bias_pattern()) { // Box and unbox the raw reference just in case we cause a STW safepoint. Handle hobj (Self, obj) ; // Relaxing assertion for bug 6320749. assert (Universe::verify_in_progress() || !SafepointSynchronize::is_at_safepoint(), "biases should not be seen by VM thread here"); BiasedLocking::revoke_and_rebias(hobj, false, JavaThread::current()); obj = hobj() ; assert(!obj->mark()->has_bias_pattern(), "biases should be revoked by now"); } }// hashCode() is a heap mutator ... // Relaxing assertion for bug 6320749. assert (Universe::verify_in_progress() || !SafepointSynchronize::is_at_safepoint(), "invariant") ; assert (Universe::verify_in_progress() || Self->is_Java_thread() , "invariant") ; assert (Universe::verify_in_progress() || ((JavaThread *)Self)->thread_state() != _thread_blocked, "invariant") ; ObjectMonitor* monitor = NULL; markOop temp, test; intptr_t hash; markOop mark = ReadStableMark (obj); // object should remain ineligible for biased locking assert (!mark->has_bias_pattern(), "invariant") ; if (mark->is_neutral()) { hash = mark->hash(); // this is a normal header if (hash) {// if it has hash, just return it return hash; } hash = get_next_hash(Self, obj); // allocate a new hash code temp = mark->copy_set_hash(hash); // merge the hash code into header // use (machine word version) atomic operation to install the hash test = (markOop) Atomic::cmpxchg_ptr(temp, obj->mark_addr(), mark); if (test == mark) { return hash; } // If atomic operation failed, we must inflate the header // into heavy weight monitor. We could add more code here // for fast path, but it does not worth the complexity. } else if (mark->has_monitor()) { monitor = mark->monitor(); temp = monitor->header(); assert (temp->is_neutral(), "invariant") ; hash = temp->hash(); if (hash) { return hash; } // Skip to the following code to reduce code size } else if (Self->is_lock_owned((address)mark->locker())) { temp = mark->displaced_mark_helper(); // this is a lightweight monitor owned assert (temp->is_neutral(), "invariant") ; hash = temp->hash(); // by current thread, check if the displaced if (hash) {// header contains hash code return hash; } // WARNING: //The displaced header is strictly immutable. // It can NOT be changed in ANY cases. So we have // to inflate the header into heavyweight monitor // even the current thread owns the lock. The reason // is the BasicLock (stack slot) will be asynchronously // read by other threads during the inflate() function. // Any change to stack may not propagate to other threads // correctly. }// Inflate the monitor to set hash code monitor = ObjectSynchronizer::inflate(Self, obj); // Load displaced header and check it has hash code mark = monitor->header(); assert (mark->is_neutral(), "invariant") ; hash = mark->hash(); if (hash == 0) { hash = get_next_hash(Self, obj); temp = mark->copy_set_hash(hash); // merge hash code into header assert (temp->is_neutral(), "invariant") ; test = (markOop) Atomic::cmpxchg_ptr(temp, monitor, mark); if (test != mark) { // The only update to the header in the monitor (outside GC) // is install the hash code. If someone add new usage of // displaced header, please update this code hash = test->hash(); assert (test->is_neutral(), "invariant") ; assert (hash != 0, "Trivial unexpected object/monitor header usage."); } } // We finally get the hash return hash; }

在此函数中:
hash = mark->hash(); if (hash) { return hash; } hash = get_next_hash(Self, obj); // allocate a new hash code temp = mark->copy_set_hash(hash); // merge the hash code into header ...

对 hash 值真正进行计算,实在 get_next_hash 函数中:
static inline intptr_t get_next_hash(Thread * Self, oop obj) { intptr_t value = https://www.it610.com/article/0 ; if (hashCode == 0) { // This form uses an unguarded global Park-Miller RNG, // so it's possible for two threads to race and generate the same RNG. // On MP system we'll have lots of RW access to a global, so the // mechanism induces lots of coherency traffic. value = https://www.it610.com/article/os::random() ; } else if (hashCode == 1) { // This variation has the property of being stable (idempotent) // between STW operations.This can be useful in some of the 1-0 // synchronization schemes. intptr_t addrBits = cast_from_oop(obj) >> 3 ; value = https://www.it610.com/article/addrBits ^ (addrBits>> 5) ^ GVars.stwRandom ; } else if (hashCode == 2) { value = https://www.it610.com/article/1 ; // for sensitivity testing } else if (hashCode == 3) { value = ++GVars.hcSequence ; } else if (hashCode == 4) { value = cast_from_oop(obj) ; } else { // Marsaglia's xor-shift scheme with thread-specific state // This is probably the best overall implementation -- we'll // likely make this the default in future releases. unsigned t = Self->_hashStateX ; t ^= (t << 11) ; Self->_hashStateX = Self->_hashStateY ; Self->_hashStateY = Self->_hashStateZ ; Self->_hashStateZ = Self->_hashStateW ; unsigned v = Self->_hashStateW ; v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ; Self->_hashStateW = v ; value = https://www.it610.com/article/v ; }value &= markOopDesc::hash_mask; if (value == 0) value = 0xBAD ; assert (value != markOopDesc::no_hash,"invariant") ; TEVENT (hashCode: GENERATE) ; return value; }

从代码中可以看到,具体生成的算法有6种,最终采用哪种取决于 hashCode 的值:
  • 0 - 使用 Park-Miller 伪随机数生成器
  • 1 - 内存地址做移位运算后与一个随机数进行异或
  • 2 - 返回固定值1
  • 3 - 使用全局的递增数列
  • 4 - 使用对象的内存地址
  • 5 - Xorshift 算法生成随机数
hashCode 的配置在 globals.hpp :
// jdk6: product(intx, hashCode, 0, "(Unstable) select hashCode generation algorithm" )// jdk7: product(intx, hashCode, 0, "(Unstable) select hashCode generation algorithm" )// jdk 8 product(intx, hashCode, 5, "(Unstable) select hashCode generation algorithm")

可以看到 jdk8 其对应的 hashCode 计算方案为:
// Marsaglia's xor-shift scheme with thread-specific state // This is probably the best overall implementation -- we'll // likely make this the default in future releases. unsigned t = Self->_hashStateX ; t ^= (t << 11) ; Self->_hashStateX = Self->_hashStateY ; Self->_hashStateY = Self->_hashStateZ ; Self->_hashStateZ = Self->_hashStateW ; unsigned v = Self->_hashStateW ; v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ; Self->_hashStateW = v ; value = https://www.it610.com/article/v ;

【Java|Java hashCode 深入研究】其中 _hashStateX,_hashStateY, _hashStateZ, _hashStateW 的定义在 thread.cpp :
// thread-specific hashCode stream generator state - Marsaglia shift-xor form _hashStateX = os::random() ; _hashStateY = 842502087 ; _hashStateZ = 0x8767 ; // (int)(3579807591LL & 0xffff) ; _hashStateW = 273326509 ;

补充:
Xorshift 是由 George Marsaglia 发明的一类伪随机数生成器。通过和自己逻辑移位后的数进行异或操作来生成序列中的下一个数,能够以极快的速度生成伪随机数序列。
总之,jdk6/7/8 中的 hashCode 跟内存地址是没有关系的。


hashCode 的作用? 在前面我们了解到 hashCode 是为了更好的支持哈希表,那么到底是怎么支持哈希表的呢?
我们先看 HashMap 初始化:
public HashMap(int initialCapacity, float loadFactor) { if (initialCapacity < 0) throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity); if (initialCapacity > MAXIMUM_CAPACITY) initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 || Float.isNaN(loadFactor)) throw new IllegalArgumentException("Illegal load factor: " + loadFactor); this.loadFactor = loadFactor; this.threshold = tableSizeFor(initialCapacity); }/** * 在 new HashMap 的时候,如果我们传入了初始容量,HashMap 会将 cap 传到 tableSizeFor 处理。 * 这个函数会返回一个最接近 cap(>=cap),并且是 2 的整数次幂的 int 值。 */ static final int tableSizeFor(int cap) { int n = cap - 1; n |= n >>> 1; n |= n >>> 2; n |= n >>> 4; n |= n >>> 8; n |= n >>> 16; return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1; }

通过 tableSizeFor,保证了 threshold 一定是 2 的 n 次幂。
我们看 get 方法的实现:
static final int hash(Object key) { int h; // 关注下面一行代码 return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); }public V get(Object key) { Node e; return (e = getNode(hash(key), key)) == null ? null : e.value; }final Node getNode(int hash, Object key) { Node[] tab; Node first, e; int n; K k; if ((tab = table) != null && (n = tab.length) > 0 && // 关注下面一行代码 (first = tab[(n - 1) & hash]) != null) { if (first.hash == hash && // always check first node ((k = first.key) == key || (key != null && key.equals(k)))) return first; if ((e = first.next) != null) { if (first instanceof TreeNode) return ((TreeNode)first).getTreeNode(hash, key); do { if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) return e; } while ((e = e.next) != null); } } return null; }



我们看下需要重点关注的一行代码: first = tab[(n - 1) & hash]
在 n 为 2 的幂次的情况下,等价于 first = tab[hash % n] 。
这里说明了为什么 HashMap 的 容量为什么一定得是 2 的整数次幂。举个例子:
  • 当长度为16时,3&(16-1)=3,2&(16-1)=2,位置不同
  • 当长度为15时,3&(15-1)=2,2&(15-1)=2,位置相同,发生碰撞
所以 hashCode 的作用是用于查找的快捷性,简单来说就是计算数组索引。


我们再看需要关注的另一行代码: (h = key.hashCode()) ^ (h >>> 16);
举个例子:
  • 对象 A 的 hashCode 为 96,二进制:0110 0000
  • 对象 B 的 hashCode 为 80,二进制:0101 0000
当数组长度为 16( n=15,二进制: 1111 )时, A 和 B 计算出来的结果都是 0,这当然不是我们期望的结果。
所以将 hashCode 值右移 16 位,也就是取 int 类型的一半。并且使用位异或运算,避免我们上面的情况的发生。
简单的说,是为了更好的均匀散列表的下标,避免 hash 碰撞。


什么时候需要重写 hashCode ? 举个例子,我们的自定义对象 Person 作为 key:
public class Person { private String name; public Person(String name) { this.name = name; }// 省略 get/set@Override public boolean equals(Object obj) { if (this == obj) { return true; } if (obj instanceof Person) { Person person = (Person) obj; return this.name.equals(person.name); } return false; } }public static void main(String[] args) { Person p1 = new Person("Tom"); Person p2 = new Person("Tom"); System.out.println("p1.equals(p2): " + p1.equals(p2)); Map map = new HashMap<>(); map.put(p1, "Tom"); System.out.println("get(p1): " + map.get(p1)); System.out.println("get(p2): " + map.get(p2)); }// 输出: // p1.equals(p2): true // get(p1): Tom // get(p2): null

我们重载 Person 类的 hashCode 方法:
private int hash; @Override public int hashCode() { int h = hash; int length = this.name == null ? 0 : this.name.length(); if (h == 0 && length > 0) { char val[] = name.toCharArray(); for (int i = 0; i < length; i++) { h = 31 * h + val[i]; } hash = h; } return h; }// 输出: // p1.equals(p2): true // get(p1): Tom // get(p2): Tom

所以当自定义对象用于哈希表结构时,只要重写 equals,就必须重写 hashCode 。
我们看下 String.java 的处理:
public boolean equals(Object anObject) { if (this == anObject) { return true; } if (anObject instanceof String) { String anotherString = (String)anObject; int n = value.length; if (n == anotherString.value.length) { char v1[] = value; char v2[] = anotherString.value; int i = 0; while (n-- != 0) { if (v1[i] != v2[i]) return false; i++; } return true; } } return false; }public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } return h; }

不仅是 String,在 Java 中很多计算 hashCode 的地方都会用到 31, 再例如 AbstractList:
public int hashCode() { int hashCode = 1; for (E e : this) hashCode = 31*hashCode + (e==null ? 0 : e.hashCode()); return hashCode; }

为什么要用 31 呢?
《Effective Java》中有介绍:
之所以使用 31,是因为他是一个奇素数。如果乘数是偶数,并且乘法溢出的话,信息就会丢失,因为与2相乘等价于移位运算。使用素数的好处并不很明显,但是习惯上使用素数来计算散列结果。 31 有个很好的特性,即用移位和减法来代替乘法,可以得到更好的性能: 31 * i == (i << 5) - i, 现代的 VM 可以自动完成这种优化。
在 HashMap 计算 hash 的时候,希望尽可能避免 Hash 冲突。如果相同的 hash 过多,那么该位置链表的长度也会更长 (或红黑树),就降低了查询效率。


最后 再深入研究的话,就都是 HashMap 的内容了。hashCode 的使命已经结束了。HashMap 作为面试绝对中的 “明星”,值得大家花时间阅读下源码。

    推荐阅读