字符串模式匹配算法简析（2）—— Rabin Karp 算法

前情提要

之前写过一篇：https://www.kmahyyg.xyz/archives/105 简要介绍了大学课本中常见的 KMP 和 BruteForce 子字符串模式匹配算法，同时根据 WikiPedia 的内容扩展了 Sunday 算法。

今天在完成 LeetCode 的过程中 https://leetcode.com/problems/implement-strstr 发现 builtin 的 strings 的库速度最快，引起了我的好奇。

本科阶段几乎都学过 KMP 算法，然而 KMP 算法的 next 数组的生成难于理解，生成算法也不是很方便，没有通用性、记忆的性价比也不够高。不论是实际应用还是面试，我想都不是一个好的选择。

常见的几个语言中： Java String 类中的 indexOf() 方法，C++ 的 strstr() 函数，使用了 O(n*m) 的暴力比较；Golang strings 包中的 Index() 方法中使用了下述 Rabin-Karp 算法，时间复杂度 O(n+m)，同样没有使用 KMP。Golang 使用 R-K 配合上其对 Bytes 序列的优化，达成了较好的运算速度。

Rabin Karp 依赖 —— Rolling Hash

Rabin Karp 的 O(m+n) 快速最重要的原因就是使用了 Rolling Hash Function。什么意思呢？这个函数可以被下列的两个公式定义：

$$ H_{0,n} = S[0] * p^{n-1} + S[1] * p^{n-2} + … + S[n-1] * p^{0} $$

$$ H_{i,j} = H_{0,j} – H_{0,i} * p^{j-i} $$

其中 P 为某个特定的质数Base，S 为字符串，j、i 为对应字符的 Index，最终为了防止溢出，对 P 取计算结果的模数。

Rolling Hash 在 R-K 算法中的 Golang 实现

由于一个 ASCII 字符是 uint8，而 Golang 默认的编码是 UTF-8，意味着可能最长会面临一个字符为 4 个字节表示的情况。加之为了保证基于除留余数法的 Rolling Hash 尽可能减少碰撞，Rabin Karp 的实现和 FNV Hash 的实现使用了相同的 Prime Base const PrimeRK = 16777619 ，长度为 32 位 uint32。至于这个 Prime Base 怎么来的，我也不知道从何考证。值得一提的是：16777619 = 2^24 + 2^8 + 0x93

这样的一个 Rolling Hash 作为一个 Non-Cryptographic Hash ，相对来说存在碰撞的概率可能比 Cryptographic Hash 要大一些，所以我们也不能直接用 Hash 来判断两个字符串相等，因此即使 Hash 相等，也应当再进行一次字符串严格相等的判定。更多的内容请参见代码注释。

// primeRK is the prime base used in Rabin-Karp algorithm.
const primeRK = 16777619

// hashStr returns the hash and the appropriate multiplicative
// factor for use in Rabin-Karp algorithm.
func hashStr(sep string) (uint32, uint32) {
    // initialize the 32bit val.
	hash := uint32(0)
    // calculate string hash
	for i := 0; i < len(sep); i++ {
		hash = hash*primeRK + uint32(sep[i])
	}
	// sq here as the positional notation
	// which means here is primeRK notation
	var pow, sq uint32 = 1, primeRK
	for i := len(sep); i > 0; i >>= 1 {
        // traverse the binary bit to simplify power factor calculation
        // reduce the calculation times, only calc when LSB == 1 
		if i&1 != 0 {
            // just like decimal: 10 -> 100 -> 1000
			pow *= sq
		}
		sq *= sq
	}
	// use uint32 as storage, which can be seen as the module operation with 2^32-1
	// the pow value will be used later for fast-rolling
	return hash, pow
}

Rabin Karp 的 Golang 实现

// Index returns the index of the first instance of substr in s, or -1 if substr is not present in s.
func Index(s, substr string) int {
	n := len(substr)
    // special cases first
	switch {
	case n == 0:   // substr == ""
		return 0
	case n == 1:   // return if the first byte is equal, golang optimized speed for bytes
		return IndexByte(s, substr[0])
	case n == len(s):      // if same length, just check equal or not
		if substr == s {
			return 0
		}
		return -1
	case n > len(s):       // if longer than parent one, never can match
		return -1
	case n <= bytealg.MaxLen:      // use internal method written by assembly, speed up when bruteforce
		// Use brute force when s and substr both are small
		if len(s) <= bytealg.MaxBruteForce {
			return bytealg.IndexString(s, substr)
		}
		c0 := substr[0]
		c1 := substr[1]
		i := 0
		t := len(s) - n + 1
		fails := 0    // set a failure count
		for i < t {
			if s[i] != c0 {
				// IndexByte is faster than bytealg.IndexString, so use it as long as
				// we're not getting lots of false positives.
				o := IndexByte(s[i:t], c0)
				if o < 0 {
					return -1
				}
				i += o
			}
			if s[i+1] == c1 && s[i:i+n] == substr {
				return i
			}
			fails++      // if not match when using indexbyte
			i++
			// Switch to bytealg.IndexString when IndexByte produces too many false positives.
			if fails > bytealg.Cutover(i) {
				r := bytealg.IndexString(s[i:], substr)
				if r >= 0 {
					return r + i
				}
				return -1
			}
		}
		return -1
	}
	c0 := substr[0]    // repeat above again to check bruteforce working or not
	c1 := substr[1]
	i := 0
	t := len(s) - n + 1
	fails := 0
	for i < t {
		if s[i] != c0 {
			o := IndexByte(s[i:t], c0)
			if o < 0 {
				return -1
			}
			i += o
		}
		if s[i+1] == c1 && s[i:i+n] == substr {
			return i
		}
		i++
		fails++
		if fails >= 4+i>>4 && i < t {
			// Give up on IndexByte, it isn't skipping ahead
			// far enough to be better than Rabin-Karp.
			// Experiments (using IndexPeriodic) suggest
			// the cutover is about 16 byte skips.
			// TODO: if large prefixes of sep are matching
			// we should cutover at even larger average skips,
			// because Equal becomes that much more expensive.
			// This code does not take that effect into account.
			j := indexRabinKarp(s[i:], substr)
			if j < 0 {
				return -1
			}
			return i + j
		}
	}
	return -1
}

func indexRabinKarp(s, substr string) int {
	// Rabin-Karp search
	hashss, pow := hashStr(substr)         // if calculate substr hash and remember power factor
	n := len(substr)
	var h uint32
	for i := 0; i < n; i++ {           // calculate the first n char of s
		h = h*primeRK + uint32(s[i])
	}
	if h == hashss && s[:n] == substr {    // if the first n char matched
		return 0
	}
	for i := n; i < len(s); {
		h *= primeRK              // position notation ++
		h += uint32(s[i])         // add current char
		h -= pow * uint32(s[i-n])     // remove the last char with corresponding power, which was defined as rolling
		i++       // next byte
		if h == hashss && s[i-n:i] == substr {        // if hash match, to prevent collision
			return i - n
		}
	}
	// if not match, return -1
	return -1
}