前情提要
之前写过一篇:https://www.kmahyyg.xyz/archives/105 简要介绍了大学课本中常见的 KMP 和 BruteForce 子字符串模式匹配算法,同时根据 WikiPedia 的内容扩展了 Sunday 算法。
今天在完成 LeetCode 的过程中 https://leetcode.com/problems/implement-strstr 发现 builtin 的 strings 的库速度最快,引起了我的好奇。
本科阶段几乎都学过 KMP 算法,然而 KMP 算法的 next 数组的生成难于理解,生成算法也不是很方便,没有通用性、记忆的性价比也不够高。不论是实际应用还是面试,我想都不是一个好的选择。
常见的几个语言中: Java String 类中的 indexOf() 方法,C++ 的 strstr() 函数,使用了 O(n*m) 的暴力比较;Golang strings 包中的 Index() 方法中使用了下述 Rabin-Karp 算法,时间复杂度 O(n+m),同样没有使用 KMP。Golang 使用 R-K 配合上其对 Bytes 序列的优化,达成了较好的运算速度。
Rabin Karp 依赖 —— Rolling Hash
Rabin Karp 的 O(m+n) 快速最重要的原因就是使用了 Rolling Hash Function。什么意思呢?这个函数可以被下列的两个公式定义:
$$ H_{0,n} = S[0] * p^{n-1} + S[1] * p^{n-2} + … + S[n-1] * p^{0} $$
$$ H_{i,j} = H_{0,j} – H_{0,i} * p^{j-i} $$
其中 P 为某个特定的质数Base,S 为字符串,j、i 为对应字符的 Index,最终为了防止溢出,对 P 取计算结果的模数。
Rolling Hash 在 R-K 算法中的 Golang 实现
由于一个 ASCII 字符是 uint8
,而 Golang 默认的编码是 UTF-8,意味着可能最长会面临一个字符为 4 个字节表示的情况。加之为了保证基于除留余数法的 Rolling Hash 尽可能减少碰撞,Rabin Karp 的实现和 FNV Hash 的实现使用了相同的 Prime Base const PrimeRK = 16777619
,长度为 32 位 uint32
。至于这个 Prime Base 怎么来的,我也不知道从何考证。值得一提的是:16777619 = 2^24 + 2^8 + 0x93
这样的一个 Rolling Hash 作为一个 Non-Cryptographic Hash ,相对来说存在碰撞的概率可能比 Cryptographic Hash 要大一些,所以我们也不能直接用 Hash 来判断两个字符串相等,因此即使 Hash 相等,也应当再进行一次字符串严格相等的判定。更多的内容请参见代码注释。
// primeRK is the prime base used in Rabin-Karp algorithm. const primeRK = 16777619 // hashStr returns the hash and the appropriate multiplicative // factor for use in Rabin-Karp algorithm. func hashStr(sep string) (uint32, uint32) { // initialize the 32bit val. hash := uint32(0) // calculate string hash for i := 0; i < len(sep); i++ { hash = hash*primeRK + uint32(sep[i]) } // sq here as the positional notation // which means here is primeRK notation var pow, sq uint32 = 1, primeRK for i := len(sep); i > 0; i >>= 1 { // traverse the binary bit to simplify power factor calculation // reduce the calculation times, only calc when LSB == 1 if i&1 != 0 { // just like decimal: 10 -> 100 -> 1000 pow *= sq } sq *= sq } // use uint32 as storage, which can be seen as the module operation with 2^32-1 // the pow value will be used later for fast-rolling return hash, pow }
Rabin Karp 的 Golang 实现
// Index returns the index of the first instance of substr in s, or -1 if substr is not present in s. func Index(s, substr string) int { n := len(substr) // special cases first switch { case n == 0: // substr == "" return 0 case n == 1: // return if the first byte is equal, golang optimized speed for bytes return IndexByte(s, substr[0]) case n == len(s): // if same length, just check equal or not if substr == s { return 0 } return -1 case n > len(s): // if longer than parent one, never can match return -1 case n <= bytealg.MaxLen: // use internal method written by assembly, speed up when bruteforce // Use brute force when s and substr both are small if len(s) <= bytealg.MaxBruteForce { return bytealg.IndexString(s, substr) } c0 := substr[0] c1 := substr[1] i := 0 t := len(s) - n + 1 fails := 0 // set a failure count for i < t { if s[i] != c0 { // IndexByte is faster than bytealg.IndexString, so use it as long as // we're not getting lots of false positives. o := IndexByte(s[i:t], c0) if o < 0 { return -1 } i += o } if s[i+1] == c1 && s[i:i+n] == substr { return i } fails++ // if not match when using indexbyte i++ // Switch to bytealg.IndexString when IndexByte produces too many false positives. if fails > bytealg.Cutover(i) { r := bytealg.IndexString(s[i:], substr) if r >= 0 { return r + i } return -1 } } return -1 } c0 := substr[0] // repeat above again to check bruteforce working or not c1 := substr[1] i := 0 t := len(s) - n + 1 fails := 0 for i < t { if s[i] != c0 { o := IndexByte(s[i:t], c0) if o < 0 { return -1 } i += o } if s[i+1] == c1 && s[i:i+n] == substr { return i } i++ fails++ if fails >= 4+i>>4 && i < t { // Give up on IndexByte, it isn't skipping ahead // far enough to be better than Rabin-Karp. // Experiments (using IndexPeriodic) suggest // the cutover is about 16 byte skips. // TODO: if large prefixes of sep are matching // we should cutover at even larger average skips, // because Equal becomes that much more expensive. // This code does not take that effect into account. j := indexRabinKarp(s[i:], substr) if j < 0 { return -1 } return i + j } } return -1 } func indexRabinKarp(s, substr string) int { // Rabin-Karp search hashss, pow := hashStr(substr) // if calculate substr hash and remember power factor n := len(substr) var h uint32 for i := 0; i < n; i++ { // calculate the first n char of s h = h*primeRK + uint32(s[i]) } if h == hashss && s[:n] == substr { // if the first n char matched return 0 } for i := n; i < len(s); { h *= primeRK // position notation ++ h += uint32(s[i]) // add current char h -= pow * uint32(s[i-n]) // remove the last char with corresponding power, which was defined as rolling i++ // next byte if h == hashss && s[i-n:i] == substr { // if hash match, to prevent collision return i - n } } // if not match, return -1 return -1 }
参考文献
- https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
- https://en.wikipedia.org/wiki/Rolling_hash
- https://blog.cyeam.com/golang/2015/01/15/go_index
- https://golang.org/src/hash/fnv/fnv.go
- https://www.ctolib.com/topics-6819.html
- https://www.cnblogs.com/golove/p/3234673.html
- https://golang.org/src/strings/strings.go
- https://en.wikipedia.org/wiki/Rabin_fingerprint
- http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
- https://ti-tanium.github.io/2019/07/10/%E5%AD%97%E7%AC%A6%E4%B8%B2%E5%8C%B9%E9%85%8D%E7%AE%97%E6%B3%95%E2%80%94%E2%80%94%E2%80%94Sunday/
- https://segmentfault.com/a/1190000004222741
- @hxt_tg thanks to his contribution for excellent analysis.