Name Description Size
byte_frequencies.rs 4431
genericsimd.rs 10776
mod.rs ! This module provides forward and reverse substring search routines. Unlike the standard library's substring search routines, these work on arbitrary bytes. For all non-empty needles, these routines will report exactly the same values as the corresponding routines in the standard library. For the empty needle, the standard library reports matches only at valid UTF-8 boundaries, where as these routines will report matches at every position. Other than being able to work on arbitrary bytes, the primary reason to prefer these routines over the standard library routines is that these will generally be faster. In some cases, significantly so. # Example: iterating over substring matches This example shows how to use [`find_iter`] to find occurrences of a substring in a haystack. ``` use memchr::memmem; let haystack = b"foo bar foo baz foo"; let mut it = memmem::find_iter(haystack, "foo"); assert_eq!(Some(0), it.next()); assert_eq!(Some(8), it.next()); assert_eq!(Some(16), it.next()); assert_eq!(None, it.next()); ``` # Example: iterating over substring matches in reverse This example shows how to use [`rfind_iter`] to find occurrences of a substring in a haystack starting from the end of the haystack. *NOTE:** This module does not implement double ended iterators, so reverse searches aren't done by calling `rev` on a forward iterator. ``` use memchr::memmem; let haystack = b"foo bar foo baz foo"; let mut it = memmem::rfind_iter(haystack, "foo"); assert_eq!(Some(16), it.next()); assert_eq!(Some(8), it.next()); assert_eq!(Some(0), it.next()); assert_eq!(None, it.next()); ``` # Example: repeating a search for the same needle It may be possible for the overhead of constructing a substring searcher to be measurable in some workloads. In cases where the same needle is used to search many haystacks, it is possible to do construction once and thus to avoid it for subsequent searches. This can be done with a [`Finder`] (or a [`FinderRev`] for reverse searches). ``` use memchr::memmem; let finder = memmem::Finder::new("foo"); assert_eq!(Some(4), finder.find(b"baz foo quux")); assert_eq!(None, finder.find(b"quux baz bar")); ``` 44571
prefilter
rabinkarp.rs This module implements the classical Rabin-Karp substring search algorithm, with no extra frills. While its use would seem to break our time complexity guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only ever use RK on a constant subset of haystacks. The main point here is that RK has good latency properties for small needles/haystacks. It's very quick to compute a needle hash and zip through the haystack when compared to initializing Two-Way, for example. And this is especially useful for cases where the haystack is just too short for vector instructions to do much good. The hashing function used here is the same one recommended by ESMAJ. Another choice instead of Rabin-Karp would be Shift-Or. But its latency isn't quite as good since its preprocessing time is a bit more expensive (both in practice and in theory). However, perhaps Shift-Or has a place somewhere else for short patterns. I think the main problem is that it requires space proportional to the alphabet and the needle. If we, for example, supported needles up to length 16, then the total table size would be len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's probably bad to put that on the stack. So ideally, we'd throw it on the heap, but we'd really like to write as much code without using alloc/std as possible. But maybe it's worth the special casing. It's a TODO to benchmark. Wikipedia has a decent explanation, if a bit heavy on the theory: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm But ESMAJ provides something a bit more concrete: http://www-igm.univ-mlv.fr/~lecroq/string/node5.html Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases: https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs 8315
rarebytes.rs 6208
twoway.rs 32520
util.rs 3778
vector.rs 3998
wasm.rs 2466
x86