byte_frequencies.rs |
|
4431 |
genericsimd.rs |
|
10776 |
mod.rs |
!
This module provides forward and reverse substring search routines.
Unlike the standard library's substring search routines, these work on
arbitrary bytes. For all non-empty needles, these routines will report exactly
the same values as the corresponding routines in the standard library. For
the empty needle, the standard library reports matches only at valid UTF-8
boundaries, where as these routines will report matches at every position.
Other than being able to work on arbitrary bytes, the primary reason to prefer
these routines over the standard library routines is that these will generally
be faster. In some cases, significantly so.
# Example: iterating over substring matches
This example shows how to use [`find_iter`] to find occurrences of a substring
in a haystack.
```
use memchr::memmem;
let haystack = b"foo bar foo baz foo";
let mut it = memmem::find_iter(haystack, "foo");
assert_eq!(Some(0), it.next());
assert_eq!(Some(8), it.next());
assert_eq!(Some(16), it.next());
assert_eq!(None, it.next());
```
# Example: iterating over substring matches in reverse
This example shows how to use [`rfind_iter`] to find occurrences of a substring
in a haystack starting from the end of the haystack.
*NOTE:** This module does not implement double ended iterators, so reverse
searches aren't done by calling `rev` on a forward iterator.
```
use memchr::memmem;
let haystack = b"foo bar foo baz foo";
let mut it = memmem::rfind_iter(haystack, "foo");
assert_eq!(Some(16), it.next());
assert_eq!(Some(8), it.next());
assert_eq!(Some(0), it.next());
assert_eq!(None, it.next());
```
# Example: repeating a search for the same needle
It may be possible for the overhead of constructing a substring searcher to be
measurable in some workloads. In cases where the same needle is used to search
many haystacks, it is possible to do construction once and thus to avoid it for
subsequent searches. This can be done with a [`Finder`] (or a [`FinderRev`] for
reverse searches).
```
use memchr::memmem;
let finder = memmem::Finder::new("foo");
assert_eq!(Some(4), finder.find(b"baz foo quux"));
assert_eq!(None, finder.find(b"quux baz bar"));
```
|
44571 |
prefilter |
|
|
rabinkarp.rs |
This module implements the classical Rabin-Karp substring search algorithm,
with no extra frills. While its use would seem to break our time complexity
guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only
ever use RK on a constant subset of haystacks. The main point here is that
RK has good latency properties for small needles/haystacks. It's very quick
to compute a needle hash and zip through the haystack when compared to
initializing Two-Way, for example. And this is especially useful for cases
where the haystack is just too short for vector instructions to do much good.
The hashing function used here is the same one recommended by ESMAJ.
Another choice instead of Rabin-Karp would be Shift-Or. But its latency
isn't quite as good since its preprocessing time is a bit more expensive
(both in practice and in theory). However, perhaps Shift-Or has a place
somewhere else for short patterns. I think the main problem is that it
requires space proportional to the alphabet and the needle. If we, for
example, supported needles up to length 16, then the total table size would be
len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's
probably bad to put that on the stack. So ideally, we'd throw it on the heap,
but we'd really like to write as much code without using alloc/std as possible.
But maybe it's worth the special casing. It's a TODO to benchmark.
Wikipedia has a decent explanation, if a bit heavy on the theory:
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
But ESMAJ provides something a bit more concrete:
http://www-igm.univ-mlv.fr/~lecroq/string/node5.html
Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases:
https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs
|
8315 |
rarebytes.rs |
|
6208 |
twoway.rs |
|
32520 |
util.rs |
|
3778 |
vector.rs |
|
3998 |
wasm.rs |
|
2466 |
x86 |
|
|