Creating a sliding window iterator of slices of chars from a String
The problem that you are facing is that String
is really represented as something like a Vec<u8>
under the hood, with some APIs to let you access char
s. In UTF-8 the representation of a code point can be anything from 1 to 4 bytes, and they are all compacted together for space-efficiency.
The only slice you could get directly of an entire String
, without copying everything, would be a &[u8]
, but you wouldn't know if the bytes corresponded to whole or just parts of code points.
The char
type corresponds exactly to a code point, and therefore has a size of 4 bytes, so that it can accommodate any possible value. So, if you build a slice of char
by copying from a String
, the result could be up to 4 times larger.
To avoid making a potentially large, temporary memory allocation, you should consider a more lazy approach – iterate through the String
, making slices at exactly the char
boundaries. Something like this:
fn char_windows<'a>(src: &'a str, win_size: usize) -> impl Iterator<Item = &'a str> {
src.char_indices()
.flat_map(move |(from, _)| {
src[from ..].char_indices()
.skip(win_size - 1)
.next()
.map(|(to, c)| {
&src[from .. from + to + c.len_utf8()]
})
})
}
This will give you an iterator where the items are &str
, each with 3 char
s:
let mut windows = char_windows(&tst, 3);
for win in windows {
println!("{:?}", win);
}
The nice thing about this approach is that it hasn't done any copying at all - each &str
produced by the iterator is still a slice into the original source String
.
All of that complexity is because Rust uses UTF-8 encoding for strings by default. If you absolutely know that your input string doesn't contain any multi-byte characters, you can treat it as ASCII bytes, and taking slices becomes easy:
let tst = String::from("abcdefg");
let inter = tst.as_bytes();
let mut windows = inter.windows(3);
However, you now have slices of bytes, and you'll need to turn them back into strings to do anything with them:
for win in windows {
println!("{:?}", String::from_utf8_lossy(win));
}
You can use itertools to walk over windows of any iterator, up to a width of 4:
extern crate itertools; // 0.7.8
use itertools::Itertools;
fn main() {
let input = "日本語";
for (a, b) in input.chars().tuple_windows() {
println!("{}, {}", a, b);
}
}
See also:
- Are there equivalents to slice::chunks/windows for iterators to loop over pairs, triplets etc?
This solution will work for your purpose. (playground)
fn main() {
let tst = String::from("abcdefg");
let inter = tst.chars().collect::<Vec<char>>();
let mut windows = inter.windows(3);
// prints ['a', 'b', 'c']
println!("{:?}", windows.next().unwrap());
// prints ['b', 'c', 'd']
println!("{:?}", windows.next().unwrap());
// etc...
println!("{:?}", windows.next().unwrap());
}
String can iterate over its chars, but it's not a slice, so you have to collect it into a vec, which then coerces into a slice.