| Crates.io | regex-chunker |
| lib.rs | regex-chunker |
| version | 0.3.0 |
| created_at | 2023-07-14 17:21:41.203438+00 |
| updated_at | 2023-07-23 00:54:53.769023+00 |
| description | Iterate over the data in a `Read` type in a regular-expression-delimited way. |
| homepage | https://github.com/d2718/regex-chunker |
| repository | https://github.com/d2718/regex-chunker |
| max_upload_size | |
| id | 916516 |
| size | 87,919 |
Splitting output from Read types with regular expressions.
The chief type in this crate is the
ByteChunker,
which wraps a type that implements
Read
and iterates over chunks of its byte stream delimited by a supplied
regular expression. The following example reads from the standard input
and prints word counts:
use std::collections::BTreeMap;
use regex_chunker::ByteChunker;
fn main() -> Result<(), Box<dyn Error>> {
let mut counts: BTreeMap<String, usize> = BTreeMap::new();
let stdin = std::io::stdin();
// The regex is a stab at something matching strings of
// "between-word" characters in general English text.
let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
for chunk in chunker {
let word = String::from_utf8_lossy(&chunk?).to_lowercase();
*counts.entry(word).or_default() += 1;
}
println!("{:#?}", &counts);
Ok(())
}
The async feature enables the stream submodule, which contains an
asynchronous version of ByteChunker that wraps an
tokio::io::AsyncRead
type and produces a
Stream
of byte chunks.
If you want to run the tests for the async features, you need to first
build src/bin/slowsource.rs with the async and test feature enabled:
$ cargo build --bin slowsource --all-features
Some of the [stream] module tests run it in a subprocess and use it as
a source of bytes.
This is, as of yet, an essentially naive implementation. What can be done to optimize performance?
Is there room to tighten up the RcErr type?
When non-overlapping blanket impls
(1672,
maybe 20400) land, remove both the
SimpleCustomChunker types.