warc-parquet

🗄️ A utility for converting WARC to Parquet.

## 📦 Install The binary may be installed via `cargo`: ```sh $ cargo install warc-parquet ``` To use the crate in your project, add the following to your `Cargo.toml` file: ``` [dependencies] warc-parquet = "0.6.1" ``` ## 🤸 Usage ### The Binary Once installed, the `warc-parquet` utility can be used to transform WARC into Parquet: ```sh $ wget --warc-file example 'https://example.com' $ cat example.warc.gz | warc-parquet --gzipped > example.zstd.parquet ``` `warc-parquet` is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward: ```sh $ wget --warc-file github 'https://github.com' $ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.zstd.parquet ``` It's also simple to preprocess via standard UNIX piping: ```sh $ cat example.warc.gz | gzip -d | warc-parquet > example.zstd.parquet ``` Various compression options, including the option to forego compression altogether, are also available: ```sh $ cat example.warc.gz | warc-parquet --gzipped --compression gzip > example.gz.parquet ``` > 💡 `warc-parquet --help` displays complete options and usage information. ### The Crate Refer to [the docs](https://docs.rs/warc-parquet) for more details about how to use the `Reader` within your own programs. ### DuckDB There are any number of ways to consume Parquet once you have it. However a natural fit might be [DuckDB](https://duckdb.org): ``` $ duckdb v0.3.3 fe9ba8003 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D select type, id from 'example.zstd.parquet'; ┌──────────┬─────────────────────────────────────────────────┐ │ type │ id │ ├──────────┼─────────────────────────────────────────────────┤ │ warcinfo │ │ │ request │ │ │ response │ │ │ metadata │ │ │ resource │ │ │ resource │ │ └──────────┴─────────────────────────────────────────────────┘ D describe select * from 'example.zstd.parquet'; ┌─────────────────────────┬─────────────┬──────┬─────┬─────────┬───────┐ │ column_name │ column_type │ null │ key │ default │ extra │ ├─────────────────────────┼─────────────┼──────┼─────┼─────────┼───────┤ │ id │ VARCHAR │ YES │ │ │ │ │ content_length │ UINTEGER │ YES │ │ │ │ │ date │ TIMESTAMP │ YES │ │ │ │ │ type │ VARCHAR │ YES │ │ │ │ │ content_type │ VARCHAR │ YES │ │ │ │ │ concurrent_to │ VARCHAR │ YES │ │ │ │ │ block_digest │ VARCHAR │ YES │ │ │ │ │ payload_digest │ VARCHAR │ YES │ │ │ │ │ ip_address │ VARCHAR │ YES │ │ │ │ │ refers_to │ VARCHAR │ YES │ │ │ │ │ target_uri │ VARCHAR │ YES │ │ │ │ │ truncated │ VARCHAR │ YES │ │ │ │ │ warc_info_id │ VARCHAR │ YES │ │ │ │ │ filename │ VARCHAR │ YES │ │ │ │ │ profile │ VARCHAR │ YES │ │ │ │ │ identified_payload_type │ VARCHAR │ YES │ │ │ │ │ segment_number │ UINTEGER │ YES │ │ │ │ │ segment_origin_id │ VARCHAR │ YES │ │ │ │ │ segment_total_length │ UINTEGER │ YES │ │ │ │ │ body │ BLOB │ YES │ │ │ │ └─────────────────────────┴─────────────┴──────┴─────┴─────────┴───────┘ ```