archive-to-parquet

Crates.ioarchive-to-parquet
lib.rsarchive-to-parquet
version
sourcesrc
created_at2024-11-28 22:51:22.007523
updated_at2024-12-11 14:36:41.26667
descriptionConverts archive files to parquet files
homepage
repositoryhttps://github.com/orf/archive-to-parquet
max_upload_size
id1465042
Cargo.toml error:TOML parse error at line 17, column 1 | 17 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include`
size0
Tom Forbes (orf)

documentation

README

archive-to-parquet

This is a small tool that reads a number of archive files and writes the content to a parquet file.

Features:

  • Supports zip, tar, tar.gz archives
  • Archive members are hashed with SHA256, which is included in the output
  • Recursive extraction of archives within archives

Example: extracting all files within a Docker image

$ skopeo copy docker://python:latest oci:docker-image/ --all
$ archive-to-parquet output.parquet docker-image/blobs/**/*
2024-11-28T22:45:52.885030Z  INFO extract: archive_to_parquet::formats: Output 5 records from docker-image/blobs/sha256/84bd722ec005c4b9a8d4ce74d1547245ee36e178a58fbca74ea8a88b83557a2a depth=0 self=tar.gz
...
2024-11-28T22:45:59.885030Z  INFO All done. Wrote 234263 rows

Usage

$ archive-to-parquet --help
Usage: archive-to-parquet [OPTIONS] <OUTPUT> [PATHS]...

Arguments:
  <OUTPUT>    Output Parquet file to create
  [PATHS]...  Input paths to read

Options:
  -m, --max-depth <MAX_DEPTH>  Recursion depth How many times to recurse into nested archives
      --min-size <MIN_SIZE>    Min file size to output. Files below this size are skipped [default: 300b]
      --max-size <MAX_SIZE>    Max file size to output. Files above this size are skipped
      --unique                 Only output unique files by hash
      --only-text              Only output text files, skipping binary files
  -h, --help                   Print help
Commit count: 22

cargo fmt