| Crates.io | zahirscan |
| lib.rs | zahirscan |
| version | 0.2.3 |
| created_at | 2026-01-24 01:38:16.996338+00 |
| updated_at | 2026-01-24 01:38:16.996338+00 |
| description | Token-efficient content compression for AI analysis using probabilistic template mining |
| homepage | https://github.com/thicclatka/zahirscan |
| repository | https://github.com/thicclatka/zahirscan |
| max_upload_size | |
| id | 2065884 |
| size | 483,155 |
"Others will dream that I am mad, while I dream of the Zahir." — JL Borges, Labyrinths
A high-performance Rust CLI tool that extracts templates and patterns from unstructured content, converting them into compact structured formats while preserving essential information. Additionally provides comprehensive metadata extraction for media files.
Note: This project is currently a work in progress, so use with caution.
ZahirScan uses probabilistic template mining to extract essential structure and patterns from content. The tool automatically adapts to different content types:
Supported Formats:
All outputs reduce size by 80-95% compared to raw content while preserving essential information.
memmap2) to handle files larger than available RAMRust (stable toolchain)
ffprobe (optional, for video/audio metadata extraction): ffprobe is distributed with FFmpeg. Install FFmpeg: https://ffmpeg.org/download.html
Note: If
ffprobeis not installed, ZahirScan will still work for text, log, and image files. Video and audio files will be processed but metadata extraction will be skipped.
# Build from source
cargo build --release
# Process log files
zahirscan -i app.log -o output/
zahirscan -i logs/*.log -o output/
zahirscan -i app.log -o output/ -f # Full metadata mode
# Process text/markdown files (extracts templates and writing footprint)
zahirscan -i document.md -o output/
zahirscan -i docs/*.txt docs/*.md -o output/
# Extract image metadata (dimensions, format, compression, chroma subsampling)
zahirscan -i images/*.jpg images/*.png -o output/ -f
# Extract video metadata (requires ffprobe: codec, resolution, bitrate, frame_rate, etc.)
zahirscan -i videos/*.mp4 -o output/ -f
# Extract audio metadata (codec, bitrate, sample_rate, channels, bit_rate_mode for MP3)
zahirscan -i audio/*.mp3 -o output/ -f
# Extract CSV metadata (row/column counts, data types, statistics)
zahirscan -i data/*.csv -o output/ -f
# Extract DOCX metadata (word count, character count, title, author, dates, revision)
zahirscan -i documents/*.docx -o output/ -f
# Extract XLSX metadata (sheet count, sheet names, row/column counts, core properties)
zahirscan -i spreadsheets/*.xlsx -o output/ -f
# Process multiple file types at once
zahirscan -i logs/*.log docs/*.md images/*.jpg data/*.csv documents/*.docx spreadsheets/*.xlsx -o output/ -f
# Skip media metadata for faster processing
zahirscan -i logs/*.log -o output/ -n
# Redact file paths in output (privacy)
zahirscan -i sensitive.log -o output/ -f -r
$ zahirscan --help
Text file and log file parser using probabilistic template mining
Usage: zahirscan [OPTIONS]
Options:
-i, --input <INPUT>...
Input file(s) to parse (can specify multiple)
-o, --output <OUTPUT>
Output folder path (defaults to temp file if not specified).
Creates filename.zahirscan.out in the folder for each input file
-f, --full
Output mode: full metadata (for development/debugging).
Default is templates-only mode (minimal JSON with templates, writing footprint, and media metadata)
-d, --dev
Development mode: enables debug logging.
Default is production mode (info level only)
-r, --redact
Redact file paths in output (show only filename as ***/filename.ext).
Useful for privacy when sharing output JSON
-n, --no-media
Skip media metadata extraction (audio, video, image).
Faster processing when metadata is not needed
-h, --help
Print help
Output formats:
ZahirScan can be used as a Rust library to extract schemas (templates and metadata) from files programmatically.
The extract_schema() function accepts flexible input types via the ToPathIter trait:
&str, &String, or String&[&str], Vec<&str>, &[String], Vec<String>, or arrays like [&str; N]use zahirscan::{extract_schema, OutputMode};
// Process a single file (accepts &str, &String, or String)
let outputs = extract_schema("app.log", OutputMode::Full)?;
println!("Found {} templates", outputs[0].templates.len());
// Process multiple files (accepts slices, vectors, or arrays)
let files = vec!["file1.log", "file2.log", "file3.log"];
let outputs = extract_schema(files.as_slice(), OutputMode::Full)?;
for output in outputs {
println!("File: {:?}", output.source);
println!("Templates: {}", output.templates.len());
}
For a complete working example, see examples/basic_usage.rs. Run it with:
cargo run --example basic_usage -- <input-file>
The extract_schema() function returns Result<Vec<Output>>. Each Output object contains:
Always Present:
templates: Vec<Template> - Extracted template patternsMode 2 (Full) Only (all optional):
source: Option<String> - Source file pathfile_type: Option<String> - Detected file type (e.g., "log", "text", "image", "video")line_count: Option<usize> - Number of lines in filebyte_count: Option<usize> - File size in bytestoken_count: Option<usize> - Estimated token countprocessing_time_ms: Option<f64> - Processing durationis_binary: Option<bool> - Whether file is binarycompression: Option<CompressionStats> - Compression metricsConditional Fields (present when applicable):
writing_footprint: Option<WritingFootprint> - Writing analysis for text/markdown filesimage_metadata: Option<ImageMetadata> - Image metadata (dimensions, format, etc.)video_metadata: Option<VideoMetadata> - Video metadata (codec, resolution, bitrate, etc.)audio_metadata: Option<AudioMetadata> - Audio metadata (codec, bitrate, sample rate, etc.)csv_metadata: Option<CsvMetadata> - CSV metadata (row/column counts, data types, statistics)pdf_metadata: Option<PdfMetadata> - PDF metadata (page count, document properties, etc.)docx_metadata: Option<DocumentMetadata> - DOCX/XLSX metadata (word count, sheet count, title, author, dates, etc.)Each Template contains:
pattern: String - Template pattern with placeholders (e.g., "[DATE] [TIME] ERROR: [MESSAGE]")count: usize - Number of lines matching this templateexamples: BTreeMap<String, Vec<String>> - Example values for each placeholderWritingFootprint (for text/markdown files) contains:
vocabulary_richness: f64 - Unique words / total words (0.0-1.0)avg_sentence_length: f64 - Average sentence length in wordspunctuation: PunctuationMetrics - Punctuation usage statisticstemplate_diversity: usize - Number of unique template patternsavg_entropy: f64 - Average entropy across templates (0.0-1.0)svo_analysis: Option<SVOAnalysis> - Sentence structure analysisword_universe: Option<WordUniverse> - Per-document vocabulary corpus for enhanced writing analysis (future enhancement)Word Universe (when enabled) provides detailed vocabulary analysis:
CompressionStats contains:
original_tokens: usize - Original content token countcompressed_tokens: usize - Compressed template token countreduction_percent: f64 - Percentage reduction (0.0-100.0)See config.toml for configuration.
Adaptive Defaults:
max_workers = 0 uses a sensible default based on CPU coresmax_workersmemmap2)image crate), videos/audio (via ffprobe)zip and quick_xml crates, calamine for XLSX row/column counts)ZahirScan implements non-invasive file operations:
This project is licensed under the MIT OR Apache-2.0 dual license - see the LICENSE-MIT and LICENSE-APACHE files for details.