rclean

Crates.iorclean
lib.rsrclean
version0.1.1
created_at2025-07-16 21:21:36.828371+00
updated_at2025-07-16 21:21:36.828371+00
descriptionA high-performance Rust-based disk cleanup tool that finds duplicates and storage outliers
homepagehttps://github.com/paiml/rclean
repositoryhttps://github.com/paiml/rclean
max_upload_size
id1756639
size257,245
Noah Gift (noahgift)

documentation

https://docs.rs/rclean

README

Tests Build binary release Clippy Rustfmt

🎓 Pragmatic AI Labs | Join 1M+ ML Engineers

🔥 Hot Course Offers:

🚀 Level Up Your Career:

Learn end-to-end ML engineering from industry veterans at PAIML.COM

RClean

A high-performance Rust-based disk cleanup tool that finds duplicate files and storage outliers.

Features

  • Duplicate Detection: Find duplicate files using MD5 hashing with parallel processing
  • Similar File Detection: Identify similar files using fuzzy matching algorithms
  • Storage Outliers: Detect large files, hidden space consumers, and file patterns
  • Cluster Analysis: Find groups of similar large files using DBSCAN clustering
  • Fast Performance: Leverages Rust's parallelization with Rayon
  • Multiple Output Formats: Table, JSON, CSV reports
  • MCP Support: Can be used as an MCP (Model Context Protocol) server

hpc-threaded-data-engineering

Installation

# From source
git clone https://github.com/paiml/rclean.git
cd rclean
cargo install --path .

# Or directly from GitHub
cargo install --git https://github.com/paiml/rclean.git

Quick Start

# Scan current directory for duplicates
rclean

# Scan specific directory
rclean /path/to/directory

# Filter by pattern
rclean ~/Documents --pattern "*.pdf" --pattern-type glob

# Generate CSV report
rclean . --csv duplicate_report.csv

# Find similar files (fuzzy matching) with 70% similarity threshold
rclean ~/Documents --similarity 70

Storage Outliers Detection (NEW!)

Find files that are consuming disproportionate disk space:

# Find large file outliers
rclean outliers /path --min-size 100MB

# Find hidden space consumers (node_modules, .git, etc.)
rclean outliers ~ --check-hidden --format json

# Find file patterns (backups, logs, etc.)
rclean outliers . --check-patterns

# Export outliers report
rclean outliers . --csv outliers_report.csv

# Combine all features
rclean outliers ~ --min-size 50MB --check-hidden --check-patterns --top 50

# Enable clustering to find groups of similar large files
rclean outliers /path --cluster --cluster-similarity 80 --min-cluster-size 3

Outliers Detection Features:

  • Statistical Analysis: Files that are X standard deviations larger than the mean
  • Hidden Consumers: Detects node_modules, .git, .cache, and other known space hogs
  • Pattern Detection: Finds groups of similar files (backup-001, backup-002, etc.)
  • Cluster Analysis: Uses DBSCAN to find clusters of similar large files (e.g., different versions of the same document)
  • Smart Recommendations: Provides cleanup suggestions for each type of outlier

Fuzzy Matching (Similarity Detection)

Find files that are similar but not identical:

# Find files with 70% or higher similarity
rclean ~/Documents --similarity 70

# Find similar Python files
rclean ~/code --pattern "*.py" --pattern-type glob --similarity 80

# Generate CSV report including similar files
rclean . --similarity 60 --csv similarity_report.csv

Use Cases:

  • Different versions of documents (v1, v2, draft, final)
  • Slightly modified code files
  • Images with minor edits
  • Reports with small updates

Advanced Pattern Matching

RClean supports ripgrep-style pattern matching:

Pattern Types

  • Literal (default): Simple string contains matching

    rclean search --path . --pattern ".txt"
    
  • Glob: Shell-style patterns

    rclean search --path . --pattern "*.txt" --pattern-type glob
    rclean search --path . --pattern "**/*.rs" --pattern-type glob
    
  • Regex: Full regular expression support

    rclean search --path . --pattern "test_.*\.rs$" --pattern-type regex
    

Additional Options

  • --hidden: Include hidden files
  • --no-ignore: Ignore .gitignore rules
  • --max-depth <N>: Maximum directory depth to traverse

MCP Server Mode

RClean can run as an MCP server for integration with AI assistants:

# Run as MCP server
rclean  # Will auto-detect MCP mode when piped

Building and Development

Quality Standards ✅

All lint checks now pass! The project follows PMAT (Production Manufacturing and Assembly Technology) quality standards with zero tolerance for warnings.

# Build and test
make all

# Development commands
make format        # Format code
make lint          # Run clippy linting (FIXED - passes cleanly!)
make lint-extreme  # Run extreme linting with PMAT standards
make test          # Run all tests
make test-examples # Run example tests (NEW!)

# Build variants
make build-release # Release build for production

# Quality assurance
make quality-gate  # Run all quality checks
make format-check  # Verify formatting

Recent Improvements (v0.1.1)

  • Fixed all clippy warnings - make lint now passes without errors
  • Added example test coverage - New make test-examples target
  • Improved code quality - Reduced function complexity with better abstractions
  • Enhanced CI/CD readiness - All quality gates pass consistently

OS X Install

  • Install rust via rustup
  • Add to ~/.cargo/config
[target.x86_64-apple-darwin]
rustflags = [
  "-C", "link-arg=-undefined",
  "-C", "link-arg=dynamic_lookup",
]

[target.aarch64-apple-darwin]
rustflags = [
  "-C", "link-arg=-undefined",
  "-C", "link-arg=dynamic_lookup",
]
  • run make all in rclean directory

License

MIT

Commit count: 0

cargo fmt