dumpfs

Crates.iodumpfs
lib.rsdumpfs
version0.1.0
created_at2025-08-07 20:28:27.217839+00
updated_at2025-08-07 20:28:27.217839+00
descriptionA tool for dumping codebase information for LLMs efficiently and effectively
homepagehttps://github.com/kkharji/dumpfs
repositoryhttps://github.com/kkharji/dumpfs
max_upload_size
id1785774
size403,984
(kkharji)

documentation

https://docs.rs/dumpfs

README

Dumpfs

TLDR: A tool for dumping codebase information for LLMs efficiently and effectively.

It analyzes a codebase and generates a structured representation that can be fed to large language models (LLMs). It supports local directories, individual files, and remote Git repositories even under specific directory.

Status: It's safe to say it's ready for daily use, as I've been using for a while now.

Installation

Cargo (Rust)

cargo install dumpfs

NPM (Node.js/JavaScript)

npm install @kkharji/dumpfs

Usage

CLI Examples

# Basic usage (current directory, output to stdout)
dumpfs gen

# Scan a specific directory with output to a file
dumpfs gen /path/to/project -o project_dump.md

# Scan with specific output format
dumpfs gen . -o output.xml -f xml

# Copy to the generate content Clipboard
dumpfs gen . --clip

# Filter files using ignore patterns
dumpfs gen . -i "*.log,*.tmp,node_modules/*"

# Include only specific files
dumpfs gen . -I "*.rs,*.toml"

# Show additional metadata in output
dumpfs gen . -smp

# Skip file contents, show only structure
dumpfs gen . --skip-content

# Scan a remote Git repository
dumpfs gen https://github.com/username/repo -o repo_dump.md

# Generate completion (supports bash)
dumpfs completion zsh ~/.config/zsh/completions/_dumpfs

Node.js/JavaScript Library Examples

import { scan } from '@kkharji/dumpfs';

// Basic usage - scan current directory
const result = await scan('.');
const llmText = await result.llmText();
console.log(llmText);

// With options - scan with custom settings
const result = await scan('/path/to/project', {
  maxDepth: 3,
  ignorePatterns: ['node_modules/**', '*.log'],
  includePatterns: ['*.js', '*.ts', '*.json'],
  skipContent: false,
  model: 'gpt4' // Enable token counting
});

// Generate different output formats
const markdownOutput = await result.llmText();

// Customize output options
const customOutput = await result.llmText({
  showPermissions: true,
  showSize: true,
  showModified: true,
  includeTreeOutline: true,
  omitFileContents: false
});

Recent Changes & New Features

v0.1.0+ Updates

🚀 Node.js Bindings (NAPI)

  • Full JavaScript/TypeScript library with async support
  • Cross-platform native modules for optimal performance
  • Type definitions included for better development experience
  • Available on npm as @kkharji/dumpfs

🧠 Token Counting & LLM Integration

  • Built-in token counting for popular LLM models (GPT-4, Claude Sonnet, Llama, Mistral)
  • Model-aware content analysis and optimization
  • Caching system for efficient repeated tokenization
  • Support for content-based token estimation

âš¡ Enhanced CLI

  • Output to stdout for better shell integration
  • Clipboard support for seamless workflow
  • Improved progress reporting and error handling
  • Better filtering and ignore patterns

🔧 Performance Improvements

  • Optimized parallel processing with configurable thread counts
  • Enhanced file type detection and text classification
  • Better memory management for large codebases
  • Improved handling of symlinks and permissions

Key Features

The architecture supports several important features:

  1. Parallel Processing: Uses worker threads for efficient filesystem traversal and processing
  2. Flexible Input: Handles both local and remote code sources uniformly
  3. Smart Filtering: Provides multiple ways to filter content:
    • File size limits
    • Modified dates
    • Permissions
    • Gitignore patterns
    • Custom include/exclude patterns
  4. Token Counting & LLM Integration:
    • Built-in tokenization for major LLM models (GPT-4, Claude, Llama, Mistral)
    • Implements caching for efficient tokenization
    • Model-aware content analysis and optimization
  5. Performance Optimization:
    • Uses efficient buffered I/O
    • Provides progress tracking
    • Supports cancelation
  6. Extensibility:
    • Modular design for adding new tokenizers
    • Support for multiple output formats
    • Pluggable formatter system

Data Flow

  1. User input (path/URL) → Subject

  2. Subject initializes appropriate source (local/remote)

  3. Scanner traverses files with parallel workers

  4. Files are processed according to type and options

  5. Results are collected into a tree structure

  6. Formatter converts tree to desired output format

  7. Results are saved or displayed

Architecture Overview

dumpfs is organized into several key modules that work together to analyze and format codebase content for LLMs:

Core Modules

1. Subject (src/subject.rs)

  • Acts as the central coordinator for processing input sources
  • Handles both local directories and remote Git repositories
  • Provides high-level API for scanning and formatting operations

2. Filesystem Scanner (src/fs/)

  • Handles recursive directory traversal and file analysis
  • Implements parallel processing via worker threads for performance
  • Detects file types and extracts content & metadata
  • Manages filtering based on various criteria (size, date, permissions)

3. Git Integration (src/git/)

  • Parses and validates remote repository URLs
  • Extracts repository metadata (owner, name, branch)
  • Manages cloning and updating of remote repositories
  • Handles authentication and credentials
  • Provides access to repository contents for scanning

4. Token Counter (src/tk/)

  • Implements token counting for various LLM models
  • Supports multiple providers (OpenAI, Anthropic, HuggingFace)
  • Includes caching to avoid redundant tokenization
  • Tracks statistics for optimization

5. Formatters (src/fs/fmt/)

  • Converts scanned filesystem data into LLM-friendly formats
  • Supports multiple output formats (Markdown, XML, JSON)
  • Handles metadata inclusion and content organization

Supporting Modules

Error Handling (src/error.rs)

  • Provides a centralized error type system
  • Implements custom error conversion and propagation
  • Ensures consistent error handling across modules

Cache Management (src/cache.rs)

  • Manages persistent caching of tokenization results
  • Provides cache location and naming utilities

CLI Interface (src/cli/)

  • Implements command-line interface using clap
  • Processes user options and coordinates operations
  • Provides progress feedback and reporting
Commit count: 0

cargo fmt