butterfly-dl

Crates.iobutterfly-dl
lib.rsbutterfly-dl
version2.0.0
created_at2025-06-26 20:56:56.161573+00
updated_at2025-06-27 08:23:21.323915+00
descriptionButterfly-dl - Optimized OpenStreetMap data downloader with HTTP support
homepagehttps://github.com/butterfly-osm/butterfly-osm
repositoryhttps://github.com/butterfly-osm/butterfly-osm
max_upload_size
id1727820
size160,585
Pierre Warnier (pierre-warnier)

documentation

https://docs.rs/butterfly-dl

README

Butterfly-dl ๐Ÿฆ‹

A high-performance, memory-efficient OpenStreetMap data downloader with intelligent source routing, resilient networking, and beautiful progress display.

Features

  • ๐Ÿš€ Optimized for Large Files: <1GB RAM usage regardless of file size (including 81GB planet.osm.pbf)
  • ๐ŸŽจ Enhanced Progress Display: Beautiful tqdm-style progress bars with smooth Unicode blocks
  • ๐Ÿ›ก๏ธ Network Resilience: Intelligent retry with exponential backoff and smart resume from interruption points
  • ๐Ÿ“ File Safety: Comprehensive overwrite protection with prompts and CLI flags
  • ๐Ÿง  Smart Source Routing: HTTP with parallel downloads optimized by file size
  • ๐Ÿ” Semantic Error Intelligence: Advanced fuzzy matching that understands semantic intent and geographic relationships
  • ๐ŸŒ Dynamic Source Loading: Automatically fetches latest available regions from Geofabrik
  • ๐Ÿ“ก HTTP Protocol: Advanced HTTP with range requests and connection pooling
  • ๐Ÿ’ง Streaming Support: Direct stdout streaming for pipeline integration
  • โšก Performance Optimized: Auto-tuning connections, Direct I/O for large files
  • ๐Ÿ”ง Curl-like Interface: Simple positional arguments, stderr logging

Installation

Pre-built Binaries (Recommended)

Download the latest release for your platform:

# Linux (x86_64)
wget https://github.com/butterfly-osm/butterfly-osm/releases/latest/download/butterfly-dl-v2.0.0-x86_64-linux.tar.gz
tar -xzf butterfly-dl-v2.0.0-x86_64-linux.tar.gz
sudo mv butterfly-dl-v2.0.0-x86_64-linux/butterfly-dl /usr/local/bin/

# macOS (Intel)
wget https://github.com/butterfly-osm/butterfly-osm/releases/latest/download/butterfly-dl-v2.0.0-x86_64-macos.tar.gz
tar -xzf butterfly-dl-v2.0.0-x86_64-macos.tar.gz
sudo mv butterfly-dl-v2.0.0-x86_64-macos/butterfly-dl /usr/local/bin/

# Windows
# Download butterfly-dl-v2.0.0-x86_64-windows.zip from releases page

Package Managers

Debian/Ubuntu

wget https://github.com/butterfly-osm/butterfly-osm/releases/latest/download/butterfly-dl_2.0.0_amd64.deb
sudo dpkg -i butterfly-dl_2.0.0_amd64.deb

Cargo (Rust)

cargo install butterfly-dl

Build from Source

git clone https://github.com/butterfly-osm/butterfly-osm
cd butterfly-osm
cargo build --release -p butterfly-dl

Workspace Structure

This repository is organized as a Rust workspace containing multiple OSM tools:

  • butterfly-common: Shared utilities and error handling
  • tools/butterfly-dl: The main downloader tool
  • Future tools: butterfly-shrink, butterfly-extract, butterfly-serve

Building Individual Tools

# Build all tools
cargo build --workspace --release

# Build specific tool
cargo build --release -p butterfly-dl

# Test specific tool
cargo test -p butterfly-dl

Usage

Basic Examples

# Download planet file from HTTP (81GB) 
butterfly-dl planet

# Download continent from HTTP
butterfly-dl europe

# Download country/region from HTTP  
butterfly-dl europe/belgium

# Stream to stdout for processing
butterfly-dl europe/monaco - | gzip > monaco.pbf.gz

# Save to custom file name
butterfly-dl planet planet-backup.pbf

# Verbose output with source info
butterfly-dl --verbose europe/belgium

Enhanced Features

๐ŸŽจ Beautiful Progress Display

# Smooth tqdm-style progress bars with comprehensive information
butterfly-dl europe/belgium
# 75%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š     | 450MB/600MB [01:30<00:30, 25.2MB/s]

๐Ÿ›ก๏ธ Network Resilience & Recovery

# Automatic retry with smart resume - no lost progress
butterfly-dl europe/belgium
# โš ๏ธ Network error (attempt 1): operation timed out. Retrying in 1000ms...
# โš ๏ธ Stream interrupted at 300MB, resuming...
# โœ… Download completed!

๐Ÿ“ File Overwrite Protection

# Interactive prompts for existing files
butterfly-dl europe/belgium
# โš ๏ธ File already exists: belgium-latest.osm.pbf
# Overwrite? [y/N]: n
# โŒ Download cancelled

# Force overwrite without prompting
butterfly-dl europe/belgium --force
# โš ๏ธ Overwriting existing file: belgium-latest.osm.pbf

# Never overwrite, fail if file exists
butterfly-dl europe/belgium --no-clobber
# Error: File already exists: belgium-latest.osm.pbf (use --force to overwrite)

Source Resolution

Input Source Description
planet HTTP Planet file from https://planet.openstreetmap.org/pbf/planet-latest.osm.pbf
europe HTTP Continent from https://download.geofabrik.de/europe-latest.osm.pbf
europe/belgium HTTP Country from https://download.geofabrik.de/europe/belgium-latest.osm.pbf

Intelligent Error Handling

butterfly-dl includes smart error correction with fuzzy matching:

# Semantic intent recognition
butterfly-dl austrailia
# Error: Source 'austrailia' not found. Did you mean 'australia-oceania'?

# Typo correction
butterfly-dl antartica
# Error: Source 'antartica' not found. Did you mean 'antarctica'?

# Geographic accuracy  
butterfly-dl antartica/belgium
# Error: Source 'antartica/belgium' not found. Did you mean 'europe/belgium'?

# Standalone country recognition
butterfly-dl luxembourg
# Error: Source 'luxembourg' not found. Did you mean 'europe/luxembourg'?

# Smart continent suggestions
butterfly-dl plant
# Error: Source 'plant' not found. Did you mean 'planet'?

Features:

  • Semantic Intelligence: Hybrid fuzzy matching that understands semantic intent, not just character distance
  • Dynamic Source Discovery: Automatically fetches available regions from Geofabrik JSON API
  • Contextual Scoring: Prioritizes meaningful matches like "australia-oceania" over "austria" for "austrailia"
  • Geographic Intelligence: Knows Belgium belongs to Europe, not Antarctica
  • Fallback Protection: Works offline with comprehensive fallback region list

Output Options

  • No output argument: Auto-generated filename (e.g., belgium-latest.osm.pbf)
  • Filename: Save to specified file
  • -: Stream to stdout (logs go to stderr)

Performance Features

Memory Efficiency

  • Fixed 64KB buffers: Memory usage independent of file size
  • Ring buffer ordering: Small memory footprint for parallel downloads
  • Direct I/O: Bypasses OS page cache for files >1GB (Unix systems)
  • Streaming writes: No intermediate accumulation

Download Optimization

  • HTTP: Single optimized stream for maximum network utilization
  • HTTP: Auto-tuned parallel range requests (2-16 connections based on file size)
  • Fallback: Graceful degradation for servers without range support
  • Progress tracking: Real-time progress bars to stderr

Intelligent Defaults

  • Connection scaling: Based on file size and CPU count
  • Protocol selection: Optimal source for each data type
  • Error handling: Robust retry and fallback mechanisms

Technical Architecture

Memory Usage Breakdown

Connection buffers:    16 ร— 64KB = 1MB
Ring buffer:          64MB (max)
HTTP client overhead: ~50MB
Runtime:             ~50MB
Total:               ~215MB (well under 1GB limit)

Direct I/O Support

Automatically enabled for files >1GB on Unix systems:

  • Bypasses OS page cache
  • Reduces memory pressure
  • Optimizes large sequential writes
  • Falls back gracefully if not available

CLI Reference

Downloads single OpenStreetMap files efficiently:
  butterfly-dl planet              # Download planet file (81GB) from HTTP
  butterfly-dl europe              # Download Europe continent from HTTP
  butterfly-dl europe/belgium      # Download Belgium from HTTP
  butterfly-dl europe/monaco -     # Stream Monaco to stdout

Usage: butterfly-dl [OPTIONS] <SOURCE> [OUTPUT]

Arguments:
  <SOURCE>  Source to download: "planet" (HTTP), "europe" (continent), or "europe/belgium" (country/region)
  [OUTPUT]  Output file path, or "-" for stdout

Options:
      --dry-run  Show what would be downloaded without downloading
  -v, --verbose  Enable verbose logging
  -h, --help     Print help
  -V, --version  Print version

Examples

Planet Download (81GB)

# Download planet file (uses HTTP, single stream, Direct I/O)
butterfly-dl planet

# Stream planet to compressed archive
butterfly-dl planet - | gzip > planet.pbf.gz

Regional Downloads

# Download all of Europe (parallel HTTP ranges)
butterfly-dl europe

# Download specific country
butterfly-dl europe/germany

# Download to custom location
butterfly-dl asia/japan japan-$(date +%Y%m%d).pbf

Pipeline Integration

# Stream and process immediately
butterfly-dl europe/monaco - | osmium fileinfo -

# Compress on the fly
butterfly-dl europe/switzerland - | bzip2 > switzerland.pbf.bz2

# Chain with other tools
butterfly-dl planet - | osmium extract --bbox 2.3,46.8,2.4,46.9 -o monaco-bbox.pbf -

Development

Building

cargo build --release

Testing

# Run all tests
cargo test

# Run with verbose output
cargo test -- --nocapture

Performance Testing

# Test with small file
time butterfly-dl europe/monaco

# Test streaming
butterfly-dl europe/monaco - | wc -c

Version Management

The project uses a centralized version management system to maintain consistency across all components:

๐Ÿ“„ Single Source of Truth:

  • VERSION file contains the current version number (e.g., 1.0.0)
  • All other files automatically read from this central location

๐Ÿ”ง Automatic Version Propagation:

  • CLI tool: Uses env!("BUTTERFLY_VERSION") from build script
  • HTTP User-Agent: Dynamically includes version in requests
  • Library exports: Version available via build-time environment
  • C bindings: pkg-config file includes correct version
  • Documentation: Version stays in sync automatically

๐Ÿ”„ Build Integration:

  • build.rs reads VERSION file and sets environment variables
  • Any change to VERSION triggers automatic rebuild
  • Build system tracks version file as dependency

๐Ÿ“ Updating Version:

# Update version for new release
echo "1.1.0" > VERSION

# Rebuild automatically picks up new version
cargo build --release

# All components now use 1.1.0
./target/release/butterfly-dl --version  # Shows 1.1.0

Note: Cargo.toml version must still be updated manually due to Cargo limitations.

Architecture

  • Rust + Tokio: Async/await for concurrent downloads
  • HTTP Client: Advanced reqwest client with connection pooling
  • Reqwest: HTTP client with connection pooling and range requests
  • Indicatif: Progress bars to stderr
  • Ring buffer: Maintains chunk ordering with minimal memory

Performance Benchmarks

butterfly-dl includes comprehensive benchmarking against industry-standard tools to validate performance claims:

Benchmark Suite

# Run benchmarks against curl and aria2
./benchmarks/bench.sh europe/monaco    # Small file (~1MB)
./benchmarks/bench.sh europe/belgium   # Medium file (~43MB)  
./benchmarks/bench.sh europe/france    # Large file (~3.5GB)

Sample Results

All benchmarks conducted on actual hardware over real network conditions

Small Files (Monaco ~0.6MB)

Tool         Duration(s)  Speed(MB/s)  Memory     Status    
----------------------------------------------------------
curl         0.459        1.32         ~10MB      โœ… Success
butterfly-dl 0.612        0.99         ~215MB     โœ… Success
aria2        0.643        0.94         ~50MB      โœ… Success

For very small files, curl's lightweight design provides startup advantages

Medium Files (Luxembourg ~43MB)

Tool         Duration(s)  Speed(MB/s)  Memory     Status    
----------------------------------------------------------
butterfly-dl 3.037        14.07        ~215MB     โœ… Success
aria2        5.447        7.84         ~120MB     โœ… Success
curl         9.349        4.57         ~10MB      โœ… Success

butterfly-dl excels with 3x faster than curl and 79% faster than aria2

Key Performance Insights

  • ๐ŸŽฏ Sweet Spot: Medium to large files (>10MB) where butterfly-dl delivers 79-200% speed improvements
  • ๐Ÿ“Š Memory Consistency: Fixed ~215MB usage regardless of file size (vs aria2's scaling memory)
  • โšก Speed Scaling: 14.07 MB/s on 43MB files vs aria2's 7.84 MB/s and curl's 4.57 MB/s
  • ๐Ÿ”ง Smart Strategy: Automatically uses single connection for small files, optimized parallel connections for larger files
  • ๐Ÿ† Performance Leader: On medium files, butterfly-dl consistently outperforms both aria2 and curl significantly

Benchmark Features

  • ๐Ÿค– Automatic Tool Detection - Only tests available tools (curl, aria2, butterfly-dl)
  • ๐Ÿ“‹ Comprehensive Metrics - Duration, speed, memory usage, file integrity validation
  • ๐Ÿ”’ MD5 Verification - Ensures all tools download identical, uncorrupted files
  • ๐Ÿงน Clean Testing - Automatic cleanup of temporary benchmark files
  • ๐Ÿ“ˆ Fair Comparison - Same network conditions, same target files, same validation

Running Your Own Benchmarks

# Clone and build
git clone https://github.com/butterfly-osm/butterfly-osm
cd butterfly-osm
cargo build --release

# Test with any supported region
./benchmarks/bench.sh <region>

# Examples covering different file sizes
./benchmarks/bench.sh europe/monaco      # Small: ~1MB
./benchmarks/bench.sh europe/luxembourg  # Small: ~2MB  
./benchmarks/bench.sh europe/belgium     # Medium: ~43MB
./benchmarks/bench.sh europe/netherlands # Large: ~580MB
./benchmarks/bench.sh europe/france      # Large: ~3.5GB

Comparison with Alternatives

Tool Memory Usage Parallel Downloads HTTP Features Streaming Speed (43MB file)
butterfly-dl ~215MB Yes (Smart) Advanced Yes 14.07 MB/s
aria2c ~50-500MB+ Yes Basic Limited 7.84 MB/s
curl ~10MB No Basic Yes 4.57 MB/s
wget ~10MB No Basic No ~4 MB/s

License

MIT License - see LICENSE file for details.

Contributing

This project follows XP pair programming with human + AI collaboration. See CLAUDE.md for development guidelines.

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Who

Butterfly Project built by Pierre pierre@warnier.net for the broader OpenStreetMap community.


Butterfly-dl: The optimal tool for downloading large OpenStreetMap datasets efficiently.

Commit count: 0

cargo fmt