rust_web_crawler

Crates.iorust_web_crawler
lib.rsrust_web_crawler
version0.1.2
created_at2025-07-23 16:12:06.354054+00
updated_at2025-07-25 08:20:25.404172+00
descriptionA short summary of what your crate does
homepage
repository
max_upload_size
id1764987
size30,869
(nabil-Tounarti)

documentation

README

Rust Web Crawler

A high-performance, concurrent web crawler implemented in Rust that demonstrates different approaches to web crawling with varying levels of concurrency and synchronization.

๐Ÿš€ Features

  • Multiple Crawling Strategies: Three different implementations showcasing various concurrency patterns
  • Thread-Safe Design: Proper synchronization using mutexes and channels
  • Extensible Architecture: Trait-based fetcher interface for easy testing and extension
  • Comprehensive Testing: Full test suite covering edge cases and error scenarios
  • Performance Optimized: Concurrent implementations for improved crawling speed

๐Ÿ“‹ Project Structure

rust_web_crawler/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ main.rs          # Main implementation with three crawler variants
โ”œโ”€โ”€ Cargo.toml           # Project dependencies and metadata
โ”œโ”€โ”€ Cargo.lock           # Locked dependency versions
โ””โ”€โ”€ README.md           # This file

๐Ÿ”ง Implementation Approaches

1. Serial Crawler

  • File: serial_crawler() function
  • Approach: Sequential, depth-first traversal
  • Use Case: Simple crawling, debugging, or when order matters
  • Performance: Slower but predictable and easy to understand

2. Concurrent Mutex Crawler

  • File: concurrent_mutex_crawler() function
  • Approach: Multi-threaded with shared state protected by mutex
  • Use Case: When you need fine-grained control over shared state
  • Performance: Good concurrency with thread safety

3. Concurrent Channel Crawler

  • File: concurrent_channel_crawler() function
  • Approach: Worker pool pattern using channels for communication
  • Use Case: High-performance crawling with better resource management
  • Performance: Optimal concurrency with reduced contention

๐Ÿ› ๏ธ Architecture

Core Components

Fetcher Trait

trait Fetcher: Send + Sync + 'static {
    fn fetch(&self, url: &str) -> Result<Vec<String>, String>;
}
  • Defines the interface for fetching URLs
  • Returns a list of discovered URLs or an error
  • Thread-safe and can be shared across threads

FakeFetcher

  • Test implementation that simulates web crawling
  • Pre-defined responses for consistent testing
  • Simulates network latency for realistic performance testing

๐Ÿš€ Getting Started

Prerequisites

  • Rust 1.70+ (edition 2024)
  • Cargo package manager

Installation

  1. Clone the repository

    git clone https://github.com/nabil-Tounarti/rust-web-crawler.git
    cd rust-web-crawler
    
  2. Build the project

    cargo build
    
  3. Run the crawler

    cargo run
    

Running Tests

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test
cargo test test_serial_crawler_happy_path

๐Ÿ“Š Performance Comparison

The project demonstrates three different crawling approaches:

Approach Concurrency Thread Safety Performance Complexity
Serial None N/A Slow Low
Mutex High Mutex-protected Medium Medium
Channel High Channel-based Fast High

๐Ÿงช Testing

The project includes comprehensive tests covering:

  • Happy Path: Basic crawling functionality
  • Edge Cases: Non-existent URLs, fetch errors
  • Concurrency: Thread safety and race condition prevention
  • Error Handling: Proper error propagation and handling

Test Categories

  1. Basic Functionality Tests

    • test_serial_crawler_happy_path
    • test_concurrent_mutex_crawler_happy_path
    • test_concurrent_channel_crawler_happy_path
  2. Error Handling Tests

    • test_start_with_nonexistent_url
    • test_crawler_with_fetch_error
  3. Edge Case Tests

    • test_single_url_no_links

๐Ÿ”„ Extending the Project

Adding Real HTTP Fetcher

To implement a real web crawler, create a new fetcher:

use reqwest;

struct HttpFetcher {
    client: reqwest::Client,
}

impl Fetcher for HttpFetcher {
    fn fetch(&self, url: &str) -> Result<Vec<String>, String> {
        // Implement real HTTP fetching logic
        // Parse HTML and extract links
        // Return discovered URLs
    }
}

Adding Configuration

Extend the project with configuration options:

struct CrawlerConfig {
    max_depth: usize,
    max_concurrent_requests: usize,
    request_timeout: Duration,
    user_agent: String,
}

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐ŸŽฏ Learning Objectives

This project serves as an excellent learning resource for:

  • Rust Concurrency: Understanding threads, mutexes, and channels
  • Trait-based Design: Building extensible and testable code
  • Error Handling: Proper error propagation in Rust
  • Testing: Comprehensive test coverage and mocking
  • Performance Optimization: Comparing different concurrency approaches

๐Ÿ”— Dependencies

  • tokio: Async runtime for high-performance networking (currently unused but ready for HTTP implementation)
  • Standard Library: Threading, collections, and synchronization primitives

๐Ÿ“ˆ Future Enhancements

  • Real HTTP fetcher implementation
  • Rate limiting and politeness controls
  • URL filtering and domain restrictions
  • Persistent storage for crawled data
  • Configuration file support
  • Metrics and monitoring
  • Distributed crawling capabilities

Note: This is currently a demonstration project using a fake fetcher. For production use, implement a real HTTP fetcher with proper error handling and rate limiting.

Commit count: 0

cargo fmt