| Crates.io | gpu-scatter-gather |
| lib.rs | gpu-scatter-gather |
| version | 1.8.0 |
| created_at | 2026-01-16 22:34:27.92411+00 |
| updated_at | 2026-01-17 02:37:38.715312+00 |
| description | World's fastest wordlist generator using GPU acceleration with multi-GPU support |
| homepage | |
| repository | https://github.com/tehw0lf/gpu-scatter-gather |
| max_upload_size | |
| id | 2049419 |
| size | 1,698,583 |
The world's fastest wordlist generator using GPU acceleration
π Read the Technical Whitepaper - Comprehensive algorithm design, formal proofs, and performance evaluation
β Status: v1.7.0 Released - Published on crates.io!
Production-ready library with 4-15Γ speedup over CPU tools (maskprocessor, cracken). Complete C FFI API with 24 functions (17 single-GPU + 7 multi-GPU), 3 output formats, formal validation, and integration guides. See Development Log for detailed progress.
GPU Scatter-Gather is a GPU-accelerated wordlist generator that achieves 365-771M words/second (depending on password length) - 4-15Γ faster than CPU tools - using a novel scatter-gather algorithm based on mixed-radix arithmetic.
Instead of traditional sequential odometer iteration, this generator uses direct index-to-word mapping:
Index β Mixed-Radix Decomposition β Word
This enables:
Target Hardware: NVIDIA RTX 4070 (5,888 CUDA cores) Actual Hardware Tested: NVIDIA RTX 4070 Ti SUPER (8,448 CUDA cores)
| Tool | 8-char Speed | 16-char Speed | Speedup (16-char) |
|---|---|---|---|
| GPU Scatter-Gather | 771 M/s | 365 M/s | 15.3Γ π |
| cracken (CPU) | 201 M/s | 43 M/s | 1.0Γ (baseline) |
| maskprocessor (CPU) | 100-142M/s | ~50-60M/s | ~6-7Γ |
Note: Performance advantage increases with password length due to GPU parallelism scaling better than CPU sequential processing. See Competitive Results for detailed benchmarks.
# Add to your Cargo.toml
[dependencies]
gpu-scatter-gather = "1.7"
# Or install as command-line tool
cargo install gpu-scatter-gather
# Clone the repository
git clone https://github.com/tehw0lf/gpu-scatter-gather
cd gpu-scatter-gather
# Build the project (compiles CUDA kernels automatically)
cargo build --release
# Or build without CUDA support (CPU-only reference implementation)
cargo build --release --no-default-features
Prerequisites:
This crate supports the following Cargo features:
cuda (enabled by default) - GPU acceleration support with CUDA
Without GPU support:
[dependencies]
gpu-scatter-gather = { version = "1.7", default-features = false }
This provides CPU-only reference implementation for development/testing without GPU hardware.
Core Features (Production Ready):
Recent Improvements (v1.3.0-1.7.0):
generate_batch_with())These features await community interest and contributions:
β‘ New to the project? See QUICKSTART.md for a 5-minute setup guide!
β Have questions? Check FAQ.md for common questions and troubleshooting.
π See EXAMPLES.md for a complete guide to all 16 examples with detailed explanations!
use gpu_scatter_gather::gpu::GpuContext;
use std::collections::HashMap;
fn main() -> anyhow::Result<()> {
// Create GPU context
let gpu = GpuContext::new()?;
// Define character sets
let mut charsets = HashMap::new();
charsets.insert(0, b"abc".to_vec());
charsets.insert(1, b"123".to_vec());
// Create mask pattern: ?0?1
let mask = vec![0, 1];
// Generate 9 words
let output = gpu.generate_batch(&charsets, &mask, 0, 9, 2)?;
// Parse results
let word_length = mask.len();
for i in 0..(output.len() / word_length) {
let start = i * word_length;
let end = start + word_length;
let word = String::from_utf8_lossy(&output[start..end]);
println!("{}", word);
}
Ok(())
}
Run the beginner example:
cargo run --release --example simple_basic
Run the comprehensive API tour:
cargo run --release --example simple_rust_api
#include <wordlist_generator.h>
int main() {
// Create multi-GPU generator (uses all GPUs automatically)
wg_multigpu_handle_t gen = wg_multigpu_create();
printf("Using %d GPU(s)\n", wg_multigpu_get_device_count(gen));
// Configure charsets
wg_multigpu_set_charset(gen, 1, "abcdefghijklmnopqrstuvwxyz", 26);
wg_multigpu_set_charset(gen, 2, "0123456789", 10);
// Set mask: ?1?1?1?1?2?2?2?2 (4 letters + 4 digits)
int mask[] = {1, 1, 1, 1, 2, 2, 2, 2};
wg_multigpu_set_mask(gen, mask, 8);
wg_multigpu_set_format(gen, WG_FORMAT_PACKED);
// Generate 100M words across all GPUs
uint8_t* buffer = malloc(100000000 * 8);
ssize_t bytes = wg_multigpu_generate(gen, 0, 100000000, buffer, 100000000 * 8);
printf("Generated %zd bytes\n", bytes);
free(buffer);
wg_multigpu_destroy(gen);
return 0;
}
Multi-GPU Features:
See Multi-GPU Benchmarking Results for detailed performance data.
# Generate wordlist and pipe to hashcat
cargo run --release --example benchmark_stdout | hashcat -m 2500 capture.hccapx
See examples/benchmark_john_pipe.rs for John the Ripper integration.
Given a mask pattern with varying charset sizes, we convert an index directly to a word:
fn index_to_word(index: u64, mask: &[usize], charsets: &[&[u8]], output: &mut [u8]) {
let mut remaining = index;
// Process positions from right to left
for pos in (0..mask.len()).rev() {
let charset_id = mask[pos];
let charset = charsets[charset_id];
let charset_size = charset.len() as u64;
let char_idx = (remaining % charset_size) as usize;
output[pos] = charset[char_idx];
remaining /= charset_size;
}
}
__global__ void generate_words_kernel(
const char* charset_data,
const int* charset_offsets,
const int* charset_sizes,
const int* mask_pattern,
unsigned long long start_idx,
int word_length,
char* output_buffer,
unsigned long long batch_size
) {
unsigned long long tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= batch_size) return;
unsigned long long idx = start_idx + tid;
char* word = output_buffer + (tid * (word_length + 1));
// Convert index to word (same algorithm as CPU)
unsigned long long remaining = idx;
for (int pos = word_length - 1; pos >= 0; pos--) {
int charset_id = mask_pattern[pos];
int cs_size = charset_sizes[charset_id];
int char_idx = remaining % cs_size;
word[pos] = charset_data[charset_offsets[charset_id] + char_idx];
remaining /= cs_size;
}
word[word_length] = '\n';
}
Key Properties:
For detailed mathematical proofs and formal specification, see docs/design/FORMAL_SPECIFICATION.md.
Hardware: NVIDIA GeForce RTX 4070 Ti SUPER
PACKED Format Performance (50M batch):
| Password Length | Throughput | PCIe Bandwidth | Notes |
|---|---|---|---|
| 8-char | 771 M words/s | 6.2 GB/s | Peak performance |
| 10-char | 576 M words/s | 5.8 GB/s | |
| 12-char | 526 M words/s | 6.3 GB/s | |
| 16-char | 365 M words/s | 5.8 GB/s | Competitive baseline |
Competitive Comparison (16-char passwords):
| Tool | Speed | Speedup |
|---|---|---|
| GPU Scatter-Gather | 365 M/s | 15.3Γ π |
| cracken (CPU, fastest) | 43 M/s | 1.0Γ |
| maskprocessor (CPU) | ~50-60 M/s | ~6-7Γ |
Validation:
See docs/benchmarking/ for detailed results and methodology.
gpu-scatter-gather/
βββ src/
β βββ lib.rs # Core library and API
β βββ ffi.rs # C FFI (24 functions)
β βββ multigpu.rs # Multi-GPU coordination
β βββ gpu/ # GPU module (CUDA integration)
β βββ charset.rs # Charset management
β βββ keyspace.rs # Keyspace calculation and index-to-word
β βββ mask.rs # Mask pattern parsing
βββ kernels/
β βββ wordlist_poc.cu # CUDA kernels (3 variants)
βββ examples/ # 16+ comprehensive examples
βββ tests/ # Integration tests (55 tests)
βββ docs/
β βββ api/ # C API & FFI documentation
β βββ design/ # Architecture and formal specification
β βββ validation/ # Correctness validation
β βββ benchmarking/ # Performance measurement
β βββ guides/ # User and integration guides
β βββ development/ # Internal development docs
βββ build.rs # CUDA kernel compilation
# Run all tests (55 tests)
cargo test --lib
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_index_to_word_complex_pattern
# GPU production benchmark (realistic performance)
cargo run --release --example benchmark_production
# Multi-GPU benchmark
cargo run --release --example benchmark_multigpu
# Competitive comparison with cracken
cargo run --release --example benchmark_cracken_comparison
The build script automatically compiles kernels for multiple architectures:
The correct kernel is loaded at runtime based on your GPU.
β οΈ Ethical Use Only: This tool is intended for defensive security research, testing, and auditing. Unauthorized access to systems is illegal. Always obtain proper authorization before testing.
This project represents the third iteration of wordlist generation by the author:
| Implementation | Language | Algorithm | Performance | Speedup | Repository |
|---|---|---|---|---|---|
| wlgen | Python | itertools.product + recursive | 210K-1.6M words/s | 1Γ | github.com/tehw0lf/wlgen (PyPI) |
| wlgen-rs | Rust | Odometer (CPU) | ~150M words/s | ~100Γ | github.com/tehw0lf/wlgen-rs |
| gpu-scatter-gather | Rust+CUDA | Mixed-radix direct indexing | 365-771M words/s | 285-3600Γ | This project (crates.io) |
Key insight: Traditional approaches (Python itertools, Rust odometer) cannot leverage GPU parallelism. The mixed-radix direct indexing algorithm (AI-proposed) enables true GPU acceleration.
Our Advantages:
cracken strengths:
Our Advantages:
Maskprocessor strengths:
Our Advantages:
Our Advantages:
Phase 1: Foundation (COMPLETE)
Phase 2: Production Kernel (COMPLETE)
Phase 3: Core Features (COMPLETE)
Phase 4: Production Release (COMPLETE)
The project is feature-complete for its core purpose. Future enhancements depend on community interest:
Language Bindings:
Platform Support:
Advanced Features:
Contributing: See CONTRIBUTING.md for guidelines on adding features.
This is a human-AI collaborative research project that serves two purposes:
The core innovationβmixed-radix direct indexingβwas autonomously proposed by Claude Code (AI assistant).
When asked "What algorithm would you suggest for a GPU-based approach that would outshine existing solutions?", the AI independently proposed abandoning the traditional odometer approach and using direct index-to-word mapping via mixed-radix arithmetic. This algorithmic choice enabled:
The human developer (tehw0lf) had minimal Rust experience prior to this project. The entire implementationβRust codebase, CUDA kernels, build system, and integrationβwas developed through AI-guided development. The AI taught Rust concepts (Result types, lifetimes, RAII, borrowing) while implementing the algorithm, demonstrating AI's capability to:
The entire developmentβfrom algorithm design through Rust/CUDA implementation, mathematical proofs, validation, and documentationβrepresents genuine human-AI pair programming in systems research, where the human provides direction, domain expertise, and validation while the AI provides implementation and formalization.
Full transparency: See docs/development/DEVELOPMENT_PROCESS.md for detailed methodology and contribution breakdown.
Contributions are welcome! This project benefits from both human and AI collaboration.
Areas where help is needed:
Development philosophy:
See CONTRIBUTING.md for detailed guidelines.
Dual-licensed under either:
Choose whichever license suits your use case.
Made with π¦ Rust + β‘ CUDA + π€ AI
Building the world's fastest wordlist generator, one kernel at a time.