heisenberg-data-processing

Crates.ioheisenberg-data-processing
lib.rsheisenberg-data-processing
version0.2.0
created_at2025-10-28 18:00:18.938184+00
updated_at2025-10-28 18:00:18.938184+00
descriptionData processing pipeline for Heisenberg location enrichment library. Includes data downloading, extraction, transformation, and loading the raw data into a queryable format that gets embedded into the Heisenberg library.
homepage
repositoryhttps://github.com/SamBroomy/heisenberg
max_upload_size
id1905347
size1,632,413
Broomy (SamBroomy)

documentation

README

Heisenberg

Location enrichment library for converting unstructured location data into structured administrative hierarchies.

Crates.io PyPI License: MIT Documentation

Heisenberg transforms incomplete location data into complete administrative hierarchies using the GeoNames dataset. It resolves ambiguous place names, fills missing administrative context, and handles alternative names across 11+ million global locations.

Features

  • Embedded dataset: Ships with data included, no downloads required
  • Fast full-text search with Tantivy indexing
  • Complete administrative hierarchy resolution (country → state → county → place)
  • Multiple data sources (cities15000, cities5000, etc.) with smart fallback
  • Batch processing for high-throughput applications
  • Python and Rust APIs
  • Configurable search behavior and scoring
  • Alternative name resolution (e.g., "Deutschland" → "Germany")

Quick Start

Python

pip install heisenberg
import heisenberg

# Create searcher instance
searcher = heisenberg.LocationSearcher()

# Simple search
results = searcher.find("Tokyo")
print(f"Found: {results[0].name}")

# Multi-term search (largest to smallest: Country, City)
results = searcher.find(["France", "Paris"])
print(f"Found: {results[0].name}")

# Resolve complete administrative hierarchy (largest to smallest: State, City)
resolved = searcher.resolve_location(["California", "San Francisco"])
context = resolved[0].context

print(f"Country: {context.admin0.name}")  # United States
print(f"State: {context.admin1.name}")    # California
print(f"County: {context.admin2.name}")   # San Francisco County
print(f"City: {context.place.name}")      # San Francisco

Rust

[dependencies]
heisenberg = "0.1"
use heisenberg::{LocationSearcher, DataSource};

// Create searcher using embedded data (fastest, no downloads)
let searcher = LocationSearcher::new_embedded()?;

// Or use specific data source with smart fallback
let searcher = LocationSearcher::initialize(DataSource::Cities15000)?;

// Simple search
let results = searcher.search(&["Tokyo"])?;
println!("Found: {}", results[0].name().unwrap_or("Unknown"));

// Resolve complete hierarchy (largest to smallest: Country, City)
let resolved = searcher.resolve_location(&["Germany", "Berlin"])?;
let context = &resolved[0].context;

if let Some(country) = &context.admin0 {
    println!("Country: {}", country.name());
}
if let Some(place) = &context.place {
    println!("City: {}", place.name());
}

Examples

The problem: inconsistent and incomplete location data.

Input (largest → smallest) Output
"Florida" United States → Florida
["France", "Paris"] France → Île-de-France → Paris
["CA", "San Francisco"] United States → California → San Francisco County → San Francisco
"Deutschland" Germany (resolves alternative names)

Administrative Levels

  • Admin0: Countries
  • Admin1: States/Provinces
  • Admin2: Counties/Regions
  • Admin3: Local administrative divisions
  • Admin4: Sub-local administrative divisions
  • Places: Cities, towns, landmarks

Usage Examples

Batch Processing

# Note: Input order is largest to smallest (Country, City)
queries = [["Japan", "Tokyo"], ["UK", "London"], ["USA", "New York"]]
batch_results = searcher.find_batch(queries)

Configuration

# Fast search (fewer results, optimized for speed)
config = heisenberg.SearchConfigBuilder.fast().build()
results = searcher.find("Berlin", config)

# Comprehensive search (more results, higher accuracy)
config = heisenberg.SearchConfigBuilder.comprehensive().build()
results = searcher.find("Cambridge", config)

See examples/ for complete Rust examples and python/examples/ for Python examples.

Installation

Python

pip install heisenberg

Rust

[dependencies]
heisenberg = "0.1"

Data

Embedded by Default: Heisenberg ships with the Cities15000 dataset embedded (~25MB compressed), providing instant startup with no downloads required.

Multiple Data Sources: Choose from different datasets based on your needs:

  • Cities15000: Cities with population > 15,000 (default, embedded)
  • Cities5000: Cities with population > 5,000
  • Cities1000: Cities with population > 1,000
  • Cities500: Cities with population > 500
  • AllCountries: Complete GeoNames dataset (~1GB)

Smart Fallback: When requesting non-embedded datasets, Heisenberg automatically downloads and processes data on first use, then caches locally.

Development:

# Use embedded test data for development
USE_TEST_DATA=true cargo test

# Force regeneration of embedded data at build time
GENERATE_EMBEDDED_DATA=1 cargo build

# Use specific data source
EMBEDDED_DATA_SOURCE=cities5000 cargo build

Performance

  • Instant startup: Using embedded data (no download/processing time)
  • Search: ~1ms per query
  • Batch processing: 10-100x faster than individual queries
  • Memory: ~200MB RAM
  • Storage: ~25MB embedded + indexes, or ~1GB for larger datasets

License

MIT License - see LICENSE for details.

Links

Commit count: 0

cargo fmt