| Crates.io | heisenberg-data-processing |
| lib.rs | heisenberg-data-processing |
| version | 0.2.0 |
| created_at | 2025-10-28 18:00:18.938184+00 |
| updated_at | 2025-10-28 18:00:18.938184+00 |
| description | Data processing pipeline for Heisenberg location enrichment library. Includes data downloading, extraction, transformation, and loading the raw data into a queryable format that gets embedded into the Heisenberg library. |
| homepage | |
| repository | https://github.com/SamBroomy/heisenberg |
| max_upload_size | |
| id | 1905347 |
| size | 1,632,413 |
Location enrichment library for converting unstructured location data into structured administrative hierarchies.
Heisenberg transforms incomplete location data into complete administrative hierarchies using the GeoNames dataset. It resolves ambiguous place names, fills missing administrative context, and handles alternative names across 11+ million global locations.
pip install heisenberg
import heisenberg
# Create searcher instance
searcher = heisenberg.LocationSearcher()
# Simple search
results = searcher.find("Tokyo")
print(f"Found: {results[0].name}")
# Multi-term search (largest to smallest: Country, City)
results = searcher.find(["France", "Paris"])
print(f"Found: {results[0].name}")
# Resolve complete administrative hierarchy (largest to smallest: State, City)
resolved = searcher.resolve_location(["California", "San Francisco"])
context = resolved[0].context
print(f"Country: {context.admin0.name}") # United States
print(f"State: {context.admin1.name}") # California
print(f"County: {context.admin2.name}") # San Francisco County
print(f"City: {context.place.name}") # San Francisco
[dependencies]
heisenberg = "0.1"
use heisenberg::{LocationSearcher, DataSource};
// Create searcher using embedded data (fastest, no downloads)
let searcher = LocationSearcher::new_embedded()?;
// Or use specific data source with smart fallback
let searcher = LocationSearcher::initialize(DataSource::Cities15000)?;
// Simple search
let results = searcher.search(&["Tokyo"])?;
println!("Found: {}", results[0].name().unwrap_or("Unknown"));
// Resolve complete hierarchy (largest to smallest: Country, City)
let resolved = searcher.resolve_location(&["Germany", "Berlin"])?;
let context = &resolved[0].context;
if let Some(country) = &context.admin0 {
println!("Country: {}", country.name());
}
if let Some(place) = &context.place {
println!("City: {}", place.name());
}
The problem: inconsistent and incomplete location data.
| Input (largest → smallest) | Output |
|---|---|
"Florida" |
United States → Florida |
["France", "Paris"] |
France → Île-de-France → Paris |
["CA", "San Francisco"] |
United States → California → San Francisco County → San Francisco |
"Deutschland" |
Germany (resolves alternative names) |
# Note: Input order is largest to smallest (Country, City)
queries = [["Japan", "Tokyo"], ["UK", "London"], ["USA", "New York"]]
batch_results = searcher.find_batch(queries)
# Fast search (fewer results, optimized for speed)
config = heisenberg.SearchConfigBuilder.fast().build()
results = searcher.find("Berlin", config)
# Comprehensive search (more results, higher accuracy)
config = heisenberg.SearchConfigBuilder.comprehensive().build()
results = searcher.find("Cambridge", config)
See examples/ for complete Rust examples and python/examples/ for Python examples.
pip install heisenberg
[dependencies]
heisenberg = "0.1"
Embedded by Default: Heisenberg ships with the Cities15000 dataset embedded (~25MB compressed), providing instant startup with no downloads required.
Multiple Data Sources: Choose from different datasets based on your needs:
Cities15000: Cities with population > 15,000 (default, embedded)Cities5000: Cities with population > 5,000Cities1000: Cities with population > 1,000Cities500: Cities with population > 500AllCountries: Complete GeoNames dataset (~1GB)Smart Fallback: When requesting non-embedded datasets, Heisenberg automatically downloads and processes data on first use, then caches locally.
Development:
# Use embedded test data for development
USE_TEST_DATA=true cargo test
# Force regeneration of embedded data at build time
GENERATE_EMBEDDED_DATA=1 cargo build
# Use specific data source
EMBEDDED_DATA_SOURCE=cities5000 cargo build
MIT License - see LICENSE for details.