| Crates.io | datafold |
| lib.rs | datafold |
| version | 0.1.55 |
| created_at | 2025-11-04 16:47:16.883182+00 |
| updated_at | 2026-01-20 20:35:21.603624+00 |
| description | A personal database for data sovereignty with AI-powered ingestion |
| homepage | https://datafold.ai |
| repository | https://github.com/shiba4life/fold_db |
| max_upload_size | |
| id | 1916560 |
| size | 4,458,142 |
A Rust-based distributed data platform with schema-based storage, AI-powered ingestion, and real-time data processing capabilities. DataFold provides a complete solution for distributed data management with automatic schema generation, field mapping, and extensible ingestion pipelines.
Download the latest release for your platform from GitHub Releases:
# macOS (Intel)
curl -LO https://github.com/shiba4life/fold_db/releases/latest/download/datafold_http_server-macos-x86_64-VERSION
chmod +x datafold_http_server-macos-x86_64-VERSION
sudo mv datafold_http_server-macos-x86_64-VERSION /usr/local/bin/datafold_http_server
# macOS (Apple Silicon)
curl -LO https://github.com/shiba4life/fold_db/releases/latest/download/datafold_http_server-macos-aarch64-VERSION
chmod +x datafold_http_server-macos-aarch64-VERSION
sudo mv datafold_http_server-macos-aarch64-VERSION /usr/local/bin/datafold_http_server
# Linux
curl -LO https://github.com/shiba4life/fold_db/releases/latest/download/datafold_http_server-linux-x86_64-VERSION
chmod +x datafold_http_server-linux-x86_64-VERSION
sudo mv datafold_http_server-linux-x86_64-VERSION /usr/local/bin/datafold_http_server
Replace VERSION with the actual version number (e.g., 0.1.5).
Add DataFold to your Cargo.toml:
[dependencies]
datafold = "0.1.0"
Or install the CLI tools:
cargo install datafold
This provides three binaries:
datafold_cli - Command-line interfacedatafold_http_server - HTTP server with web UIdatafold_node - P2P node serverThe crate ships without generating TypeScript artifacts by default so it can
compile cleanly in any environment. If you need the auto-generated bindings for
the web UI, enable the ts-bindings feature when building or testing:
cargo build --features ts-bindings
The feature keeps the ts-rs dependency optional and writes the generated
definitions to the existing bindings/ directory just like the repository
version.
use datafold::{DataFoldNode, IngestionCore, Schema};
use serde_json::json;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize a DataFold node
let node = DataFoldNode::new_with_defaults().await?;
// Create an ingestion pipeline
let config = datafold::IngestionConfig::from_env_allow_empty();
let ingestion = IngestionCore::new(config)?;
// Process JSON data with automatic schema generation
let data = json!({
"name": "John Doe",
"email": "john@example.com",
"age": 30,
"preferences": {
"theme": "dark",
"notifications": true
}
});
let response = ingestion.process_json_ingestion(
datafold::IngestionRequest { data }
).await?;
println!("Ingestion result: {:?}", response);
Ok(())
}
# Start the HTTP server with web UI
datafold_http_server --port 9001
Then visit http://localhost:9001 for the web interface.
DataFold uses dynamic schemas that define data structure and operations:
use datafold::{Schema, Operation};
// Load a schema
let schema_json = std::fs::read_to_string("my_schema.json")?;
let schema: Schema = serde_json::from_str(&schema_json)?;
// Execute operations
let operation = Operation::Query(query_data);
let result = node.execute_operation(operation).await?;
Automatically analyze and ingest data from any source:
use datafold::{IngestionConfig, IngestionCore};
// Configure with OpenRouter API
let config = IngestionConfig {
openrouter_api_key: Some("your-api-key".to_string()),
openrouter_model: "anthropic/claude-3.5-sonnet".to_string(),
..Default::default()
};
let ingestion = IngestionCore::new(config)?;
// Process any JSON data
let result = ingestion.process_json_ingestion(request).await?;
Connect nodes in a P2P network:
use datafold::{NetworkConfig, NetworkCore};
let network_config = NetworkConfig::default();
let network = NetworkCore::new(network_config).await?;
// Start networking
network.start().await?;
// Discover peers
let peers = network.discover_peers().await?;
DataFold includes a comprehensive React frontend with a unified API client architecture that provides type-safe, standardized access to all backend operations.
The frontend uses specialized API clients that eliminate boilerplate code and provide consistent error handling, caching, and authentication:
import { schemaClient, securityClient, systemClient } from '../api/clients';
// Schema operations with automatic caching
const response = await schemaClient.getSchemas();
if (response.success) {
const schemas = response.data; // Fully typed SchemaData[]
}
// System monitoring with intelligent caching
const status = await systemClient.getSystemStatus(); // 30-second cache
// Security operations with built-in validation
const verification = await securityClient.verifyMessage(signedMessage);
import {
isNetworkError,
isAuthenticationError,
isSchemaStateError
} from '../api/core/errors';
try {
const response = await schemaClient.approveSchema('users');
} catch (error) {
if (isAuthenticationError(error)) {
redirectToLogin();
} else if (isSchemaStateError(error)) {
showMessage(`Schema "${error.schemaName}" is ${error.currentState}`);
} else {
showMessage(error.toUserMessage());
}
}
# Start the backend server
cargo run --bin datafold_http_server -- --port 9001
# In another terminal, start the React frontend
cd src/datafold_node/static-react
npm install
npm run dev
The frontend will be available at http://localhost:5173 with hot-reload.
DataFold supports ingesting data from various sources with the new adapter-based architecture:
See SOCIAL_MEDIA_INGESTION_PROPOSAL.md for the complete ingestion architecture.
DataFold provides two ways to ingest files:
1. Traditional File Upload
curl -X POST http://localhost:9001/api/ingestion/upload \
-F "file=@/path/to/local/file.json" \
-F "autoExecute=true"
2. S3 File Path (No Re-upload Required)
curl -X POST http://localhost:9001/api/ingestion/upload \
-F "s3FilePath=s3://my-bucket/path/to/file.json" \
-F "autoExecute=true"
3. Programmatic API (for Lambda/Rust code)
use datafold::ingestion::{ingest_from_s3_path_async, S3IngestionRequest};
// Async ingestion (returns immediately with progress_id)
let request = S3IngestionRequest::new("s3://bucket/file.json".to_string());
let response = ingest_from_s3_path_async(&request, &state).await?;
println!("Started: {}", response.progress_id.unwrap());
// Or sync ingestion (waits for completion)
use datafold::ingestion::ingest_from_s3_path_sync;
let response = ingest_from_s3_path_sync(&request, &state).await?;
println!("Complete: {} mutations", response.mutations_executed);
The S3 file path option allows you to process files already stored in S3 without uploading them again, saving bandwidth and time. This is particularly useful for:
Requirements for S3 file paths:
DATAFOLD_UPLOAD_STORAGE_MODE=s3)s3:GetObject permissionsSee S3 File Path Ingestion Guide for complete documentation and Lambda example for AWS Lambda integration.
# Clone the repository
git clone https://github.com/yourusername/datafold.git
cd datafold
# Install dependencies
sudo apt install rustup
rustup default stable # Installs cargo
sudo apt install openssl libssl-dev pkg-config
# Build all components
cargo build --release --workspace
# Run tests
cargo test --workspace
For development with hot-reload:
# Start the Rust backend
cargo run --bin datafold_http_server -- --port 9001
# In another terminal, start the React frontend
cd src/datafold_node/static-react
npm install
npm run dev
The UI will be available at http://localhost:5173.
DataFold can run in serverless environments like AWS Lambda using S3-backed storage:
use datafold::{FoldDB, S3Config};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure S3 storage
let config = S3Config::new(
"my-folddb-bucket".to_string(),
"us-west-2".to_string(),
"production".to_string(),
);
// Database automatically downloads from S3 on startup
let db = FoldDB::new_with_s3(config).await?;
// Use normally - all operations are local
// ... queries, mutations, transforms ...
// Sync back to S3
db.flush_to_s3().await?;
Ok(())
}
Environment variable configuration:
# Database storage (Sled with S3 sync)
export DATAFOLD_STORAGE_MODE=s3
export DATAFOLD_S3_BUCKET=my-folddb-bucket
export DATAFOLD_S3_REGION=us-west-2
# Upload storage (for file ingestion)
export DATAFOLD_UPLOAD_STORAGE_MODE=s3
export DATAFOLD_UPLOAD_S3_BUCKET=my-uploads-bucket
export DATAFOLD_UPLOAD_S3_REGION=us-west-2
See S3 Configuration Guide for complete setup instructions, AWS Lambda deployment, and cost optimization.
DataFold provides first-class support for AWS Lambda with a multi-tenant DynamoDB backend. This allows you to build serverless, user-isolated applications without managing servers.
Add the lambda feature to your Cargo.toml:
[dependencies]
datafold = { version = "0.1.0", features = ["lambda"] }
Initialize the LambdaContext with LambdaStorage::DynamoDb:
use datafold::lambda::{LambdaConfig, LambdaContext, LambdaStorage, LambdaLogging};
use datafold::storage::{DynamoDbConfig, ExplicitTables};
// Using ExplicitTables::from_prefix for convenience
let config = LambdaConfig::new(
LambdaStorage::DynamoDb(DynamoDbConfig {
region: "us-east-1".to_string(),
tables: ExplicitTables::from_prefix("MyApp"), // Creates: MyApp-main, MyApp-schemas, etc.
auto_create: true,
user_id: None,
}),
LambdaLogging::Stdout,
);
LambdaContext::init(config).await?;
The system requires and automatically manages 11 tables per deployment. Using ExplicitTables::from_prefix("MyApp"), they are:
MyApp-main (Data)MyApp-metadataMyApp-node_id_schema_permissionsMyApp-transformsMyApp-orchestrator_stateMyApp-schema_statesMyApp-schemasMyApp-public_keysMyApp-transform_queue_treeMyApp-native_indexMyApp-process (Process Tracking)DataFold automatically handles multi-tenancy. When you pass a user_id to ingestion or node retrieval methods, operations are scoped to that user within the DynamoDB tables.
# Use the CLI to load a schema
datafold_cli load-schema examples/user_schema.json
# Query data
datafold_cli query examples/user_query.json
# Execute mutations
datafold_cli mutate examples/user_mutation.json
See examples/ directory for:
// Quick example: Ingest S3 file in Lambda
use datafold::ingestion::{ingest_from_s3_path_async, S3IngestionRequest};
let request = S3IngestionRequest::new("s3://bucket/data.json".to_string());
let response = ingest_from_s3_path_async(&request, &state).await?;
See datafold_api_examples/ for Python scripts demonstrating:
DataFold uses JSON configuration files. Default config:
{
"storage_path": "data/db",
"default_trust_distance": 1,
"network": {
"port": 9000,
"enable_mdns": true
},
"ingestion": {
"enabled": true,
"openrouter_model": "anthropic/claude-3.5-sonnet"
}
}
Environment variables:
OPENROUTER_API_KEY - API key for AI-powered ingestionDATAFOLD_CONFIG - Path to configuration fileDataFold stores registered Ed25519 public keys in the sled database. When the node starts it loads all saved keys, and new keys are persisted as soon as they are registered. This keeps authentication intact across restarts. See PBI SEC-8 documentation for implementation details.
DATAFOLD_LOG_LEVEL - Logging level (trace, debug, info, warn, error)We welcome contributions! Please see our contributing guidelines:
cargo test --workspaceThis project is licensed under either of:
at your option.
DataFold - Distributed data platform for the modern world 🚀