Crates.io | aqueducts-utils |
lib.rs | aqueducts-utils |
version | |
source | src |
created_at | 2024-05-21 12:32:34.457614 |
updated_at | 2024-12-09 14:41:16.47582 |
description | Framework to build ETL data pipelines declaratively |
homepage | https://github.com/vigimite/aqueducts |
repository | https://github.com/vigimite/aqueducts |
max_upload_size | |
id | 1246806 |
Cargo.toml error: | TOML parse error at line 18, column 1 | 18 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
Aqueducts is a framework to write and execute ETL data pipelines declaratively.
Features:
This framework builds on the fantastic work done by projects such as:
Please show these projects some support :heart:!
You can find the docs at https://vigimite.github.io/aqueducts
Change log: CHANGELOG
To define and execute an Aqueduct pipeline there are a couple of options
You can check out some examples in the examples directory. Here is a simple example defining an Aqueduct pipeline using the yaml config format link:
sources:
# Register a local file source containing temperature readings for various cities
- type: File
name: temp_readings
file_type:
type: Csv
options: {}
# use built-in templating functionality
location: ./examples/temp_readings_${month}_${year}.csv
#Register a local file source containing a mapping between location_ids and location names
- type: File
name: locations
file_type:
type: Csv
options: {}
location: ./examples/location_dict.csv
stages:
# Query to aggregate temperature data by date and location
- - name: aggregated
query: >
SELECT
cast(timestamp as date) date,
location_id,
round(min(temperature_c),2) min_temp_c,
round(min(humidity),2) min_humidity,
round(max(temperature_c),2) max_temp_c,
round(max(humidity),2) max_humidity,
round(avg(temperature_c),2) avg_temp_c,
round(avg(humidity),2) avg_humidity
FROM temp_readings
GROUP by 1,2
ORDER by 1 asc
# print the query plan to stdout for debugging purposes
explain: true
# Enrich aggregation with the location name
- - name: enriched
query: >
SELECT
date,
location_name,
min_temp_c,
max_temp_c,
avg_temp_c,
min_humidity,
max_humidity,
avg_humidity
FROM aggregated
JOIN locations
ON aggregated.location_id = locations.location_id
ORDER BY date, location_name
# print 10 rows to stdout for debugging purposes
show: 10
# Write the pipeline result to a parquet file (can be omitted if you don't want an output)
destination:
type: File
name: results
file_type:
type: Parquet
options: {}
location: ./examples/output_${month}_${year}.parquet
This repository contains a minimal example implementation of the Aqueducts framework which can be used to test out pipeline definitions like the one above:
cargo install aqueducts-cli
aqueducts --file examples/aqueduct_pipeline_example.yml --param year=2024 --param month=jan