| Crates.io | xml2arrow |
| lib.rs | xml2arrow |
| version | 0.10.0 |
| created_at | 2024-12-29 13:47:59.40204+00 |
| updated_at | 2025-05-29 16:14:00.153593+00 |
| description | Efficiently convert XML data to Apache Arrow format for high-performance data processing |
| homepage | |
| repository | https://github.com/mluttikh/xml2arrow |
| max_upload_size | |
| id | 1498232 |
| size | 131,024 |
A Rust crate for efficiently converting XML data to Apache Arrow format.
A Python version of this library is also available on GitHub: https://github.com/mluttikh/xml2arrow-python
xml2arrow provides a high-performance solution for transforming XML documents into Apache Arrow tables. It leverages the quick-xml parser for efficient XML processing and the arrow crate for building Arrow data structures. This makes it ideal for handling large XML datasets and integrating them into data processing pipelines that utilize the Arrow ecosystem.
xml2arrow converts XML data to Apache Arrow format using a YAML configuration file.
The YAML configuration defines the mapping between your XML structure and Arrow tables and fields.
tables:
- name: <table_name> # The name of the resulting Arrow table
xml_path: <xml_path> # The XML path to the *parent* element of the table's row elements
levels: # Index levels for nested XML structures.
- <level1>
- <level2>
fields:
- name: <field_name> # The name of the Arrow field
xml_path: <field_path> # The XML path to the field within a row
data_type: <data_type> # The Arrow data type (see below)
nullable: <true|false> # Whether the field can be null
scale: <number> # Optional scaling factor for floats.
offset: <number> # Optional offset for numeric floats
- name: ... # Define additional tables as needed
tables: A list of table configurations. Each entry defines a separate Arrow table.
name: The name of the resulting Arrow RecordBatch (table).xml_path: An XPath-like string specifying the parent element of the row elements. For example, for <library><book>...</book><book>...</book></library>, the xml_path would be /library.levels: An array of strings representing parent tables for creating indexes in nested structures. For /library/shelves/shelf/books/book, use levels: ["shelves", "books"]. This creates indexes named <shelves> and <books>.fields: A list of field configurations (columns) for the Arrow table.
name: The name of the field in the Arrow schema.xml_path: An XPath-like string selecting the field's value. Use @ to select attributes (e.g., /library/book/@id).data_type: The Arrow data type. Supported types:
Boolean (false, true, 0 or 1)Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64Float32, Float64Utf8 (Strings)nullable (Optional): Whether the field can be null (defaults to false).scale (Optional): A scaling factor for float fields.offset (Optional): An offset value for float fields.use std::fs::File;
use std::io::BufReader;
use xml2arrow::{Config, parse_xml};
fn main() -> Result<(), Box<dyn std::error::Error>>{
let config = Config::from_yaml_file("config.yaml")?;
let file = File::open("data.xml")?;
let reader = BufReader::new(file);
let record_batches = parse_xml(reader, &config)?;
// Process the record batches...
Ok(())
}
This example demonstrates how to convert meteorological station data from XML to Arrow format.
stations.xml)<report>
<header>
<title>Meteorological Station Data</title>
<created_by>National Weather Service</created_by>
<creation_time>2024-12-30T13:59:15Z</creation_time>
</header>
<monitoring_stations>
<monitoring_station id="MS001">
<location>
<latitude>-61.39110459389277</latitude>
<longitude>48.08662749089257</longitude>
<elevation>547.1050788360882</elevation>
</location>
<measurements>
<measurement>
<timestamp>2024-12-30T12:39:15Z</timestamp>
<temperature unit="C">35.486545480326114</temperature>
<pressure unit="hPa">950.439973486407</pressure>
<humidity unit="%">49.77716576844861</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:44:15Z</timestamp>
<temperature unit="C">29.095166644493865</temperature>
<pressure unit="hPa">1049.3215015450517</pressure>
<humidity unit="%">32.5687148391251</humidity>
</measurement>
</measurements>
<metadata>
<description>Located in the Arctic Tundra area, used for Scientific Research.</description>
<install_date>2024-03-31</install_date>
</metadata>
</monitoring_station>
<monitoring_station id="MS002">
<location>
<latitude>11.891496388319311</latitude>
<longitude>135.09336983543022</longitude>
<elevation unit="m">174.53349357280004</elevation>
</location>
<measurements>
<measurement>
<timestamp>2024-12-30T12:39:15Z</timestamp>
<temperature unit="C">24.791842953632283</temperature>
<pressure unit="hPa">989.4054287187706</pressure>
<humidity unit="%">57.70794884397625</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:44:15Z</timestamp>
<temperature unit="C">15.153690541845911</temperature>
<pressure unit="hPa">1001.413052919951</pressure>
<humidity unit="%">45.45094598045342</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:49:15Z</timestamp>
<temperature unit="C">-4.022555715139081</temperature>
<pressure unit="hPa">1000.5225751769922</pressure>
<humidity unit="%">70.40117458947834</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:54:15Z</timestamp>
<temperature unit="C">25.852920542644185</temperature>
<pressure unit="hPa">953.762785698162</pressure>
<humidity unit="%">42.62088244545566</humidity>
</measurement>
</measurements>
<metadata>
<description>Located in the Desert area, used for Weather Forecasting.</description>
<install_date>2024-01-17</install_date>
</metadata>
</monitoring_station>
</monitoring_stations>
</report>
stations.yaml)tables:
- name: report
xml_path: /
levels: []
fields:
- name: title
xml_path: /report/header/title
data_type: Utf8
nullable: false
- name: created_by
xml_path: /report/header/created_by
data_type: Utf8
nullable: false
- name: creation_time
xml_path: /report/header/creation_time
data_type: Utf8
nullable: false
- name: stations
xml_path: /report/monitoring_stations
levels:
- station
fields:
- name: id
xml_path: /report/monitoring_stations/monitoring_station/@id # Path to an attribute
data_type: Utf8
nullable: false
- name: latitude
xml_path: /report/monitoring_stations/monitoring_station/location/latitude
data_type: Float32
nullable: false
- name: longitude
xml_path: /report/monitoring_stations/monitoring_station/location/longitude
data_type: Float32
nullable: false
- name: elevation
xml_path: /report/monitoring_stations/monitoring_station/location/elevation
data_type: Float32
nullable: false
- name: description
xml_path: report/monitoring_stations/monitoring_station/metadata/description
data_type: Utf8
nullable: false
- name: install_date
xml_path: report/monitoring_stations/monitoring_station/metadata/install_date
data_type: Utf8
nullable: false
- name: measurements
xml_path: /report/monitoring_stations/monitoring_station/measurements
levels:
- station # Link to the 'stations' table by element order
- measurement
fields:
- name: timestamp
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/timestamp
data_type: Utf8
nullable: false
- name: temperature
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/temperature
data_type: Float64
nullable: false
offset: 273.15 # Convert from Celsius to Kelvin
- name: pressure
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/pressure
data_type: Float64
nullable: false
scale: 100.0 # Convert from hPa to Pa
- name: humidity
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/humidity
data_type: Float64
nullable: false
use std::fs::File;
use std::io::BufReader;
use xml2arrow::{Config, parse_xml};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = Config::from_yaml_file("stations.yaml")?;
let file = File::open("stations.xml")?;
let reader = BufReader::new(file);
let record_batches = parse_xml(reader, &config)?;
// Accessing the record batches (example)
for (name, batch) in record_batches {
// Process the record batches...
}
Ok(())
}
- report:
┌─────────────────────────────┬──────────────────────────┬──────────────────────┐
│ title ┆ created_by ┆ creation_time │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════════════════════╪══════════════════════════╪══════════════════════╡
│ Meteorological Station Data ┆ National Weather Service ┆ 2024-12-30T13:59:15Z │
└─────────────────────────────┴──────────────────────────┴──────────────────────┘
- stations:
┌───────────┬───────┬────────────┬────────────┬────────────┬────────────────────────┬──────────────┐
│ <station> ┆ id ┆ latitude ┆ longitude ┆ elevation ┆ description ┆ install_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ f32 ┆ f32 ┆ f32 ┆ str ┆ str │
╞═══════════╪═══════╪════════════╪════════════╪════════════╪════════════════════════╪══════════════╡
│ 0 ┆ MS001 ┆ -61.391106 ┆ 48.086628 ┆ 547.105103 ┆ Located in the Arctic ┆ 2024-03-31 │
│ ┆ ┆ ┆ ┆ ┆ Tundra a… ┆ │
│ 1 ┆ MS002 ┆ 11.891497 ┆ 135.093369 ┆ 174.533493 ┆ Located in the Desert ┆ 2024-01-17 │
│ ┆ ┆ ┆ ┆ ┆ area, us… ┆ │
└───────────┴───────┴────────────┴────────────┴────────────┴────────────────────────┴──────────────┘
- measurements:
┌───────────┬───────────────┬──────────────────────┬─────────────┬───────────────┬───────────┐
│ <station> ┆ <measurement> ┆ timestamp ┆ temperature ┆ pressure ┆ humidity │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════════╪══════════════════════╪═════════════╪═══════════════╪═══════════╡
│ 0 ┆ 0 ┆ 2024-12-30T12:39:15Z ┆ 308.636545 ┆ 95043.997349 ┆ 49.777166 │
│ 0 ┆ 1 ┆ 2024-12-30T12:44:15Z ┆ 302.245167 ┆ 104932.150155 ┆ 32.568715 │
│ 1 ┆ 0 ┆ 2024-12-30T12:39:15Z ┆ 297.941843 ┆ 98940.542872 ┆ 57.707949 │
│ 1 ┆ 1 ┆ 2024-12-30T12:44:15Z ┆ 288.303691 ┆ 100141.305292 ┆ 45.450946 │
│ 1 ┆ 2 ┆ 2024-12-30T12:49:15Z ┆ 269.127444 ┆ 100052.257518 ┆ 70.401175 │
│ 1 ┆ 3 ┆ 2024-12-30T12:54:15Z ┆ 299.002921 ┆ 95376.27857 ┆ 42.620882 │
└───────────┴───────────────┴──────────────────────┴─────────────┴───────────────┴───────────┘