Crates.io | arff |
lib.rs | arff |
version | 0.3.0 |
source | src |
created_at | 2018-05-24 20:15:59.394522 |
updated_at | 2018-06-20 21:15:11.044217 |
description | ARFF file format serializer and deserializer |
homepage | |
repository | https://github.com/mbillingr/arff |
max_upload_size | |
id | 66927 |
size | 158,391 |
ARFF file format serializer and deserializer
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. Its main use is in data science to store tabular data: rows are instances and columns are attributes. Meta data such as attribute (column) names, data types, and comments are included in the file format.
The ARFF crate utilizes the power of Serde to allow serialization and
deserialization of certain Rust types. The file format is relatively
simple, so not all rust types are supported. As a general rule of thumb,
data needs to be represented as a sequence of rows, and a row can be
either a struct
with named columns or a sequence with static length.
[dependencies]
arff = "0.1"
extern crate arff;
#[macro_use]
extern crate serde_derive;
fn main() {
let input = "
@RELATION Data
@ATTRIBUTE a NUMERIC
@ATTRIBUTE b NUMERIC
@DATA
42, 9
7, 5";
#[derive(Debug, Deserialize)]
struct NamedRow {
b: i32, // order of fields does not matter
a: i32,
}
let named_data: Vec<NamedRow> = arff::from_str(input).unwrap();
println!("{:?}", named_data);
let unnamed_data: Vec<[i32; 2]> = arff::from_str(input).unwrap();
println!("{:?}", unnamed_data);
}
The tabular ARFF data is represented in Rust as a sequence of rows. Columns may have different data types, but each row must be the same.
If Row
is a valid type that can be serialized into a data row, the
following types can be serialized as ARFF data sets:
Vec<Row>
&[Row]
[Row; N]
(Row, Row, Row, ...)
By default the data set is named unnamed_data
. You can give the data
set a different name by wrapping it in a newtype struct. For example,
MyData(Vec<Row>)
is represented in ARFF as
@RELATION MyData
...
A tuple struct is serialized like a tuple wrapped in a newtype struct.
MoreData((Row, Row, Row))
is equivalent to MoreData(Row, Row, Row)
.
Valid types for the Row
data format are
#[derive(Serialize)] struct Row { ... }
type Row = (i32, f64, bool, String, ...);
type Row<T> = [T; N];
It is possible to have nested sequences in rows. These will be flattened
during serialization. For example, [[i32; 2]; 2]
,
(i32, [i32; 2], i32)
, and [i32; 4]
result in equivalent
serializations.
Nested structs are currently not supported because they could result in ambiguous column names.
ARFF supports NUMURIC, STRING, and NOMINAL data types. The serializer performs the following mappings from rust types to ARFF types:
i8
, i16
, i32
, i64
, u8
, u16
, u32
, u64
, f32
, f64
String
, &str
enum
Missing values are encoded as ?
in ARFF. Option::None
is mapped to
?
, while Option::Some(T)
is unwrapped and serialized according to
the rules above.
If Row
is a valid type that can be deserialized into a data row, the
following types can be deserialized from ARFF:
Vec<Row>
[Row; N]
(Row, Row, Row, ...)
The data set name is ignored during deserialization. It is possible to wrap above types in a newtype struct, but the name of the type is not checked against the data set.
Valid types for deserializing a data row are
#[derive(Deserialize)] sruct Row { ... }
type Row<T> = [T; N];
type Row = (i32, f64, Sting, bool, ...);
type Row<T> = Vec<T>;
Nested sequences will be flattened, similar to serialization.
The deserializer ignores the ARFF type description and tries to
parse each value as the rust type expected by the targetted
data structure. If this is not possible an Error
is returned.
Columns that can contain missing values need to be wrapped in an
Option
, so that an encoded ?
is parsed as None
.
The ARFF crate is licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the ARFF crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.