Crates.io | array-object |
lib.rs | array-object |
version | 0.2.2 |
source | src |
created_at | 2024-10-13 19:59:43.874578 |
updated_at | 2024-12-01 13:24:29.626636 |
description | Self-describing binary format for arrays of integers, real numbers, complex numbers and strings, designed for object storage, database and single file |
homepage | |
repository | https://github.com/YShoji-HEP/ArrayObject |
max_upload_size | |
id | 1407645 |
size | 122,758 |
Self-describing binary format for arrays of integers, real numbers, complex numbers and strings, designed for object storage, database and single file.
ArrayObject
is a part of dbgbb
project.
Vec<_>
, [T; N]
, ndarray
and nalgebra
are supported.Encoding and decording:
use array_object::*;
fn main() {
// Convert data into binary
let original = vec![1u32, 2, 3, 4];
let obj: ArrayObject = original.clone().try_into().unwrap();
let packed = obj.pack(); // This converts the data into Vec<u8>.
// Restore data
let unpacked = ArrayObject::unpack(packed).unwrap();
let inflated: Vec<u32> = unpacked.try_into().unwrap();
assert_eq!(original, inflated);
}
One can also use the macros to write and read a file:
use array_object::*;
fn main() {
// Save into a file
let original = vec![1f64, 2.2, -1.1, 5.6];
export_obj!("testdata.bin", original.clone()); // The type has to be known at this point. If the file exists, this will overwrite the existing file.
// Load from a file
let restored: Vec<f64> = import_obj!("testdata.bin"); // The type annotation is required.
assert_eq!(original, restored);
}
Feature | Description |
---|---|
allow_float_down_convert |
Allow implicit conversion such as from f64 to f32 . |
ndarray_15 |
Enable ndarray support. The compatible version is 0.15.x. |
ndarray_16 |
Enable ndarray support. The compatible version is 0.16.x. |
nalgebra |
Enable nalgebra support. Confirmed to work with version 0.33.0. |
The data format is automatically selected to minimize the datasize.
Integer is either unsigned or signed, which is determined when [ArrayObject
] is constructed. The zigzag encoding is used for signed integers. When restored to a variable, data is automatically converted into the desired integer type if the ranges overlap.
Short Integer (5bit)
The data is stored in the same byte as the footer. Thus the total data size is only one byte.
Variable Length (8 x n bit)
The integer is shortened to 8bit x (smallest number).
Currently 32bit and 64bit floating numbers are supported.
Only UTF-8 string is allowed, in particular, the non-UTF value of 0xFF is used internally and should be avoided.
A simple case is to store a 2D array of numbers similar to a CSV file. Unlike a CSV file, ArrayObject
can store arrays of more than two-dimensions, has a smaller filesize, and is fast to read and write.
Instead, it is not possible to append data to ArrayObject
or to have different types of data in a single ArrayObject
. The definition of ArrayObject
is a chunk of data that is not separable. Hence there is no sense in appending data or having different types. In other words, ArrayObject
is the data that we want to load into memory simultaneously, not line by line.
Another use is an object for object storage, which is actually what the ArrayObject
is for. The object storage does the things that ArrayObject
cannot: it allows you to append the data and store different types of data in a storage. This is why ArrayObject
should not have these capabilities: having multiple options to do the same thing would complicate the system.
Instead, with a minimal footer, it does what the object storage cannot: type abstraction, forced type checking and type-dependent compression.
A raw binary file has no type checking system. Integers, for example, can be unsigned or signed, 32bit or 64bit, little endian or big endian, etc. Without type checking, one has to check that both reader and writer are using the same format.
When it comes to an array, it becomes more complicated: the raw memory order can be row major or column major depending on the programing language, and the real part and the imaginary part of complex numbers can be assigned to the most inner or the most outer index.
ArrayObject
has a well defined spec and one does not need to worry about these things.
ArrayObject
is a simple, compact and portable data format for a single array, but is not a database. Thus, it does not have complicated structure like dataset or group. Even so, ArrayObject
is self-contained: it knows how to inflate itself and we do not need to feed any information to read the data.
Instead, ArrayObject
does not have information of datasize, name, timestamp, permission, etc, which are supposed to be managed by object storage or filesystem. ArrayObject
is rather closer to a CSV file: the program knows how to read it, but does not know what the data is about, when it is created, or what are the column and row, without accessing the metadata stored outside of the file.
ArrayObject
forbids nested structures: it can be an array of numbers or strings, but not of ArrayObject
s themselves. It also cannot be a tuple. Such structures are supposed to be provided by the storage.
Serialization typically adds the datasize to each data to indicate the boundaries of the data. This information is not necessary for storing data in a file or a object storage because it is already managed by the storage. Instead, ArrayObject
stores the information about the shape of the array.
A more technical difference is that ArrayObject
adds a footer instead of a header, allowing metadata to be separated at no cost.
The relation between complex numbers and real numbers is the same as that of real numbers and integers. A subset of complex numbers is real numbers and there is a well-defined map betwen them where they overlap. This makes a difference between a complex number and a vague array of length 2.
Practically, in dynamically typed languages, the results of functions like sqrt or log yield either real numbers or complex numbers depending on the sign of the argument. It is thus useful to manifestly indicate complex numbers as a type and keep the same array shape.
In addition, it is somewhat cumbersome to convert an array of complex numbers into an array of real numbers. ArrayObject
provides a handy export/import option for complex numbers.
Vec<Vec<_>>
?Vec<Vec<_>>
may contain vectors of different lengths and does not fit into ArrayObject
.
In general, if you work with an fixed length array, it is much efficient to use a crate like ndarray or nalgebra.
If necessary, for ArrayObject
s having the same size, shape and type, .try_concat()
method is available for Vec<ArrayObject>
, which generates a one-dimensional higher array.
On the other hand, if the lengths are different, it can be stored as multiple ArrayObject
s in an object storage. See dbgbb
for such an implementation.