# An ORC reader for Rust [![Rust build status](https://img.shields.io/github/workflow/status/travisbrown/orcrs/rust-ci.svg?label=rust)](https://github.com/travisbrown/orcrs/actions) [![Java build status](https://img.shields.io/github/workflow/status/travisbrown/orcrs/java-ci.svg?label=java)](https://github.com/travisbrown/orcrs/actions) [![Coverage status](https://img.shields.io/codecov/c/github/travisbrown/orcrs/main.svg)](https://codecov.io/github/travisbrown/orcrs) This project contains tools for working with [Apache ORC][apache-orc] files from the [Rust programming language][rust]. ORC is an open source data format that lets you represent tables of data efficiently (think [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), but with types, compression, indexing, etc.). Please note that this software is **not** "open source", but the source is available for use and modification by individuals, non-profit organizations, and worker-owned businesses (see the [license section](#license) below for details). ## Example use case I've recently been working with the [Twitter Stream Grab][twitter-stream-grab], a data set published by the [Archive Team][archive-team] and the [Internet Archive][internet-archive] that includes billions of tweets and Twitter user profiles collected between 2011 and 2021. The Twitter Stream Grab is 5.2 terabytes of compressed JSON data, and around 50 terabytes uncompressed. It takes many hundreds of hours of computing time to parse this data, which makes repeated processing impractical for personal projects, or for projects by activist groups with limited resources. Storing this much data can also be impractical. I personally spent several hundred dollars just getting a copy from the Internet Archive's servers to Berlin, and storing a (compressed) copy in [S3][s3] currently costs about $122 per month. There are many kinds of derived data sets and products you might want to build from data like the Twitter Stream Grab. One example is this [collection of several million Twitter user profile snapshots][stop-the-steal] for accounts that were active in spreading false claims about voter fraud in 2020. I'm also running a [web service][memory-lol] that allows users to look up past screen names for Twitter accounts. I'm using the ORC format to make building projects like these from this data more practical. The basic idea is that instead of re-processing the entire 50 terabytes of JSON data for each application, you parse it once to extract the user profiles (and other information) into a set of ORC tables. This intermediate representation is slightly more compact: for example the original compressed data for December 2020 takes up about 60 gigabytes, but the ORC table I've built for data from that month only takes up about 21 gigabytes. This means storing the ORC representation of the full 10 years of data only costs around $40 per month using a service like S3, but more importantly it means that it's much, much cheaper and easier to process or query the data. AWS's [Athena][athena] lets you run SQL queries directly against ORC files stored in S3, for example. You can also use Athena to process CSV files in S3, but running _any_ SQL query against compressed CSV files for the entire Twitter Stream Grab would cost at least $2.50 (since all of the two or three terabytes of compressed data have to be scanned), while querying ORC in Athena generally costs a tiny fraction of that, since the ORC format makes it possible to avoid scanning data that isn't relevant to the query. Products like Athena are useful for exploring data like the Twitter Stream Grab, and ORC makes this practical in terms of cost and time, but it's also possible to process the ORC files directly, so that for example instead of spending hundreds of hours of computing time to build a relational database of Twitter user info from the raw JSON data, you can spend a few hours and extract the data from the ORC files. ## Why this project? The ORC format was developed to be a native storage format for [Apache Hive][hive], which is built on [Hadoop][hadoop], which is firmly in the Java ecosystem. I personally find Hive to be extremely annoying and painful to work with, and I don't prefer writing Java. There is also a [C++ API for ORC][orc-cpp], but I have a fair amount of related tooling already written in Rust, and I wanted to learn more about the internals of the ORC spec, so I decided to try to put together this implementation, and it only took a couple of days. ## Use The project currently provides one command-line tool that does a couple of things: ``` $ target/release/orcrs --help orcrs 0.1.0 Travis Brown USAGE: orcrs [OPTIONS] OPTIONS: -h, --help Print help information -v, --verbose Level of verbosity -V, --version Print version information SUBCOMMANDS: export Export the contents of the ORC file help Print this message or the help of the given subcommand(s) info Dump raw info about the ORC file ``` To list all profiles for verified Twitter accounts from the provided sample data, for example: ```bash target/release/orcrs -vvv export --header --columns 0,3,9 examples/ts-10k-2020-09-20.orc | egrep -v "(false|,)$" id,screen_name,verified 561595762,morinaga_pino,true 1746230882607849472,weareoneEXO,true 29363584,Sandi,true 2067989391190130694,WayV_official,true 36764368,AdamParkhomenko,true 53970806,stephengrovesjr,true 15327404,fox32news,true 1678598579585548288,Mippcivzla,true 158278844,fadlizon,true 79721594,alfredodelmazo,true ``` This tool can currently export around 10 million rows of this data from a 886 megabyte ORC file (representing one day from 2020) in about 6 seconds: ```bash $ time target/release/orcrs -vvv export --header --columns 0,3,9 /data/tsg/users/v2/2020-09-20.orc | wc 9705227 9705227 314287998 real 0m5.088s user 0m6.048s sys 0m0.349s ``` This is currently completely unoptimized and could be made at least a little faster. ## Features This project currently only supports _reading_ ORC files (writing will probably stay out of scope unless I switch to using bindings to the ORC C++ API at some point). | Feature | Status | Notes | |-|-|-| | Integer types |:heavy_check_mark:| | | String types |:heavy_check_mark:| | | Floating point types |❌|Coming soon| | Date types |❌| | | Compound types |❌| | | Zlib compression |:heavy_check_mark:| | | Zstandard compression |:heavy_check_mark:| | | Snappy compression |❌|Probably trivial| | Column encryption |❌|Almost certainly permanently out of scope| Also note that right now these tools don't use the indices: you see every row in the file. So far this is fast enough for the things I need to do, but that will probably change in the future. ## Known issues This software is largely untested, undocumented, and unoptimized. ## Developing You'll need to install [Rust and Cargo][cargo] to build the project. Once you've got them, you can check out this repository and run `cargo test` (to run the tests) and `cargo build --release` (to build the command-line tool, which will be available as `target/release/orcrs`). ~~The [Protobuf schemas for the metadata in the ORC file][orc-proto] are not distributed with this repository, but they will be downloaded to `$OUT_DIR/proto/` during the build. You can update this file as needed either manually or by changing the commit in `build.rs`.~~ I got frustrated after 15 minutes of trying to figure out how to make the Protobuf code generation work properly with the build, so it's gone. You'll need to copy the `scripts/build.rs` file into the project directory in order to update the Protobuf schemas (but this shouldn't be necessary very often). This repository also includes a Java project with some code that I used for generating ORC test data during development. ## Previous work There's a partial implementation of a few pieces of an ORC reader for Rust [here][scritchley-orcrs]. I've borrowed a couple of test cases for the byte run length encoding reader, but my implementation is otherwise unrelated. ## Future work I'll probably continue to add support for ORC format features as I need them. Eventually it'd be nice to have Rust bindings for the C++ API, and I may end up doing that here. ## License This software is published under the [Anti-Capitalist Software License][acsl] (v. 1.4). [acsl]: https://anticapitalist.software/ [apache-orc]: https://orc.apache.org/ [archive-team]: https://wiki.archiveteam.org/ [athena]: https://aws.amazon.com/athena [cargo]: https://doc.rust-lang.org/cargo/getting-started/installation.html [csv]: https://en.wikipedia.org/wiki/Comma-separated_values [hadoop]: https://en.wikipedia.org/wiki/Apache_Hadoop [hive]: https://en.wikipedia.org/wiki/Apache_Hive [internet-archive]: https://archive.org/ [orc-cpp]: https://orc.apache.org/docs/core-cpp.html [orc-proto]: https://github.com/apache/orc/blob/main/proto/orc_proto.proto [orc-spec]: https://orc.apache.org/specification/ORCv1/ [rust]: https://www.rust-lang.org/ [s3]: https://aws.amazon.com/s3/ [scritchley-orcrs]: https://github.com/scritchley/orcrs [stop-the-steal]: https://github.com/travisbrown/stop-the-steal [twitter-stream-grab]: https://archive.org/details/twitterstream [memory-lol]: https://twitter.com/travisbrown/status/1466414144261918721