Crates.io | lib-ruby-parser |
lib.rs | lib-ruby-parser |
version | 4.0.6+ruby-3.1.2 |
source | src |
created_at | 2020-11-11 14:39:53.160254 |
updated_at | 2024-04-02 00:14:19.270035 |
description | Ruby parser |
homepage | |
repository | https://github.com/lib-ruby-parser/lib-ruby-parser |
max_upload_size | |
id | 311250 |
size | 1,230,147 |
lib-ruby-parser
is a Ruby parser written in Rust.
Basic usage:
use lib_ruby_parser::{Parser, ParserOptions};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let options = ParserOptions {
buffer_name: "(eval)".to_string(),
..Default::default()
};
let mut parser = Parser::new(b"2 + 2".to_vec(), options);
println!("{:#?}", parser.do_parse());
Ok(())
}
TLDR; it's fast, it's precise, and it has a beautiful interface.
Comparison with Ripper
/RubyVM::AST
:
parse.y
, and so it returns exactly the same sequence of tokens.rubyspec
and ruby/ruby
repos and there's no difference with Ripper.lex
.Ripper
, Ripper parses 4M LOC in ~24s, lib-ruby-parser
does it in ~4.5s. That's ~950K LOC/s. You can find benchmarks in the bench/
directory, they don't include any IO or GC.Comparison with whitequark/parser:
Testing corpus has 4,176,379
LOC and 170,114,575
bytes so approximate parsing speed on my local machine is:
Parser | Total time | Bytes per second | Lines per second |
---|---|---|---|
lib-ruby-parser | ~4.4s | ~38,000,000 | ~950,000 |
ripper | ~24s | ~7,000,000 | ~175,000 |
whitequark/parser | ~245s | ~700,000 | ~17,000 |
lib-ruby-parser
follows MRI/master. There are no plans to support multiple versions like it's done in whitequark/parser
.
Ruby version | lib-ruby-parser version |
---|---|
3.0.0 | 3.0.0+ |
3.1.0 | 4.0.0+ruby-3.1.0 |
Starting from 4.0.0
lib-ruby-parser follows SemVer. Base version increments according to API changes,
while metadata matches current Ruby version, i.e. X.Y.Z+ruby-A.B.C
means:
X.Y.Z
base versionA.B.C
Both versions bump separately.
By default lib-ruby-parser
can only parse source files encoded in UTF-8
or ASCII-8BIT/BINARY
.
It's possible to pass a decoder
function in ParserOptions
that takes a recognized (by the library) encoding and a byte array. It must return a UTF-8 encoded byte array or an error:
use lib_ruby_parser::source::{InputError, Decoder, DecoderResult};
use lib_ruby_parser::{Parser, ParserOptions, ParserResult, LocExt};
fn decode(encoding: String, input: Vec<u8>) -> DecoderResult {
if "US-ASCII" == encoding.to_uppercase() {
// reencode and return Ok(result)
return DecoderResult::Ok(b"# encoding: us-ascii\ndecoded".to_vec());
}
DecoderResult::Err(InputError::DecodingError(
"only us-ascii is supported".to_string(),
))
}
let options = ParserOptions {
decoder: Some(Decoder::new(Box::new(decode))),
..Default::default()
};
let mut parser = Parser::new(b"# encoding: us-ascii\n3 + 3".to_vec(), options);
let ParserResult { ast, input, .. } = parser.do_parse();
assert_eq!(ast.unwrap().expression().source(&input).unwrap(), "decoded".to_string())
Ruby doesn't require string literals to be valid in their encodings. This is why the following code is valid:
# encoding: utf-8
"\xFF"
Byte sequence 255
is invalid in UTF-8, but MRI ignores it.
But not all languages support it, and this is why string and symbol nodes encapsulate a custom StringValue
instead of a plain String
.
If your language supports invalid strings you can use raw .bytes
of this StringValue
. For example, a Ruby wrapper for this library could do that.
If your language doesn't support it, better call .to_string_lossy()
that replaces all unsupported chars with a special U+FFFD REPLACEMENT CHARACTER (�)
.
Ruby constructs regexes from literals during parsing to:
To mirror this behavior lib-ruby-parser
uses Onigurama to compile, validate and parse regex literals.
This feature is disabled by default, but you can add it by enabling "onig"
feature.
The grammar of lib-ruby-parser
is built using a custom bison skeleton that was written for this project.
For development you need the latest version of Bison installed locally. Of course, it's not necessary for release builds from crates.io (because compiled parser.rs
is included into release build AND build.rs
that converts it is excluded).
If you use it from GitHub directly you also need Bison (because parser.rs
is under gitignore)
You can use parse
example:
$ cargo run --bin parse --features=bin-parse -- --print=N --run-profiler --glob "blob/**/*.rb"
A codebase of 4M LOCs can be generated using a download.rb
script:
$ ruby gems/download.rb
Then, run a script that compares Ripper
and lib-ruby-parser
(attached results are from Mar 2024):
$ ./scripts/bench.sh
Running lib-ruby-parser
Run 1:
Time taken: 4.4287733330 (total files: 17895)
Run 2:
Time taken: 4.4292764170 (total files: 17895)
Run 3:
Time taken: 4.4460961250 (total files: 17895)
Run 4:
Time taken: 4.4284508330 (total files: 17895)
Run 5:
Time taken: 4.4695665830 (total files: 17895)
--------
Running MRI/ripper
Run 1:
Time taken: 24.790103999897838 (total files: 17894)
Run 2:
Time taken: 23.145863000303507 (total files: 17894)
Run 3:
Time taken: 25.50493900012225 (total files: 17894)
Run 4:
Time taken: 24.570900999940932 (total files: 17894)
Run 5:
Time taken: 26.0963700003922 (total files: 17894)
First, make sure to switch to nightly:
$ rustup default nightly
Then install cargo-fuzz
:
$ cargo install cargo-fuzz
And run the fuzzer (change the number of --jobs
as you need or remove it to run only 1 parallel process):
$ RUST_BACKTRACE=1 cargo fuzz run parse --jobs=8 -- -max_len=50