Crates.io | sas-lexer |
lib.rs | sas-lexer |
version | 1.0.0-beta.3 |
source | src |
created_at | 2024-11-05 05:42:08.086514 |
updated_at | 2024-11-08 02:59:07.386186 |
description | Ultra fast "correct" static context-aware parsing SAS code lexer. |
homepage | |
repository | https://github.com/mishamsk/sas-lexer |
max_upload_size | |
id | 1436054 |
size | 382,748 |
Ultra fast "correct" static context-aware parsing SAS code lexer.
Let me break it down for you:
Available in two flavors:
The key limitation is that the lexer is static, meaning it does not execute the code. One can produce SAS code that is impossible to statically tokenize the same way SAS scanner would. Hence the need for some heuristics. However, you're unlikely to run into these limitations in practice.
%let v='01jan87'd;
will lex '01jan87'd
as a DateLiteral
token instead of MacroString
.%mcall( arg value )
will have a MacroString
token with the text arg value
.%macro $bad
will cause whatever follows up-to %mend
to be skipped. The lexer does not do this, and will try to recover and continue lexing.=
in %let a 1;
but SAS will not recover missing )
in %macro a(a=1;
, while this lexer will.SAS has thousands of keywords, and none of them are reserved. All fans of columns named when
, rejoice, you can finally execute sql that looks like this select case when when = 42 then then else else end from table
!
Thus the selection of keywords that are lexed as a dedicated token type vs. as an identifier is somewhat arbitrary and based on personal experience of writing parsers for SAS code.
You can add the Rust crate as a dependency via Cargo:
cargo add sas-lexer
For Python, install the package using pip:
pip install sas-lexer
use sas_lexer::{lex_program, LexResult, TokenIdx};
fn main() {
let source = "data mydata; set mydataset; run;";
let LexResult { buffer, .. } = lex_program(&source).unwrap();
let tokens: Vec<TokenIdx> = buffer.into_iter().collect();
for token in tokens {
println!("{:?}", buffer.get_token_raw_text(token, &source));
}
}
macro_sep
: Enables a special virtual MacroSep
token that is emitted between open code and macro statements when there is no "natural" separator, or when semicolon is missing between two macro statements (a coding error). This may be used by a downstream parser as a reliable terminating token for dynamic open code and thus avoid doing lookaheads. Dynamic, means that the statement has a macro statements in it, like data %if cond %then %do; t1 %end; %else %do; t2 %end;;
serde
: Enables serialization and deserialization of the ResolvedTokenInfo
struct using the serde
library. For an example of usage, see the Python bindings crate sas-lexer-py
.opti_stats
: Enables some additional statistics during lexing, used for performance tuning. Not intended for general use.from sas_lexer import lex_program_from_str
tokens, errors, str_lit_buf = lex_program_from_str(
"data mydata; set mydataset; run;"
)
for token in tokens:
print(token)
Whether it is because the Dragon Book had not been published when the language was conceived, or due to the deep and unwavering love of its users, the SAS language allows for almost anything, except perhaps brewing your coffee in the morning. Although, I wouldn't be surprised if that turned out to be another undocumented feature.
If you think I am exaggerating, read on.
THIS SECTION IS WIP. PLANNED CONTENT:
=
%sysfunc/%syscall
function aware lexingWhy build a modern lexer specifically for the SAS language? Mostly for fun! SAS is possibly the most complicated programming language for static parsing in the world. I have worked with it for many years as part of my day job, which eventually included a transpiler from SAS to PySpark. I wanted to see how fast a complex context-aware lexer can theoretically be, and SAS seemed like a perfect candidate for this experiment.
This project is licensed under the AGPL-3.0. If you are interested in using the lexer for commercial purposes, please reach out to me for further discussion.
We welcome contributions in the form of issues, feature requests, and feedback! However, due to licensing complexities, we are not currently accepting pull requests. Please feel free to open an issue for any proposals or suggestions.
tests
directory without modifications.