# scanlex - a simple lexical scanner. ## The Problem of Input It is easier to write things out than to read them in, since more things can go wrong. The read may fail, the text may not be valid UTF-8, the number may be malformed or simply out of range. ## Lexical Scanners Lexical scanners split a stream of characters into _tokens_. Tokens are returned by repeatedly calling the `get` method of `Scanner`, (which will return `Token::End` if no tokens are left) or by iterating over the scanner. They represent numbers, characters, identifiers, or single/double quoted strings. There is also `Token::Error` to indicate a badly formed token. This lexical scanner makes some assumptions, such as a number may not be directly followed by a letter, etc. No attempt is made in this version to decode C-style escape codes in strings. All whitespace is ignored. It's intended for processing generic structured data, rather than code. For example, the string "hello 'dolly' * 42" will be broken into four tokens: - an _identifier_ 'hello' - a quoted string 'dolly' - a character '*' - and a number 42 ```rust extern crate scanlex; use scanlex::{Scanner,Token}; let mut scan = Scanner::new("hello 'dolly' * 42"); assert_eq!(scan.get(),Token::Iden("hello".into())); assert_eq!(scan.get(),Token::Str("dolly".into())); assert_eq!(scan.get(),Token::Char('*')); assert_eq!(scan.get(),Token::Int(10)); assert_eq!(scan.get(),Token::End); ``` To extract the values, use code like this: ```rust let greeting = scan.get_iden()?; let person = scan.get_string()?; let op = scan.get_char()?; let answer = scan.get_integer(); // i64 ``` `Scanner` implements `Iterator`. If you just wanted to extract the words from a string, then filtering with `as_iden` will do the trick, since it returns `Option`. ```rust let s = Scanner::new("bonzo 42 dog (cat)"); let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect(); assert_eq!(v,&["bonzo","dog","cat"]); ``` Using `as_number` instead you can use this strategy to extract all the numbers out of a document, ignoring all other structure. The `scan.rs` example shows you the tokens that would be generated by parsing the given string on the commmand-line. This iterator only stops at `Token::End` - you can handle `Token::Error` yourself. Usually it's important _not_ to ignore structure. Say we have input strings that look like this "(WORD) = NUMBER": ```rust scan.skip_chars("(")?; let word = scan.get_iden()?; scan.skip_chars(")=")?; let num = scan.get_number()?; ``` _Any_ of these calls may fail! It is a common pattern to create a scanner for each line of text read from a readable source. The `scanline.rs` example shows how to use `ScanLines` to accomplish this. ```rust let f = File::open("scanline.rs").expect("cannot open scanline.rs"); let mut iter = ScanLines::new(&f); while let Some(s) = iter.next() { let mut s = s.expect("cannot read line"); // show the first token of each line println!("{:?}",s.get()); } ``` A more serious example (taken from the tests) is parsing JSON: ```rust type JsonArray = Vec>; type JsonObject = HashMap>; #[derive(Debug, Clone, PartialEq)] pub enum Value { Str(String), Num(f64), Bool(bool), Arr(JsonArray), Obj(JsonObject), Null } fn scan_json(scan: &mut Scanner) -> Result { use Value::*; match scan.get() { Token::Str(s) => Ok(Str(s)), Token::Num(x) => Ok(Num(x)), Token::Int(n) => Ok(Num(n as f64)), Token::End => Err(scan.scan_error("unexpected end of input",None)), Token::Error(e) => Err(e), Token::Iden(s) => if s == "null" {Ok(Null)} else if s == "true" {Ok(Bool(true))} else if s == "false" {Ok(Bool(false))} else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))}, Token::Char(c) => if c == '[' { let mut ja = Vec::new(); let mut ch = c; while ch != ']' { let o = scan_json(scan)?; ch = scan.get_ch_matching(&[',',']'])?; ja.push(Box::new(o)); } Ok(Arr(ja)) } else if c == '{' { let mut jo = HashMap::new(); let mut ch = c; while ch != '}' { let key = scan.get_string()?; scan.get_ch_matching(&[':'])?; let o = scan_json(scan)?; ch = scan.get_ch_matching(&[',','}'])?; jo.insert(key,Box::new(o)); } Ok(Obj(jo)) } else { Err(scan.scan_error(&format!("bad char '{}'",c),None)) } } } ``` (This is of course an Illustrative Example. JSON is a solved problem.) ## Options With `no_float` you get a barebones parser that does not recognize floats, just integers, strings, chars and identifiers. This is useful if the existing rules are too strict - e.g "2d" is fine in `no_float` mode, but an error in the default mode. [chrono-english](https://github.com/stevedonovan/chrono-english) uses this mode to parse date expressions. With `line_comment` you provide a character; after this character, the rest of the current line will be ignored.