25 July, 2007
Lexical analyzer (fa_lex)
Introduction
The lexical analyzer (the lexer) takes a sequence of characters and returns a sequence of tokens; where every token is a meaningful unit identified by its type and its boundaries.
Everything that is not meaningful is normally discarded, like spaces or new-line symbols for C++. The tokens cannot overlap and include each other, in other words each character belongs to not more than one token. Depending on the language, the definition of the tokens can be different, see examples below:
For C++:
Input: if(++i==0) { j = 0; }
Output: if/OP (/LRB ++/OP i/VAR ==/OP 0/NUM )/RBR {/LCBR j/VAR =/OP 0/NUM ;/OP }/RCBR
Where {OP, LBR, RBR, VAR, NUM, LCBR, RCBR} is a possible set of token types for C++.
For English:
Input: Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov.29.
Output: Pierre/WORD Vinken/WORD ,/PUNKT 61/CD years/WORD old/WORD ,/PUNKT will/WORD join/WORD the/WORD board/WORD as/WORD a/WORD nonexecutive/WORD director/WORD Nov./WORD 29/CD ./EOS
Where: {WORD, PUNKT, CD, EOS} is a possible set of token types for English.
Grammar of fa_lex rules
The lexer uses rules in order to identify the boundaries and types of the tokens. Each rule describes one token in a context. The rules are based on the character regular expressions.
Each rule consists of optional left context description, the token description, optional right context description and the token type. A token description is enclosed into triangular brackets. A left, right context and token descriptions are character regular expressions. However, context descriptions should not be cyclic (e.g. accept a string of an infinite length) and the token description should not be empty (e.g. accept a string of a zero length). All rules are combined together by an "or" operator. The following grammar in Backus-Naur form formally describes the syntax of the lexer rules.
GRAMMAR ::= RULES
GRAMMAR ::= FUNCTIONS
RULES ::= RULE
RULES ::= RULE\n RULES
RULE ::= CONDITION --> ACTION
CONDITION ::= Regexp* < Regexp > Regexp*
CONDITION ::= < Regexp > Regexp*
CONDITION ::= Regexp* < Regexp >
CONDITION ::= < Regexp >
ACTION ::= Tag
ACTION ::= _call FUNCTION_NAMES
ACTION ::= Tag _call FUNCTION_NAMES
FUNCTION_NAMES ::= _main
FUNCTION_NAMES ::= FnName
FUNCTION_NAMES ::= FnName FUNCTION_NAMES
FUNCTIONS ::= FUNCTION
FUNCTIONS ::= FUNCTION\n FUNCTIONS
FUNCTION ::= _function FnName\n RULES\n _end
Regexp -- non-empty character-based regular expression
Regexp* -- acyclic character-based regular expression
Tag – Tag name (token type name)
FnName – function name, can be one of the tags or a new name
_function – a keyword indicating beginning of the function
_end – a keyword indicating the end of the function
_call – a keyword indicating function call
_main – a special function name referring to the main rule set
Example 1:
< [0-9]+ > --> NUM
[0-9] < [-+*/] > [-]?[0-9] --> OP
< [[:alpha:]]+ > --> VAR
Example 2:
< ([A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153])+[+-] > [0-9] --> WORD
< ([0]?[0-9]|(1)[0-9]|(2)[0-4])[:]([0-5][0-9])([:]([0-5][0-9]))? > [^0-9] --> HHMM
< ([\x0024\x00A2-\x00A5\x09F2\x09F3\x0E3F\x20A0\x20A2\x20A3\x20A4\x20A6-\x20AF])[\x0020\t]*
((0)|[1-9][0-9]*)[\x0020\t]*((
(AED)|(ARP)|(ATS)|(AUD)|(BBD)|(BEF)|(BGL)|(BHD)|(BMD)|(BRR)|(BRL)|(BSD)
| (CAD)|(CHF)|(CLP)|(CNY)|(CSK)|(CYP)|(DEM)|(DKK)|(DJF)|(DZD)|(EGP)|(ESP)
| (EUR)|(FIM)|(FJD)|(FRF)|(GBP)|(GRD)|(HKD)|(HUF)|(IDR)|(IEP)|(ILS)|(INR)
| (IQD)|(ISK)|(ITL)|(JMD)|(JOD)|(JPY)|(KRW)|(KWD)|(LBP)|(LUF)|(LYD)|(MAD)
| (MRO)|(MXP)|(MYR)|(NLG)|(NOK)|(NZD)|(OMR)|(PHP)|(PKR)|(PLN)|(PTE)|(QAR)
| (ROL)|(RUR)|(SAR)|(SDD)|(SEK)|(SGD)|(SKK)|(SOS)|(SYP)|(SUR)|(THB)|(TND)
| (TRL)|(TRY)|(TTD)|(TWD)|(USD)|(VEB)|(XEC)|(YER)|(ZAR)|(ZMK)|(DM)|(FF)
| (\x20AC((uro)|(URO))[s]?)
)) > ([^.,0-9A-Za-z\x00C0-\x00D6\x00D8-\x00F6\x00F8-\x00FF\x0152\x0153]) --> CURR
The following rules are incorrect:
< [-+*/] > [-]?[0-9]+ --> OP ; the context should be acyclic
< [-+*/]* > [-]?[0-9] --> OP ; the token description must not allow empty tokens
[-]?[0-9]+ --> CD ; the token definition should be enclosed in the triangular brackets
The following are equivalent rule-sets:
1. The
< [-+*/] > [-]?[0-9] --> OP
and
< [-+*/] > [-][0-9] --> OP
< [-+*/] > [0-9] --> OP
2. The
< [-+*/] > [[:alpha:]]|[[:digit:]] --> OP
and
< [-+*/] > [[:alpha:]] --> OP
< [-+*/] > [[:digit:]] --> OP
3. The
< [-+*/] > [-]? --> OP
and
< [-+*/] > --> OP
4. The
< [-+*/] > [^a] --> OP
and
< [-+*/] > [-] --> OP
< [-+*/] > [^a] --> OP
5. The
< [-+*/] > . --> OP
and
< [-+*/] > [^a] --> OP
< [-+*/] > [^b] --> OP
Description of Functions:
Functions are isolated named sets of rules in fa_lex syntax.
If the action of the rule R contains _call keyword followed by a function name FnName or by a _main keyword then each time R extracts a token, the rule set FnName or _main is applied to the token span. If _call is followed by one function name then the function’s rule set extracts all possible non-overlapping tokens out of the span. If _call is followed by more than one function name then each corresponding rule set extracts just one token, one after another in a sequence with exception to the main rule set _main (it always extracts all possible tokens.)
As formal grammar defines, there may be three types of actions: a) tag assignment b) function call c) tag assignment and function call. If the action is a function call without tag assignment then no token corresponding to the span is extracted. In this case "fa_lex" returns whatever is the output of the calling function. It is possible that the calling function will return nothing.
Functions are optional, they may be used for hierarchical tokens extractions (such as date as a whole and day, month and year as its parts,) they also may be used for wide context description and conflict resolution.
Examples of Functions:
The following function HY_WORD is called to split the hyphenated word into segments. Input: “out-of-date”, output: “out/WORD -/WORD of/WORD -/WORD date/WORD” No nested tokens are created.
_function HY_WORD
< [A-Za-z]+ > --> WORD
< [-] > --> WORD
_end
< [A-Za-z]+([-][A-Za-z]+)+ > --> _call HY_WORD
In the following example the tag ACR assigned and the function ACR is called (it is fine to have functions and tags of the same names.) Input: “A.B.C.”, output: “A.B.C./ACR A./WORD B./WORD C./WORD”.
_function ACR
< [A-Z][.] > --> WORD
_end
< ([A-Z][.])+ > --> ACR _call ACR
The following functions DAY, MONTH, YEAR are called in a sequence one after another, each of them extracts just one token in the selected by the caller rule span. Input: “12/13/2006 13/12/2006”, output: “12/13/2006/DATE_US 12/MONTH 13/DAY 2006/YEAR 13/12/2006/DATE_EU 13/DAY 12/MONTH 2006/YEAR”. For the cases when the input token matches both DATE_US and DATE_EU rules the fa_lex prefers tag name which has smaller value, so depending on the tagset definition DATE_US may be preferred to the DATE_EU and vice versa (see “Conflict resolution rules.”)
_function MONTH
< [0-9][0-9] > --> MONTH
_end
_function DAY
< [0-9][0-9] > --> DAY
_end
_function YEAR
< [0-9][0-9][0-9]?[0-9]? > --> YEAR
_end
< [01][0-9][/][0123][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_US _call MONTH DAY YEAR
< [0123][0-9][/][01][0-9][/][0-9][0-9][0-9]?[0-9]? > --> DATE_EU _call DAY MONTH YEAR
Extra syntax notes:
1. The blank characters does not mean anything for fa_lex and they are simply ignored. In order to match with any of those characters, the following constructions can be used: \t, \n, \r, \f, \v, [ ], [\t], [\n], [\r], [\f], [\v], \x20, \x09, \x0D, \x0A, [\x20], [\x09], [\x0D], [\x0A], [[:blank:]], [[:space:]] and so on.
2. Left (^) and right ($) anchors are ordinary symbols for the lexer, they can be included into both contexts as well as the token definition. If possible, including them into the token definition is more preferable. The “any” symbol (e.g. . ) matches both of the anchors, the negation of a character (e.g. [^a]) also matches any of the anchors.
3. Chracter classes:
[:alnum:] [:alpha:] [:lower:] [:xdigit:]
[:digit:] [:space:] [:upper:] [:print:]
[:punct:] [:blank:] [:cntrl:] [:graph:]
are defined as in POSIX "C" locale and have to be extended for Unicode range, if necessary.
4. See POSIX 1003.2 standard for regular expressions for more details on the regular expression syntax (http://www.unusualresearch.com/regex/regexmanpage.htm ).
Compilation:
The lexer compiler fa_build_lex takes two input files: one a rule-set and the other a tagset. The tagset is a list of symbolic names of token types each of which has a numerical value associated with it. The tagset can be shared with some other modules like POS tagging, in case of NL analysis.
Suppose the file lex_rules.utf8 contains the following rule-set:
[0-9] < [-+*/] > [-]?[0-9] --> PUNKT
< [-]?[0-9]+ > --> CD
< [-+*/]+ > --> WORD
The tagset.txt contains the following tagset:
CD 1
PUNKT 2
WORD 3
The following command will compile the rule-set:
fa_build_lex --in=lex_rules.utf8 --out=lex_rules.dump \
--build-dump --tagset=tagset.txt
The --build-dump parameter makes a memory-dump representation of the compiled rule-set, without this parameter the compiled rule-set will be stored in the textual representation. See the description of all switches by typing: fa_build_lex --help
If there were no compilation errors the output file lex_rules.dump will be created.
Lexical analysis
As it has been said, the lexical analysis is a process of conversion of a sequence of characters into a sequence of tokens; where each token is a meaningful unit identified by its type and its boundaries. Everything that is not the token is ignored. The output tokens cannot overlap and include each other, in other words each character of the input text belongs to not more than one token. In order to guarantee this condition, it is necessary to be able to prefer one match over the other if more than one rule matches the given character of the input text. This is addressed by the conflict resolution rules (see below).
Conflict resolution for matching rules:
The following is the order in which fa_lex selects which rule to execute if more than one matched the text:
1. The leftmost rule,
2. The rule with the longest span,
3. The rule with the smallest left context,
4. The rule with the smallest right context,
5. The rule with no tag assignment (just function call)
6. The rule with the smaller tag value
7. The rule with lexicographically smaller list of function names (based their values)
Conflict resolution rules #1 and #2 are common for many lexical analyzer implementations (including lex and flex). The rules #3 -- #7 are specific for fa_lex. Unlike in lex/flex in fa_lex the rule order does not play any role in conflict resolution, thus it absolutely does not matter in which order the rules are specified.
Runtime Execution:
The lexical analysis of the text can be performed by a stand-alone program fa_lex. It takes two obligatory parameters: a compiled rule-set and a tagset and reads from stdin or from an input file the raw text and prints out the extracted tokens to stdout or an output file in the tagged-text format. The output can be redirected to programs like fa_ts2ps, fa_gcd, fa_ts2stat or any other understanding this format.
The following command will make a lexical analysis of the input text with respect to the grammar defined and compiled in the pervious section:
$ echo 23452345+34534 | fa_lex --tagset=tagset.txt --stage=lex_rules.dump
> 23452345/CD +/PUNKT 34534/CD
$ echo 23452345+34534+ | fa_lex --tagset=tagset.txt --stage=lex_rules.dump
> 23452345/CD +/PUNKT 34534/CD +/WORD
FAQ
1. Why use fa_lex?
The fa_lex lexical analyzer does not require rule-authors to write a C/C++ code or even have a C/C++ compiler installed. In fa_lex, the rule-sets are purely declarative. This allows authors (usually linguists) to focus on the linguistic aspects of the problem and be isolated from the actual implementation. The rules “by-design” cannot contain a hard to understand logic, they are more independent from each other, and, thus, easier to maintain than in other lexer programs.
2. How is fa_lex different from flex?
Efficiency aspects:
Authoring aspects:
Other aspects:
3. How would you use a function for context description and conflict resolution?
As for the context you can detect that span contains something (for example, the quotes at the beginning and the end of the span) then you can apply a specific rule-set to this span (for example, a separate rule to the left quote, the right quote and the same rule-set for the rest of the span.) Needless to say this is impossible to do without functions.
As for the conflict resolution, the key here is that a function call is always preferred to the tag assignment. So if we need some special/exceptional treatment for this particular span(s) and tag priorities does not work then we just call a function for this span(s) and override the behavior.
For example, the ruleset 1:
< [^\s\t\r\n]+ > --> WORD
< [A-Za-z][.]([A-Za-z][.])+ > --> ACR
< e[.]g[.] > --> WORD
And the tagset:
ACR 1
WORD 2
The intension of the author is to mark sequences of non spaces with the tag WORD, but sequences of more than one letter followed by a dot with the tag ACR, but the "e.g." as a WORD. Author wants to choose ACR every time possible and only for the rest of the cases use WORD tag, that is why the WORD tag has the biggest value in the tagset. The "e.g." word however belongs to both the /[A-Za-z][.]([A-Za-z][.])+/ language and the /e[.]g[.]/ language so there is a conflict. This conflict cannot be resolved by changing the tag value, it either should be resolved by adding more context to the /e[.]g[.]/ rule or by using a function call as follows (see the rulset2)
The ruleset 2:
< [^\s\t\r\n]+ > --> WORD
< [A-Za-z][.]([A-Za-z][.])+ > --> ACR
< e[.]g[.] > --> _call WORD
_function WORD
< ^ .+ $ > --> WORD
_end
The function WORD assigns the tag WORD to the entire span. According to the priority rules the "e.g." rule will have the highest priority. See “Conflict resolution for matching rules” section.
4. How to obtain fa_lex?
The fa_lex lexical analyzer is available in the NLG-MAIN enlistment. The program fa_build_lex builds rule-sets and the fa_lex program is a stand alone program that does a lexical analysis, see ”Compilation” and “Runtime Execution” sections.