# Clex Language Clex is a generator language, that can generate a set of random numbers/string based on a given grammar rules. > [!NOTE] > Clex currently can't and won't support arithmetic, logical or any other relationship except back-references. So, a few things that can be testcases that might have a relationship between the generated values can't be possibly tested using Clex. The AST is same for a language in all the case, while the generated string from the language will vary. For example: S[4,@CH_UPPER@] can generate "GAHS" or "JHAS" etc. ## Rules for grammar ```txt ClexLanguage ::= UnitExpression* UnitExpression ::= CapturingGroup | NonCapturingGroup | DataType | EOF CapturingGroup ::= "(" "N" PositiveRange? ")" NonCapturingGroup ::= "(?:" UnitExpression* ")" Quantifiers? DataType ::= "N" Range? Quantifiers? | "F" Range? Quantifiers? | "S" StringModifier? Quantifiers? StringModifier ::= "[" PositiveReference? "," CharacterSet? "]" Range ::= "[" Reference? "," Reference? "]" PositiveRange ::= "[" PositiveReference? "," PositiveReference? "]" Quantifiers ::= "{" PositiveReference "}" Reference ::= "\\" GroupNo | i64 PositiveReference ::= "\\" GroupNo | u64 GroupNo ::= u64 CharacterSet ::= "'" ASCII_CHARACTER_SET+ "'" | "@" Character "@" Character ::= "CH_ALPHA" | "CH_NUM" | "CH_NEWLINE" | "CH_ALNUM" | "CH_UPPER" | "CH_LOWER" | "CH_ALL" ASCII_CHARACTER_SET ::= ``` ## Semantic Meaning of each expression in the grammar ### ASCII_CHARACTER_SET Denotes the set of all ASCII characters, that can be represented. ### Character Denotes a set of characters from which a string is going to be randomly generated. Basically the character set for string that will be generated by the generator. | Character | Meaning | |---------------------|-------------------------------------------------------| | CH_ALPHA | Set of Alphabetical characters | | CH_NUM | Set of Numeral characters | | CH_ALNUM | Set of AlphaNumeric characters (default) | | CH_UPPER | Set of Uppercase Alphabets | | CH_NEWLINE | Newline Character | | CH_LOWER | Set of Lowercase Alphabets | | CH_ALL | Set of Alphabets, Numbers and some special characters | ### CharacterSet Just _Character_ enclosed within at symbol(@) or custom strings into single quotes to represent the character set. ### GroupNo Represents the group number for back-referencing. One awesome thing about clex language is its support for dynamic back-references as compared to static ones as found in regex. Each _CapturingGroup_ captures and stores an element by value indexed from 1. Obviously, it can't be more than the number of _CapturingGroup_ present in _ClexLanguage_. ### Reference _Reference_ can be a back-reference to a capturing group (GroupNo) or a numeric value (i64). It is used in Range to specify the bounds. If not specified, default values are used. Prime purpose of _Reference_ is to act as an abstraction layer to store the literal value or the reference of the value that will be guaranteed to be available in future upon use. Back-referencing is done by using `"\\" GroupNo`, in this case the value in that specific Group is de-referenced upon use and put back in as a value. lex uses 1-based indexing for backreferences, rather than zero-based like many other regular expression engines. ### PositiveReference _PositiveReference_ is similar to Reference but ensures that the referenced value is non-negative. It is used in _PositiveRange_. ### Quantifier _Quantifier_ specify the number of occurrences for the preceding expression. The _PositiveReference_ in "{ ... }" denote the number of occurrences. If not specified, the associated expression occurs only once. The number of occurrences can't be negative for obvious reasons. ### Range _Range_ specifies a domain of values for numeric _DataType_ (Integer and Float) from which its value will be generated during generator phase. It includes _Reference_(s) for the lower and the upper bound for the number to be generated. If not specified, default values(INT64_MIN, INT64_MAX) are used. The upper and lower bound is always an integer(even if defining range for float data types also). Range is always inclusive, so `[m, n]` would mean that value can be anywhere from `m` to including `n`. ### PositiveRange _PositiveRange_ is similar to _Range_ but ensures that the specified references are non-negative(using _PositiveReference_). It includes _PositiveReference_ for the lower and the upper bound for the number to be generated. If not specified, default values(0, INT64_MAX) are used. The upper and lower bound is always a non-negative integer. ### StringModifier _StringModifier_ is an optional modifier for the String ("S") _DataType_, specifying additional properties for generating strings. It includes a _PositiveReference_ for the length of the string and a _CharacterSet_ for the set of characters from which string has to be generated. ### DataType _DataType_ represents different types of data that can be generator. It includes "N" for integers, "F" for floating-point numbers, "S" for strings, and "C" for characters. Each data type can have an optional range, string modifier, and quantifiers based on their respective types. ### NonCapturingGroup A _NonCapturingGroup_ is a _UnitExpression_ that groups other expressions without capturing the matched text, i.e. no account in group register is hold for it. The "(?:" and ")" denote the start and end of the non-capturing group. It can contain other unit expressions and may have associated quantifiers. A NonCapturingGroup can be nested and/or store _CapturingGroup_ as well. However, it's worth mentioning that if the _NonCapturingGroup_ is repeated using _Quantifier_ and there is a _CapturingGroup_ inside that _NonCapturingGroup_, then the _CapturingGroup_ will only have one group number, not many for each iteration. Example : (?:(N)){3} : In this the group number of N will always be one, irrespective of how many times it's called. It won't be 1, 2, 3. ### CapturingGroup A _CapturingGroup_ is a _UnitExpression_ that captures and stores a non-negative number. It is used for grouping and capturing elements in the regular expression. Capturing Group is a special UnitExpression that only house a non-negative number. By design, it's made to capture positive values only, due to its wide use in quantifier, where value of quantifier can't be negative semantically. ### UnitExpression UnitExpression is a fundamental building block in the Clex language, representing a single element or group of elements in the expression. It can be either a _CapturingGroup_ or _NonCapturingGroup_ or _DataType_ or _EOF_. ### EOF Denotes the end of the Clex language expression. It isn't shown semantically in the language itself, but is generated during intermediate steps manually in lexical phase. ### ClexLanguage ClexLanguage is the overall expression in the Clex language, consisting of multiple UnitExpression elements. It represents the complete expression pattern defined in the Clex language. In essence, ClexLanguage is the top-level structure that encapsulates the entire regular expression defined in the Clex language, composed of various UnitExpression elements. ## Constants in Language - MAX_STRING_SIZE = 12 - DEFAULT_CHARSET = CharacterSet::AlphaNumeric ## Common Rules while deriving a language - Whitespace(s) introduced at any stages are eaten completely by the lexers. So, space are treated the same way as typical comments in other languages. - _PositiveReference_ must have their dereferenced values always positive. This rule is enforced by ensuring that the value generated in any _CapturingGroup_ is always a non-negative integer. - At any given instance, _GroupNo_ CANNOT EXCEED the **total number of occurrences of _CapturingGroup_** in that specific Language. So, if there are only three capturing group in that language, then language will not allow _GroupNo_ > 3. ## Defaults - In case of _CapturingGroup_, if the _PositiveRange_ is not given, then its range bounds defaults to the defaults of _PositiveRange_. - In any case, if the _Quantifier_ is not given, then the associated expression will occur only once. - In case of _DataType_ (for "N" | "F"), if the _Range_ is not present, then its range bounds defaults to the defaults of _Range_. - In case, if _StringModifier_ is not given in _DataType_ (for "S"), then it defaults to the defaults of _StringModifier_ only. - In case of _StringModifier_, if the _PositiveReference_ is not given, then it defaults to the constant **MAX_STRING_SIZE** i.e., 12. - In case of _StringModifier_, if the CharacterSet is not given, then it defaults to the constant **DEFAULT_CHARSET** i.e., "'N'". - If _Reference_ in _Range_ is not given then it defaults to INT64_MIN and INT64_MAX respectively for the corresponding missing value. - If _PositiveReference_ in _PositiveRange_ is not given then it defaults to 0 and INT64_MAX respectively for the corresponding missing value. ## Examples - `N{2}` : Generates two random integers. - `(N) (?:N){\\1}` : Generates a random integer, then the same number of additional integers. - `(N) (?:S[\\1,])` : Generates a random integer, then a string of that length. - `(N) (?:S[\\1,@CH_UPPER@])` : Generates a random integer followed by a random string of uppercase letters, where the length of the string is equal to the generated integer. - `N S C` : Generates a random integer, string, and character. - `F[-100,100]` : Generates a random floating-point number between -100 and 100. - `(N[1,100]) (?:N[1,1000]){\\1} N[1,10000]` : Captures a random integer between 1 and 100, then generates that many integers between 1 and 1000, followed by another integer between 1 and 10000. ## References For more details on the `clex` language and advanced usage, you can refer to the following references: - [Back-references in repetition construct regex](https://stackoverflow.com/questions/3407696/using-a-regex-back-reference-in-a-repetition-construct-n) - [Back-references S.O.](https://stackoverflow.com/questions/29728622/regex-with-backreference-as-repetition-count) - [Possible solution using Code Call-out](https://stackoverflow.com/questions/29728622/regex-with-backreference-as-repetition-count/61898415#61898415)