-
Notifications
You must be signed in to change notification settings - Fork 0
Lexical
The first step in processing Shard language source file is lexical analysis. It splits source code to a set of various tokens.
The Shard supports only 7bit ASCII or UTF-8 encoded source file without BOM (Byte Order Marks). Other encodings are not supported.
The lexer ends processing source code when one of following condition is met:
- Physical end of source file.
-
U+0000character is found.
End of line can be a sequence of one or two characters in order to support EOL marks from different platforms (Linux, Windows and Mac).
-
U+000A(Linux & macOS) -
U+000D U+000A(Windows)
Whitespace is a sequence of characters used for other tokens separation (not necessary). They are probably not important for higher layers so a token is not generated for those characters.
Any sequence of following characters is a whitespace:
-
U+0020(space) -
U+0009(horizontal tab)
There are two types of comments: line and block.
Line comment starts with // sequence and anything following this sequence is ignored until End of Line is found.
Comment starts with /* and ends with */. Anything between is taken as a comment. Any /* inside block is ignored and comment ends when first */ is found.
/* block /* block */ end */
^
not a comment
Any sequence of characters which match following rules is an identifier.
- Starts with alpha character (
a-z,A-Z) or_. - Contains alphanumeric character (
a-z,A-Z,0-9) or_.
Literals are special tokens which represents a immutable value.
Sequence of characters which can represent a number.
- Starts with numberic character (
0-9) - Contains alphanumeric character (
a-z,A-Z,0-9).
Represented by any UNICODE character surrounded by single quote ' (U+0027) character.
A string literal is a sequence of characters surrounded by double quote " (U+0022) character.
Within character or string literal an escape sequence can be used. It's handy in cases when required character cannot be specified directly like single or double quote.
- Special characters prefixed by backslash
\(U+005C) character.\0(null character),\\(backslash),\t(horizontal tab),\n(line feed),\r(carriage return),\"(double quote) and\'(single quote). - UNICODE codepoint value as
\unwherenis hexadecimal number of the codepoint. The number must be a value in range supported by UTF-8 encoding (U+0000-U+10FFFF).
Other tokens have no special meaning in view of the tokenizer but might have in view of tokenizer user. The result token is one printable charater long.