Lexer

The FlavorLang lexer is responsible for breaking the input source code into fundamental components, called tokens. These tokens then serve as the input for the parser, enabling FlavorLang to interpret and execute your code.

Overview

Below are the core token types recognized by the lexer:

Token Type	Examples	Description
KEYWORD	`serve`, `create`, `deliver`, `if`	Reserved language keywords (e.g., `let`, `while`, `burn`, etc.).
IDENTIFIER	`cake`, `temperature`	Names for variables, functions, or parameters.
NUMBER	`42`, `200`, `3.14`	Integer or floating-point numeric literals.
STRING	`"Hello World!"`	Text enclosed in double quotes.
OPERATOR	`=`, `==`, `+`, `-`, `>=`	Single or multi-character operators used in expressions.
DELIMITER	`,`, `:`, `(`, `)`, `;`, `{`, `}`	Punctuation symbols that mark statement boundaries or grouping.
COMMENT	`#`	Everything from `#` to the end of a line is ignored.
EOF	End of input	Signifies that there are no more tokens to read.

How the Lexer Works

Input & Initialization
- The lexer reads the entire file into a string.
- It creates a token buffer (dynamic array) that grows as needed.
Character Classification
- Whitespace is skipped, incrementing line if \n.
- Comments begin with # and continue until end of line.
- Numbers are sequences of digits (0-9), optionally including one decimal point.
- Strings are enclosed in double quotes ("), supporting multiple characters.
- Identifiers or Keywords are alphanumeric sequences that either match a keyword (e.g., let) or become IDENTIFIER.
- Operators (e.g. =, ==, >=) may be single or multi-character.
- Delimiters (e.g. ,, ;, () are tokenized in a straightforward manner.
Token Construction
- Each recognized piece of text is turned into a token:
  - Type: The token kind (keyword, number, operator, etc.).
  - Lexeme: The exact substring from the source code.
  - Line Number: The line on which the token appears.
End of File
- After scanning all characters, the lexer appends a TOKEN_EOF to signify there are no more tokens.

Tokenizing Key Constructs

Comments: Start at # and continue until \n. The lexer ignores them entirely.
Numbers: If digits are encountered, they may form either an INTEGER or FLOAT if a decimal point is found.
Strings: Start and end with ". Unterminated strings trigger an error.
Identifiers/Keywords: Any valid identifier start (letter or _) followed by letters/digits forms an identifier, and it’s cross-checked against a keywords list to decide if it’s KEYWORD.
Operators: Single (+, -, =) or multi-character (==, >=, <=) operators.
Delimiters: For punctuation like ,, (, ), ;, the lexer directly appends a token.

Debugging Tokens

Use the --debug flag to print all generated tokens (with their types, lexemes, and line numbers). This helps diagnose lexical issues quickly.

Example Debug Output

[Line 1] Token Type: KEYWORD | Lexeme: let
[Line 1] Token Type: IDENTIFIER | Lexeme: x
[Line 1] Token Type: OPERATOR | Lexeme: =
[Line 1] Token Type: INTEGER | Lexeme: 10
...

Error Handling

Unexpected Characters: If the lexer encounters an unrecognized symbol, it raises an error and terminates.
Unterminated Strings: If a " is opened and never closed, the lexer reports an error with the line number.
Memory Allocation Failures: If resizing or token creation fails, the lexer exits.

Future Enhancements

Additional Operators: E.g., !=, ++, or typed operators if the language evolves.
Better Error Recovery: Instead of immediate exit, attempt to skip malformed input.
Performance Tweaks: For very large .flv files, more efficient reading and token building.

License

This project is licensed under the Apache 2.0 License — see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!