Ara is a lightweight compiler/interpreter project.
The lexer (or "tokenizer") is the first stage of the Ara pipeline. Its job is to take raw source code (a long string of characters) and transform it into a sequence of meaningful units called Tokens.
The lexer maintains several pointers to keep track of its progress through the input string:
input: The raw source code string.position: Points to the current character being examined (ch).readPosition: Points to the next character in the input (used for "peeking" or advancing).ch: The actual character currently being inspected.
The readChar method is the "heartbeat" of the lexer. It:
- Checks if it has reached the end of the input.
- If not, it updates
chwith the character atreadPosition. - Advances both
positionandreadPosition.
When the parser asks for the next unit of code, NextToken is called. It follows these steps:
A. Skip Whitespace
It calls skipWhitespace to ignore spaces, tabs, and newlines. The lexer doesn't care about these unless they are part of a literal (like a string).
B. Identify Simple Tokens
It checks the current character (l.ch) against known single-character operators and punctuation:
- Operators:
+,-,*,/ - Punctuation:
(,),, - EOF (End Of File): A special token used to tell the parser nothing is left.
C. Identify Complex Tokens (Identifiers and Numbers)
If the character doesn't match an operator, it checks if it's the start of something more complex:
- Identifiers: If it's a letter (or
_), it callsreadIdentifier. This continues reading until it hits a non-letter/non-digit character. These are used for variable names or keywords. - Numbers: If it's a digit, it calls
readNumber. It consumes characters as long as they are digits or a decimal point..
D. Error Handling
If a character is encountered that doesn't fit any rule (like a random symbol @ or $), it generates a TOKEN_ERROR.
If you feed the lexer the input (1 + foo), it performs the following transformations:
| Character(s) | Action | Resulting Token |
|---|---|---|
( |
Direct Match | Token(Type: LPAREN, Literal: "(") |
1 |
isDigit -> readNumber |
Token(Type: NUMBER, Literal: "1") |
+ |
Direct Match | Token(Type: PLUS, Literal: "+") |
foo |
isLetter -> readIdentifier |
Token(Type: IDENT, Literal: "foo") |
) |
Direct Match | Token(Type: RPAREN, Literal: ")") |
| (End) | Empty Input | Token(Type: EOF, Literal: "") |
Here is a breakdown of the most common ones and why we are choosing a Pratt Parser for Ara.
| Parser Type | How it Works | Pros | Cons |
|---|---|---|---|
| Recursive Descent | Uses a set of recursive functions that mirror the language's grammar rules. | Very easy to write and debug by hand. | Struggles with operator precedence (e.g., * vs +) without creating a deep, complex hierarchy of functions. |
| LL/LR (Generators) | Uses tools like Yacc, Bison, or ANTLR to generate code from a formal grammar file. | Extremely powerful; handles very complex languages efficiently. | Hard to debug; the generated code is often unreadable "spaghetti." |
| Pratt Parser | Associates parsing functions with token types and uses "binding power" (precedence levels) to decide how tokens stick together. | Perfect for expressions. Handles operator precedence and associativity elegantly with very little code. | Can be slightly more abstract to understand initially than basic recursive descent. |
Since Ara is an expression engine, its primary job is to evaluate mathematical and logical formulas (like 1 + 2 * 3 or a > 10 && b < 5).
In a standard recursive descent parser, you would need a separate function for every level of priority (parseAddition, parseMultiplication, parseComparison, etc.). With a Pratt Parser, we simply assign a number to each operator:
+: Precedence 10*: Precedence 20
The parser then automatically knows that the * "pulls" the numbers closer than the + does. This makes the code much cleaner, faster, and easier to extend when we add new operators later.
Ara currently supports the following expressions:
- Arithmetic:
+,-,*,/ - Comparisons:
==,!=,<,>,<=,>= - Logical:
&&,|| - Booleans:
true,false - Grouping: Parentheses
( )for precedence control - Identifiers: Variable names (e.g.
price,volume) - Numbers: Integer literals (e.g.
100,42)