The Lexer

How produceToken() works?

produceToken() is a wrapper around LookaheadDFA(). This method's role is to deal with groups. Group mode is a special mode of the lexer that is used to ignore everything between group start and group end symbols, and is often used for comments. For example, we have a comment group start "/", and a group end "/", and between those we want to have a single comment token, instead of many others, so in group mode the lexer will accumulate tokens' text inside the comment token.

First we produce a Token with LookaheadDFA()

while (true)
{
	Token token = produceTokenDFA();

	//The logic - to determine if a group should be nested - requires that the top of the stack
	//and the symbol's linked group need to be looked at. Both of these can be unset. So, this section
	//sets a Boolean and avoids errors. We will use this boolean in the logic chain below. 
	bool nestGroup = false;
	if (token.SymbolType == GrammarSymbolType.GroupStart)
	{
		nestGroup = _GroupStack.Count == 0 || 
			_GroupStack.Peek().SymbolGroup.Nesting.Contains(token.SymbolGroup.TableIndex);
	}

	//TOKEN IS GROUP START-ING
	//If a token is one that is a start of a group, we can start a 
	//group by pushing it on the group stack. We are allowed to do so,
	//if the group stack is empty(we are not in a group), or group
	//stack is not empty(we are in a group) and that groups' Nesting
	//allows for current group we are starting to be nested in.
	if (nestGroup)
	{
		trimBuffer((token.Data as string).Length);
		_GroupStack.Push(token);
	}

	//TOKEN IS NOT GROUP-START OR IS NOT NESTABLE GROUP-START
	//WE ARE NOT IN A GROUP
	else if (_GroupStack.Count == 0)
	{
		trimBuffer((token.Data as string).Length);
		return token;
	}

	//TOKEN IS NOT GROUP-START OR IS NOT NESTABLE GROUP-START
	//WE ARE NOT-NOT IN A GROUP
	//TOKEN IS GROUP END-ING
	else if (_GroupStack.Peek().SymbolGroup.End == token.Symbol)
	{
		//End the current group
		Token pop = _GroupStack.Pop();

		//if we have GrammarGroupEndingMode.Closed then add the end
		//token, otherwise don't
		if (pop.SymbolGroup.Ending == GrammarGroupEndingMode.Closed)
		{
			pop.Data = (pop.Data as string) + (token.Data as string);
			trimBuffer((token.Data as string).Length);
		}

		//We are out of the group. Return pop'd token (which contains all the group text)
		if (_GroupStack.Count == 0)
		{
			pop.Symbol = pop.SymbolGroup.Container; //Change symbol to symbol of the group
			return pop;
		}
		else
		{
			Token top = _GroupStack.Peek();
			top.Data = (top.Data as string) + (pop.Data as string);
		}
	}

	//TOKEN IS NOT GROUP-START OR IS NOT NESTABLE GROUP-START
	//TOKEN IS GROUP END-ING
	//WE ARE NOT-NOT IN A GROUP
	//TOKEN IS EOF - "End Of File"
	else if (token.SymbolType == GrammarSymbolType.End)
	{
		return token;
	}

	//TOKEN IS NOT GROUP-START OR IS NOT NESTABLE GROUP-START
	//TOKEN IS NOT GROUP END-ING
	//WE ARE NOT-NOT IN A GROUP
	//TOKEN IS NOT EOF - "End Of File"
	//
	//we are in group and none of the above, so if that group is TokenAdvance
	//we add the text of the token to the Data of previous one, otherwise, if
	//we have CharacterAdvance, we add just the first character. we then continue
	//on with the loop
	else
	{
		Token previousToken = _GroupStack.Peek();
		if (previousToken.SymbolGroup.Advance == GrammarGroupAdvanceMode.Token)
		{
			previousToken.Data = (previousToken.Data as string) + (token.Data as string);
			trimBuffer((token.Data as string).Length);
		}
		else
		{
			previousToken.Data = (previousToken.Data as string) + (token.Data as string)[0];
			trimBuffer(1);
		}
	}
}

How LookaheadDFA() works?

This function implements the DFA for the parser's lexer and generate the actual tokens.

A. We declare a variable "Token result" that we will eventually return.
B. We use Lookahead(1) to get the first character from the lookahead buffer.

B.1. If Lookahead(1) can't read a character (returns empty string - ""), then we have reached the end of the file, thus we set the result token to be EOF token and proceed to 4.

C. Otherwise, we set a loop:

C.1. We declare 4 int variables - "currentDfa", "lastAcceptState", "lastAcceptPosition", and "currentPosition". "currentDfa" will store the current dfa state the lexer is in, "currentPosition" - the current position or index of the current character in the lookahead buffer, and "lastAcceptState", "lastAcceptPosition" will be assigned to after we have possible accept states. For now "lastAcceptPosition = -1" and "lastAcceptState = -1", currentDfa becomes the FAStateList.InitialState, and currentPosition is set to 1, because we will start from the first symbol from the lookahead buffer. Note that currentPosition is 1 based, not 0 based.

C.2. We declare a "bool done = false" flag and start the while loop that will execute while "done" is false. After the loop exits, we proceed to 4.

D. Inside the loop, we first get a character - "Lookahead(currentPosition)", then if it is not empty (because if it is that would mean that end of file has been reached) we get the current state "FAState currentFAState = FAStateList[currentDfa]" and loop through its edges. If an edge's character set contains our character, we set "target" to that edge's target.

E. If we have a target (target > -1), we check if that target state has an Accept, and if so, we set "lastAcceptState = target; lastAcceptPosition = currentPosition;". For a state to have an accept this means that it is an accepting state. Accept is a Symbol, and if we stop the lexer here, we will generate a token with that Accept Symbol. After that, we set "currentDfa = target;" and increment currentPosition. If we don't have a target however (target == -1), done becomes true because we are breaking out of the loop. While still in the loop, however, we check if we had an accept at all with this token - "lastAcceptState == -1", and if so (we don't) we create a token with one character and it is an error token. Otherwise, we create a token with the Accept symbol and the lastAcceptPosition. We read the text in both cases with "LookaheadBuffer", that will remove the token text from the buffer. we assign the newly created token to "result".

D. Now outside of the while loop, we copy the _SysPosition to our "result" Token and return it.

1)
Token result = new Token();
int currentDfa = FAStateList.InitialState;
int lastAcceptState = -1;		//Next byte in the input Stream
int lastAcceptPosition = -1;            //We have not yet accepted a character string
int currentPosition = 1;
bool done = false;

string ch = Lookahead(1);               // Get first symbol from the lookahead buffer
if (ch != "" && Convert.ToInt32(ch[0]) != 65535) // End of file is not reached
{
	while (done == false)
	{
		ch = Lookahead(currentPosition);//Get first x symbols from the lookahead buffer
		int target = -1;        
		if (!string.IsNullOrEmpty(ch))               
		{
			GrammarFAState currentFAState = GrammarTable.GrammarFAStateList[currentDfa];
			for (int n = 0; n < currentFAState.Edges.Count(); n++)
			{
				GrammarFAEdge fAEdge = currentFAState.Edges[n];
				if (fAEdge.Characters.Contains(Convert.ToInt32(ch[0])))
				{
					target = fAEdge.Target;
					break;
				}
			}
		}

		if (target > -1) //that means that target state is found
		{
			if (GrammarTable.GrammarFAStateList[target].Accept != null)
			{
				lastAcceptState = target;
				lastAcceptPosition = currentPosition;
			}
			currentDfa = target;
			currentPosition++;
		}
		else
		{
			done = true;
			if (lastAcceptState == -1)  // Lexer cannot recognize symbol
			{
				result.Parent = GrammarTable.GrammarSymbolList
					.GetFirstOfType(GrammarSymbolType.Error);
				result.Data = LookaheadBuffer(1);
			}
			else //Lexer can recognize symbol
			{
				result.Parent = GrammarTable.GrammarFAStateList[lastAcceptState].Accept;
				result.Data = LookaheadBuffer(lastAcceptPosition);  
			}
		}
	}
}
else // End of file reached, create End Token
{
	result.Data = "";
	result.Parent = GrammarTable.GrammarSymbolList.GetFirstOfType(GrammarSymbolType.End);
}
result.Position().Copy(_SysPosition);
return result;

How Lookahead(int charIndex) works?

Get a character from _LookaheadBuffer, and return it as string. If _LookaheadBuffer is too short, read more chars from _Source TextReader. returns _LookaheadBuffer[charIndex - 1]

Check if we must read characters from the Stream, and read characters, if needed (1). If the buffer is still smaller than the index, we have reached the end of the text. In this case, return a null string (2).

1)
if (charIndex > _LookaheadBuffer.Length)
	for (int i = 0; i < charIndex - _LookaheadBuffer.Length; i++)
	{
		int x = _Source.Read();
		char c = (char)x;
		_LookaheadBuffer += c.ToString();
	}

2)
if (charIndex <= _LookaheadBuffer.Length)
	return _LookaheadBuffer[charIndex - 1].ToString();
return "";

How LookaheadBuffer(int count) works?

LookaheadBuffer gets first "count" number of symbols from _LookaheadBuffer, and removes those symbols from the buffer. If we try to get more sybbols then the buffer contains, the method returns the full buffer instead.

1)
if (count > _LookaheadBuffer.Length)
	count = _LookaheadBuffer.Length;
return _LookaheadBuffer.Substring(0, count);

How ConsumeBuffer(int n) works?

ConsumeBuffer advances _SysPosition according to the characters in _LookaheadBuffer and removes those characters from _LookaheadBuffer.

This method advances _SysPosition with n number of characters, and Removes that number of characters from the _LookaheadBuffer.

1)
if (charCount > _LookaheadBuffer.Length) return;
for (int n = 0; n <= charCount - 1; n++)
{
	switch (_LookaheadBuffer[n])
	{
		case '\n':
			_SysPosition.Line ++;
			_SysPosition.Column = 0;
			break;

		case '\r':          //carriage return
			break;

		case '\v':          //vertical tab			   
		case '\f':          //form feed
		default:
			_SysPosition.Column ++;
			break;
	}
}
_LookaheadBuffer = _LookaheadBuffer.Remove(0, charCount);

Runaway group - EOF without group end before that

Home
1. Introduction
2. Use the Engine
3. The Parse tree
4. The Grammar tables
5. The Egt file format
6. The Parser
7. The Lexer

Links
1. Home
2. Getting started
3. About
4. Engines
5. Documentation
6. Articles
7. GOLD Meta Language

Pseudo Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Lexer

How produceToken() works?

How LookaheadDFA() works?

How Lookahead(int charIndex) works?

How LookaheadBuffer(int count) works?

How ConsumeBuffer(int n) works?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally