Skip to content

Conversation

@adamziel
Copy link
Collaborator

@adamziel adamziel commented Nov 2, 2025

After skimming through https://tc39.es/ecma262/#sec-unicode-format-control-characters, I realized tokenizing JavaScript is not that difficult. This PR doodles on such a tokenizer. It's largely AI-produced and probably wrong on at least few counts, but it seems to produce useful results for a few normative scripts I've fed it. I don't have a good use-case for it yet, other than maybe rewriting URLs in JavaScript snippets, so I'm posting this PR and forgetting about it for now.

$tokenizer = new JavaScriptTokenizer( 'const \u{0061}mazing = "string contents"; console.log(amazing)' );

while ( $tokenizer->next_token() ) {
	echo str_pad($tokenizer->get_type(), 15) . ' "' . $tokenizer->get_text() . '"' . PHP_EOL;
}
/*
IdentifierName  "const"
IdentifierName  "\u"
Punctuator      "{"
NumericLiteral  "0061"
Punctuator      "}"
IdentifierName  "mazing"
Punctuator      "="
StringLiteral   ""string contents""
Punctuator      ";"
IdentifierName  "console"
Punctuator      "."
IdentifierName  "log"
Punctuator      "("
IdentifierName  "amazing"
Punctuator      ")"
*/

cc @dmsnell @sirreal @brandonpayton in case it makes you think of some problem you've seen somewhere.

Simplify the tokenizer

Simplify the tokenizer

Simplify parse_punctuator

Document
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants