JavaScript tokenizer #208

adamziel · 2025-11-02T23:30:40Z

After skimming through https://tc39.es/ecma262/#sec-unicode-format-control-characters, I realized tokenizing JavaScript is not that difficult. This PR doodles on such a tokenizer. It's largely AI-produced and probably wrong on at least few counts, but it seems to produce useful results for a few normative scripts I've fed it. I don't have a good use-case for it yet, other than maybe rewriting URLs in JavaScript snippets, so I'm posting this PR and forgetting about it for now.

$tokenizer = new JavaScriptTokenizer( 'const \u{0061}mazing = "string contents"; console.log(amazing)' );

while ( $tokenizer->next_token() ) {
	echo str_pad($tokenizer->get_type(), 15) . ' "' . $tokenizer->get_text() . '"' . PHP_EOL;
}
/*
IdentifierName  "const"
IdentifierName  "\u"
Punctuator      "{"
NumericLiteral  "0061"
Punctuator      "}"
IdentifierName  "mazing"
Punctuator      "="
StringLiteral   ""string contents""
Punctuator      ";"
IdentifierName  "console"
Punctuator      "."
IdentifierName  "log"
Punctuator      "("
IdentifierName  "amazing"
Punctuator      ")"
*/

cc @dmsnell @sirreal @brandonpayton in case it makes you think of some problem you've seen somewhere.

Simplify the tokenizer Simplify the tokenizer Simplify parse_punctuator Document

Draft JS tokenizer

f145d73

Simplify the tokenizer Simplify the tokenizer Simplify parse_punctuator Document

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JavaScript tokenizer #208

JavaScript tokenizer #208

Uh oh!

adamziel commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JavaScript tokenizer #208

Are you sure you want to change the base?

JavaScript tokenizer #208

Uh oh!

Conversation

adamziel commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants