feat: enrich AST with structured word spans and assignment detection by mpecan · Pull Request #11 · mpecan/rable

mpecan · 2026-03-25T11:22:28Z

Summary

Moves expansion boundary tracking into the lexer and eliminates duplicate re-parsing in the sexp/format layers. The lexer now records WordSpans for all 14 bash expansion types with full quoting context, and all downstream code uses spans instead of re-scanning word value strings.

Key changes

WordBuilder (word_builder.rs): New lexer type bundling word value + spans + quoting context stack. All 14 lexer word-reading functions converted from &mut String to &mut WordBuilder
14 span kinds recorded: CommandSub, ArithmeticSub, ParamExpansion, SimpleVar, AnsiCQuote, LocaleString, ProcessSub, SingleQuoted, DoubleQuoted, Backtick, BracketSubscript, Extglob, DeprecatedArith, Escape
QuotingContext tracked per span: None, DoubleQuote, ParamExpansion, CommandSub, Backtick — enables context-sensitive ANSI-C handling ($'...' inside ${...} vs "...")
segments_from_spans() replaces parse_word_segments() — converts spans to segments without re-parsing, with top-level span filtering and context-aware formatting
Assignment detection: Lexer emits AssignmentWord tokens; parser populates Command.assignments
Array parsing at lex time: read_array_elements() parses array content as words directly, replacing post-hoc normalize_array_content string manipulation
Token spans threaded to all nodes: word_node_from_token() and cond_term_from_token() preserve lexer spans; write_redirect() uses spans instead of string searches

Dead code removed (~540 lines)

Function	File	Lines
`parse_word_segments()` + helpers	sexp/word.rs	~195
`extract_paren_content()`	sexp/mod.rs	~107
`try_normalize_array()` + `normalize_array_content()`	sexp/word.rs	~107
`needs_word_processing()`, `write_word_value()`, `needs_value_path()`	sexp/	~25
`should_format_from_value()`, `parts_to_segments()`	sexp/word.rs	~40
`skip_single/double/backtick`, `is_backslash_escaped`, `read_balanced_delim`	context.rs	~110

Numbers

Metric	Value
Files changed	17 (+ 2 new)
Lines added	~960
Lines removed	~880
Net	~+80 (includes 22 new span tests + 6 assignment tests)
Oracle	167/181 (was 165 at baseline — gained 2 from eliminating buggy re-parsing)

Test plan

130 tests pass (87 unit + 37 integration + 6 doc)
Oracle: 167/181 (+2 improvement)
cargo clippy --all-targets -- -D warnings clean
S-expression output identical to before (Parable compatibility)
16 span recording tests verify byte offsets for all expansion types
6 assignment tests verify lexer + parser behavior
7 word decomposition tests use real lexer spans (not re-parsing fallback)

🤖 Generated with Claude Code

Move expansion boundary tracking into the lexer to eliminate ~300 lines of duplicate re-parsing in the sexp formatting layer. Key changes: - **WordBuilder**: New lexer type that bundles word value string with expansion spans and quoting context. All 14 lexer word-reading functions converted from `&mut String` to `&mut WordBuilder`. - **WordSpan tracking**: Lexer records spans for all 14 expansion types (CommandSub, AnsiCQuote, LocaleString, ProcessSub, SingleQuoted, DoubleQuoted, ParamExpansion, SimpleVar, ArithmeticSub, Backtick, BracketSubscript, Extglob, DeprecatedArith, Escape) with quoting context (None, DoubleQuote, ParamExpansion, CommandSub, Backtick). - **segments_from_spans**: New span-based segment extraction replaces `parse_word_segments` re-parsing for words with spans. Filters to sexp-relevant span kinds, handles nested spans via top-level collection, and uses quoting context for correct ANSI-C behavior. - **Assignment detection**: Lexer emits AssignmentWord tokens for `NAME=`, `NAME+=`, `NAME[...]=` patterns. Parser populates `Command.assignments` field. - **Word.parts**: Populated via `decompose_word()` with structured AST nodes (WordLiteral, CommandSubstitution, ProcessSubstitution, AnsiCQuote, LocaleString). - **Array parsing**: New `read_array_elements`/`read_array_element` in lexer parses array content as individual words at lex time, replacing post-hoc `normalize_array_content` string manipulation. - **Dead code removed**: `try_normalize_array`, `normalize_array_content`, `read_balanced_delim`, `should_format_from_value`, `parts_to_segments`. All 138 tests pass (95 unit + 37 integration + 6 doc). Oracle: 165/181 (unchanged from baseline). S-expression output identical (Parable compatibility preserved). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Eliminate ~400 lines of duplicate parsing code in the sexp/format layers by ensuring all Word and CondTerm nodes carry lexer spans. - Thread token spans via `word_node_from_token()` (9 call sites updated) - Add spans to CondTerm nodes via `cond_term_from_token()` (4 call sites) - Rewrite `write_redirect()` to use span-based segments instead of string searches (`needs_word_processing` removed) - Update `process_word_value()` in format module to use spans - Replace `decompose_word()` with `decompose_word_with_spans()` (span path) and `decompose_word_literal()` (synthetic nodes) - Word Display for span-less synthetic nodes uses `write_escaped_word` directly instead of re-parsing through `parse_word_segments` Deleted dead code: - `parse_word_segments()` + `flush_literal` + `extract_ansi_c_content` + `extract_locale_content` (~195 lines in sexp/word.rs) - `extract_paren_content()` (~107 lines in sexp/mod.rs) - `needs_word_processing()`, `write_word_value()`, `needs_value_path()` - `skip_single_quoted/double/backtick`, `is_backslash_escaped` (~70 lines in context.rs) - 8 tests for deleted `parse_word_segments` function Oracle improved: 167/181 (was 165) — removing buggy re-parsing fixed 2 edge cases where the two parsers disagreed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

… code - Fix O(n²) collect_top_level_sexp_spans: sort by start offset then single-pass sweep (was using ptr::eq containment check) - Move word_node_from_token/cond_term_from_token to take owned Token, eliminating String+Vec clones per word - Remove stale #[allow(dead_code)] annotations on Token.spans, Token::with_spans, and blanket module-level allow on word_builder - Remove unused WordBuilder::push_str method - Fix CondTerm Display to handle all segment types via write_redirect_segments instead of debug-formatting unknown segments - Fix locale string context handling for $"..." inside words Oracle improved: 169/181 (was 167). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replace the non-asserting oracle test with one that tracks known failures explicitly. The test now fails on: - Regressions (previously passing test now fails) - Newly passing tests (update KNOWN_ORACLE_FAILURES to track progress) 12 known failures documented with root cause comments. Oracle: 169/181. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Each oracle .tests file now has its own test function generated by the oracle_test! macro (e.g., oracle_ansi_c_escapes, oracle_heredoc_formatting). Each test individually asserts no regressions and no newly passing tests. This replaces the single oracle_test_suite function, giving specific line-level feedback on which oracle category passes or fails. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

🤖 I have created a release *beep* *boop* --- ## [0.1.8](rable-v0.1.7...rable-v0.1.8) (2026-03-25) ### Features * enrich AST with structured word spans and assignment detection ([9163d24](9163d24)) * enrich AST with structured word spans and assignment detection ([#11](#11)) ([3b58c38](3b58c38)) ### Bug Fixes * CTLESC byte doubling for bash-oracle compatibility (179/181) ([72bc381](72bc381)) * heredoc trailing newline at EOF with backslash (180/181) ([4af8d91](4af8d91)) * resolve 11 oracle test failures (180/181) ([#13](#13)) ([69d6bc8](69d6bc8)) * resolve 3 more oracle failures (177/181) ([8aca953](8aca953)) * resolve 6 oracle test failures ([0496222](0496222)) * resolve 6 oracle test failures (175/181) ([1708884](1708884)) ### Documentation * comprehensive documentation update ([#14](#14)) ([6abfb20](6abfb20)) * comprehensive documentation update for better DX ([61114b0](61114b0)) ### Code Refactoring * remove sexp re-parsing by threading spans through all nodes ([54db8c7](54db8c7)) * simplify span collection, move to owned tokens, remove dead code ([cec7e8e](cec7e8e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

mpecan and others added 5 commits March 25, 2026 12:22

mpecan merged commit 3b58c38 into main Mar 25, 2026
5 checks passed

mpecan deleted the feat/enrich-ast-and-lexer-spans branch March 25, 2026 13:02

repository-butler bot mentioned this pull request Mar 25, 2026

chore(main): release rable 0.1.8 #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enrich AST with structured word spans and assignment detection#11

feat: enrich AST with structured word spans and assignment detection#11
mpecan merged 5 commits intomainfrom
feat/enrich-ast-and-lexer-spans

mpecan commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mpecan commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Dead code removed (~540 lines)

Numbers

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mpecan commented Mar 25, 2026 •

edited

Loading