Skip to content

feat: enrich AST with structured word spans and assignment detection#11

Merged
mpecan merged 5 commits intomainfrom
feat/enrich-ast-and-lexer-spans
Mar 25, 2026
Merged

feat: enrich AST with structured word spans and assignment detection#11
mpecan merged 5 commits intomainfrom
feat/enrich-ast-and-lexer-spans

Conversation

@mpecan
Copy link
Copy Markdown
Owner

@mpecan mpecan commented Mar 25, 2026

Summary

Moves expansion boundary tracking into the lexer and eliminates duplicate re-parsing in the sexp/format layers. The lexer now records WordSpans for all 14 bash expansion types with full quoting context, and all downstream code uses spans instead of re-scanning word value strings.

Key changes

  • WordBuilder (word_builder.rs): New lexer type bundling word value + spans + quoting context stack. All 14 lexer word-reading functions converted from &mut String to &mut WordBuilder
  • 14 span kinds recorded: CommandSub, ArithmeticSub, ParamExpansion, SimpleVar, AnsiCQuote, LocaleString, ProcessSub, SingleQuoted, DoubleQuoted, Backtick, BracketSubscript, Extglob, DeprecatedArith, Escape
  • QuotingContext tracked per span: None, DoubleQuote, ParamExpansion, CommandSub, Backtick — enables context-sensitive ANSI-C handling ($'...' inside ${...} vs "...")
  • segments_from_spans() replaces parse_word_segments() — converts spans to segments without re-parsing, with top-level span filtering and context-aware formatting
  • Assignment detection: Lexer emits AssignmentWord tokens; parser populates Command.assignments
  • Array parsing at lex time: read_array_elements() parses array content as words directly, replacing post-hoc normalize_array_content string manipulation
  • Token spans threaded to all nodes: word_node_from_token() and cond_term_from_token() preserve lexer spans; write_redirect() uses spans instead of string searches

Dead code removed (~540 lines)

Function File Lines
parse_word_segments() + helpers sexp/word.rs ~195
extract_paren_content() sexp/mod.rs ~107
try_normalize_array() + normalize_array_content() sexp/word.rs ~107
needs_word_processing(), write_word_value(), needs_value_path() sexp/ ~25
should_format_from_value(), parts_to_segments() sexp/word.rs ~40
skip_single/double/backtick, is_backslash_escaped, read_balanced_delim context.rs ~110

Numbers

Metric Value
Files changed 17 (+ 2 new)
Lines added ~960
Lines removed ~880
Net ~+80 (includes 22 new span tests + 6 assignment tests)
Oracle 167/181 (was 165 at baseline — gained 2 from eliminating buggy re-parsing)

Test plan

  • 130 tests pass (87 unit + 37 integration + 6 doc)
  • Oracle: 167/181 (+2 improvement)
  • cargo clippy --all-targets -- -D warnings clean
  • S-expression output identical to before (Parable compatibility)
  • 16 span recording tests verify byte offsets for all expansion types
  • 6 assignment tests verify lexer + parser behavior
  • 7 word decomposition tests use real lexer spans (not re-parsing fallback)

🤖 Generated with Claude Code

mpecan and others added 5 commits March 25, 2026 12:22
Move expansion boundary tracking into the lexer to eliminate ~300 lines
of duplicate re-parsing in the sexp formatting layer.

Key changes:

- **WordBuilder**: New lexer type that bundles word value string with
  expansion spans and quoting context. All 14 lexer word-reading
  functions converted from `&mut String` to `&mut WordBuilder`.

- **WordSpan tracking**: Lexer records spans for all 14 expansion types
  (CommandSub, AnsiCQuote, LocaleString, ProcessSub, SingleQuoted,
  DoubleQuoted, ParamExpansion, SimpleVar, ArithmeticSub, Backtick,
  BracketSubscript, Extglob, DeprecatedArith, Escape) with quoting
  context (None, DoubleQuote, ParamExpansion, CommandSub, Backtick).

- **segments_from_spans**: New span-based segment extraction replaces
  `parse_word_segments` re-parsing for words with spans. Filters to
  sexp-relevant span kinds, handles nested spans via top-level
  collection, and uses quoting context for correct ANSI-C behavior.

- **Assignment detection**: Lexer emits AssignmentWord tokens for
  `NAME=`, `NAME+=`, `NAME[...]=` patterns. Parser populates
  `Command.assignments` field.

- **Word.parts**: Populated via `decompose_word()` with structured AST
  nodes (WordLiteral, CommandSubstitution, ProcessSubstitution,
  AnsiCQuote, LocaleString).

- **Array parsing**: New `read_array_elements`/`read_array_element` in
  lexer parses array content as individual words at lex time, replacing
  post-hoc `normalize_array_content` string manipulation.

- **Dead code removed**: `try_normalize_array`, `normalize_array_content`,
  `read_balanced_delim`, `should_format_from_value`, `parts_to_segments`.

All 138 tests pass (95 unit + 37 integration + 6 doc).
Oracle: 165/181 (unchanged from baseline).
S-expression output identical (Parable compatibility preserved).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Eliminate ~400 lines of duplicate parsing code in the sexp/format layers
by ensuring all Word and CondTerm nodes carry lexer spans.

- Thread token spans via `word_node_from_token()` (9 call sites updated)
- Add spans to CondTerm nodes via `cond_term_from_token()` (4 call sites)
- Rewrite `write_redirect()` to use span-based segments instead of
  string searches (`needs_word_processing` removed)
- Update `process_word_value()` in format module to use spans
- Replace `decompose_word()` with `decompose_word_with_spans()` (span
  path) and `decompose_word_literal()` (synthetic nodes)
- Word Display for span-less synthetic nodes uses `write_escaped_word`
  directly instead of re-parsing through `parse_word_segments`

Deleted dead code:
- `parse_word_segments()` + `flush_literal` + `extract_ansi_c_content`
  + `extract_locale_content` (~195 lines in sexp/word.rs)
- `extract_paren_content()` (~107 lines in sexp/mod.rs)
- `needs_word_processing()`, `write_word_value()`, `needs_value_path()`
- `skip_single_quoted/double/backtick`, `is_backslash_escaped` (~70
  lines in context.rs)
- 8 tests for deleted `parse_word_segments` function

Oracle improved: 167/181 (was 165) — removing buggy re-parsing fixed
2 edge cases where the two parsers disagreed.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… code

- Fix O(n²) collect_top_level_sexp_spans: sort by start offset then
  single-pass sweep (was using ptr::eq containment check)
- Move word_node_from_token/cond_term_from_token to take owned Token,
  eliminating String+Vec clones per word
- Remove stale #[allow(dead_code)] annotations on Token.spans,
  Token::with_spans, and blanket module-level allow on word_builder
- Remove unused WordBuilder::push_str method
- Fix CondTerm Display to handle all segment types via
  write_redirect_segments instead of debug-formatting unknown segments
- Fix locale string context handling for $"..." inside words

Oracle improved: 169/181 (was 167).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace the non-asserting oracle test with one that tracks known
failures explicitly. The test now fails on:
- Regressions (previously passing test now fails)
- Newly passing tests (update KNOWN_ORACLE_FAILURES to track progress)

12 known failures documented with root cause comments.
Oracle: 169/181.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each oracle .tests file now has its own test function generated by the
oracle_test! macro (e.g., oracle_ansi_c_escapes, oracle_heredoc_formatting).
Each test individually asserts no regressions and no newly passing tests.

This replaces the single oracle_test_suite function, giving specific
line-level feedback on which oracle category passes or fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@mpecan mpecan merged commit 3b58c38 into main Mar 25, 2026
5 checks passed
@mpecan mpecan deleted the feat/enrich-ast-and-lexer-spans branch March 25, 2026 13:02
mpecan added a commit that referenced this pull request Mar 25, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.8](rable-v0.1.7...rable-v0.1.8)
(2026-03-25)


### Features

* enrich AST with structured word spans and assignment detection
([9163d24](9163d24))
* enrich AST with structured word spans and assignment detection
([#11](#11))
([3b58c38](3b58c38))


### Bug Fixes

* CTLESC byte doubling for bash-oracle compatibility (179/181)
([72bc381](72bc381))
* heredoc trailing newline at EOF with backslash (180/181)
([4af8d91](4af8d91))
* resolve 11 oracle test failures (180/181)
([#13](#13))
([69d6bc8](69d6bc8))
* resolve 3 more oracle failures (177/181)
([8aca953](8aca953))
* resolve 6 oracle test failures
([0496222](0496222))
* resolve 6 oracle test failures (175/181)
([1708884](1708884))


### Documentation

* comprehensive documentation update
([#14](#14))
([6abfb20](6abfb20))
* comprehensive documentation update for better DX
([61114b0](61114b0))


### Code Refactoring

* remove sexp re-parsing by threading spans through all nodes
([54db8c7](54db8c7))
* simplify span collection, move to owned tokens, remove dead code
([cec7e8e](cec7e8e))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant