feat: enrich AST with structured word spans and assignment detection#11
Merged
feat: enrich AST with structured word spans and assignment detection#11
Conversation
Move expansion boundary tracking into the lexer to eliminate ~300 lines of duplicate re-parsing in the sexp formatting layer. Key changes: - **WordBuilder**: New lexer type that bundles word value string with expansion spans and quoting context. All 14 lexer word-reading functions converted from `&mut String` to `&mut WordBuilder`. - **WordSpan tracking**: Lexer records spans for all 14 expansion types (CommandSub, AnsiCQuote, LocaleString, ProcessSub, SingleQuoted, DoubleQuoted, ParamExpansion, SimpleVar, ArithmeticSub, Backtick, BracketSubscript, Extglob, DeprecatedArith, Escape) with quoting context (None, DoubleQuote, ParamExpansion, CommandSub, Backtick). - **segments_from_spans**: New span-based segment extraction replaces `parse_word_segments` re-parsing for words with spans. Filters to sexp-relevant span kinds, handles nested spans via top-level collection, and uses quoting context for correct ANSI-C behavior. - **Assignment detection**: Lexer emits AssignmentWord tokens for `NAME=`, `NAME+=`, `NAME[...]=` patterns. Parser populates `Command.assignments` field. - **Word.parts**: Populated via `decompose_word()` with structured AST nodes (WordLiteral, CommandSubstitution, ProcessSubstitution, AnsiCQuote, LocaleString). - **Array parsing**: New `read_array_elements`/`read_array_element` in lexer parses array content as individual words at lex time, replacing post-hoc `normalize_array_content` string manipulation. - **Dead code removed**: `try_normalize_array`, `normalize_array_content`, `read_balanced_delim`, `should_format_from_value`, `parts_to_segments`. All 138 tests pass (95 unit + 37 integration + 6 doc). Oracle: 165/181 (unchanged from baseline). S-expression output identical (Parable compatibility preserved). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Eliminate ~400 lines of duplicate parsing code in the sexp/format layers by ensuring all Word and CondTerm nodes carry lexer spans. - Thread token spans via `word_node_from_token()` (9 call sites updated) - Add spans to CondTerm nodes via `cond_term_from_token()` (4 call sites) - Rewrite `write_redirect()` to use span-based segments instead of string searches (`needs_word_processing` removed) - Update `process_word_value()` in format module to use spans - Replace `decompose_word()` with `decompose_word_with_spans()` (span path) and `decompose_word_literal()` (synthetic nodes) - Word Display for span-less synthetic nodes uses `write_escaped_word` directly instead of re-parsing through `parse_word_segments` Deleted dead code: - `parse_word_segments()` + `flush_literal` + `extract_ansi_c_content` + `extract_locale_content` (~195 lines in sexp/word.rs) - `extract_paren_content()` (~107 lines in sexp/mod.rs) - `needs_word_processing()`, `write_word_value()`, `needs_value_path()` - `skip_single_quoted/double/backtick`, `is_backslash_escaped` (~70 lines in context.rs) - 8 tests for deleted `parse_word_segments` function Oracle improved: 167/181 (was 165) — removing buggy re-parsing fixed 2 edge cases where the two parsers disagreed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… code - Fix O(n²) collect_top_level_sexp_spans: sort by start offset then single-pass sweep (was using ptr::eq containment check) - Move word_node_from_token/cond_term_from_token to take owned Token, eliminating String+Vec clones per word - Remove stale #[allow(dead_code)] annotations on Token.spans, Token::with_spans, and blanket module-level allow on word_builder - Remove unused WordBuilder::push_str method - Fix CondTerm Display to handle all segment types via write_redirect_segments instead of debug-formatting unknown segments - Fix locale string context handling for $"..." inside words Oracle improved: 169/181 (was 167). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace the non-asserting oracle test with one that tracks known failures explicitly. The test now fails on: - Regressions (previously passing test now fails) - Newly passing tests (update KNOWN_ORACLE_FAILURES to track progress) 12 known failures documented with root cause comments. Oracle: 169/181. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each oracle .tests file now has its own test function generated by the oracle_test! macro (e.g., oracle_ansi_c_escapes, oracle_heredoc_formatting). Each test individually asserts no regressions and no newly passing tests. This replaces the single oracle_test_suite function, giving specific line-level feedback on which oracle category passes or fails. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
mpecan
added a commit
that referenced
this pull request
Mar 25, 2026
🤖 I have created a release *beep* *boop* --- ## [0.1.8](rable-v0.1.7...rable-v0.1.8) (2026-03-25) ### Features * enrich AST with structured word spans and assignment detection ([9163d24](9163d24)) * enrich AST with structured word spans and assignment detection ([#11](#11)) ([3b58c38](3b58c38)) ### Bug Fixes * CTLESC byte doubling for bash-oracle compatibility (179/181) ([72bc381](72bc381)) * heredoc trailing newline at EOF with backslash (180/181) ([4af8d91](4af8d91)) * resolve 11 oracle test failures (180/181) ([#13](#13)) ([69d6bc8](69d6bc8)) * resolve 3 more oracle failures (177/181) ([8aca953](8aca953)) * resolve 6 oracle test failures ([0496222](0496222)) * resolve 6 oracle test failures (175/181) ([1708884](1708884)) ### Documentation * comprehensive documentation update ([#14](#14)) ([6abfb20](6abfb20)) * comprehensive documentation update for better DX ([61114b0](61114b0)) ### Code Refactoring * remove sexp re-parsing by threading spans through all nodes ([54db8c7](54db8c7)) * simplify span collection, move to owned tokens, remove dead code ([cec7e8e](cec7e8e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves expansion boundary tracking into the lexer and eliminates duplicate re-parsing in the sexp/format layers. The lexer now records
WordSpans for all 14 bash expansion types with full quoting context, and all downstream code uses spans instead of re-scanning word value strings.Key changes
word_builder.rs): New lexer type bundling word value + spans + quoting context stack. All 14 lexer word-reading functions converted from&mut Stringto&mut WordBuilder$'...'inside${...}vs"...")segments_from_spans()replacesparse_word_segments()— converts spans to segments without re-parsing, with top-level span filtering and context-aware formattingAssignmentWordtokens; parser populatesCommand.assignmentsread_array_elements()parses array content as words directly, replacing post-hocnormalize_array_contentstring manipulationword_node_from_token()andcond_term_from_token()preserve lexer spans;write_redirect()uses spans instead of string searchesDead code removed (~540 lines)
parse_word_segments()+ helpersextract_paren_content()try_normalize_array()+normalize_array_content()needs_word_processing(),write_word_value(),needs_value_path()should_format_from_value(),parts_to_segments()skip_single/double/backtick,is_backslash_escaped,read_balanced_delimNumbers
Test plan
cargo clippy --all-targets -- -D warningsclean🤖 Generated with Claude Code