Skip to content

Conversation

h-a-n-a
Copy link
Contributor

@h-a-n-a h-a-n-a commented Oct 11, 2025

Description:

This PR reapplied #10987 and replaced lone_surrogates: bool mark with Wtf8Atom. lone_surrogates introduced in #10987 is more of an implicit mark, whereas the use of Wtf8Atom makes it more explicit where users need to explicitly call to_string_lossy to get the lossy UTF-8 result, making a huge difference from the original implementation where users need to convert \uFFFDxxx into the correct unicode.

BREAKING CHANGE:

Both cooked in TemplateElement and value in StringLiteral are replaced with Wtf8Atom introduced in #11104.

Wtf8Atom does not expose a to_string method like the old Atom does. Internally, it stores the code points of the characters. You can call code_points() to get an iterator of the code points or call to_string_lossy() to get an lossy string in which all unpaired surrogates are replaced with U+FFFD(Replacement Character).

Related issue (if exists):

@changeset-bot
Copy link

changeset-bot bot commented Oct 11, 2025

⚠️ No Changeset found

Latest commit: 54cfd1c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@socket-security
Copy link

socket-security bot commented Oct 11, 2025

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn Critical
[email protected] has a Critical CVE.

CVE: GHSA-fjxv-7rqg-78g4 form-data uses unsafe random function in form-data for choosing boundary (CRITICAL)

Affected versions: < 2.5.4; >= 3.0.0 < 3.0.4; >= 4.0.0 < 4.0.4

Patched version: 2.5.4

From: ?npm/[email protected]

ℹ Read more on: This package | This alert | What is a critical CVE?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at [email protected].

Suggestion: Remove or replace dependencies that include known critical CVEs. Consumers can use dependency overrides or npm audit fix --force to remove vulnerable dependencies.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore npm/[email protected]. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

Warn High
[email protected] has Obfuscated code.

Confidence: 0.96

Location: Package overview

From: ?npm/[email protected]npm/[email protected]

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at [email protected].

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore npm/[email protected]. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

@h-a-n-a h-a-n-a force-pushed the surrogates branch 6 times, most recently from f3c6816 to 6d05f37 Compare October 15, 2025 09:36
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 15, 2025

CodSpeed Performance Report

Merging #11144 will create unknown performance changes

Comparing h-a-n-a:surrogates (54cfd1c) with main (bd55d30)1

Summary

⚠️ No benchmarks were detected in both the base of the PR and the PR.
Please ensure that your benchmarks are correctly instrumented with CodSpeed.

Check out the benchmarks creation guide

⏩ 141 skipped2

Footnotes

  1. No successful run was found on main (4848301) during the generation of this report, so bd55d30 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

  2. 141 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@h-a-n-a h-a-n-a force-pushed the surrogates branch 7 times, most recently from aede87f to 3362795 Compare October 20, 2025 08:16
@socket-security
Copy link

socket-security bot commented Oct 20, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedcopyfiles@​2.4.110010010078100
Addedwebpack-cli@​3.3.129710010083100
Addedwebpack-dev-server@​3.11.3968810084100
Addedwebpack@​4.47.09710010095100

View full report

@h-a-n-a h-a-n-a force-pushed the surrogates branch 2 times, most recently from 9df7324 to e9740e5 Compare October 20, 2025 09:05
@h-a-n-a h-a-n-a marked this pull request as ready for review October 20, 2025 09:38
@h-a-n-a h-a-n-a requested review from a team as code owners October 20, 2025 09:38
@Copilot Copilot AI review requested due to automatic review settings October 20, 2025 09:38
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves handling of unpaired Unicode surrogates in JavaScript/TypeScript by replacing the implicit lone_surrogates: bool marker with an explicit Wtf8Atom type. This breaking change requires users to call to_string_lossy() or as_str() to convert WTF-8 atoms to valid UTF-8, making surrogate handling more explicit and preventing silent data loss.

Key Changes

  • Replaced Atom with Wtf8Atom for string literal values and template element cooked values
  • Added WTF-8 handling utilities and conversion functions throughout the codebase
  • Updated all consumers to handle potential surrogate pair scenarios explicitly

Reviewed Changes

Copilot reviewed 152 out of 205 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Multiple package.json files Fixed missing newline before closing brace
packages/helpers/package.json Added export entry for new TypeScript helper
crates/swc_typescript/src/fast_dts/ Updated to handle Wtf8Atom with fallback conversion
crates/swc_ecma_utils/src/unicode.rs Added Unicode surrogate pair utilities
crates/swc_ecma_utils/src/lib.rs Added WTF-8 string handling functions
crates/swc_ecma_transforms_*/ Updated string operations to use WTF-8 APIs
crates/swc_ecma_minifier/ Updated string optimization to handle surrogates
crates/swc_ecma_parser/ Core lexer changes to produce Wtf8Atom tokens
Test files Updated expected outputs and added surrogate tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

**Description:**

This PR fixed an issue related to lone surrogates handling in Rust.

This fix's credits all go to Oxc team
swc-project#10978 (comment).
What I'm doing is porting the fix that was made in Oxc and make it
working under SWC.

The problem is related to the fundamental difference between how Rust
and JavaScript handle Unicode, especially lone surrogates.

**JavaScript's Unicode Model**

```javascript
// JavaScript allows this - lone surrogates are stored in UTF-16
let str = "\uD800";  // High surrogate alone - technically invalid Unicode
let obj = { "\uD800": "value" };  // Works fine in JS
```
JavaScript uses UTF-16 internally and tolerates invalid Unicode
sequences:
- Strings are UTF-16 code unit sequences, not Unicode scalar sequences
- Lone surrogates (U+D800-U+DFFF) are allowed and preserved
- No validation that surrogates come in proper high/low pairs
- Engine just stores the raw UTF-16 code units

**Rust's Unicode Model**

```rust
// This CANNOT exist in Rust:
let s = "\u{D800}";  // ❌ COMPILE ERROR - not a valid Unicode scalar
let c: char = '\u{D800}';  // ❌ COMPILE ERROR - char excludes surrogates
```
Rust enforces strict Unicode validity:
- String is UTF-8 and must contain valid Unicode scalar values
- char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF)
- Surrogate code points (U+D800-U+DFFF) are explicitly excluded
- No way to represent lone surrogates in Rust's standard string types

1. AST Structure: Added `lone_surrogates: bool` field to `Str` and
`TplElement` structs to track when strings contain lone surrogates
2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD}
(replacement character) followed by the original hex digits for internal
representation
3. Code Generation: Modified string output to properly escape lone
surrogates back to \uXXXX format during codegen
4. Test: Also fixed some cases related to member expression
optimizations and string concatenation optimizations

1. Add support for serializing and deserializing literals with lone
surrogates in `swc_estree_compat`
2. Reflect AST changes in `binding` crates

Breaks the AST by adding `lone_surrogates` field to `Str` and
`TplElement` and breaks the `value` and `cooked` respectly in `Str` and
`TplElement`. Both of the field is using `\u{FFFD}` (Replacement
Character) as an escape if `lone_surrogates` set to `true`.

To consume the real value, you need to first check if `lone_surrogates`
is `true`, then unescape it by removing the char and construct it with
the four trailing hexs(from `\u{FFFD}D800` to `\uD800`).

**Related issue:**

 - Closes swc-project#10978
 - Closes swc-project#10353

Fixed a regression of swc-project#7678
@Copilot Copilot AI review requested due to automatic review settings October 22, 2025 06:46
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 154 out of 207 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (5)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Oct 22, 2025

CI is failed here https://github.com/swc-project/swc/actions/runs/18711211814/job/53359904217?pr=11144 due to different layout of ArchivedString and ArchivedVec<u8>

@kdy1
Copy link
Member

kdy1 commented Oct 22, 2025

@h-a-n-a If so, I think you can reflect the difference in the AST, too. I mean, CI failure seems to be related to mix of Vec and String, but I don't think we need to mix it, because this is a breaking change regardless

@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Oct 22, 2025

@h-a-n-a If so, I think you can reflect the difference in the AST, too. I mean, CI failure seems to be related to mix of Vec and String, but I don't think we need to mix it, because this is a breaking change regardless

@kdy1 I thought the breaking change would be only on the API side and not the ABI side, but for now I think it's a breaking change for all scenarios. You're right. We don't need to mix these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants