Skip to content

fix: preserve JSONL rows with unicode separators#62

Merged
steipete merged 2 commits into
steipete:mainfrom
uwe-schwarz:fix/jsonl-line-separators
Jun 19, 2026
Merged

fix: preserve JSONL rows with unicode separators#62
steipete merged 2 commits into
steipete:mainfrom
uwe-schwarz:fix/jsonl-line-separators

Conversation

@uwe-schwarz

@uwe-schwarz uwe-schwarz commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • escape literal U+2028 / U+2029 when writing JSONL backup values
  • read backup JSONL using physical \n records instead of Node readline logical lines
  • cover round-trip export/import with Unicode line and paragraph separators in collection raw JSON

Fixes #61.

Checks

  • pnpm test src/lib/backup.test.ts
  • pnpm run format:check
  • pnpm run lint
  • pnpm run typecheck
  • pnpm test

Proof

❯ cd ~/dev/forks/birdclaw
git switch fix/jsonl-line-separators
git pull --ff-only
echo "branch: $(git branch --show-current)"
echo "commit: $(git rev-parse --short HEAD)"
pnpm exec tsx src/cli.ts backup sync --repo ~/Documents/Backup/X

Already on 'fix/jsonl-line-separators'
Your branch is up to date with 'origin/fix/jsonl-line-separators'.
Already up to date.
branch: fix/jsonl-line-separators
commit: b30b218
{
  "ok": true,
  "repoPath": "/Users/uwe/Documents/Backup/X",
  "pulled": true,
  "imported": true,
  "importResult": {
    "ok": true,
    "repoPath": "/Users/uwe/Documents/Backup/X",
    "mode": "merge",
    "manifest": {
      "app": "birdclaw",
      "backupHash": "22905214e9205837da8afb1ff0fef7bc737f6d69d7eea020c2c74a263b217a1e",
      "counts": {
        "accounts": 1,
        "collections_bookmarks": 145,
        "collections_likes": 4510,
        "dm_conversations": 262,
        "dm_messages": 458,
        "follow_edges": 1290,
        "follow_events": 1290,
        "follow_snapshot_members": 1290,
        "follow_snapshots": 2,
        "link_occurrences": 3,
        "profile_bio_entities": 1713,
        "profile_snapshots": 1417,
        "profiles": 1999,
        "timeline_edges_authored": 1860,
        "timeline_edges_home": 1723,
        "timeline_edges_mention": 300,
        "timeline_edges_unknown": 397,
        "tweets": 7139,
        "url_expansions": 3
      },
      "files": [
 …
      ],
      "generatedAt": "2026-06-18T14:26:09.124Z",
      "schemaVersion": 1
    },
    "validation": {
      "ok": true,
      "repoPath": "/Users/uwe/Documents/Backup/X",
      "files": [
…
      ],
      "counts": {
…
      },
      "backupHash": "22905214e9205837da8afb1ff0fef7bc737f6d69d7eea020c2c74a263b217a1e",
      "errors": []
    },
    "fingerprint": {
      "counts": {
…
      },
      "hash": "dabe4fb775078768c7c411eca07a0987110df1c2626ca0bd3d8f47f8cb77f29f"
    }
  },
  "exportResult": {
    "ok": true,
    "repoPath": "/Users/uwe/Documents/Backup/X",
    "manifest": {
      "app": "birdclaw",
      "schemaVersion": 2,
      "generatedAt": "2026-06-18T14:50:10.341Z",
      "counts": {
…
      },
      "files": [
…
      ],
      "backupHash": "13f216dcba398a73d9e1cb30563c8206bfba40cccc736d850b5859505fdcd658"
    },
    "validation": {
      "ok": true,
      "repoPath": "/Users/uwe/Documents/Backup/X",
      "files": [
…
      ],
      "counts": {
…
      },
      "backupHash": "13f216dcba398a73d9e1cb30563c8206bfba40cccc736d850b5859505fdcd658",
      "errors": []
    },
    "git": {
      "committed": true,
      "pushed": true,
      "commit": "2032747cc45cd42852a116630a6294e0534f0373"
    }
  }
}

I removed files[] and counts[] to make it a little bit shorter and more readable.

@clawsweeper

clawsweeper Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 18, 2026, 10:59 AM ET / 14:59 UTC.

Summary
The PR updates backup JSONL export/import to escape U+2028/U+2029, split import records on physical LF records, and adds round-trip backup coverage.

Reproducibility: yes. at source level: current main uses node:readline for backup JSONL and a Node probe showed createInterface splitting one physical JSON row at U+2028. I did not run a full backup import because this review is read-only.

Review metrics: none identified.

Root-cause cluster
Relationship: fixed_by_candidate
Canonical: #61
Summary: This PR is the candidate fix for the linked backup JSONL Unicode separator bug.

Members:

Proposal only: this assessment does not dispatch repair, suppress jobs, mutate sibling items, close, or merge anything.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] I did not run the full Birdclaw test suite in this read-only review; validation relies on source inspection plus the PR-provided checks and terminal proof.

Maintainer options:

  1. Decide the mitigation before merge
    Land this narrow reader/writer fix after normal maintainer review and checks; keep the linked issue open until the PR merges.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P2] No repair lane is needed; this PR is already the focused candidate fix and has no concrete review finding for automation to address.

Security
Cleared: The diff only changes local backup JSONL parsing/serialization and tests; it does not touch dependencies, workflows, secrets, permissions, or downloaded code.

Review details

Best possible solution:

Land this narrow reader/writer fix after normal maintainer review and checks; keep the linked issue open until the PR merges.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main uses node:readline for backup JSONL and a Node probe showed createInterface splitting one physical JSON row at U+2028. I did not run a full backup import because this review is read-only.

Is this the best way to solve the issue?

Yes, the PR targets the shared JSONL reader/writer, preserves parsed values, and adds focused round-trip coverage without adding config or a migration.

AGENTS.md: not found in the target repository.

Codex review notes: model internal, reasoning high; reviewed against 7498414d9e17.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted terminal output from the PR branch running backup sync with successful import, export, and validation results on the reporter's backup repository.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🦞 diamond lobster.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes redacted terminal output from the PR branch running backup sync with successful import, export, and validation results on the reporter's backup repository.
  • remove rating: 🧂 unranked krab: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P2: The PR addresses a real backup import/sync failure with a limited code path and no evidence of security impact, data loss, or a broad outage.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes redacted terminal output from the PR branch running backup sync with successful import, export, and validation results on the reporter's backup repository.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted terminal output from the PR branch running backup sync with successful import, export, and validation results on the reporter's backup repository.
Evidence reviewed

What I checked:

  • Target policy check: No AGENTS.md exists under the Birdclaw repository root, so no target-specific AGENTS.md policy was applied.
  • Current main writer can emit literal separators: Current main serializes backup JSONL values with plain JSON.stringify before joining records with LF. (src/lib/backup.ts:219, 7498414d9e17)
  • Current main reader uses node readline: Current main feeds backup JSONL through createInterface, which splits U+2028/U+2029 in a string into multiple logical lines on the runtime probe. (src/lib/streaming-ingestion.ts:25, 7498414d9e17)
  • Runtime primitive reproduces the failure mode: A Node createInterface probe over one physical JSONL row containing U+2028 split the row into fragments, matching the linked parse failure path.
  • PR writer fix: The PR escapes literal U+2028/U+2029 in serialized JSONL values before writing backup rows. (src/lib/backup.ts:221, b30b21808619)
  • PR reader fix: The PR replaces readline with a physical LF splitter that preserves U+2028/U+2029 inside JSON strings while still trimming CRLF line endings. (src/lib/streaming-ingestion.ts:24, b30b21808619)

Likely related people:

  • steipete: Git blame attributes the current JSONL writer and readline reader to the v0.8.3 release commit by Peter Steinberger, and recent backup refactor history also touched the same backup/test area. (role: introduced behavior and recent area contributor; confidence: high; commits: a030afb8ed96, 0ff0317c9afe; files: src/lib/backup.ts, src/lib/streaming-ingestion.ts, src/lib/backup.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 18, 2026
@steipete steipete merged commit 7622e77 into steipete:main Jun 19, 2026
2 checks passed
@steipete

Copy link
Copy Markdown
Owner

Landed in 7622e77.

Tested:

  • pnpm test src/lib/backup.test.ts src/lib/streaming-ingestion.test.ts — 22 passed.
  • Isolated CLI export/replace-import E2E — seeded literal U+2028/U+2029, confirmed export escaped both separators on disk, validated the backup, and confirmed import restored both original code points.
  • Isolated legacy CLI import E2E — rewrote the generated fixture with literal U+2028/U+2029 and confirmed replace-import restored both code points.
  • pnpm check — formatting, lint, and types passed.
  • pnpm coverage — 141 files and 1,171 tests passed; 90.84% lines / 80.41% branches.
  • pnpm build and pnpm pack:smoke — production artifacts built; package smoke passed with 98 files.
  • BIRDCLAW_PLAYWRIGHT_PORT=3017 pnpm e2e — 12 passed.
  • Autoreview on local additions and the complete branch — no accepted/actionable findings.
  • PR CI — green, including checks, coverage, build, and Playwright.

Caveat: the legacy E2E intentionally used --no-validate after mutating a generated shard because its manifest checksum was then stale; this directly exercised the historical import parser path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

backup import fails on JSONL rows containing Unicode line separators

2 participants