Claude Code 400 "No Low Surrogate": Repairing a Permanently Broken Session Transcript
You are twenty turns deep into a complex agentic run. Claude Code has been processing CI logs and worker reports for the better part of an hour, accumulating tool results, reasoning chains, and intermediate decisions. Then every subsequent request returns:
400: not valid JSON: no low surrogate in string
Not just the next turn. Every turn. At the same byte offset. Restarting the process does nothing — the error comes back immediately, pointing at the same location. If you take the path most developers reach for first — abandoning the session and starting fresh — you lose everything the model built up during that run.
This is not a transient API hiccup. It is structural damage baked into your session transcript, and understanding why requires examining an architectural decision in Claude Code that most users have never had reason to think about.
How Claude Code Actually Stores Your Sessions
Most developers assume chat APIs work the way their documentation describes: you send a list of messages, the model responds, and each request is self-contained. Claude Code works differently under the hood.
Every session is persisted as a JSONL file — one JSON object per line — stored under a projects directory on disk. The critical behavior is the replay semantics: on every single turn, Claude Code reads the entire JSONL file and re-transmits the complete conversation history to the API. Session state is stateless at the API level; the state lives on disk. Every tool call output, every model response, every user message accumulated in the session is serialized to that file and replayed from line one on the next request.
This design has genuine advantages. It produces a complete, human-inspectable audit trail. It enables claude --resume to pick up any session from any point, surviving process crashes cleanly. The JSONL transcript functions as a write-once append log — a pattern familiar from event sourcing and database journals.
That last property is precisely what bites you here.
The UTF-16 Surrogate Problem, Exactly
Unicode above U+FFFF — emoji, mathematical operators, and many scripts outside the Basic Multilingual Plane — requires two 16-bit code units when encoded in UTF-16. These are called a surrogate pair: a high surrogate (U+D800–U+DBFF) followed immediately by a low surrogate (U+DC00–U+DFFF). This encoding scheme is a historical artifact of UTF-16's 16-bit origins and has no relevance to UTF-8, where every Unicode scalar value gets its own unambiguous byte sequence.
The problem arises at truncation boundaries. When a large tool output — a CI log full of emoji status markers, a worker report with Unicode symbols, any sufficiently dense text — gets cut mid-character, and the truncated character spans a UTF-16 surrogate pair, the high surrogate can be written to disk without its corresponding low surrogate. In Python's internal string representation, you now have a string containing a code point in the range 0xD800–0xDFFF. These are not valid Unicode scalar values — they are encoding machinery that is only meaningful when paired.
Python's standard json.dumps serializes these willingly, writing something like \ud83d into the JSON string. The Anthropic API's JSON parser, operating strictly per RFC 7159, rejects this outright: lone surrogates are not valid JSON string values. Hence:
400: not valid JSON: no low surrogate in string
Now recall the replay architecture. The corrupt character is embedded in, say, line 47 of your JSONL transcript. Every subsequent request re-transmits the entire transcript, including line 47. The API rejects it at the same byte offset every time. The session is permanently broken until the transcript file itself is repaired.
The asymmetry here is what makes this failure mode so disorienting: one corrupt write creates unbounded future failure. It does not fail once and recover. It fails on turn 48, turn 49, turn 50, and every turn afterward, for the rest of the session's lifetime.
The Surgical Repair
You do not need to abandon the session. You need to find the offending JSONL line and strip exactly the characters whose Unicode ordinal falls in 0xD800–0xDFFF from every string value it contains. Here is the complete repair script:
import json
import sys
from pathlib import Path
def strip_surrogates(obj):
if isinstance(obj, str):
return ''.join(c for c in obj if not (0xD800 <= ord(c) <= 0xDFFF))
if isinstance(obj, list):
return [strip_surrogates(item) for item in obj]
if isinstance(obj, dict):
return {k: strip_surrogates(v) for k, v in obj.items()}
return obj
def repair_transcript(path: Path):
lines = path.read_text(encoding='utf-8', errors='surrogatepass').splitlines()
repaired = []
changed = 0
for i, line in enumerate(lines):
if not line.strip():
repaired.append(line)
continue
try:
obj = json.loads(line)
cleaned = strip_surrogates(obj)
repaired.append(json.dumps(cleaned, ensure_ascii=False))
if cleaned != obj:
changed += 1
print(f" Repaired line {i + 1}")
except json.JSONDecodeError as e:
print(f" Warning: line {i + 1} is not valid JSON: {e}")
repaired.append(line)
path.write_text('\n'.join(repaired) + '\n', encoding='utf-8')
print(f"Done. {changed} line(s) repaired.")
if __name__ == '__main__':
repair_transcript(Path(sys.argv[1]))
Run it against the offending transcript:
python repair_transcript.py ~/.claude/projects/<project-hash>/<session-id>.jsonl
claude --resume
Several design decisions in this script are deliberate.
errors='surrogatepass' on read. Python's default UTF-8 decoder raises UnicodeDecodeError when it encounters byte sequences ED A0 80 through ED BF BF — the technically-illegal UTF-8 encoding of surrogate code points. The surrogatepass handler lets Python read these bytes and represent them as surrogate code points, making them available for stripping.
Recursive walk. A JSONL transcript object can be arbitrarily nested — tool results frequently contain deeply nested JSON structures. A shallow pass over top-level fields would miss surrogates buried in nested arrays or objects.
ensure_ascii=False on output. This preserves legitimate non-ASCII characters in their native form rather than escaping everything to \uXXXX sequences, keeping the file size consistent.
Why emoji are safe. The 0xD800–0xDFFF range is entirely reserved for UTF-16 encoding machinery and contains no characters that should appear in valid UTF-8 content. Emoji above U+FFFF — like 🧭 (U+1F9ED, COMPASS) — are single Python code points with ordinals well above 0xDFFF, encoded in UTF-8 as four bytes. They are completely untouched by this operation.
Scanning at Scale: The 3x Pre-Filter
If you are running Claude Code in an automation context, you may want to scan your entire transcript directory on session start and repair files proactively. Active pipelines can push transcript directories to 174+ files. A naive Python scan that loads and fully parses each JSONL takes around 3.4 seconds at that scale. A C-level byte pre-filter cuts that to 1.1 seconds — a 3x improvement — by checking for tell-tale byte patterns before invoking the JSON parser at all.
Lone surrogates encoded as UTF-8 via surrogatepass produce byte sequences starting with 0xED in the range 0xA0–0xBF. The JSON escape form produces the literal bytes \ud (5C 75 64). A fast pre-filter:
def has_surrogate_bytes(raw: bytes) -> bool:
if b'\\ud' in raw or b'\\uD' in raw:
return True
idx = 0
while (idx := raw.find(b'\xed', idx)) != -1:
if idx + 1 < len(raw) and 0xA0 <= raw[idx + 1] <= 0xBF:
return True
idx += 1
return False
Files that pass this check are clean and skip the full parse. Only files that fail get the parse-and-repair treatment. At 174 files, this byte-level pass achieves the 3x wall-clock reduction because the vast majority of transcripts, at any given moment, are clean — you are paying the full parse cost only for the files that need it.
The Architectural Lesson: Transcripts Are Write-Once Logs
The Unicode edge case is not the actual story. Lone surrogates leaking into JSON is a known hazard wherever UTF-16 is round-tripped through a UTF-8 pipeline — it has burned developers with every major JSON library in every language, and the mitigation is well-documented.
The story is what Claude Code's replay architecture implies about data quality discipline.
Most developers treat tool outputs as transient: you send them, you get a response, the data recedes. Claude Code's JSONL transcript is not a scratch buffer. It is a write-once append log where every entry accumulates cost on every API call for the session's lifetime. A bad write on turn 47 does not affect turn 47. It affects every turn from 48 onward, permanently, with compounding cost as the session grows longer.
The fixed byte offset in the 400 error is genuinely useful diagnostic information — the API is telling you exactly where in the transcript the corruption lives. But the more important signal is what that fixed offset reveals architecturally: the system has no error correction, no checkpointing, no ability to skip a bad line and continue. Session state is linear and unforgiving.
This reframes how the input validation problem should be understood. Every tool call output entering a Claude Code session is effectively a database insert. You have one opportunity to write clean data. A dirty write does not fail in isolation — it corrupts every subsequent read for the duration of the session. The appropriate discipline is the same discipline you apply to append-only event logs or schema migrations: validate at the point of write, before the data reaches the transcript, not downstream at repair time.
What to Do, Depending on Your Context
The right response depends on whether you are running Claude Code interactively or in an automated pipeline.
For automated pipelines: do not use repair scripts as your primary strategy. Apply UTF-8 sanitization at the tool output boundary, before content reaches the JSONL transcript. Replace lone surrogates with U+FFFD at serialization time:
def sanitize_for_transcript(text: str) -> str:
return ''.join(
'\ufffd' if 0xD800 <= ord(c) <= 0xDFFF else c
for c in text
)
This is strictly preferable to post-hoc repair for two reasons. First, prevention eliminates the failure mode rather than recovering from it. Second, the repair introduces a subtle correctness problem: stripping a surrogate silently corrupts the model's memory of that tool output. If the orphaned surrogate was embedded in a filename, a path, or an identifier that the model later reasons about, it now holds subtly wrong context with no indication that anything changed. Silent context corruption is significantly harder to debug than a visible serialization failure.
For long-running agentic pipelines ingesting CI logs, ingestion reports, or worker telemetry, a single bad log line should not be capable of bricking the session. Gate content at the source.
For interactive sessions: keep the repair script accessible. When you see the same fixed-offset 400 appearing on consecutive turns, the session is structurally broken, not transiently broken. Locate your projects directory (typically ~/.claude/projects/), identify the session file by modification time, run the repair, and resume. The surgical approach preserves your accumulated context — on a multi-hour interactive run, that context is worth preserving.
One important caveat: the repair script assumes the JSONL object boundary is intact and only a string value is malformed. If the original truncation happened mid-JSON-object rather than cleanly inside a string, stripping D800–DFFF code points will not fix the parse error. If the script reports Warning: line N is not valid JSON after the repair pass, that line is structurally corrupt and may need to be excised entirely — a more destructive intervention requiring careful re-validation of the surrounding context to ensure the model's understanding of adjacent turns remains coherent.
Retention: transcript directories grow. 174 files is already large enough that a naive scan is noticeably slow. A year of active automation work will push the corpus much further. Establish a retention policy — archive or purge transcripts older than 30–60 days — or the startup pre-filter cost will creep back up as the corpus expands.
The Takeaway
Claude Code's 400: not valid JSON: no low surrogate in string is not a curiosity. For any team feeding emoji-rich or Unicode-dense content into Claude Code — CI logs, data pipeline outputs, creative text, anything with non-BMP characters — this is a repeatable failure mode that can hit any session, permanently, with no obvious recovery path unless you know the architecture.
The surgical fix is safe and targeted: strip code points in 0xD800–0xDFFF from the offending JSONL line, re-serialize with json.dumps, resume. Emoji are untouched; valid characters never live in the surrogate range. The byte pre-filter makes proactive session-directory scanning viable at real-world corpus sizes.
The deeper prescription, though, is not a repair script. It is treating every tool output that enters a Claude Code session with the validation discipline you would apply to any append-only system: you get one shot to write clean data. The transcript does not forgive dirty writes, and neither does the architecture that replays it. For production automation, the corruption should never reach the transcript at all. The surgical repair exists for interactive sessions where you cannot control upstream input — and it should stay there.
Sources & Editorial Disclosure
This article was researched and written with AI assistance (Claude by Anthropic) as part of StackRadar's automated editorial pipeline. Content was synthesised from the following public developer community sources: Dev.to.
All technical claims, version numbers, benchmarks, and project details should be independently verified against official documentation or the original sources listed above. StackRadar analyses and synthesises publicly available information and does not claim original authorship of the underlying events, projects, or research described. Mention of any project, product, or organisation does not constitute an endorsement by StackRadar. This content is provided for informational purposes only — 2026-06-29.