fix: align the LLM's conversational view with the SMS transcript#37
Merged
Conversation
Merge consecutive same-role turns, prepend a placeholder user turn when history starts with an assistant message, and drop empty/None content. Converse requires user-first, strictly alternating, non-empty turns; owning that constraint in the engine lets build_chat_history stay a faithful transcript. No-op for current alternating histories. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The LLM now sees exactly what was exchanged over SMS: hub opening messages (texet_hub_initial) stay in history as assistant turns, bot messages count only once delivered (sent) — dropping failed/queued — and moderated exchanges remain withheld on both sides. Previously only the first-ever opening was injected into the system prompt; since conversations merged to one-per-user (#35) every later daily opening was invisible to the model, which then hallucinated their content or denied having context. Remove the [Opening message] injection and get_opening_message entirely; Bedrock's user-first/ alternating constraint is owned by the engine-boundary normalization. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
History spanning multiple days was an undifferentiated blob: the model could not map 'what we talked about this week' onto its context, and stale time references in old replies contradicted [User's Local Time]. - build_chat_history(annotate_days=True) prefixes the first message of each user-local calendar day with a [Tuesday, June 9] marker; the offset comes from per-utterance user_local_time meta (bot rows via their generation snapshot), backfilled for leading messages, UTC fallback. Only the LLM view is annotated — stored text and exports are untouched, and the moderation-email caller keeps the default. - compose_instruction_prompt always appends a code-owned [Conversation history conventions] section telling the model what its context actually is: a real SMS thread since Sunday, day-marked, with its own openings included and safety-withheld messages absent. - texet_generation snapshot version bumped to 2 (history semantics changed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
compose_instruction_prompt takes day_number and labels the daily section [Today's Activity (Day N)] so the model can tie the curriculum to the study day. docs/prompts/charla-system-prompt-v2.md is the deployable base prompt (paste via admin console; latest row wins). It adds what v1 lacked: a memory self-knowledge section (the model sees this week's real SMS thread + last week's summary — never deny it, never invent beyond it), usage guidance for the activity/summary sections, SMS length and anti-repetition rules, stale-time handling, and instruction privacy decoupled from memory denial. Also recommends moving off Llama 4 Maverick 17B. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The autouse kani_stub bypassed kani entirely, so nothing proved an assistant-first history survives a real chat round. Two new e2e tests restore the real _generate_reply: one drives a capture engine through the full Kani round (hub opening reaches the engine assistant-first, day-marked, reply persisted and sent), the other drives a stubbed BedrockEngine and asserts two back-to-back openings reach the Converse payload merged behind the placeholder user turn. scripts/replay_generation.py loads a bot utterance's texet_generation snapshot and prints unified diffs of the snapshot system prompt/history vs what current code would build — read-only, for replaying prod generations like eb02e4ed against context changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Prod utterance
eb02e4ed(Bedrock / Llama 4 Maverick) showed the bot repeatedly denying it had chat history it actually had, hallucinating the content of a hub opening it couldn't see, and confusing times across days. Root causes: hub openings were stripped from history (only the first-ever one was injected into the system prompt — stale since #35), failed sends stayed in history, multi-day history had no day boundaries, and the prompt never told the model what it remembers.What
Five test-gated steps, in deploy-safe order:
normalize_converse_messagesmerges consecutive same-role turns, prepends a[start of conversation]user placeholder for assistant-first history, drops empty/None content. Owns the Converse user-first/alternating constraint at the engine boundary so the transcript layer doesn't have to lie. No-op for current traffic.get_opening_messageremoved), bot messages only whenstatus=sent(failed/queued dropped), moderated excluded on both sides.[Tuesday, June 9](offset from per-utteranceuser_local_time, UTC fallback); every system prompt ends with a code-owned[Conversation history conventions]section telling the model its context is a real multi-day SMS thread it does remember.texet_generationsnapshot version → 2.[Today's Activity (Day N)];docs/prompts/charla-system-prompt-v2.mdis the deployable base prompt (memory self-knowledge, SMS constraints, anti-repetition, instruction privacy without amnesia claims) — paste via admin console after this deploys.scripts/replay_generation.pyto diff any stored generation snapshot against current-code context.Testing
make test), mypy clean, no new lint errors vs main.eb02e4edon prod and inspect the context diff before any prompt/model change.🤖 Generated with Claude Code