From 3874d3d0686b315c32b1c75dc97f8fe97909713d Mon Sep 17 00:00:00 2001
From: Hoang Nguyen <hoangnn93@gmail.com>
Date: Sun, 28 Jun 2026 20:00:50 +0000
Subject: [PATCH] docs(design): propose telegram message entities

---
 .../feature-telegram-entities-design.md       | 259 ++++++++++++++++++
 1 file changed, 259 insertions(+)
 create mode 100644 docs/ai/design/feature-telegram-entities-design.md

diff --git a/docs/ai/design/feature-telegram-entities-design.md b/docs/ai/design/feature-telegram-entities-design.md
new file mode 100644
index 00000000..ca5e337b
--- /dev/null
+++ b/docs/ai/design/feature-telegram-entities-design.md
@@ -0,0 +1,259 @@
+---
+phase: design
+title: "Telegram MessageEntity Rich Messages Design"
+description: Design for replacing Telegram HTML parse mode with plain text plus Bot API MessageEntity spans
+---
+
+# System Design: Telegram MessageEntity Rich Messages
+
+## Architecture Overview
+
+Telegram outbound messages currently render markdown to Telegram-compatible HTML and send each chunk with `parse_mode: 'HTML'`. The long-term design is to render markdown directly into plain text plus Telegram Bot API `MessageEntity` ranges, then send each chunk with `entities` instead of `parse_mode`.
+
+This is design-only. It does not reopen or modify the short-term HTML fallback work from PR #122.
+
+```mermaid
+graph TD
+    Assistant[Assistant markdown output]
+    Adapter[TelegramAdapter.sendMessage]
+    Renderer[telegramEntities renderer]
+    Chunker[entity-aware chunker]
+    API[Telegraf sendMessage]
+    Telegram[Telegram Bot API]
+
+    Assistant --> Adapter
+    Adapter --> Renderer
+    Renderer -->|plain text plus entities| Chunker
+    Chunker -->|adjusted chunk entities| API
+    API -->|text, entities| Telegram
+```
+
+### Key Principles
+
+- Render from markdown tokens to plain text and entity ranges in one pass.
+- Do not generate HTML and parse it back into entities.
+- Keep the current HTML renderer available during rollout.
+- Preserve message delivery over formatting fidelity when Telegram rejects an entity payload.
+- Keep `ChannelAdapter.sendMessage(chatId, text)` unchanged; rich rendering remains an internal Telegram concern.
+
+## Current State
+
+Affected current files:
+
+| File | Current responsibility |
+|---|---|
+| `packages/channel-connector/src/adapters/TelegramAdapter.ts` | Calls `markdownToTelegramHtml`, chunks rendered HTML, sends `parse_mode: 'HTML'`, and retries parse failures as plain text |
+| `packages/channel-connector/src/utils/telegramHtml.ts` | Uses `marked` custom renderers to emit Telegram HTML |
+| `packages/channel-connector/src/__tests__/adapters/TelegramAdapter.test.ts` | Covers HTML send options, chunking, and parse-entities fallback |
+| `packages/channel-connector/src/__tests__/utils/telegramHtml.test.ts` | Covers markdown-to-HTML behavior |
+
+Local Telegraf types already support the target API:
+
+```typescript
+sendMessage(chatId, text, { entities });
+```
+
+`@telegraf/types` defines `MessageEntity.offset` and `MessageEntity.length` as UTF-16 code unit positions.
+
+## Data Models
+
+Add a new utility module next to the existing HTML renderer:
+
+```typescript
+import type { MessageEntity } from '@telegraf/types';
+
+export interface TelegramRichText {
+  text: string;
+  entities?: MessageEntity[];
+}
+
+export function markdownToTelegramEntities(markdown: string): TelegramRichText;
+
+export function chunkTelegramRichText(
+  message: TelegramRichText,
+  maxLength?: number
+): TelegramRichText[];
+```
+
+Recommended module path:
+
+```text
+packages/channel-connector/src/utils/telegramEntities.ts
+```
+
+Keep `telegramHtml.ts` during incremental rollout so `TelegramAdapter` can switch between renderers and retain a known fallback.
+
+## Renderer Design
+
+The renderer should continue to use `marked`, but it should walk tokens and append to a mutable plain-text buffer. Each formatted token records the buffer length before and after rendering its children.
+
+```typescript
+const start = buffer.length;
+renderChildren(token.tokens);
+const length = buffer.length - start;
+entities.push({ type: 'bold', offset: start, length });
+```
+
+JavaScript string indexes and `.length` are UTF-16 code units, matching Telegram offsets. The implementation should still make this explicit with helper names and tests so future changes do not switch to code point or grapheme counts accidentally.
+
+### Markdown Mapping
+
+| Markdown input | Plain text output | Entity |
+|---|---|---|
+| Heading | Heading text plus blank line | `bold` |
+| Strong | Text | `bold` |
+| Emphasis | Text | `italic` |
+| Strikethrough | Text | `strikethrough` |
+| Inline code | Code text | `code` |
+| Fenced code block | Code text plus blank line | `pre`, with `language` when present |
+| Link | Link label | `text_link` with `url` |
+| Image | Alt text, or URL when alt text is empty | `text_link` with `url` |
+| Blockquote | Quote text plus blank line | `blockquote` |
+| Unordered list | `- item` or bullet text | No entity |
+| Ordered list | `1. item` | No entity |
+| Table | Padded ASCII table | `pre` |
+| Horizontal rule | Plain divider text | No entity |
+| Raw HTML | Dropped | No entity |
+
+Lists should remain plain text because Telegram has no list entity. The exact marker can preserve current visible output, but ASCII `-` is simpler than a bullet when strict ASCII output is preferred.
+
+## Entity Constraints
+
+Telegram allows nested message entities only under these constraints:
+
+- If two entities share characters, one must fully contain the other.
+- `bold`, `italic`, `underline`, `strikethrough`, and `spoiler` can contain or be contained by other entities except `pre` and `code`.
+- `blockquote` entities cannot be nested.
+- Other entity types cannot contain each other.
+
+The renderer should include a validation or normalization step before returning entities:
+
+1. Drop zero-length entities.
+2. Sort by `offset`, then by descending `length` for containing entities.
+3. Reject partial overlaps.
+4. Remove style or link entities that fall inside `code` or `pre`.
+5. Avoid nested `blockquote`; keep the outermost quote.
+6. Prefer dropping the incompatible inner entity over failing the full message.
+
+This preserves the user's text even when some nested markdown cannot be represented exactly.
+
+## UTF-16 Offset Handling
+
+Telegram offsets are UTF-16 code units. The renderer and chunker should use JavaScript string lengths directly:
+
+```typescript
+const utf16Offset = text.length;
+const utf16Length = rendered.length;
+```
+
+Important cases:
+
+- Emoji outside the BMP, such as U+1F600, count as two UTF-16 code units.
+- ZWJ emoji sequences count as multiple UTF-16 code units.
+- Combining marks and variation selectors count as separate UTF-16 code units.
+- CJK text usually counts as one UTF-16 code unit per character.
+
+The implementation should not use `[...text].length`, `Array.from(text).length`, `Intl.Segmenter`, or byte lengths for Telegram entity offsets.
+
+## Chunking Design
+
+Telegram `sendMessage` accepts 1-4096 characters after entity parsing. With entities, chunking should happen after markdown rendering and before sending:
+
+1. Split on plain text, not HTML.
+2. Prefer paragraph boundaries, then single newlines, then hard splits.
+3. Do not split inside a UTF-16 surrogate pair.
+4. For each chunk, keep only entities that intersect the chunk.
+5. Adjust retained offsets by subtracting the chunk start offset.
+6. For entities crossing a boundary:
+   - Split simple style entities such as `bold`, `italic`, `strikethrough`, and `blockquote`.
+   - Drop crossing `text_link`, `code`, and `pre` entities unless the chunk contains the full original entity.
+7. Re-run entity normalization for each chunk.
+
+Hard splitting should include a helper that backs up one code unit when `text.charCodeAt(splitAt - 1)` is a high surrogate and `text.charCodeAt(splitAt)` is a low surrogate.
+
+## TelegramAdapter Integration
+
+Add a renderer mode behind an internal option or feature flag:
+
+```typescript
+type TelegramRichMessageMode = 'html' | 'entities';
+
+interface TelegramAdapterOptions {
+  botToken: string;
+  richMessageMode?: TelegramRichMessageMode;
+}
+```
+
+Initial default should remain `html`. The `entities` path should call:
+
+```typescript
+const rendered = markdownToTelegramEntities(text);
+for (const chunk of chunkTelegramRichText(rendered, TELEGRAM_MAX_MESSAGE_LENGTH)) {
+  await bot.telegram.sendMessage(chatId, chunk.text, { entities: chunk.entities });
+}
+```
+
+When `chunk.entities` is empty, omit the extra options object or omit `entities` to keep plain text sends simple.
+
+## Fallback Behavior
+
+Fallback should continue to favor message delivery:
+
+- If entity rendering throws, send the original markdown as plain text chunks with no entities.
+- If Telegram rejects a chunk with a parse-entities error, retry that chunk as plain text with no entities.
+- If Telegram rejects for a non-parse error, propagate the error as today.
+- If the `entities` mode is enabled and a structural invariant fails locally, log or debug the renderer issue and fall back to plain text rather than falling back through HTML.
+
+The fallback text for entity mode is already plain text, so it does not need HTML tag stripping or entity decoding.
+
+## Rollout Phases
+
+1. Add `telegramEntities.ts` and focused tests without changing `TelegramAdapter` behavior.
+2. Add `richMessageMode: 'html' | 'entities'` with default `html`.
+3. Wire `TelegramAdapter` to the entity renderer behind the option and keep existing HTML tests.
+4. Add adapter tests proving entity sends use `{ entities }` and do not set `parse_mode`.
+5. Exercise the option in a non-default environment or manual Telegram bot smoke test.
+6. Make `entities` the default after confidence is built.
+7. Remove the HTML renderer after one release cycle if no fallback dependency remains.
+
+## Test Plan
+
+Add `packages/channel-connector/src/__tests__/utils/telegramEntities.test.ts` with cases for:
+
+- Plain text pass-through with no entities.
+- Bold, italic, strikethrough, nested style spans, and sorted offsets.
+- Inline code excluding nested formatting.
+- Fenced code block with `language`.
+- Links and image alt-text links as `text_link`.
+- Blockquotes and nested blockquote normalization.
+- Ordered, unordered, and nested lists as readable plain text.
+- Tables as ASCII `pre`.
+- Raw HTML stripping.
+- Emoji and Unicode before, inside, and after formatted ranges, asserting UTF-16 offsets.
+- Chunking at paragraph, newline, hard limit, and surrogate-pair boundaries.
+- Entity offset adjustment after chunking.
+- Crossing entity behavior for style, `text_link`, `code`, and `pre`.
+
+Update `TelegramAdapter.test.ts` with cases for:
+
+- Entity mode sends `{ entities }` and no `parse_mode`.
+- Empty entity arrays are omitted.
+- Renderer failure sends original markdown as plain text.
+- Telegram parse-entities rejection retries the chunk as plain text.
+- Non-parse errors still propagate.
+- HTML mode remains available during rollout.
+
+## Risks and Trade-offs
+
+- Entity rendering is more precise but more complex than HTML rendering.
+- Some markdown nesting cannot be represented by Telegram entities; the design drops incompatible inner formatting to preserve delivery.
+- Splitting long code blocks may remove `pre` formatting from split chunks, but avoids invalid entity spans.
+- Feature-flagged rollout temporarily keeps two renderers, increasing test matrix size.
+- UTF-16 correctness is easy to regress if future code uses code point or grapheme counts.
+
+## Non-Functional Requirements
+
+- Rendering and chunking should be deterministic and synchronous.
+- Runtime behavior should not require new dependencies beyond existing `marked` and Telegraf types.
+- No secrets, chat IDs, or message content should be logged by default.
+- The public channel connector API should remain stable.