Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions docs/ai/design/2026-06-28-feature-telegram-markdown-chunking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
phase: design
title: Telegram Markdown-first Chunking Design
description: Use marked tokens to split source Markdown before rendering Telegram HTML
---

# Telegram Markdown-first Chunking Design

## Architecture Overview

```mermaid
graph TD
A[TelegramAdapter.sendMessage markdown text] --> B[chunkMarkdownForTelegram]
B --> C[marked lexer top-level tokens]
C --> D[Group tokens by rendered HTML length]
D --> E[Split oversized token]
E --> F[Render each markdown chunk with markdownToTelegramHtml]
F --> G[sendMessage parse_mode HTML]
G --> H[parse-entities fallback to plain text]
```

The adapter moves chunking ahead of rendering. A new chunking helper uses `marked` lexer tokens to keep source Markdown boundaries, renders each candidate with `markdownToTelegramHtml`, and only emits chunks whose rendered HTML fits the Telegram limit. `TelegramAdapter` then sends those already-valid HTML chunks using the existing parse mode and fallback path.

## Data Models

- Markdown input: raw string passed to `TelegramAdapter.sendMessage`.
- Marked token: top-level `Tokens.Generic` from `marked.lexer(markdown)`.
- Markdown chunk: source Markdown string that can be rendered independently.
- Rendered chunk: Telegram-compatible HTML string produced by `markdownToTelegramHtml(markdownChunk)`.

## API Design

Internal helper:

- `chunkMarkdownForTelegram(markdown: string, maxLen: number): string[]`
- Returns rendered Telegram HTML chunks.
- Throws only if the renderer/lexer fails before fallback can handle it.
- Does not expose new public package APIs.

Adapter flow:

1. Try Markdown-first chunking.
2. Send each rendered chunk with `{ parse_mode: 'HTML' }`.
3. If Telegram rejects a chunk with `can't parse entities`, send plain text derived from that rendered chunk.
4. If Markdown lexing/rendering throws before chunks are produced, preserve the existing source/plain text fallback in max-length chunks.

## Component Breakdown

- `packages/channel-connector/src/adapters/TelegramAdapter.ts`
- Replace rendered-HTML chunking with Markdown-first chunking.
- Keep `htmlToPlainText`, parse-entities detection, and plain text fallback.
- `packages/channel-connector/src/utils/telegramHtml.ts`
- Keep existing renderer unchanged.
- Export or reuse `marked` lexer only if it helps avoid duplicate configuration.
- `packages/channel-connector/src/__tests__/adapters/TelegramAdapter.test.ts`
- Add behavior tests for long code fences, nested list code, paragraphs, Unicode/emoji, and unchanged normal markdown.

## Design Decisions

- Chosen: use `marked` lexer/token raw source and render candidate Markdown chunks for validation.
- Trade-off: simple and aligned with current dependency, but requires recursive split heuristics for oversized tokens.
- Alternative: chunk rendered HTML with an HTML parser.
- Rejected because the user asked to chunk Markdown/source before rendering and because Telegram HTML validity depends on rendering each chunk independently.
- Alternative: convert to Telegram MessageEntity.
- Rejected by explicit requirement.

## Splitting Strategy

- Top-level token grouping: append token raw source to the current candidate when its rendered HTML fits.
- Oversized code token: split by lines, wrapping every chunk in a fenced block using the original language.
- Oversized list token: split by list items, reusing item raw source where available; if an item remains oversized, split that item recursively.
- Oversized paragraph/text token: split raw paragraph content by newline, then sentence punctuation, then word, while validating rendered length.
- Fallback: hard split source/plain text if rendering a chunk still cannot fit, then send without parse mode only for that fallback path.

## Non-Functional Requirements

- Reliability: each parse-mode send is independently rendered HTML.
- Performance: rendering candidates is acceptable because Telegram sends are already network-bound and messages are small relative to process memory.
- Security: continue escaping HTML through the existing renderer; do not pass raw HTML through.
- Maintainability: keep chunking local to the Telegram adapter and use `marked` tokens rather than ad hoc Markdown parsing.
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
phase: implementation
title: Telegram Markdown-first Chunking Implementation
description: Implementation notes for marked-token chunking in TelegramAdapter
---

# Telegram Markdown-first Chunking Implementation

## Development Setup

- Active worktree: `/home/ubuntu/code/ai-devkit/.worktrees/feature-telegram-markdown-chunking`
- Branch: `feature-telegram-markdown-chunking`
- Bootstrap: `npm ci`
- Package: `@ai-devkit/channel-connector`

## Code Structure

- `packages/channel-connector/src/adapters/TelegramAdapter.ts`: Telegram send flow and chunking helpers.
- `packages/channel-connector/src/utils/telegramHtml.ts`: existing Markdown-to-Telegram-HTML renderer, retained as-is for rendering.
- `packages/channel-connector/src/__tests__/adapters/TelegramAdapter.test.ts`: mocked adapter behavior tests.

## Implementation Notes

### Core Features

- Implemented `chunkMarkdownForTelegram` in `TelegramAdapter.ts`.
- Uses `Marked.lexer` to obtain top-level Markdown tokens.
- Groups tokens by rendering candidate source Markdown through `markdownToTelegramHtml`.
- Splits oversized code tokens by lines while wrapping every emitted part in the original fenced code marker and language.
- Splits oversized lists by list item where possible, then falls back to recursive text/code splitting for oversized items.
- Splits oversized paragraphs/text by newline, sentence, word, then code point fallback.
- Sends rendered chunks with Telegram HTML parse mode only after they fit.

### Patterns & Best Practices

- Keep renderer behavior unchanged.
- Keep fallback behavior local to `TelegramAdapter.sendMessage`.
- Prefer source Markdown chunk boundaries over rendered HTML manipulation.

## Integration Points

- No public API changes.
- No Telegram Bot API contract changes.
- Telegraf remains mocked in tests.

## Error Handling

- If Markdown chunk generation fails, fall back to source/plain text chunks.
- If Telegram rejects parse-mode HTML with `can't parse entities`, send plain text derived from that rendered chunk.
- Non-parse Telegram send errors continue to propagate.

## Performance Considerations

- Candidate rendering is repeated during grouping and splitting; this is bounded by Telegram message size and channel send frequency.
- Avoid large dependency changes or a custom parser.

## Security Notes

- The existing renderer continues to escape user content.
- Raw Markdown HTML remains dropped by the renderer.
- No secrets or new config are introduced.

## Validation Results

- `npx ai-devkit@latest lint --feature telegram-markdown-chunking`: exited 0.
- `npm --workspace @ai-devkit/channel-connector test -- src/__tests__/adapters/TelegramAdapter.test.ts`: exited 0, 26 tests passed.
- `npm --workspace @ai-devkit/channel-connector test`: exited 0, 62 tests passed.
- `npm --workspace @ai-devkit/channel-connector run typecheck`: exited 0.
- `npm --workspace @ai-devkit/channel-connector run lint`: exited 0.
- Final rerun of `npx ai-devkit@latest lint --feature telegram-markdown-chunking`: exited 0.
- Commit hook rerun after direct workspace package builds: repo lint exited 0 with existing warnings; repo tests exited 0 with 70 files and 821 tests passed.
- Post-fetch targeted validation: `npm --workspace @ai-devkit/channel-connector test -- src/__tests__/adapters/TelegramAdapter.test.ts` exited 0, 26 tests passed.

## Deviations and Follow-ups

- No design deviations.
- Plain/source fallback remains available if a rendered chunk still cannot fit after semantic splitting.
- PR opened: https://github.com/codeaholicguy/ai-devkit/pull/125.
80 changes: 80 additions & 0 deletions docs/ai/planning/2026-06-28-feature-telegram-markdown-chunking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
phase: planning
title: Telegram Markdown-first Chunking Plan
description: Implementation tasks for semantic Markdown chunking before Telegram HTML rendering
---

# Telegram Markdown-first Chunking Plan

## Milestones

- [x] Milestone 1: Requirements, design, and tests describe Markdown-first chunking.
- [x] Milestone 2: Adapter chunks Markdown source with marked tokens and sends independently rendered HTML chunks.
- [x] Milestone 3: Targeted tests, typecheck, lifecycle lint, review, commit, and PR are complete.

## Task Breakdown

### Phase 1: Documentation and Existing Behavior

- [x] Task 1.1: Capture requirements, design, testing scenarios, and implementation plan.
- Outcome: lifecycle docs explain scope, non-goals, splitting strategy, and validation.
- Validation: `npx ai-devkit@latest lint --feature telegram-markdown-chunking`.
- Related tests: all testing doc scenarios.
- [x] Task 1.2: Inspect current Telegram adapter, renderer, package scripts, and existing tests.
- Outcome: implementation reuses local patterns and dependencies.
- Validation: source references recorded in implementation notes.

### Phase 2: TDD and Core Implementation

- [x] Task 2.1: Add failing tests for long fenced code, nested list code, paragraphs, Unicode/emoji, and unchanged normal markdown.
- Outcome: tests fail against rendered-HTML chunking for the right reasons.
- Validation: targeted Vitest run exits non-zero before production changes.
- [x] Task 2.2: Implement Markdown-first chunking with `marked` lexer tokens.
- Outcome: each rendered HTML chunk is independently valid and within the Telegram max length.
- Validation: targeted Vitest run exits zero.
- [x] Task 2.3: Preserve fallbacks for renderer failures and Telegram parse-entities errors.
- Outcome: existing fallback tests still pass.
- Validation: adapter test suite exits zero.

### Phase 3: Verification and Review

- [x] Task 3.1: Run typecheck and targeted package tests.
- Outcome: changed package validates locally.
- Validation: command output recorded in implementation/testing docs.
- [x] Task 3.2: Review implementation against design and update lifecycle docs.
- Outcome: docs reflect actual files, decisions, deviations, and risks.
- Validation: lifecycle lint passes.
- [x] Task 3.3: Commit, push, and open PR.
- Outcome: branch `feature-telegram-markdown-chunking` has a PR ready for review.
- Validation: commit SHA and PR URL reported.

## Dependencies

- Depends on existing `marked` dependency in `@ai-devkit/channel-connector`.
- Depends on existing `markdownToTelegramHtml` renderer remaining stable.
- No external Telegram API dependency for automated tests.

## Timeline & Estimates

- Documentation and code discovery: small.
- TDD and chunking implementation: medium, because recursive splitting must avoid malformed HTML and preserve fallbacks.
- Verification, review, PR: small to medium depending on CI/local runtime.

## Risks & Mitigation

- Risk: marked token `raw` values may differ across token kinds.
- Mitigation: use `raw` where available and fall back to token text for known oversized splitters.
- Risk: rendered length may exceed source length due to HTML wrappers/entities.
- Mitigation: validate by rendering every candidate before sending.
- Risk: plain text hard fallback could lose formatting.
- Mitigation: use it only after semantic splitting and rendering cannot fit.

## Resources Needed

- Repo-local tests and typecheck.
- `npx ai-devkit@latest` docs/lint commands.
- GitHub CLI or configured forge CLI for PR creation.

## Progress Summary

Implementation tasks are complete through package verification, local review, commit, push, and PR creation. PR: https://github.com/codeaholicguy/ai-devkit/pull/125.
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
phase: requirements
title: Telegram Markdown-first Chunking
description: Chunk Telegram markdown source by semantic boundaries before rendering HTML
---

# Telegram Markdown-first Chunking

## Problem Statement

AI DevKit's Telegram adapter currently renders an entire Markdown message to Telegram-compatible HTML, then chunks the rendered HTML string. This can split HTML tags or entities, especially for long fenced code blocks and nested-list content that renders into `<pre><code>...</code></pre>`. Telegram then receives invalid HTML for a chunk and may reject the send.

Affected users are people using `ai-devkit channel start telegram` to read long agent responses in Telegram. The current workaround is a parse-entities fallback that strips formatting for rejected chunks, but that still sends partial rendered fragments and loses formatting.

## Goals & Objectives

- Chunk Markdown/source before rendering, using `marked` lexer tokens rather than a new Markdown parser.
- Preserve the existing `markdownToTelegramHtml` renderer and Telegram HTML parse mode.
- Ensure every Telegram HTML send receives an independently rendered chunk with valid Telegram-compatible HTML.
- Split oversized content by sensible semantic boundaries:
- top-level Markdown tokens first
- code fences by lines while preserving fences and language
- lists by list item where possible
- paragraphs by newline, then sentence, then word
- source/plain text fallback only when rendering still fails
- Maintain normal markdown output for messages that already fit.

## Non-goals

- Do not implement Telegram `MessageEntity` conversion.
- Do not replace or rewrite the existing Markdown-to-Telegram-HTML renderer.
- Do not implement a custom Markdown parser.
- Do not change Telegram authorization, channel setup, polling, or send retry behavior beyond chunk preparation.

## User Stories & Use Cases

- As a Telegram channel user, I want long fenced code blocks containing literal strings like `<code>tag</code>` to arrive as multiple valid formatted code chunks, so Telegram does not reject malformed HTML.
- As a Telegram channel user, I want nested lists that include long code blocks to chunk at list item or code-line boundaries, so structure and readable formatting are preserved.
- As a Telegram channel user, I want long paragraphs to be split at readable boundaries, so responses remain understandable within Telegram's 4096 character limit.
- As a Telegram channel user, I want emoji and other Unicode text to be counted consistently with JavaScript/Telegram send limits, so chunks do not exceed the adapter's configured limit.
- As a maintainer, I want existing short and normal Markdown messages to remain unchanged.

## Success Criteria

- Each HTML send from `TelegramAdapter.sendMessage` is at most `TELEGRAM_MAX_MESSAGE_LENGTH` characters.
- Each HTML send is generated by independently calling `markdownToTelegramHtml` on a Markdown chunk.
- Tests cover long fenced code containing literal `<code>tag</code>` text without splitting rendered `<pre>`, `<code>`, or HTML entities.
- Tests cover nested list content with long fenced code.
- Tests cover long paragraphs split by sensible boundaries.
- Tests cover Unicode/emoji length behavior.
- Tests cover normal Markdown unchanged in a single send.
- Existing parse-entities fallback behavior remains available when Telegram rejects a rendered chunk.

## Constraints & Assumptions

- Telegram max message length remains represented by `TELEGRAM_MAX_MESSAGE_LENGTH = 4096`.
- JavaScript string length is the existing counting model for this adapter; this feature must not introduce a new byte-based or grapheme-count limit.
- `marked` is already available in `@ai-devkit/channel-connector`; use its lexer/tokens.
- Rendering a candidate chunk is the authoritative validation because the rendered HTML, not raw Markdown length, determines Telegram payload size.
- If a single semantic unit still cannot be split cleanly under the limit, a plain/source fallback is acceptable to preserve delivery.

## Questions & Open Items

- No material open requirements. The user explicitly decided to keep the existing HTML renderer and not implement MessageEntity conversion.
62 changes: 62 additions & 0 deletions docs/ai/testing/2026-06-28-feature-telegram-markdown-chunking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
phase: testing
title: Telegram Markdown-first Chunking Testing
description: Verify semantic Markdown chunking before Telegram HTML rendering
---

# Telegram Markdown-first Chunking Testing

## Test Coverage Goals

- Unit test coverage target: all new Telegram chunking branches added in `TelegramAdapter`.
- Integration scope: mocked Telegraf send calls from `TelegramAdapter.sendMessage`.
- End-to-end scope: not required; no live Telegram API calls for this change.
- Acceptance criteria map directly to the requirements edge cases.

## Unit Tests

### TelegramAdapter.sendMessage

- [x] Long fenced code containing literal `<code>tag</code>` is split into multiple parse-mode HTML sends, each within 4096 characters and each containing balanced `<pre><code...>` wrappers.
- [x] Nested list with long fenced code is split into multiple parse-mode HTML sends without a malformed partial HTML code block.
- [x] Long paragraphs are split into multiple parse-mode HTML sends at readable boundaries and stay within limit.
- [x] Unicode/emoji content respects JavaScript string length limits for chunk size.
- [x] Normal markdown that fits is sent once and renders unchanged.
- [x] Existing parse-entities retry still falls back to plain text.
- [x] Existing renderer-throws fallback still sends source/plain text chunks.

## Integration Tests

- [x] Mocked `telegraf.telegram.sendMessage` calls always receive `{ parse_mode: 'HTML' }` for successful rendered chunks.
- [x] Plain text fallback calls omit parse mode.

## End-to-End Tests

- Not planned. The behavior is deterministic and covered through the adapter boundary with Telegraf mocked.

## Test Data

- Fenced TypeScript code block with repeated lines containing `<code>tag</code>` and `&`.
- Nested unordered list with a child fenced code block large enough to exceed Telegram length after rendering.
- Paragraphs containing sentence punctuation, long words, and emoji.
- Short markdown sample: `**bold** and *italic* and `code``.

## Test Reporting & Coverage

- Red command: `npm --workspace @ai-devkit/channel-connector test -- src/__tests__/adapters/TelegramAdapter.test.ts` exited 1 with 4 expected failures before production changes.
- Targeted adapter command: `npm --workspace @ai-devkit/channel-connector test -- src/__tests__/adapters/TelegramAdapter.test.ts` exited 0 with 26 tests passed.
- Package test command: `npm --workspace @ai-devkit/channel-connector test` exited 0 with 4 files and 62 tests passed.
- Typecheck command: `npm --workspace @ai-devkit/channel-connector run typecheck` exited 0.
- Lint command: `npm --workspace @ai-devkit/channel-connector run lint` exited 0.

## Manual Testing

- Not required for this non-UI adapter change.

## Performance Testing

- No dedicated benchmark required. Tests should avoid pathological runtime by using representative 5k to 12k character inputs.

## Bug Tracking

- Regressions should be added as adapter tests with inputs that previously generated malformed rendered HTML chunks.
Loading
Loading