Skip to content

MarkdownV2Parser.parse ignores backslash escapes and deviates from the spec in several places #830

Description

@zeynalnia

Problem

MarkdownV2Parser.parse (in gramjs/extensions/markdownv2.ts) departs from
the Telegram MarkdownV2 spec
in several ways:

  1. Backslash escapes are not honored. Per the spec, \X for any X in
    _*[]()~`>#+-=|{}.! becomes the literal X, and \\ becomes \. Today
    these are passed through verbatim:

    • Input 1\.5 → output text 1\.5 (expected 1.5).
    • Input \*not bold\* → output text <b>not bold</b> (expected literal
      text *not bold* with no entity, since the delimiters are escaped).
  2. Italic uses - instead of _. The current code matches -text- for
    italic. The spec uses _text_ and reserves __text__ for underline.

  3. No blockquote support. Per spec, lines beginning with > form a
    blockquote, and a final line ending in || marks it as expandable
    (MessageEntityBlockquote.collapsed = true). Today these are emitted as
    literal > chars.

  4. Per-region escape rules are not applied. Inside pre and code only
    \\ and \` should unescape; inside the (URL) of a link or custom
    emoji only \\ and \) should unescape. There's no pass to apply this
    selectively.

  5. HTML special characters in plain text confuse the downstream HTML
    parser.
    A user-typed < is not escaped before being handed to
    HTMLParser.parse, so <b>not bold</b> typed into MarkdownV2 input is
    incorrectly interpreted as bold.

  6. HTMLParser is missing some Telegram HTML-spec tags that
    htmlToMarkdownV2 (and external callers) need to round-trip cleanly:
    <tg-spoiler>, <span class="tg-spoiler">, <ins> (underline
    alternative), <strike> (strikethrough alternative). And
    HTMLParser.unparse emits the library-internal <spoiler> tag (not in
    the spec) and drops the expandable attribute on collapsed blockquotes,
    so the flag doesn't survive round-trips.

Proposal

Rewrite the markdown→HTML transform inside the existing
markdown → HTML → HTMLParser pipeline as a staged process: extract
protected regions (pre/code/link/emoji) up front with their own escape
rules; mask remaining backslash-escapes; HTML-escape & and < in user
content; run span and blockquote markup; unmask; restore protected regions.
Switch italic to _ per spec. Expose markdownV2ToHtml and
htmlToMarkdownV2 as standalone functions so external callers can convert
between formats. Patch HTMLParser to recognize the missing tag forms and
to preserve the expandable attribute on round-trip.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions