Skip to content

feat: accept language arg on add_pattern / update_pattern MCP tools#63

Merged
attila merged 9 commits into
mainfrom
feat/mcp-language-arg
May 19, 2026
Merged

feat: accept language arg on add_pattern / update_pattern MCP tools#63
attila merged 9 commits into
mainfrom
feat/mcp-language-arg

Conversation

@attila

@attila attila commented May 19, 2026

Copy link
Copy Markdown
Owner

Summary

Closes the MCP authoring-surface gap from PR #50: add_pattern and update_pattern now accept an optional language argument (scalar or array), write a canonical language: [...] frontmatter line, and surface unknown tokens via stderr and a new language_warnings array on the lore-metadata fence. update_pattern mirrors tags's three-way semantics (omit preserves, [] clears, non-empty replaces). append_to_pattern does not accept language by design — appends are body-only; the schema test pins this and fails fast if reverted.

Closes the inbox-branch advisory blackout in the same diff — the short-circuit now invokes parse_frontmatter_language_list directly so agent-submitted patterns get the same warning surface as locally-indexed ones.

Plan: docs/plans/2026-05-19-001-feat-mcp-language-arg-plan.md.

Test plan

  • just ci passes locally (fmt + clippy + 777 lib tests + 10 integration tests + deny + doc)
  • Real-binary UAT via lore serve over JSON-RPC from an isolated XDG environment
    • Scalar language: "rust" → file written with language: [rust], fence language_warnings: []
    • Array language: ["java","kotlin","groovy"] → order preserved
    • Unknown token ["rust","objectiv-c"] → both surfaces fire (stderr Warning: line + fence language_warnings: ["objectiv-c"]); file still written (warn-and-proceed)
    • update_pattern body-only against language: [swift, objectivec] → frontmatter preserved verbatim
    • tools/list confirms language absent on append_to_pattern, present on the other two
  • lore status post-UAT reflects all writes in the Languages: breakdown

attila added 7 commits May 19, 2026 19:54
Establishes the per-unit breakdown, key technical decisions, and risk
analysis for accepting a language argument on the pattern-authoring MCP
tools. The plan is the input artefact for the four implementation units
that follow.
Add the optional `language` field to both write-tool input schemas
(`oneOf [string, array<string>]`) and parse it at the MCP boundary.
`parse_language_arg` coerces a scalar to a one-element vec and preserves
the absent / empty-array / non-empty distinction that U2 will route into
`update_pattern`'s preserve / clear / replace branches.

`check_language_limit` caps the serialised JSON shape at 8 KB to keep
the validation cheap and bounded, mirroring `check_tags_limit`.

The parsed value is intentionally unused in this commit — `_language`
binding is consumed by U2 when the ingest signatures grow the parameter
and `build_file_content` learns to render `language:` frontmatter.
Validation still runs end-to-end, so a malformed argument fails with a
structured error before any write lock is taken.

Tests cover the six argument-shape cases (absent, scalar, array,
duplicates, empty, non-string scalar, array with non-string entries),
the size-limit guard, and three schema-shape pins on the tool catalogue:
language present with the right `oneOf` shape on add_pattern and
update_pattern, language absent on append_to_pattern (load-bearing
regression guard for the schema-honesty decision).
Extend `build_file_content` to render a canonical `language: [...]`
flow-list line after `tags:` so the on-disk frontmatter shape matches
what `parse_frontmatter_language_list` already accepts. Single-token
inputs render as one-element flow lists (`[rust]`) rather than scalars
so the on-disk shape stays uniform regardless of how the agent passed
the value through MCP.

`ingest::add_pattern` gains a `language: &[&str]` parameter — empty
slice writes no `language:` line, preserving the pre-feature shape
byte-for-byte. `ingest::update_pattern` gains a matching
`language: Option<&[&str]>` parameter with the same three-way
semantics as `tags` (absent preserves the existing list via
`parse_frontmatter_language_list`, `Some(&[])` clears, `Some(&[..])`
replaces). The handlers in `server.rs` thread the parsed argument
through.

Tests pin all four U2 contracts end-to-end against the DB row's
`language_json` value (the slice-shape-vs-pipeline-tests learning):
flow-list rendering, multi-token order preservation, the
tags-before-language frontmatter ordering, the absent case writing no
frontmatter, and update_pattern's preserve / clear / replace branches
with the preserve-on-absent case as the load-bearing regression guard
against the de-language footgun.

The two `#[allow(clippy::too_many_arguments)]` annotations on the
write functions are intentional — both are the canonical MCP write
paths and refactoring them into a parameter struct would obscure the
call sites rather than help readers.
Extend `WriteResult` with `language_warnings: Vec<String>` and populate
it from the chunking parser's `malformed_language` advisories so MCP
callers can see exactly which tokens warn-and-proceeded. The metadata
fence on `add_pattern`, `update_pattern`, and `append_to_pattern` now
renders the array unconditionally (empty when valid, never omitted) so
agents can pattern-match on the key without first checking existence.

Closes the inbox-branch advisory gap (Residual 2). Before this commit
the inbox-branch short-circuit skipped `index_single_file` entirely,
silently swallowing unknown-language-token warnings on the dominant
agent-submission path — exactly the dropped-argument failure mode this
plan exists to fix, just on a different code path. Each of the three
short-circuits (add, update, append) now invokes
`parse_frontmatter_language_list` directly on the about-to-be-written
content; the shared `collect_language_warnings` helper emits one stderr
line per unique unknown token and populates `WriteResult`. Stderr and
metadata-fence advisories now match across every write path.

Integration tests in `tests/branch_push.rs` pin the inbox-branch
behaviour against a real bare-remote setup: unknown tokens collect into
the warnings vec, all-valid input returns an empty vec (present, not
omitted), and update_pattern's inbox path mirrors add_pattern's. Unit
tests in `src/ingest.rs` cover the normal-write equivalents including
the preserve-on-absent path firing for unknown tokens already on disk.
The `add_pattern` metadata-fence test in `src/server.rs` pins the
end-to-end MCP-side contract for both the all-valid and warn cases.
Tighten the public-API contract for the new `language` field across
the three pattern-authoring tools' top-level descriptions:

  - `add_pattern` mentions `language` and the warn-and-proceed
    semantics so agents discover the capability from the schema alone.
  - `update_pattern` documents the three-way `tags`-style semantics
    (omit preserves, `[]` clears, non-empty replaces) so callers do not
    have to test the boundary themselves to learn the contract.
  - `append_to_pattern` points at `update_pattern` for frontmatter
    changes — the discoverability surface for Decision 3's schema-honesty
    posture (no `language` field on the body-only tool).

Add unit tests that pin each description-prose substring above as a
drift guard. Substring matches keyed on stable nouns — `language`,
`update_pattern`, the three preserve/clear/replace keywords — survive
innocent wording tweaks while still failing fast if the contract
documentation regresses.

Move the ROADMAP entry from `## Up Next` to `## Completed` and add a
single one-sentence CHANGELOG bullet under `Unreleased` › `Added`. The
PR number on the CHANGELOG line is left as `(#?)` until the PR is
opened; the owner can substitute the real number before review.
Two findings from the correctness review applied:

Reject tokens containing `,`, `[`, `]`, newlines, or control
characters at the MCP boundary. Embedded delimiters would render as
`language: [a,b]` and the chunking parser would split the value into
two distinct tokens on read-back, producing a silent round-trip
mismatch the agent cannot easily detect. Hard-erroring in
`parse_language_arg` surfaces the mistake before any write or push.
Three new tests pin the comma, bracket, and newline cases.

Document the intentional asymmetry between `tags` and `language` in
the `update_pattern` preserve-on-`None` branch: the language parser
lowercases tokens at read time, so a body-only update against an
existing `language: [Rust]` file rewrites it as `language: [rust]`.
The DB has stored the lowercased form since PR #50; the file now
converges to match the canonical form. `tags`'s preserve path keeps
casing because the tags parser is case-preserving — language tokens
are validated against the canonical `LANGUAGES` table where lowercase
IS the canonical form. Pinned by a new regression test so a future
parser change in either direction surfaces at the call site.
@attila attila marked this pull request as ready for review May 19, 2026 19:34
attila added 2 commits May 19, 2026 20:37
Add the `pr:` frontmatter field to the plan so the in-flight mapping
between plan and pull request is discoverable. Status stays `active`
until merge per the existing convention.

Condense the CHANGELOG entry to a single assertive sentence per the
project's two-rule CHANGELOG convention; the detail lives in the plan
and PR body.
Three knowledge-track solution docs derived from this PR's planning,
implementation, and review cycle:

- best-practices/schema-and-description-prose-are-testable-surface —
  treat MCP tool schemas, descriptions, and any contract artefact as
  testable surface. Substring and structural assertions catch a
  regression class nothing else covers. Triggered by the U4 plan-review
  exchange that reframed 'documentation deliverable' as testable
  contract.

- best-practices/sibling-code-paths-can-reintroduce-fixed-failure-modes
  — when a fix establishes a contract on a public surface, every
  sibling code path that returns through that surface owes the contract.
  Audit sibling paths against the canonical failure signal and fold the
  fix into the current PR when cheap; otherwise hazard-pin per the
  composition-cascades pattern. Triggered by the R4 inbox-branch gap
  becoming U3 inside the same PR rather than a follow-up.

- design-patterns/preserve-branch-canonicalisation-asymmetry — when
  two sibling fields share three-way preserve/clear/replace semantics
  but their parsers normalise differently, the preserve branches will
  diverge in observable behaviour. Pin intentional asymmetry with a
  regression test and an in-tree comment. Triggered by the correctness
  review of update_pattern's language preserve branch versus the tags
  precedent.

All three cross-reference adjacent prior docs (slice-shape-tests,
composition-cascades, round-trip-discriminator) and demarcate their
distinct angle.
@attila attila merged commit 0011502 into main May 19, 2026
10 checks passed
@attila attila deleted the feat/mcp-language-arg branch May 19, 2026 19:57
@attila attila mentioned this pull request May 19, 2026
attila added a commit that referenced this pull request May 19, 2026
* doc: mark merged plans complete with PR links

Four plan docs were left with `status: active` after their PRs merged.
Flip to `status: complete`, add `completed:` dates matching merge dates,
and link the `pr:` URL — matching the frontmatter shape used by other
completed plans that carry a PR link.

- coverage-check skill (#32, merged 2026-04-08)
- language-detection architecture (#50, merged 2026-05-14)
- track-2 observability (#59, merged 2026-05-16)
- mcp language arg (#63, merged 2026-05-19)

* chore(release): cut v0.4.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant