Skip to content

feat(omni): guide VLM to use pet names from home profile in caption#298

Open
zackzmai wants to merge 2 commits into
XiaoMi:mainfrom
zackzmai:feat/omni-pet-naming-from-profile
Open

feat(omni): guide VLM to use pet names from home profile in caption#298
zackzmai wants to merge 2 commits into
XiaoMi:mainfrom
zackzmai:feat/omni-pet-naming-from-profile

Conversation

@zackzmai

Copy link
Copy Markdown
Contributor

Summary

Add a pet-naming rule to the CAPTION field spec, enabling VLM to refer to pets by their registered names when their appearance clearly matches the home profile description.

Change

One line added to field_registry.py CAPTION spec:

涉及宠物:若「# 家庭档案」记录了宠物及其外貌特征(颜色/品种/体型),且画面中宠物的外观与其中某只明确吻合,用该宠物名(如"小黑在沙发上");不确定或档案无记录时用泛称("一只猫"/"一只狗")

Motivation

  • Home profile already supports pet info (member_persona with pet descriptions)
  • VLM already sees pets in frame and describes them in caption
  • This bridges the gap: VLM can now use the pet's name when the visual match is unambiguous
  • No new models needed — leverages VLM's existing visual understanding

Design Decisions

  • Opt-in behavior: Only activates when home profile has pet entries with appearance descriptions
  • Conservative rule: Must be a "clear match" — uncertain cases fall back to generic labels ("一只猫")
  • Consistent with human naming rule: Humans use identities; pets use profile appearance matching (since pets have no identity mechanism yet)

Tests

All 148 omni tests pass. No runtime behavior change for users without pet profile entries.

Add a caption spec rule for pet naming: when the home profile records
a pet with distinctive appearance (color/breed/size) and the pet in
frame clearly matches, use its name instead of a generic label.

This enables personalized pet descriptions (e.g. '小黑在沙发上' instead
of '一只猫在沙发上') without requiring ReID models — leveraging VLM's
visual understanding against the family profile context.
@github-actions

Copy link
Copy Markdown

👋 感谢提交 PR @zackzmai!维护者会尽快 review。

提交前请确认:

  • CI 全绿(test / lint / build)
  • 改动聚焦单一主题,便于审阅
  • 若改动了依赖(lockfile / pyproject.toml / package.json),需维护者评论 /allow-dependencies-change <当前 head SHA> 放行(之后再 push 需重新放行)

…e-profile

Complement the caption pet-naming rule with upstream data flow:
- home-observe: add pet appearance to 'worth recording' table; guide
  extraction of color/breed/size when pets first appear in perception logs
- home-profile: note that pet appearance descriptions enable the
  perception system to identify and name pets in camera feeds
@HCl8

HCl8 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

/review

@HCl8 HCl8 requested a review from ExWang June 24, 2026 07:42

@ExWang ExWang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR lands on something worth digging into. In #296 you classified pet identity as not-yet-implemented, noting that cross-frame association would likely need a dedicated ReID model. #298 goes the other way: no tracker, no ReID, no track — it just has the VLM match a pet's appearance against the home-profile descriptions, and that's enough to name the pet. It gets a capability working over a much lighter path than the one #296 framed as needing a ReID model first — so pet naming may not need to replicate the full human-identity pipeline at all.

But it surfaces something we should settle first: #296 and #298 point at two mutually exclusive answers to "where does a pet's identity/name come from." Until that's reconciled I'd rather not merge #298 alone — I'd like to take it with #297 as the design of a "pet-identity subsystem." The two routes I see, plus my open questions:

Route 1 — replicate the human-identity pipeline (a Tracker maintains identity). Recognize once, then hold identity via the track (no per-frame re-matching), with the state machine for denoising/reliability. Versus the human version we could likely drop TierC (recent-sample) and TierU (stranger-sample) management — pets don't change clothes, so those layers are probably unnecessary. The cost is standing up the whole "dedicated pet Tracker + cross-frame association" stack (the path you noted in #296 that might need a ReID model). I lean toward Route 2: for pets a text description may be enough, and it's uncertain whether vision/ReID can reliably tell "this cat" from "that cat" — which needs data to settle (see below).

Route 2 — inject home-profile info and match on text (what #298 does today). I'm inclined to favor this, consistent with #298. A few things feel under-specified:

  1. Same-breed / look-alike pets. Two tabby cats, or two Samoyeds — does caption fall back to a generic "a cat / a dog", or "a [appearance] cat/dog" (distinguish by appearance without forcing a name)? The rule needs to define the "profile has multiple pets, this one can't be uniquely matched" case.
  2. Active-registration flow (web UI). E.g. on registering a pet, call omni once to observe it and auto-generate a default appearance description, then let the user edit/confirm before submitting — rather than writing it from scratch. That keeps the profile text aligned with what the model sees, which makes matching stable.
  3. Presenting pet vs human members in the web UI. Pets are non-human subjects (empty subject_id) — how should the UI show the difference while still unifying them under "family members"?
  4. Is caption the only field that needs this? Only the caption spec changes here. Should suggestions and matched_rules also name/distinguish pets — e.g. when a rule names a specific pet?
  5. Does pet detection/tracking still have a role here? If naming relies entirely on text-matching, not track continuity, what do detection/tracking buy us in Route 2? One possible use — please make the case rather than assume it — is reducing the caption prompt's complexity: inject the pet-matching rule only when a pet is detected on screen, omit it otherwise. This also ties into how we handle #296.

Two cross-cutting asks:

  • Data validation. Human recognition was validated on a large amount of real data before shipping; pet recognition hasn't been validated at all. Could we collect a small amount of real data and quantitatively check the text-matching approach's reliability (and compare it against an image-based approach)? That's what would let us pick Route 1 vs 2 on evidence rather than intuition.
  • Experimental flag. Until it's validated, could we put this behind an "experimental — pet identity recognition" switch (off by default, clearly labeled), shipping it as an experimental feature rather than pushing unvalidated behavior to everyone?

None of this rejects #298's direction — I think the Route 2 it points to may well be right. But "where a pet's identity comes from" is foundational, worth settling (with #297) in one design discussion before we decide which PR lands how, and whether to ship behind an experimental flag first. Want to walk through these here or in a dedicated issue / RFC? Thanks for pushing this this far.

@zackzmai

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review — a few thoughts:

Background: The earlier test PRs were adding test coverage for the existing code. #298 came later from actual usage — I found that VLM text-matching against home-profile appearance descriptions was enough to name pets, no tracker/ReID/track needed. I tested it at home with a Ragdoll and an Abyssinian cat (admittedly quite distinct), and accuracy was high — it correctly distinguished and named both. Small sample, distinct breeds, so this doesn't say much about look-alike reliability.

Q1 Look-alike pets: Let prompt rules guide the VLM's own visual judgment — clear match → name; multiple similar → appearance-based generic ("the orange cat"); can't tell → fully generic ("a cat"). Look-alike scenarios do need more real-data validation. That said, most multi-pet households have visually distinct pets in practice — we could note on the feature toggle that it may not correctly distinguish similar-looking ones.

Q2 Registration flow: Calling omni once to observe and auto-generate a description is a good idea. But text input (OpenClaw setup guide or web UI manual entry) seems simpler interaction-wise and more consistent with how the home profile works today.

Q3 Web UI presentation: #298 was meant as a lightweight implementation, so it didn't touch the frontend. Looking at the current code, pet entries land in the "shared info" group in HomeKnowledgePanel — pets have member_persona type with empty subject_id (not in the identity library), and the UI buckets all unmatched member_* entries there. Not ideal — pets get mixed in with genuinely shared family info. A dedicated group or a pet marker would be cleaner, but that's a frontend change for a later phase.

Q4 Which fields: Share the home-profile pet records; caption / suggestions / matched_rules each reference as needed. Naming happens in the existing fused call — no extra omni invocation.

Q5 Role of detection/tracking: I think naming doesn't depend on SORT. The pure VLM approach is sufficient — the VLM matches appearance each time it sees the frame. SORT could optionally provide gating and cross-frame reuse once it lands, but those are nice-to-have, not prerequisites. The naming layer can ship independently.

Data validation: Agreed. Can collect real videos and quantitatively measure accuracy and fallback rates. What I have now is only small-sample.

Experimental flag: Agreed. But "is it usable" might matter more than "is it 100% stable" — we could consider shipping early with a clearly-labeled experimental flag. Miloco/VLM already have their share of visual/audio hallucinations today, and vision models are also iterating fast. An experimental feature that collects real-user feedback while users understand the risk could converge faster than waiting for full validation.

Direction: On technical direction, architecture, and product judgment, the official team has the fuller picture. I'd suggest the pet identity direction be led by the official team going forward — I'm happy to help with data validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants