Skip to content

Commit b15c068

Browse files
rsasaki0109claude
andcommitted
Add SayCan-style affordance grounding (embodied_ai/39)
A language model is a good planner and a bad robot: asked to "wipe the table" it proposes "pick up the sponge" without knowing whether the robot is near the sponge. SayCan (Ahn et al., 2022, "Do As I Can, Not As I Say") grounds the model by scoring every skill twice and multiplying: score(skill) = p_LLM(skill furthers the instruction) * p_affordance(works now) so the greedy argmax walks out a feasible plan with no separate planner and never commands a skill whose preconditions are unmet. The repo had no foundation-model loop; this adds the smallest honest one and ties it to the existing clarifying-question / conformal-ask-for-help line. The contrast is built into the same file via a `ground` flag (mirroring MCL's `augment`): grounded: go_to_sponge -> pick_sponge -> go_to_table -> wipe (goal in 4-5 steps) ungrounded: argmax LLM = pick_sponge from the wrong place, forever -> timeout The "LLM" is a small, transparent scorer conditioned on the running facts (the history-conditioned query SayCan makes) but deliberately blind to physical preconditions — which is exactly what the affordance term grounds. - self-contained KitchenWorld (two locations, five stochastic skills with preconditions/affordances) + a SayCanAgent and a References section - three smoke tests: grounded walks the feasible plan with no affordance violations; grounded retries a stochastic skill_slip and still wins; ungrounded language-only loops on affordance_violation and times out while grounded succeeds on the same seed - examples index + embodied_ai README section; example 41->42, tests 115->118 Verified across seeds 0-7 (grounded cleans every seed in 4-5 steps, retrying slips; ungrounded never cleans). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 6e62001 commit b15c068

6 files changed

Lines changed: 388 additions & 3 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ star helps others find it.
7777

7878
## Status
7979

80-
41 runnable examples · 38 README GIFs · 115 smoke / regression tests ·
80+
42 runnable examples · 38 README GIFs · 118 smoke / regression tests ·
8181
5 Gymnasium-style adapters · CI green on Python 3.10, 3.11, and 3.12.
8282

8383
See `docs/status.md` for the implementation snapshot, `docs/plan.md` for the

docs/status.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,10 @@ see what exists, what is verified, and what should come next.
55

66
## Snapshot
77

8-
- Runnable examples: 41
8+
- Runnable examples: 42
99
- Learning-path roadmap examples: 20
1010
- README GIFs: 38
11-
- Smoke and regression tests: 115 (102 example/adapter/static + 13 planning)
11+
- Smoke and regression tests: 118 (105 example/adapter/static + 13 planning)
1212
- Colab notebooks: 5
1313
- Core dependencies: `numpy`, `matplotlib`
1414
- Contributor extra: `pip install -e ".[dev]"`

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ Run any example headless with its `--no-render` flag when available.
6363
| `embodied_ai/33_inverse_reward_from_demo.py` | `python examples/embodied_ai/33_inverse_reward_from_demo.py` | demo feature expectation -> learned weights -> shaped A* to new goal |
6464
| `embodied_ai/35_clarifying_question.py` | `python examples/embodied_ai/35_clarifying_question.py "pick the block" --answer red` | ambiguous command -> ask question -> answer -> act |
6565
| `embodied_ai/36_household_task_agent.py` | `python examples/embodied_ai/36_household_task_agent.py "put the block away" --answer red` | clarify -> plan -> safety check -> retry -> human replan |
66+
| `embodied_ai/39_saycan_affordance_grounding.py` | `python examples/embodied_ai/39_saycan_affordance_grounding.py` | LLM score x affordance -> feasible skill -> retry slip -> goal |
6667

6768
## World Models
6869

Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
"""Ground a language model's plan in affordances: say what helps, do what is possible.
2+
3+
A language model is a good planner and a bad robot. Asked to "wipe the table" it
4+
will confidently propose *pick up the sponge* — the right idea — without knowing
5+
whether the robot is anywhere near the sponge. SayCan (Ahn et al., 2022, "Do As I
6+
Can, Not As I Say") fixes this by scoring every skill twice and multiplying:
7+
8+
score(skill) = p_LLM(skill furthers the instruction) * p_affordance(skill works now)
9+
10+
The language term ("Say") ranks skills by relevance to the goal; the affordance
11+
term ("Can") is the robot's own estimate that the skill will succeed from the
12+
current state. Their product is high only for a skill that is both useful *and*
13+
executable, so the greedy argmax walks out a feasible plan with no separate
14+
planner — and never commands a skill whose preconditions are unmet.
15+
16+
This example runs the same kitchen task two ways via the ``ground`` flag:
17+
18+
* ``ground=True`` (SayCan): language x affordance -> go to the sponge, pick it
19+
up (retrying a slip), carry it to the table, wipe. Goal reached.
20+
* ``ground=False`` (language only): the argmax of the raw LLM scores commands
21+
"pick the sponge" while standing at the table, the precondition is unmet, and
22+
the robot repeats that affordance_violation until it times out. Ungrounded
23+
language is not executable.
24+
25+
The "LLM" here is a small, transparent stand-in for a language-model call: it
26+
scores skills by relevance to the instruction given the running facts (what is
27+
held, what is done), exactly the history-conditioned query SayCan makes — but it
28+
is deliberately blind to physical preconditions, which is the whole point of
29+
grounding it.
30+
31+
Success: the table is wiped clean.
32+
Failure: affordance_violation (recoverable - a skill was commanded with its
33+
precondition unmet), skill_slip (recoverable - an afforded skill stochastically
34+
missed and is retried), and timeout (terminal).
35+
36+
References:
37+
* M. Ahn et al., "Do As I Can, Not As I Say: Grounding Language in Robotic
38+
Affordances," CoRL 2022. arXiv:2204.01691. https://say-can.github.io/
39+
"""
40+
41+
from __future__ import annotations
42+
43+
import argparse
44+
import sys
45+
from dataclasses import dataclass
46+
from pathlib import Path
47+
from typing import Any
48+
49+
import numpy as np
50+
51+
ROOT = Path(__file__).resolve().parents[2]
52+
if str(ROOT) not in sys.path:
53+
sys.path.insert(0, str(ROOT))
54+
55+
from pir.core.random import make_rng
56+
from pir.core.types import Failure, StepResult, Trace
57+
58+
SKILLS = ("go_to_sponge", "go_to_table", "pick_sponge", "wipe_table", "done")
59+
60+
61+
@dataclass
62+
class KitchenState:
63+
location: str = "table" # robot starts at the dirty table, sponge is elsewhere
64+
holding_sponge: bool = False
65+
table_clean: bool = False
66+
67+
68+
@dataclass
69+
class Skill:
70+
"""A primitive with a precondition, an affordance (base success), and an effect."""
71+
72+
name: str
73+
precondition: Any # state -> bool
74+
base_success: float # p(success) when the precondition is met
75+
effect: Any = None # state -> None, applied on success
76+
77+
78+
def _build_skills() -> dict[str, Skill]:
79+
def at(loc: str):
80+
return lambda s: s.location == loc
81+
82+
skills = {
83+
"go_to_sponge": Skill("go_to_sponge", lambda s: True, 1.0,
84+
lambda s: setattr(s, "location", "sponge")),
85+
"go_to_table": Skill("go_to_table", lambda s: True, 1.0,
86+
lambda s: setattr(s, "location", "table")),
87+
"pick_sponge": Skill("pick_sponge", lambda s: at("sponge")(s) and not s.holding_sponge,
88+
0.8, lambda s: setattr(s, "holding_sponge", True)),
89+
"wipe_table": Skill("wipe_table", lambda s: at("table")(s) and s.holding_sponge,
90+
0.85, lambda s: setattr(s, "table_clean", True)),
91+
"done": Skill("done", lambda s: True, 1.0, None),
92+
}
93+
return skills
94+
95+
96+
class KitchenWorld:
97+
"""A two-location kitchen; skills enforce preconditions and may slip."""
98+
99+
def __init__(self, *, seed: int | None = 0, max_steps: int = 20) -> None:
100+
self.skills = _build_skills()
101+
self.max_steps = max_steps
102+
self.seed = seed
103+
self.reset(seed=seed)
104+
105+
def reset(self, seed: int | None = None) -> dict[str, Any]:
106+
if seed is not None:
107+
self.seed = seed
108+
self.rng = make_rng(self.seed)
109+
self.state = KitchenState()
110+
self.time = 0
111+
return self.observe()
112+
113+
def observe(self) -> dict[str, Any]:
114+
s = self.state
115+
return {
116+
"time": self.time,
117+
"location": s.location,
118+
"holding_sponge": s.holding_sponge,
119+
"table_clean": s.table_clean,
120+
"affordances": {name: self.affordance(name) for name in SKILLS},
121+
}
122+
123+
def affordance(self, skill_name: str) -> float:
124+
"""The robot's estimate that the skill succeeds from the current state.
125+
126+
High when the precondition holds (the skill's base success rate), near
127+
zero when it does not. This is the grounding signal SayCan multiplies in.
128+
"""
129+
skill = self.skills[skill_name]
130+
return skill.base_success if skill.precondition(self.state) else 0.02
131+
132+
def step(self, action: dict[str, Any]) -> StepResult:
133+
self.time += 1
134+
name = action.get("skill", "done")
135+
skill = self.skills[name]
136+
info: dict[str, Any] = {
137+
"time": self.time,
138+
"skill": name,
139+
"affordance": self.affordance(name),
140+
"success": False,
141+
}
142+
143+
if name == "done":
144+
done = True
145+
info["success"] = self.state.table_clean
146+
return StepResult(self.observe(), 1.0 if self.state.table_clean else -0.2, done, info)
147+
148+
if not skill.precondition(self.state):
149+
# The commanded skill is not executable here: the failure that
150+
# grounding is meant to prevent.
151+
info["failure"] = Failure(
152+
"affordance_violation", f"{name} precondition unmet in {self.state.location}", True
153+
)
154+
done = self.time >= self.max_steps
155+
if done:
156+
info["failure"] = Failure("timeout", "ran out of steps", False)
157+
return StepResult(self.observe(), -0.2, done, info)
158+
159+
if self.rng.random() < skill.base_success:
160+
if skill.effect is not None:
161+
skill.effect(self.state)
162+
info["success"] = self.state.table_clean
163+
reward = 1.0 if self.state.table_clean else 0.05
164+
done = self.state.table_clean or self.time >= self.max_steps
165+
if not self.state.table_clean and self.time >= self.max_steps:
166+
info["failure"] = Failure("timeout", "ran out of steps", False)
167+
return StepResult(self.observe(), reward, done, info)
168+
169+
# Afforded but stochastically slipped (e.g. the grasp missed): retry next.
170+
info["failure"] = Failure("skill_slip", f"{name} was afforded but missed", True)
171+
done = self.time >= self.max_steps
172+
if done:
173+
info["failure"] = Failure("timeout", "ran out of steps", False)
174+
return StepResult(self.observe(), -0.1, done, info)
175+
176+
177+
def language_scores(instruction: str, obs: dict[str, Any]) -> dict[str, float]:
178+
"""A transparent stand-in for an LLM call: p(skill furthers the instruction).
179+
180+
It conditions on the running facts (held / clean) the way SayCan re-prompts
181+
the model with the plan so far, and ranks skills by *relevance to the goal* —
182+
but it never checks physical preconditions (it does not know where the robot
183+
is standing). That blindness is exactly what the affordance term grounds.
184+
"""
185+
_ = instruction # one task here; kept to mirror a real LLM prompt signature
186+
if obs["table_clean"]:
187+
scores = {"done": 0.70, "go_to_table": 0.10, "wipe_table": 0.08,
188+
"go_to_sponge": 0.06, "pick_sponge": 0.06}
189+
elif obs["holding_sponge"]:
190+
# Has the sponge -> the model says "go wipe the table" (relevant, maybe
191+
# infeasible from here).
192+
scores = {"wipe_table": 0.45, "go_to_table": 0.30, "done": 0.10,
193+
"pick_sponge": 0.08, "go_to_sponge": 0.07}
194+
else:
195+
# No sponge yet -> the model says "pick up the sponge" (relevant, and
196+
# infeasible unless already standing at it).
197+
scores = {"pick_sponge": 0.45, "go_to_sponge": 0.25, "wipe_table": 0.15,
198+
"go_to_table": 0.10, "done": 0.05}
199+
return {name: scores.get(name, 0.0) for name in SKILLS}
200+
201+
202+
class SayCanAgent:
203+
"""Pick argmax over p_LLM(skill) * p_affordance(skill); drop the affordance to ablate."""
204+
205+
def __init__(self, instruction: str = "wipe the table", ground: bool = True) -> None:
206+
self.instruction = instruction
207+
self.ground = ground
208+
209+
def reset(self) -> None:
210+
self.last_scores: dict[str, dict[str, float]] = {}
211+
212+
def act(self, obs: dict[str, Any]) -> dict[str, Any]:
213+
llm = language_scores(self.instruction, obs)
214+
affordance = obs["affordances"]
215+
if self.ground:
216+
combined = {name: llm[name] * affordance[name] for name in SKILLS}
217+
else:
218+
combined = dict(llm) # language only: ignore whether the skill is possible
219+
chosen = max(SKILLS, key=lambda name: combined[name])
220+
self.last_scores = {"llm": llm, "affordance": affordance, "combined": combined}
221+
return {"skill": chosen}
222+
223+
def update(self, obs: dict[str, Any], reward: float, info: dict[str, Any]) -> None:
224+
name = info.get("skill")
225+
if name and self.last_scores:
226+
info["llm_score"] = round(self.last_scores["llm"][name], 4)
227+
info["combined_score"] = round(self.last_scores["combined"][name], 4)
228+
info["grounded"] = self.ground
229+
230+
231+
def run(
232+
seed: int = 0,
233+
render: bool = True,
234+
max_steps: int = 20,
235+
ground: bool = True,
236+
instruction: str = "wipe the table",
237+
) -> Trace:
238+
world = KitchenWorld(seed=seed, max_steps=max_steps)
239+
obs = world.reset(seed=seed)
240+
agent = SayCanAgent(instruction=instruction, ground=ground)
241+
agent.reset()
242+
trace = Trace()
243+
244+
for _ in range(max_steps):
245+
action = agent.act(obs)
246+
result = world.step(action)
247+
obs, reward, done, info = result.as_tuple()
248+
agent.update(obs, reward, info)
249+
trace.append(obs, action, reward, info)
250+
251+
if render:
252+
_render(info)
253+
254+
if done:
255+
break
256+
257+
return trace
258+
259+
260+
def _render(info: dict[str, Any]) -> None:
261+
failure = info.get("failure")
262+
tag = f" [{failure.kind}]" if failure else ""
263+
print(
264+
f" t={info['time']:2d} skill={info['skill']:<13} "
265+
f"affordance={info['affordance']:.2f} combined={info.get('combined_score', 0):.3f}{tag}"
266+
)
267+
268+
269+
def main() -> None:
270+
parser = argparse.ArgumentParser()
271+
parser.add_argument("--seed", type=int, default=0)
272+
parser.add_argument("--max-steps", type=int, default=20)
273+
parser.add_argument("--instruction", type=str, default="wipe the table")
274+
parser.add_argument("--no-render", action="store_true")
275+
parser.add_argument(
276+
"--no-ground", action="store_true", help="language only (no affordance grounding)"
277+
)
278+
args = parser.parse_args()
279+
280+
if not args.no_render:
281+
print(f'instruction: "{args.instruction}" (grounded={not args.no_ground})')
282+
trace = run(
283+
seed=args.seed,
284+
render=not args.no_render,
285+
max_steps=args.max_steps,
286+
ground=not args.no_ground,
287+
instruction=args.instruction,
288+
)
289+
final = trace.infos[-1]
290+
failures = sorted({f.kind for f in trace.failures()})
291+
print(
292+
f"cleaned={final.get('success', False)} steps={len(trace.actions)} "
293+
f"failures={failures} grounded={not args.no_ground}"
294+
)
295+
296+
297+
if __name__ == "__main__":
298+
main()

examples/embodied_ai/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -482,3 +482,47 @@ follow shaped path -> compare scenic visits across demo, baseline, learned
482482
collapses to the baseline path.
483483
- Provide a second demo trajectory and average the two feature
484484
expectations before subtracting the uniform baseline.
485+
486+
## `39_saycan_affordance_grounding.py`
487+
488+
### What this teaches
489+
490+
A language model is a good planner and a bad robot: asked to "wipe the table" it
491+
proposes *pick up the sponge* without knowing whether the robot is near the
492+
sponge. SayCan (Ahn et al., 2022) grounds it by scoring every skill twice —
493+
`p_LLM(skill furthers the instruction) * p_affordance(skill works now)` — and
494+
taking the argmax. The product is high only for a skill that is both relevant and
495+
executable, so the greedy choice walks out a feasible plan with no separate
496+
planner. Run with `--no-ground` to drop the affordance term and watch the raw LLM
497+
argmax command an unexecutable skill until it times out.
498+
499+
### Run
500+
501+
```bash
502+
python examples/embodied_ai/39_saycan_affordance_grounding.py
503+
python examples/embodied_ai/39_saycan_affordance_grounding.py --no-ground # language only
504+
```
505+
506+
### Key loop
507+
508+
```text
509+
LLM score x affordance -> argmax feasible skill -> execute -> slip ? retry : advance -> goal
510+
```
511+
512+
### Simplifications
513+
514+
- a tiny two-location kitchen and five discrete skills
515+
- the "LLM" is a transparent hand-written scorer conditioned on the running facts
516+
(held / clean), standing in for a history-conditioned language-model call
517+
- affordance is the skill's base success rate when its precondition holds, near
518+
zero when it does not
519+
- skills are stochastic (an afforded pick or wipe can slip and is retried)
520+
521+
### Things to try
522+
523+
- Toggle `--no-ground` and compare: grounding turns the same LLM scores from an
524+
`affordance_violation` loop into an executable plan.
525+
- Lower a skill's `base_success` and watch `skill_slip` retries grow.
526+
- Start the robot at the sponge (`KitchenState(location="sponge")`) and watch the
527+
first grounded skill change.
528+
- Add a second tool whose skill the LLM ranks highly but that is never afforded.

0 commit comments

Comments
 (0)