Skip to content

feat(_universal): network-aware bridge create + cloud-init (slices 4/5)#90

Open
pparage wants to merge 13 commits into
devfrom
feature/network-provisioning
Open

feat(_universal): network-aware bridge create + cloud-init (slices 4/5)#90
pparage wants to merge 13 commits into
devfrom
feature/network-provisioning

Conversation

@pparage

@pparage pparage commented Jun 11, 2026

Copy link
Copy Markdown
Member

What

Makes the _universal scenario honor topology-defined networks instead of hardcoding the 192.168.{bridge_base + team_id} scheme.

Slice 4 — network_create_idempotent.yml: replaces the hard assert (which blocked deploys unless an operator had pre-created every bridge) with create-if-missing:

  • Reads backend-resolved bridge/cidr/gateway from r42_network_map (inventory all.vars).
  • Stages the bridge via the proxmox-controller network_add_interfaces_node API, applies it via the Proxmox API reload (PUT), and ensures NAT through the idempotent ansible.builtin.iptables module.
  • Falls back to the legacy r42_*_for_team scheme when a network carries no explicit values → existing topologies unchanged.

Slice 5 — vm_provision.yml: cloud-init ip/netmask/gateway/bridge now read the backend-resolved values from inventory hostvars (ansible_host / r42_ci_netmask / r42_ci_gateway / r42_net_bridge), with a legacy fallback. Also FQCN-normalizes the existing include_role calls.

Why this shape (single source of truth)

The playbook previously had its own expansion (r42_expand_per_team + r42_*_for_team filters) that ignored the topology's cidr/gateway — a third copy of the expansion logic alongside the TS and Python expand_replication. This PR moves resolution to the backend (inventory_writer), so the playbook only consumes resolved values. No 4th renderer.

Dependency (cross-repo)

Requires range42-backend-api changes that emit r42_network_map + per-host r42_ci_* vars (branch feature/network-provisioning, builds on the #84 schema/expander reconciliation and #73 inventory CIDR-derivation). Merge/validate that first.

⚠️ NEEDS LIVE VALIDATION before merge

These tasks touch live Proxmox host networking and cannot be tested offline. A reversible API spike on pve01 already proved bridge create + reload works and that a naive post-up iptables -A re-appends duplicate MASQUERADE rules on reload (hence the idempotent iptables module here). Still to verify on a real 2-VM deploy:

  • Fresh bridge: create → reload applies it without disturbing vmbr0/existing bridges.
  • NAT MASQUERADE present once (no duplicates) and VM reaches the internet.
  • Cloud-init IP equals the inventory ansible_host (so Ansible can reach the VM).
  • A canvas-drawn custom CIDR (e.g. 10.x.0.0/16) is honored end-to-end.
  • Open question — NAT persistence: the iptables rule is applied at runtime; persistence across a host reboot is a tracked follow-up (the existing bridges persist NAT via post-up, which is the non-idempotent path we're avoiding).

Offline verification done

  • ansible-lint (production profile): 0 failures on both changed files.
  • ansible-playbook --syntax-check on _universal/main.yml: passes.

Philippe Parage and others added 9 commits May 8, 2026 11:41
Six filters used by scenarios/_universal/main.yml:
- r42_bridge_for_team / r42_subnet_for_team: per-team network derivation
- r42_vmid_for_node / r42_ip_for_node: per-team VMID and IP allocation
- r42_expand_per_team: replicate items across teams (shared first, then per-team)
- r42_attachments_for: resolve a node's attachments by inventory hostname

Tests: 14 unit tests in filter_plugins/__tests__/ covering shared/per-team
scopes, multi-segment hostnames, and defensive handling of empty/missing input.

ansible.cfg: explicit filter_plugins path so DetachedRunner picks the plugins
up after symlinking playbooks into private_data_dir/project/.
…pyc files

Follow-up to 0423a48; the previous commit accidentally included compiled
Python artifacts because .gitignore lacked Python-specific entries.
Slice 4 (network_create_idempotent.yml): replace the hard assert that
required operators to pre-create bridges with create-if-missing, using the
backend-resolved r42_network_map (bridge/cidr/gateway). Applies via the
Proxmox API reload and adds NAT idempotently through the iptables module
(avoids the duplicate-MASQUERADE accumulation seen with post-up iptables -A
on reload). Falls back to the legacy 192.168.{bridge_base+team} scheme.

Slice 5 (vm_provision.yml): cloud-init ip/netmask/gateway/bridge now read the
backend-resolved values from inventory hostvars (r42_ci_*/r42_net_bridge),
keeping a legacy fallback. Single source of truth = inventory_writer.

Also FQCN-normalizes the existing include_role calls in vm_provision.
Depends on backend inventory_writer changes (r42_network_map + per-host ci
vars). NEEDS LIVE 2-VM VALIDATION before merge (see PR body).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46b243aa82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tasks:
- name: Network presence check (loop)
include_tasks: tasks/network_create_idempotent.yml
loop: "{{ topology.nodes | selectattr('kind', 'equalto', 'network') | list | r42_expand_per_team(team_count | int) }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate topology beyond localhost

This loop runs in a later play on proxmox-cli, but topology and the defaulted team_count are created with set_fact only for localhost in the first play. Unless those same variables are also supplied as inventory/extravars, the first provisioning play fails with topology undefined before any network or VM work starts; use hostvars['localhost'], play vars, or delegate/cache the facts for the target hosts.

Useful? React with 👍 / 👎.

Comment on lines +50 to +51
- name: "Network: create bridge when missing"
when: _expected_bridge not in (existing_bridges | default([]))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reapply NAT when a bridge already exists

Because the NAT task is inside the when: _expected_bridge not in existing_bridges block, reruns skip MASQUERADE for any bridge that already exists. This breaks the documented create-if-missing path when an operator pre-created the bridge, a previous attempt created it but failed before NAT, or the host rebooted and lost the runtime iptables rule; the idempotent iptables task should run regardless of whether the bridge had to be created.

Useful? React with 👍 / 👎.

Comment on lines +39 to +44
_expected_bridge: >-
{{ _net.bridge | default((topology.bridge_base | default(140))
| r42_bridge_for_team(net_item.team_id | default(0))) }}
_net_cidr: >-
{{ _net.cidr | default((topology.bridge_base | default(140))
| r42_subnet_for_team(net_item.team_id | default(0))) }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle shared networks in legacy fallback

For a shared network without explicit r42_network_map values, net_item.team_id is None, and Ansible's default(0) does not replace defined null values before the custom filters run. That sends None into r42_bridge_for_team / r42_subnet_for_team, whose int(team_id) raises, so the promised legacy fallback fails for shared-scope networks; use an explicit is none branch or default(0, true).

Useful? React with 👍 / 👎.

Replace hardcoded validate_certs: false with proxmox_validate_certs | default(false)
so a hardened deployment can enable verification without editing the playbook.
Default preserves repo-wide behavior (lab uses self-signed certs; the
proxmox_controller role disables verification too). Systemic TLS pinning is
tracked in backend #85. Addresses automated commit-review HIGH finding.
@pparage

pparage commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

@hyde-repo I will test before your check

…defaults)

- Relay topology+team_count to the proxmox-cli plays (set_fact on localhost is
  host-scoped; the provisioning plays would otherwise fail with topology
  undefined). Also fixes the attachment-dispatch loop fallback.
- Apply the idempotent NAT MASQUERADE unconditionally instead of only when the
  bridge is created — pre-existing bridges (and post-reboot) need it too.
- Use default(0, true)/default('shared', true) for team_id so shared-scope
  networks/VMs don't pass None into the int-based r42_*_for_team filters.
@pparage

pparage commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Addressed the Codex review in 3144f52: (1) relay topology/team_count to the proxmox-cli plays (set_fact on localhost is host-scoped) + backend now passes team_count as an extravar; (2) idempotent NAT MASQUERADE now runs unconditionally (pre-existing bridges + post-reboot need it); (3) default(0, true)/default('shared', true) so shared-scope networks don't pass None into the int-based filters. Validating live next.

@pparage

pparage commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

hey ben quick context on why these playbook changes

the old _universal just asserted the bridges already existed and made you pre-create vmbr140-151 by hand before any deploy. and it hardcoded every ip/gateway/bridge to the 192.168.{bridge_base+team} scheme, so whatever network you actually drew in the ui (custom cidr etc) got thrown away. whole point here is to make the canvas mean something.

instead of teaching the playbook to render the network templates itself which would've been a 3rd copy of the expand logic sitting next to the ts and python ones, basically guaranteed to drift the backend now resolves everything and hands it over. there's an r42_network_map in the inventory plus per-host r42_ci_* vars, and the playbook just reads those. single source of truth.

so network_create_idempotent now: if the bridge is missing it creates it from the resolved cidr/gateway, applies it via the proxmox api reload, then ensures the NAT rule. the NAT goes through the ansible iptables module on purpose i ran a spike on pve01 and the old post-up iptables -A ... MASQUERADE re-appends a duplicate rule on every reload. the host was already sitting at ~7 dupe masquerade rules per subnet. the iptables module is idempotent so it stops that.

vm_provision: cloud-init ip/netmask/gw/bridge now come from the inventory hostvars the backend wrote instead of recomputing the hardcoded scheme. falls back to the old scheme if a topology doesn't carry the values, so nothing existing breaks.

the last commit is review fixes (thanks codex): topology was only set_fact'd on localhost so the proxmox-cli plays couldn't see it and would've died right away, the NAT task was stuck inside the create-only block so it skipped pre-existing bridges, and shared networks crashed passing None into the int filters.

still needs a real 2-vm deploy to actually validate the create+reload+nat path, can't test that offline. NAT persistence across a host reboot is a separate follow-up (runtime rule for now). backend half is range42-backend-api#98.

I will test it and once OK i'll tell you and have it reviewed by you let me know your thoughts!

@pparage pparage self-assigned this Jun 11, 2026
Live test on pve01: slice 4 creates the bridge + NAT correctly and re-deploys
onto existing bridges are clean, but creating a NEW bridge triggers the Proxmox
API reload which re-runs the other bridges' legacy post-up iptables -A hooks,
appending +N duplicate MASQUERADE rules. Document accurately; systemic fix is
backend #82/#85 (make host bridge NAT idempotent).
@pparage

pparage commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Live validation on pve01 (slice 4)

Ran network_create_idempotent.yml directly against the real Proxmox host with a throwaway bridge (vmbr177, 10.77.0.0/24), via r42_network_map. Fully reverted afterwards (host restored to baseline, net change zero).

Works ✅

  • Bridge created from the resolved map: vmbr177 came up 10.77.0.1/24.
  • API reload applied it; vmbr0/vmbr142 and all VMs untouched.
  • NAT MASQUERADE added for the subnet (correct comment/source).
  • Re-deploy idempotency: re-running with the bridge already present skips create and the iptables module adds no duplicate → MASQUERADE count unchanged. The "NAT always runs, unconditionally" review fix is confirmed.

Caught a real issue ⚠️ (corrected the code comment in 76c3614)

Infra gaps that block the full backend-driven deploy (pre-existing, not this PR)

  • The deployer VM has no root SSH key to pve01, so the proxmox-cli plays can't run from the backend.
  • The generated/static inventory carries no real Proxmox API token (CHANGE_ME).
  • No Project/Source data is set up for a _universal deployment.

So slice 4's logic is validated against real Proxmox; the end-to-end backend path needs the above wiring before it can run unattended.

Two fact-name mismatches (found via subagent audit, verified live on pve01)
broke idempotency:
- main.yml read 'network_list' but the role sets 'network_list_interfaces_node'
  -> existing_bridges was always [] -> the create-block (and its NAT-dup
  triggering reload) fired on EVERY deploy, even for existing bridges.
- vm_provision read selectattr('vmid') but vm_list stores 'vm_id' -> the
  r42-deployed skip-guard never matched -> VMs always re-cloned.
Live-validated: existing_bridges now lists vmbr140-151; VM 1021 found via vm_id.
@pparage

pparage commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Re-targeted from feature/gamenet-authoring-v1 (stale: last commit May 8, superseded by dev) to dev, so the current _universal network-provisioning state can be deployed and tested directly.

Pairs with range42-backend-api#98 (also into dev) — that one emits r42_network_map + per-host r42_ci_* vars that these tasks consume, so deploy both together. The NEEDS LIVE VALIDATION checklist in the description above is the hands-on test plan (fresh bridge create/reload, single MASQUERADE, cloud-init IP == inventory ansible_host, custom CIDR honored end-to-end).

Note: this is the vmbr + iptables 'first draft' approach, not the SDN direction under discussion.

@pparage pparage changed the base branch from feature/gamenet-authoring-v1 to dev June 12, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant