feat(_universal): network-aware bridge create + cloud-init (slices 4/5)#90
feat(_universal): network-aware bridge create + cloud-init (slices 4/5)#90pparage wants to merge 13 commits into
Conversation
Six filters used by scenarios/_universal/main.yml: - r42_bridge_for_team / r42_subnet_for_team: per-team network derivation - r42_vmid_for_node / r42_ip_for_node: per-team VMID and IP allocation - r42_expand_per_team: replicate items across teams (shared first, then per-team) - r42_attachments_for: resolve a node's attachments by inventory hostname Tests: 14 unit tests in filter_plugins/__tests__/ covering shared/per-team scopes, multi-segment hostnames, and defensive handling of empty/missing input. ansible.cfg: explicit filter_plugins path so DetachedRunner picks the plugins up after symlinking playbooks into private_data_dir/project/.
…pyc files Follow-up to 0423a48; the previous commit accidentally included compiled Python artifacts because .gitignore lacked Python-specific entries.
…ne/file/external)
Slice 4 (network_create_idempotent.yml): replace the hard assert that
required operators to pre-create bridges with create-if-missing, using the
backend-resolved r42_network_map (bridge/cidr/gateway). Applies via the
Proxmox API reload and adds NAT idempotently through the iptables module
(avoids the duplicate-MASQUERADE accumulation seen with post-up iptables -A
on reload). Falls back to the legacy 192.168.{bridge_base+team} scheme.
Slice 5 (vm_provision.yml): cloud-init ip/netmask/gateway/bridge now read the
backend-resolved values from inventory hostvars (r42_ci_*/r42_net_bridge),
keeping a legacy fallback. Single source of truth = inventory_writer.
Also FQCN-normalizes the existing include_role calls in vm_provision.
Depends on backend inventory_writer changes (r42_network_map + per-host ci
vars). NEEDS LIVE 2-VM VALIDATION before merge (see PR body).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46b243aa82
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tasks: | ||
| - name: Network presence check (loop) | ||
| include_tasks: tasks/network_create_idempotent.yml | ||
| loop: "{{ topology.nodes | selectattr('kind', 'equalto', 'network') | list | r42_expand_per_team(team_count | int) }}" |
There was a problem hiding this comment.
Propagate topology beyond localhost
This loop runs in a later play on proxmox-cli, but topology and the defaulted team_count are created with set_fact only for localhost in the first play. Unless those same variables are also supplied as inventory/extravars, the first provisioning play fails with topology undefined before any network or VM work starts; use hostvars['localhost'], play vars, or delegate/cache the facts for the target hosts.
Useful? React with 👍 / 👎.
| - name: "Network: create bridge when missing" | ||
| when: _expected_bridge not in (existing_bridges | default([])) |
There was a problem hiding this comment.
Reapply NAT when a bridge already exists
Because the NAT task is inside the when: _expected_bridge not in existing_bridges block, reruns skip MASQUERADE for any bridge that already exists. This breaks the documented create-if-missing path when an operator pre-created the bridge, a previous attempt created it but failed before NAT, or the host rebooted and lost the runtime iptables rule; the idempotent iptables task should run regardless of whether the bridge had to be created.
Useful? React with 👍 / 👎.
| _expected_bridge: >- | ||
| {{ _net.bridge | default((topology.bridge_base | default(140)) | ||
| | r42_bridge_for_team(net_item.team_id | default(0))) }} | ||
| _net_cidr: >- | ||
| {{ _net.cidr | default((topology.bridge_base | default(140)) | ||
| | r42_subnet_for_team(net_item.team_id | default(0))) }} |
There was a problem hiding this comment.
Handle shared networks in legacy fallback
For a shared network without explicit r42_network_map values, net_item.team_id is None, and Ansible's default(0) does not replace defined null values before the custom filters run. That sends None into r42_bridge_for_team / r42_subnet_for_team, whose int(team_id) raises, so the promised legacy fallback fails for shared-scope networks; use an explicit is none branch or default(0, true).
Useful? React with 👍 / 👎.
Replace hardcoded validate_certs: false with proxmox_validate_certs | default(false) so a hardened deployment can enable verification without editing the playbook. Default preserves repo-wide behavior (lab uses self-signed certs; the proxmox_controller role disables verification too). Systemic TLS pinning is tracked in backend #85. Addresses automated commit-review HIGH finding.
|
@hyde-repo I will test before your check |
…defaults)
- Relay topology+team_count to the proxmox-cli plays (set_fact on localhost is
host-scoped; the provisioning plays would otherwise fail with topology
undefined). Also fixes the attachment-dispatch loop fallback.
- Apply the idempotent NAT MASQUERADE unconditionally instead of only when the
bridge is created — pre-existing bridges (and post-reboot) need it too.
- Use default(0, true)/default('shared', true) for team_id so shared-scope
networks/VMs don't pass None into the int-based r42_*_for_team filters.
|
Addressed the Codex review in 3144f52: (1) relay |
|
hey ben quick context on why these playbook changes the old _universal just asserted the bridges already existed and made you pre-create vmbr140-151 by hand before any deploy. and it hardcoded every ip/gateway/bridge to the 192.168.{bridge_base+team} scheme, so whatever network you actually drew in the ui (custom cidr etc) got thrown away. whole point here is to make the canvas mean something. instead of teaching the playbook to render the network templates itself which would've been a 3rd copy of the expand logic sitting next to the ts and python ones, basically guaranteed to drift the backend now resolves everything and hands it over. there's an r42_network_map in the inventory plus per-host r42_ci_* vars, and the playbook just reads those. single source of truth. so network_create_idempotent now: if the bridge is missing it creates it from the resolved cidr/gateway, applies it via the proxmox api reload, then ensures the NAT rule. the NAT goes through the ansible iptables module on purpose i ran a spike on pve01 and the old vm_provision: cloud-init ip/netmask/gw/bridge now come from the inventory hostvars the backend wrote instead of recomputing the hardcoded scheme. falls back to the old scheme if a topology doesn't carry the values, so nothing existing breaks. the last commit is review fixes (thanks codex): topology was only set_fact'd on localhost so the proxmox-cli plays couldn't see it and would've died right away, the NAT task was stuck inside the create-only block so it skipped pre-existing bridges, and shared networks crashed passing None into the int filters. still needs a real 2-vm deploy to actually validate the create+reload+nat path, can't test that offline. NAT persistence across a host reboot is a separate follow-up (runtime rule for now). backend half is range42-backend-api#98. I will test it and once OK i'll tell you and have it reviewed by you let me know your thoughts! |
Live test on pve01: slice 4 creates the bridge + NAT correctly and re-deploys onto existing bridges are clean, but creating a NEW bridge triggers the Proxmox API reload which re-runs the other bridges' legacy post-up iptables -A hooks, appending +N duplicate MASQUERADE rules. Document accurately; systemic fix is backend #82/#85 (make host bridge NAT idempotent).
Live validation on pve01 (slice 4)Ran Works ✅
Caught a real issue
Infra gaps that block the full backend-driven deploy (pre-existing, not this PR)
So slice 4's logic is validated against real Proxmox; the end-to-end backend path needs the above wiring before it can run unattended. |
Two fact-name mismatches (found via subagent audit, verified live on pve01)
broke idempotency:
- main.yml read 'network_list' but the role sets 'network_list_interfaces_node'
-> existing_bridges was always [] -> the create-block (and its NAT-dup
triggering reload) fired on EVERY deploy, even for existing bridges.
- vm_provision read selectattr('vmid') but vm_list stores 'vm_id' -> the
r42-deployed skip-guard never matched -> VMs always re-cloned.
Live-validated: existing_bridges now lists vmbr140-151; VM 1021 found via vm_id.
|
Re-targeted from Pairs with range42-backend-api#98 (also into Note: this is the vmbr + iptables 'first draft' approach, not the SDN direction under discussion. |
What
Makes the
_universalscenario honor topology-defined networks instead of hardcoding the192.168.{bridge_base + team_id}scheme.Slice 4 —
network_create_idempotent.yml: replaces the hardassert(which blocked deploys unless an operator had pre-created every bridge) with create-if-missing:bridge/cidr/gatewayfromr42_network_map(inventoryall.vars).network_add_interfaces_nodeAPI, applies it via the Proxmox API reload (PUT), and ensures NAT through the idempotentansible.builtin.iptablesmodule.r42_*_for_teamscheme when a network carries no explicit values → existing topologies unchanged.Slice 5 —
vm_provision.yml: cloud-initip/netmask/gateway/bridgenow read the backend-resolved values from inventory hostvars (ansible_host/r42_ci_netmask/r42_ci_gateway/r42_net_bridge), with a legacy fallback. Also FQCN-normalizes the existinginclude_rolecalls.Why this shape (single source of truth)
The playbook previously had its own expansion (
r42_expand_per_team+r42_*_for_teamfilters) that ignored the topology'scidr/gateway— a third copy of the expansion logic alongside the TS and Pythonexpand_replication. This PR moves resolution to the backend (inventory_writer), so the playbook only consumes resolved values. No 4th renderer.Dependency (cross-repo)
Requires range42-backend-api changes that emit
r42_network_map+ per-hostr42_ci_*vars (branchfeature/network-provisioning, builds on the #84 schema/expander reconciliation and #73 inventory CIDR-derivation). Merge/validate that first.These tasks touch live Proxmox host networking and cannot be tested offline. A reversible API spike on
pve01already proved bridge create + reload works and that a naivepost-up iptables -Are-appends duplicate MASQUERADE rules on reload (hence the idempotent iptables module here). Still to verify on a real 2-VM deploy:vmbr0/existing bridges.ansible_host(so Ansible can reach the VM).10.x.0.0/16) is honored end-to-end.post-up, which is the non-idempotent path we're avoiding).Offline verification done
ansible-lint(production profile): 0 failures on both changed files.ansible-playbook --syntax-checkon_universal/main.yml: passes.