[linux-cp/dpdk] Fix sub-port and LAG sub-port datapath on sonic-platform-vpp#237
[linux-cp/dpdk] Fix sub-port and LAG sub-port datapath on sonic-platform-vpp#237lunyue-ms wants to merge 8 commits into
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
b4d1855 to
30f3586
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
I am not sure about this. vpp has api to do that. in vlan-bvi HLD (#228), it has this.
4.7 Promiscuous Mode on Every Physical Interface
File: SwitchVppHostif.cpp
At LCP pair creation for each physical port:
configure_lcp_interface(hwif_name, dev, true);
interface_set_promiscuous(hwif_name, true); // <-- addedThere was a problem hiding this comment.
Good idea. interface_set_promiscuous() is listed in HLD #228 but isn't in master yet — the underlying VPP API sw_interface_set_promisc exists, so it's a thin shim.
I'll add it into sairedis PR #1907: add the wrapper, call it from SwitchVppHostif::createHostif after configure_lcp_interface(), and remove vpp_promisc.sh from this PR. Do you agree?
| # Why not support sub interfaces | ||
| linux-cp { | ||
| lcp-auto-subint | ||
| # lcp-auto-subint disabled: auto-pair "BondEthernet<id>.<vlan>" <-> "be<id>.<vlan>" |
There was a problem hiding this comment.
This is a big change. Please make sure you run full sonic-mgmt tests to avoid breakage
There was a problem hiding this comment.
Yes, it is. I wil run a roll of ful regression tests to make sure it doesn't break any other tests.
There was a problem hiding this comment.
I ran additional regression testing and found that arp/test_neighbor_mac.py is failing intermittently with this change set.
Initial debugging shows the failure is not caused by VPP DPDK promisc being off — bobm0 stays promiscuous during the failure window. Packet capture indicates that in failing cases the ARP reply for 20.0.0.1 is sent with the Linux/LCP tap MAC instead of the expected port/router MAC, so the following ICMP ping from PTF fails. I am still debugging the LCP/hostif MAC handling and will update the PR once I have a fix or a clearer root cause.
lunyue-ms
left a comment
There was a problem hiding this comment.
Thanks @yue-fred-gao for your comments. Please check if my thoughts is feasible.
| # Why not support sub interfaces | ||
| linux-cp { | ||
| lcp-auto-subint | ||
| # lcp-auto-subint disabled: auto-pair "BondEthernet<id>.<vlan>" <-> "be<id>.<vlan>" |
There was a problem hiding this comment.
Yes, it is. I wil run a roll of ful regression tests to make sure it doesn't break any other tests.
There was a problem hiding this comment.
Good idea. interface_set_promiscuous() is listed in HLD #228 but isn't in master yet — the underlying VPP API sw_interface_set_promisc exists, so it's a thin shim.
I'll add it into sairedis PR #1907: add the wrapper, call it from SwitchVppHostif::createHostif after configure_lcp_interface(), and remove vpp_promisc.sh from this PR. Do you agree?
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1e0deb0 to
ce80ca0
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
ce80ca0 to
203db4b
Compare
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
203db4b to
d2437f8
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Sub-port host kernel netdevs (e.g. Ethernet64.20, PortChannel1.20) were created in DOWN state because VPP linux-cp plugin sets host-side link state to DOWN by default when the lcp itf-pair is created, overriding the prior 'ip link set <subif> up' that IntfMgr executed. As a result, host kernel ignored incoming sub-port traffic and PTF-driven sub-port tests (e.g. test_packet_routed_with_valid_vlan) failed because the host ICMP stack never replied to echo-request. Enable 'lcp-sync' in the linux-cp startup configuration so that VPP propagates sub-interface admin state changes to the corresponding host netdev. When sairedis brings the VPP sub-interface up via SAI API, lcp-sync now auto-ups the host counterpart, restoring the standard sub-port datapath. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com>
VPP DPDK virtio ports default to 'unicast off all-multicast on', which
drops broadcast frames such as ARP requests targeting newly-created L3
sub-port interfaces. Real ASICs are implicitly promiscuous for routed
ports, so PTF tests that rely on post-boot sub-port creation fail at
the very first ARP from the peer.
Manual reproduction on vlab-vpp-01:
- Create Ethernet64.40 with IP 172.16.0.9/30
- From PTF eth16.40 (172.16.0.10) ping 172.16.0.9
- 0 packets arrive at bobm16 (verified via vppctl trace)
- After 'vppctl set interface promiscuous on bobm16', ping succeeds
VPP's startup-config directive cannot fix this because CLI commands run
*before* the DPDK plugin finishes initialising its ports, so
'set interface promiscuous on bobm0' silently fails at that point.
A previous attempt forked a background subshell from vpp_init.sh to
wait for VPP CLI and re-apply the CLI commands. That subshell did not
survive supervisord's process-group handling, so promisc was never
actually enabled at runtime. Worse, the syncd container is restarted
by sonic-mgmt PTF fixtures during test setup ('config reload') -- any
one-shot solution loses its effect after that restart.
This change replaces the fragile subshell with a dedicated supervisord
program 'vpp_promisc.sh' that:
1. Waits up to 180s for the VPP CLI to come up and at least one
bobm[0-9]+ interface to appear.
2. Iterates every discovered bobm* DPDK port and enables promisc.
3. Stays alive in a 2s monitor loop so promisc is re-applied after
in-container VPP restarts.
The new program depends on vpp_init.sh:running via
supervisord_dependent_startup, mirroring the existing pattern used by
syncd / rsyslogd / start. Supervisord auto-restarts the script if it
exits non-zero (e.g. timeout waiting for bobm0).
Applied to both docker-syncd-vpp and docker-sonic-vpp.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com>
…ision When VPP linux-cp lcp-auto-subint is on, VPP auto-generates host tap "be<id>.<vlan>" for BondEthernet<id>.<vlan>, but SONiC IntfMgr creates "PortChannel<id>.<vlan>" (teamd vlan child) as the actual host netdev that holds the L3 IP. The two netdevs are different, so ARP replies land on the auto-tap, not on PortChannel<X>.<vlan>, causing addNeighbor 0/8 success rate for sub-port-over-LAG. Disable lcp-auto-subint so sairedis is the sole producer of sub-port LCP pairs and can explicitly call configure_lcp_interface(BondEthernet<id>.<vlan>, PortChannel<id>.<vlan>) in SwitchVpp::vpp_create_router_interface for SAI_ROUTER_INTERFACE_TYPE_SUB_PORT. Verified on vlab-vpp-01 PTF test_packet_routed_with_valid_vlan: - LCP pair table now shows itf-pair: [46] BondEthernet1.20 tap4137.20 PortChannel1.20 - No more 'configure_lcp_interface ... returned -81 EEXIST' - [port] sub-test still PASS (no regression on non-LAG sub-port) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com>
Two related hardening fixes to vpp_promisc.sh / vpp_promisc.conf
applied symmetrically to docker-syncd-vpp and docker-sonic-vpp:
1. Restart storm on missing DPDK ports (phase-1 timeout).
The script previously exited 1 after 180s of polling if no
bobm* interface ever appeared. Combined with autorestart=true
and startsecs=5, supervisord respawned the script forever:
180s wait + 'timed out' log + restart, every cycle. Two
changes:
- phase-1 timeout now exits 0 with a single explanatory
log line ('no DPDK ports to manage')
- autorestart=true -> autorestart=unexpected, so a clean
exit leaves the program in EXITED state and only real
crashes (segfault, bash bug) get restarted.
2. Fragile promisc state check.
'grep -q "promiscuous: unicast on"' matches an exact
substring. Minor format changes between VPP releases (extra
space, key reordering) would silently break the 'already on'
check and cause vppctl to re-issue 'set promiscuous on' every
2s. Use a whitespace-tolerant regex
'promiscuous:[[:space:]]+unicast[[:space:]]+on' instead.
No functional change for the happy path (DPDK present, VPP
recent): script still polls every 2s and enables promiscuous on
new bobm* ports as before.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com>
Document the LAG sub-port (PortChannel<id>.<vlan>) plumbing introduced in sonic-sairedis PR #1907 and the matching platform-vpp changes (startup.conf 'lcp-auto-subint = disabled', vpp_promisc.sh monitor loop). vpp-lag.md: - TOC, Revisions (v0.3) and Scope: add 'LAG Subinterface' entries. - New section 'LAG Subinterface' explaining the 3-tuple naming (PortChannel<id>.<vlan> / BondEthernet<id>.<vlan> / be<id>.<vlan>), the bidirectional 'tc mirred' bridge between LCP host and SONiC L3 netdev, why 'lcp-auto-subint=disabled' is required, and the RIF IP / ARP handling (idempotent -105, arp_accept=1, neighbor pre-warm). - Status row updated to reflect 'Hwif naming + explicit LCP pair + tc bridge + RIF IP idempotency + arp_accept/pre-warm + basic [port_in_lag] PTF PASS', with a follow-up note for DUT-originated ARP on a few PTF cases (root cause to be diagnosed). SONICVPP-HLD.md: - Subinterface row: align with current behaviour (explicit configure_lcp_interface from sairedis with 'lcp-auto-subint=disabled'). - New LAG Subinterface row covering BondEthernet<id>.<vlan> + be<id>.<vlan> LCP pair and the bidirectional tc bridge to PortChannel<id>.<vlan>. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com>
Both docker-syncd-vpp and docker-sonic-vpp now disable lcp-auto-subint and carry the same rationale comment, so sairedis is the sole producer of LAG sub-port LCP pairs in either deployment model (multi-container VM image or single-container sonic-vpp image). Also rewrites the existing rationale comment in docker-syncd-vpp to be more accurate: configure_lcp_interface pairs BondEthernet<id>.<vlan> with be<id>.<vlan> (not with PortChannel<id>.<vlan>); the bridge to PortChannel<id>.<vlan> is a separate tc mirred step in the same SAI RIF create. ARP replies do reach the correct LCP host tap; the kernel drops them because the IP lives on a different netdev. Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Promiscuous mode on the VPP DPDK physical interfaces is now configured directly through the sairedis VPP API (interface_set_promiscuous, called at LCP pair creation and after bond member enslave) instead of the supervisord-managed vpp_promisc.sh polling script. Remove the now-redundant vpp_promisc.sh / vpp_promisc.conf and their COPY lines from both docker-sonic-vpp and docker-syncd-vpp. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com>
Signed-off-by: Lun Yue <17232861+lunyue-ms@users.noreply.github.com>
d2437f8 to
37a4d40
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I agree we disable lcp-auto-subint. I see some cases (with VLAN bridge) subinterface is not desirable |
|
@yue-fred-gao I included the changes of platform_vpp and sairedis. The regression test got passed. So, "disable lcp-auto-subint" + "enable lcp-sync" will not break other tests. Since it replys on sonic-net/sonic-sairedis#1907 to make sure other tests get passed, can you help to review that PR before merging this one(#237)? Thanks! |
|
|
||
| ### Plumbing | ||
|
|
||
| For a non-LAG sub-port (`Ethernet0.10`) the LCP host can simply be the SONiC L3 netdev itself, because it sits on top of a VPP-created tap (`tap<N>`). That does not work for a LAG sub-port: `PortChannel<id>.<vlan>` is a teamd VLAN child, not a VPP tap, and VPP linux-cp can only punt to a kernel netdev backed by a VPP tap. |
There was a problem hiding this comment.
This solution probably works but I don't think it is same as other hw platforms. Does sonic expects packets to be punted from member interfaces instead of directly to the port channel interface? Same for outgoing path. The way implemented here bypasses linux bond group.
There was a problem hiding this comment.
Thanks for the point. Yes, the current modification bypasses the linux bond group by redirecting PortChannelX.Y directly to the beX.Y. This is a workaround. I think that your idea(adding a vpp node to handle TX packets) in the following comments is the right way.
|
|
||
| For a non-LAG sub-port (`Ethernet0.10`) the LCP host can simply be the SONiC L3 netdev itself, because it sits on top of a VPP-created tap (`tap<N>`). That does not work for a LAG sub-port: `PortChannel<id>.<vlan>` is a teamd VLAN child, not a VPP tap, and VPP linux-cp can only punt to a kernel netdev backed by a VPP tap. | ||
|
|
||
| The design pairs the VPP sub-interface with `be<id>.<vlan>` (a VLAN child of the dummy tap `be<id>` that already exists for the parent LAG), and then bridges that LCP host to the SONiC L3 netdev `PortChannel<id>.<vlan>` with a bidirectional `tc mirred` redirect. The ingress redirect (`be<id>.<vlan>` → `PortChannel<id>.<vlan>`) carries VPP-punted ARP/ICMP frames to the netdev that holds the IP, so the kernel handles them normally. The egress redirect (`PortChannel<id>.<vlan>` → `be<id>.<vlan>`) carries kernel-originated frames into VPP via the LCP tap, where the linux-cp cross-connect forwards them to `BondEthernet<id>.<vlan>`. Both directions are mandatory: a one-direction-only bridge silently breaks LAG sub-port ping, and failure of either direction rolls back the LCP pair and VPP sub-interface so the SAI RIF create fails cleanly. |
There was a problem hiding this comment.
Does it not work without egress redirect? packets sent to the pc sub interface will be added with vlan tag then to the bond driver. bond driver sends it to a member interface, which is injected to vpp and transmitted from a front panel port through linux cp cross connect.
There was a problem hiding this comment.
no it doesn't. I think the problem is the member interface doesn't have sub interface. Packet injected to the member interface will be dropped with "unknown vlan". I am implementing a new node that bypasses ethernet-input to send the packet directly from egress port. I think this fits sonic design. Packets exits from EthernetX netdev are fully formatted with proper l2 header. They can be sent out directly from physical port without further processing. Is this correct?
There was a problem hiding this comment.
Thanks a lot. Please correct me if my understanding is wrong:
I think that your idea is a proper one and follows the sonic design principle. So the new node can solve the TX direction via the following path. Is it?
PortChannelX.Y -> Linux bond/team -> Ethernet member -> VPP new direct-TX node -> wire
For the RX direction of host/CP packets, do you expect to keep the current design?
VPP BondEthernetX.Y -> beX.Y -> tc -> PortChannelX.Y
Or are you also planning a different RX path?
Description
Description of PR
Summary:
Three platform-level fixes that, together with the companion
sonic-sairedisPR, restore the sub-port and LAG-sub-port datapathon
sonic-platform-vpp. After this PR, PTFtest_sub_port_interfaces.py::TestSubPorts::test_packet_routed_with_valid_vlanpasses on
vlab-vpp-01for both[port]and[port_in_lag]parametrisations.
Fixes: sonic-net/sonic-platform-vpp#227
Companion PR:
the PASS result)
Type of change
Approach
What is the motivation for this PR?
On
sonic-platform-vpp, three independent issues keep the sub-portdatapath broken: VPP linux-cp holds the sub-port host netdev DOWN after
each itf-pair (re)create, DPDK virtio ports come up non-promiscuous and
drop broadcast ARP on freshly-created L3 sub-ports, and
lcp-auto-subintraces sairedis to install the LAG sub-port LCP pair(producing the wrong host tap). Together with the companion sairedis
changes, fixing these three restores sub-port and LAG sub-port
datapath end-to-end.
How did you do it?
Changes are confined to
docker-syncd-vpp/docker-sonic-vppconfigand
docs/HLD/:lcp-syncenabled — VPP propagates sub-interface admin state tothe host tap; without it linux-cp keeps the host netdev DOWN after
itf-pair (re)create, even if IntfMgr already ran
ip link set up.vpp_promisc.shpolls for DPDK port appearance and runsvppctl set interface promiscuous on bobm*. Real ASICs areimplicitly promiscuous on routed ports; VPP's
startup-configdirective cannot fix this because it runs before the DPDK plugin
initialises ports.
lcp-auto-subintdisabled — sairedis is now the sole producerof sub-port LCP pairs and atomically wires up
configure_lcp_interface(BondEthernet<id>.<vlan>, be<id>.<vlan>)plus the bidirectional
tc mirredbridge betweenbe<id>.<vlan>and the SONiC L3 netdev
PortChannel<id>.<vlan>in the same SAIRIF create.
autorestart=unexpected(no respawn storm on virtio-only platforms),and a whitespace-tolerant regex avoids re-issuing
set promiscuous onevery 2s on minor VPP output format changes.vpp-lag.mdgains a new "LAG Subinterface" sectiondescribing the 3-tuple naming (
PortChannel<id>.<vlan>/BondEthernet<id>.<vlan>/be<id>.<vlan>) and the bidirectionaltc mirredbridge;SONICVPP-HLD.mdadds a LAG Subinterface rowto the "Interface Create Flow by Type" table.
How did you verify/test it?
Result: 22 passed / 10 failed in ~32 min.
Failure breakdown:
test_routing_between_sub_ports_and_port[*-svi]— blocked by VLAN BVI / SVI routing not implemented in
vslib/vpp(sonic-buildimage #26936).
test_tunneling_between_sub_ports[*]— blocked by sairedis PR #1860 (IPinIP encap/decap; VPP 2510 enum
needs
MP2P->MPalias).test_routing_between_sub_ports_and_port[port_in_lag-l3],test_routing_between_sub_ports_unaffected_by_sub_ports_removal[port_in_lag-different],test_balancing_sub_ports[port_in_lag],test_max_numbers_of_sub_ports[port_in_lag].DUT-originated ARP on a LAG sub-port can still fail to install a
neighbor into the VPP sub-interface RIF in these stress / cross-port
cases. The
[port]counterparts PASS in the same run, isolating thebug to the LAG-specific plumbing (tc bridge or kernel → VPP
propagation on top of it). Tracked as a follow-up PR.
Any platform specific information?
Changes only touch
sonic-platform-vpp(docker-syncd-vpp/docker-sonic-vppconfig + HLD). No impact on other ASIC platforms.Both syncd and sonic images are updated symmetrically. The companion
sonic-sairedisPR is required to reproduce the PASS result.Documentation
Updated
docs/HLD/vpp-lag.md(new "LAG Subinterface" section, v0.3revision, status row) and
docs/HLD/SONICVPP-HLD.md(Subinterface rowaligned with current behaviour; new LAG Subinterface row).