LFX Mentorship — Summer 2026: DNS resolution & localhost bridging for urunc #758

ananos · 2026-06-10T15:23:09Z

ananos
Jun 10, 2026
Maintainer

alimx07 · 2026-06-14T19:37:33Z

alimx07
Jun 14, 2026

Note first: this is only about the host ns design, how a packet from the guest reaches a localhost service on the host and comes back. The per-unikernel side (routing inside the guest, rewriting the resolver if needed) will be discussed later.

This is just a brief document before our sync. It will be updated as needed and finalized through our sync.

IP Tables

As shown in the previous sync, ip tables can solve the DNS problem in a Docker network as a proof of concept.

The idea:

punch a hole in the TC mirror for a specific IP
use ip rule + a custom routing table (priority below the local table) to route DNS traffic to the resolver/localhost
fwmark the packets in the mangle OUTPUT chain so the rule matches them
flip some kernel flags (e.g. route_localnet, since the host-ns IP is the same as the tap IP)
bind ARP to the tap so the replies come back. If we use an IP outside the subnet, anything that doesn't match the hole just gets dropped instead of leaking out, which is fine.

For the Docker case we add one more rule in nat to forward our dummy IP into localhost (basically DNAT dummyIP:53 → 127.0.0.11:53).

This works, but as discussed, we need a more dynamic solution. So I keep it as the baseline/fallback only.

IPVS

IPVS (IP Virtual Server) is an efficient load balancer built on netfilter that sits in front of a cluster of servers. It works in three modes: NAT, Direct Routing, and IP Tunneling, but it's also not our solution, for a simple reason:

It's a load balancer (one VIP → many backends), which isn't our problem. We just want one destination's packets to hit one local resolver and the reply to come home. DR mode only rewrites the dst MAC and wants the real servers on the same L2 sharing the VIP on loopback; NAT mode drags back the static-IP / return-path problems we're avoiding. (It also lives in the netfilter path we're deliberately bypassing.)

So it adds a lot of machinery for nothing here. Not pursued.

TC

In our current implementation of the dynamic network we depend on a simple idea: the TC ingress hook is triggered before any IP stack, so we mirror all traffic from the container veth into our tap, and we do the same on the tap ingress. Like this we get some kind of bidirectional flow between the two:

ali-mohamed@Ali-PC:~/contrib/urunc$ sudo nsenter --net=$NETNS -- tc filter show dev tap0_urunc ingress
filter parent ffff: protocol all pref 49152 u32 chain 0 
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1 
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid not_in_hw 
  match 00000000/00000000 at 0
	action order 1: mirred (Egress Redirect to device eth0) stolen
	index 1 ref 1 bind 1

ali-mohamed@Ali-PC:~/contrib/urunc$ sudo nsenter --net=$NETNS -- tc filter show dev eth0 ingress
filter parent ffff: protocol all pref 49152 u32 chain 0 
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1 
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid not_in_hw 
  match 00000000/00000000 at 0
	action order 1: mirred (Egress Redirect to device tap0_urunc) stolen
	index 2 ref 1 bind 1

Also, this explains why any manipulation of ip rule as in the first solution won't affect anything in our workload from outside packets, as they are stolen at ingress and never reach routing. There's another important hook, egress, which is triggered right after the IP stack (ip rule, routing, netfilter, etc.). We don't use it now in our code, but we'll need it for our case.

So for the Docker case as a PoC we can depend on only TC and drop all the routing stuff. The point is that loopback (lo) itself is an interface, so it also has ingress and egress hooks.

The proposal is simple: with the same hole on a specific IP, we route the packets into lo ingress (because we need them to go through the IP stack so the Docker NAT rules apply), and on the lo egress hook we reply back into tap0_urunc egress (not ingress, since if we reply to ingress everything gets mirrored to eth0 by the first rule, so our packets go outside).

The tcpdump output, working end to end. As we see here, the packet shows up as out on tap0_urunc as proposed:

ali-mohamed@Ali-PC:~/contrib/urunc$ sudo nsenter --net=$NETNS -- tcpdump -ni any
tap0_urunc P   IP 172.19.0.2.44919 > 180.0.0.1.53: 31989+ A? github.com. (28)
lo         In  IP 172.19.0.2.44919 > 180.0.0.1.53: 31989+ A? github.com. (28)
tap0_urunc Out IP 180.0.0.1.53 > 172.19.0.2.44919: 31989 1/0/0 A 140.82.121.3 (44)

The last thing: loopback writes the src & dst MAC as zero which makes our guest drop the packets, so in the TC rule on lo we make sure to write the dst MAC, which I think will be only one MAC for all our taps (in the future) as we will share eth0's MAC.

ali-mohamed@Ali-PC:~/contrib/urunc$ sudo nsenter --net=$NETNS -- tcpdump -ni tap0_urunc -e
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap0_urunc, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:16:56.291288 0a:4a:bc:53:a8:86 > 7a:10:88:c5:38:55, ethertype IPv4 (0x0800), length 70: 172.19.0.2.58995 > 180.0.0.1.53: 1574+ A? github.com. (28)
23:16:56.291326 0a:4a:bc:53:a8:86 > 7a:10:88:c5:38:55, ethertype IPv4 (0x0800), length 70: 172.19.0.2.58995 > 180.0.0.1.53: 1575+ AAAA? github.com. (28)
23:16:56.294414 00:00:00:00:00:00 > 0a:4a:bc:53:a8:86, ethertype IPv4 (0x0800), length 135: 180.0.0.1.53 > 172.19.0.2.58995: 1575 0/1/0 (93)
23:16:56.299536 00:00:00:00:00:00 > 0a:4a:bc:53:a8:86, ethertype IPv4 (0x0800), length 86: 180.0.0.1.53 > 172.19.0.2.58995: 1574 1/0/0 A 140.82.121.4 (44)

/ # nslookup github.com
Server:		180.0.0.1
Address:	180.0.0.1:53
Name:	github.com
Address: 140.82.121.3

How we could generalize

Given this small idea, now if we generalize to N taps (N containers) in our main namespace, we'll have N+1 interfaces (the taps + lo) to write rules on.

Note: even if every container runs in an isolated namespace, the case where two containers run on the same port will work theoretically; but in practice (the multi-container pod, shared netns) the MAC and IP are shared, so we should enforce that every container runs on a unique port so the routing from lo to the right tap works. That's the only idea I have for now in this mode, since MAC and IP are shared.

Approach A: the same, punch a hole on every tap into the lo interface and have N TC rules on lo per port/tap so we can route into different containers, while we route everything else (not on our dummyIP) as normal to eth0. Also, for the input traffic we narrow the mirror per port (easy with flower in TC, so N rules on eth0 ingress instead of just 1).
So the workflow for two containers on 8080/9090 looks like:
- 2 rules on eth0 for the two taps, everything else to the host namespace.
- 2 rules on lo egress to route back into the taps' egress.
- 1 rule per tap to route into lo ingress.
Approach B: an extension of A, instead of routing all localhost into lo and then to the other tap, we have the N rules per tap (e.g. container1 on 8080 routes directly to container2 on 9090), which gives slightly faster tap-to-tap communication, and route anything else to localhost (e.g. an envoy sidecar) on lo (benchmarking needed).

The catch (return path): while above solutions seems to work with the unique-port rule handles the inbound/listening direction fine, which is a stable key. But for traffic the guest starts (like DNS), the reply comes back to the guest's ephemeral source port (e.g. 58995 in the dumps above), not the listening port. We don't know that port ahead of time, so a static lo egress rule has no stable key to pick the right tap. So static rules can't handle the guest in this case.

Two ways out:

DummyIP per container
State that remembers which tap a flow came from, which is what the eBPF map below gives us.

TC + eBPF

The static-rule approach above is enough for the single-container PoC or multiple containers with only listening ports. The more powerful option for the general case: instead of static TC rules, use eBPF programs at the hooks (ingress & egress), so we get all the power an eBPF program gives us, at the cost of C code + another dependency on something like cilium/ebpf for loading and attaching the kernel programs.

The key thing eBPF buys us is state. A simple eBPF map, accessible from a user program if needed and shared between all the programs, can hold the ports + interfaces info, and (more importantly) we can record at ingress which tap a flow came from and look it up at egress to send the reply back to the right tap. That's exactly the return-path problem static rules can't solve in a shared netns. We can also store IPs, MACs, etc. (even while the programs run), so even NATing (like the Docker DNS case) can be done at this level instead of with iptables.

So it's TC with an extra cost and functional gain as the moment we go multi-container and need the guest-initiated return path, the map state is the thing that actually makes it work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LFX Mentorship — Summer 2026: DNS resolution & localhost bridging for urunc #758

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

LFX Mentorship — Summer 2026: DNS resolution & localhost bridging for urunc #758

Uh oh!

Uh oh!

ananos Jun 10, 2026 Maintainer

Background & motivation

Goal

Plan (by week)

Phase 0 — Onboarding & analysis (Weeks 1–2, Jun 8–21)

Phase 1 — Design & PoC (Weeks 3–5, Jun 22–Jul 12)

Phase 2 — Midterm milestone (Week 6, Jul 13–19)

Phase 3 — Hardening & generalization (Weeks 7–9, Jul 20–Aug 9)

Phase 4 — Merge & wrap-up (Weeks 10–12, Aug 10–29)

References

Replies: 1 comment

Uh oh!

alimx07 Jun 14, 2026

IP Tables

IPVS

TC

How we could generalize

TC + eBPF

ananos
Jun 10, 2026
Maintainer

alimx07
Jun 14, 2026