fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability by mmckeen · Pull Request #1665 · microsoft/retina

mmckeen · 2025-06-06T17:21:44Z

Description

Previously packetparser in high dataAggregationLevel would report (mostly) every single packet since important flags were observed over the lifetime of the connection.

This changes that behavior to only observe the important flags on individual packets and report when necessary.

This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before.

I also noticed the current docs for the TCP flags metrics are inaccurate, we only report a subset of the supported flags. Not sure if this is intentional, however supporting more flags will put more memory pressure on both conntrack as well as performance pressure on packet reporting. With sampling in place, this should be more than worth it but there may be repercussions for the performance of low dataAggregationLevel.

Checklist

I have read the contributing documentation.
I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
I have correctly attributed the author(s) of the code.
I have tested the changes locally.
I have followed the project's style guidelines.
I have updated the documentation, if necessary.
I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

eBPF objects compile and load as expected.

`main` Branch

This Branch

Additional Notes

#1628 will be a follow-up to this to add additional sampling functionality.

Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

mmckeen · 2025-06-20T14:56:51Z

@nddq @SRodi this is ready for review 🙇

nddq · 2025-06-26T20:42:30Z

@mmckeen sorry for the delay, I just got back from a break. I’ve gone through your proposed change a couple of times, and it looks solid to me. In fact, it addresses something we initially overlooked when conntrack was introduced. First, a few points to make sure we're aligned:

Previously packetparser in high dataAggregationLevel would report (mostly) every single packet since important flags were observed over the lifetime of the connection.

That’s correct. For any given packet, we currently:

Report it if it contains a flag set we haven’t seen before for this specific connection
Report it if a certain amount of time has passed since the last reported packet for this connection (default: 30 seconds; applies to both dataAggregation levels)
Otherwise, we skip it

In a typical TCP connection, the reported events would likely look like:
SYN, SYN-ACK, ACK, PSH, PSH-ACK, (30 secs), PSH, PSH-ACK, ... FIN, FIN-ACK, FIN-ACK

As a result, we ignore all packets during those 30-second windows, which skews the reported packet, byte, and TCP flag counts from the actual values.

That said:

This changes that behavior to only observe the important flags on individual packets and report when necessary.

Could you clarify this part? From what I see, conntrack already behaves this way today — so I’m not sure this change introduces new behavior?

This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before.

This is great — it addresses the gap I mentioned earlier around ignored packets. So in that sense, this feels more like a bug fix than a new feature 🙂

mmckeen · 2025-06-26T20:53:03Z

@mmckeen sorry for the delay, I just got back from a break. I’ve gone through your proposed change a couple of times, and it looks solid to me. In fact, it addresses something we initially overlooked when conntrack was introduced. First, a few points to make sure we're aligned:

Previously packetparser in high dataAggregationLevel would report (mostly) every single packet since important flags were observed over the lifetime of the connection.

That’s correct. For any given packet, we currently:

Report it if it contains a flag set we haven’t seen before for this specific connection

Report it if a certain amount of time has passed since the last reported packet for this connection (default: 30 seconds; applies to both dataAggregation levels)

Otherwise, we skip it

In a typical TCP connection, the reported events would likely look like: SYN, SYN-ACK, ACK, PSH, PSH-ACK, (30 secs), PSH, PSH-ACK, ... FIN, FIN-ACK, FIN-ACK

As a result, we ignore all packets during those 30-second windows, which skews the reported packet, byte, and TCP flag counts from the actual values.

That said:

This changes that behavior to only observe the important flags on individual packets and report when necessary.

Could you clarify this part? From what I see, conntrack already behaves this way today — so I’m not sure this change introduces new behavior?

This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before.

This is great — it addresses the gap I mentioned earlier around ignored packets. So in that sense, this feels more like a bug fix than a new feature 🙂

I think it's a bit of both a bug fix and a new feature.

This will also always report packets if we hit important flags like TCP_URG TCP_ECE but only for the packet with that flag and not for the rest of the connection.

But yes, I think overall this is more a bug fix and that is just a minor change to reflect expected functionality that the 30 second reporting window is respected for connections without new flags.

mmckeen · 2025-07-10T16:33:03Z

@nddq @SRodi any further thoughts on this?

nddq

Overall lgtm, beside a minor comment. May I suggest naming this PR to something along the lines of fix underreporting of TCP metrics due to conntrack reporting logic? @SRodi for another pair of eyes

mmckeen · 2025-07-15T15:53:21Z

@nddq added the remaining flags, appreciate one more review when you get the chance 🙇

SRodi

@mmckeen just a couple of very minor nit. - one question, would it be worth to add this to our documentation?

mmckeen · 2025-07-16T18:42:13Z

@mmckeen just a couple of very minor nit. - one question, would it be worth to add this to our documentation?

Probably, happy to follow up on that with a future PR 🙇

…s, improve scalability Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>

SRodi

LGTM @mmckeen - please update the branch prior merge

…s, improve scalability (#1665) # Description Previously `packetparser` in `high` `dataAggregationLevel` would report (mostly) every single packet since important flags were observed over the lifetime of the connection. This changes that behavior to only observe the important flags on individual packets and report when necessary. This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before. I also noticed the current docs for the TCP flags metrics are inaccurate, we only report a subset of the supported flags. Not sure if this is intentional, however supporting more flags will put more memory pressure on both conntrack as well as performance pressure on packet reporting. With sampling in place, this should be more than worth it but there may be repercussions for the performance of `low` `dataAggregationLevel`. ## Checklist - [X] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [X] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [X] I have correctly attributed the author(s) of the code. - [X] I have tested the changes locally. - [X] I have followed the project's style guidelines. - [X] I have updated the documentation, if necessary. - [X] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed eBPF objects compile and load as expected. # `main` Branch <img width="1463" alt="tcpflags main" src="https://github.com/user-attachments/assets/167908f0-7c37-4498-a7f6-20a41110d925" /> <img width="1463" alt="prometheus packets retina main" src="https://github.com/user-attachments/assets/de8ad834-ed91-4673-8643-aa9cc51b3451" /> <img width="1463" alt="prometheus bytes retina main" src="https://github.com/user-attachments/assets/420f54dc-fbbe-4a7b-a290-24c5d6666518" /> # This Branch <img width="1463" alt="tcpflags patched" src="https://github.com/user-attachments/assets/88460ee0-f769-4992-b9be-392b38a64b19" /> <img width="1463" alt="prometheus packets retina patched" src="https://github.com/user-attachments/assets/304bbb97-8c57-477f-9c19-91bb1510b9c7" /> <img width="1463" alt="prometheus bytes retina patched" src="https://github.com/user-attachments/assets/ef917562-487e-45be-8fbb-c9573fa708c1" /> ## Additional Notes #1628 will be a follow-up to this to add additional sampling functionality. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project. Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>

# Description This PR allows for optional sampling of packet reporting when in high data aggregation level for `packetparser`. By default, all packets are reported but optionally `1 out of n` packets are sampled by random chance with the exception of certain important control flags or when hitting the reporting interval. This allows Retina to scale to high network volume environments at the trade-off of some reporting granularity. The performance impact of this is mostly for workloads with lots of new connections, connections already tracked in the conntrack table rely on #1665 for scalability. The behavior added in #1665 allows for accurate reporting of metrics despite sampling being in place. ## Related Issue #1760 ## Checklist - [X] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [X] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [X] I have correctly attributed the author(s) of the code. - [X] I have tested the changes locally. - [X] I have followed the project's style guidelines. - [X] I have updated the documentation, if necessary. - [X] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed ## Main <img width="1487" height="860" alt="Screenshot 2025-07-22 at 4 51 24 PM" src="https://github.com/user-attachments/assets/72bc7b42-b280-4d10-aa7b-d114b460cd73" /> ## After the change (with default sampling rate of 1) <img width="1487" height="860" alt="Screenshot 2025-07-22 at 4 57 36 PM" src="https://github.com/user-attachments/assets/6c115205-3068-4e97-ac51-9980c088890d" /> ## After the change (with sampling rate of 1000) <img width="1487" height="856" alt="Screenshot 2025-07-22 at 5 04 22 PM" src="https://github.com/user-attachments/assets/b5e6cd5e-9c44-446f-bc1d-996044820f16" /> --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project. Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>

mmckeen requested a review from a team as a code owner June 6, 2025 17:21

mmckeen requested review from QxBytes and vipul-21 June 6, 2025 17:21

mmckeen mentioned this pull request Jun 6, 2025

feat(packetparser): Allow sampling of packets #1628

Closed

7 tasks

mmckeen force-pushed the reportImportantPacketsTweak branch 6 times, most recently from 4384e47 to e6e6161 Compare June 12, 2025 23:31

nddq requested review from SRodi and nddq and removed request for QxBytes and vipul-21 June 13, 2025 15:12

mmckeen force-pushed the reportImportantPacketsTweak branch 2 times, most recently from 5d502a9 to 11be8f2 Compare June 18, 2025 14:37

nddq previously approved these changes Jul 13, 2025

View reviewed changes

Comment thread pkg/plugin/conntrack/_cprog/conntrack.c

mmckeen changed the title ~~feat(packetparser): Only report important packets~~ fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability Jul 14, 2025

mmckeen dismissed nddq’s stale review via 056e9bf July 15, 2025 15:50

mmckeen force-pushed the reportImportantPacketsTweak branch from 11be8f2 to 056e9bf Compare July 15, 2025 15:50

mmckeen requested a review from nddq July 15, 2025 15:53

mmckeen force-pushed the reportImportantPacketsTweak branch 3 times, most recently from a7fe72e to ef753d9 Compare July 16, 2025 15:38

SRodi reviewed Jul 16, 2025

View reviewed changes

Comment thread pkg/plugin/conntrack/_cprog/conntrack.c Outdated

Comment thread pkg/plugin/conntrack/_cprog/conntrack.c Outdated

mmckeen force-pushed the reportImportantPacketsTweak branch from ef753d9 to ccf1a2e Compare July 16, 2025 18:47

mmckeen requested a review from SRodi July 16, 2025 18:47

fix(packetparser): Fix under reporting of TCP flags and packet metric…

c66937f

…s, improve scalability Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>

mmckeen force-pushed the reportImportantPacketsTweak branch from ccf1a2e to c66937f Compare July 16, 2025 20:16

SRodi approved these changes Jul 17, 2025

View reviewed changes

Merge branch 'main' into reportImportantPacketsTweak

26c3b3a

nddq approved these changes Jul 17, 2025

View reviewed changes

nddq added this pull request to the merge queue Jul 17, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 17, 2025

nddq added this pull request to the merge queue Jul 17, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 17, 2025

nddq added this pull request to the merge queue Jul 17, 2025

Merged via the queue into microsoft:main with commit 95a48c1 Jul 17, 2025
31 checks passed

mmckeen mentioned this pull request Jul 21, 2025

Retina-agent high CPU usage #1760

Closed

mmckeen deleted the reportImportantPacketsTweak branch July 22, 2025 21:05

mmckeen mentioned this pull request Jul 22, 2025

feat(packetparser): Allow sampling of packets #1767

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability#1665

fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability#1665
nddq merged 2 commits into
microsoft:mainfrom
mmckeen:reportImportantPacketsTweak

mmckeen commented Jun 6, 2025 •

edited

Loading

Uh oh!

mmckeen commented Jun 20, 2025

Uh oh!

nddq commented Jun 26, 2025

Uh oh!

mmckeen commented Jun 26, 2025 •

edited

Loading

Uh oh!

mmckeen commented Jul 10, 2025

Uh oh!

nddq left a comment

Uh oh!

Uh oh!

mmckeen commented Jul 15, 2025 •

edited

Loading

Uh oh!

SRodi left a comment

Uh oh!

Uh oh!

Uh oh!

mmckeen commented Jul 16, 2025 •

edited

Loading

Uh oh!

SRodi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mmckeen commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Screenshots (if applicable) or Testing Completed

main Branch

This Branch

Additional Notes

Uh oh!

mmckeen commented Jun 20, 2025

Uh oh!

nddq commented Jun 26, 2025

Uh oh!

mmckeen commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmckeen commented Jul 10, 2025

Uh oh!

nddq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mmckeen commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SRodi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mmckeen commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SRodi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mmckeen commented Jun 6, 2025 •

edited

Loading

`main` Branch

mmckeen commented Jun 26, 2025 •

edited

Loading

mmckeen commented Jul 15, 2025 •

edited

Loading

mmckeen commented Jul 16, 2025 •

edited

Loading