Skip to content

fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability#1665

Merged
nddq merged 2 commits into
microsoft:mainfrom
mmckeen:reportImportantPacketsTweak
Jul 17, 2025
Merged

fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability#1665
nddq merged 2 commits into
microsoft:mainfrom
mmckeen:reportImportantPacketsTweak

Conversation

@mmckeen

@mmckeen mmckeen commented Jun 6, 2025

Copy link
Copy Markdown
Contributor

Description

Previously packetparser in high dataAggregationLevel would report (mostly) every single packet since important flags were observed over the lifetime of the connection.

This changes that behavior to only observe the important flags on individual packets and report when necessary.

This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before.

I also noticed the current docs for the TCP flags metrics are inaccurate, we only report a subset of the supported flags. Not sure if this is intentional, however supporting more flags will put more memory pressure on both conntrack as well as performance pressure on packet reporting. With sampling in place, this should be more than worth it but there may be repercussions for the performance of low dataAggregationLevel.

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

eBPF objects compile and load as expected.

main Branch

tcpflags main prometheus packets retina main prometheus bytes retina main

This Branch

tcpflags patched prometheus packets retina patched prometheus bytes retina patched

Additional Notes

#1628 will be a follow-up to this to add additional sampling functionality.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

@mmckeen mmckeen requested a review from a team as a code owner June 6, 2025 17:21
@mmckeen mmckeen requested review from QxBytes and vipul-21 June 6, 2025 17:21
@mmckeen mmckeen force-pushed the reportImportantPacketsTweak branch 6 times, most recently from 4384e47 to e6e6161 Compare June 12, 2025 23:31
@nddq nddq requested review from SRodi and nddq and removed request for QxBytes and vipul-21 June 13, 2025 15:12
@mmckeen mmckeen force-pushed the reportImportantPacketsTweak branch 2 times, most recently from 5d502a9 to 11be8f2 Compare June 18, 2025 14:37
@mmckeen

mmckeen commented Jun 20, 2025

Copy link
Copy Markdown
Contributor Author

@nddq @SRodi this is ready for review 🙇

@nddq

nddq commented Jun 26, 2025

Copy link
Copy Markdown
Member

@mmckeen sorry for the delay, I just got back from a break. I’ve gone through your proposed change a couple of times, and it looks solid to me. In fact, it addresses something we initially overlooked when conntrack was introduced. First, a few points to make sure we're aligned:

Previously packetparser in high dataAggregationLevel would report (mostly) every single packet since important flags were observed over the lifetime of the connection.

That’s correct. For any given packet, we currently:

  • Report it if it contains a flag set we haven’t seen before for this specific connection
  • Report it if a certain amount of time has passed since the last reported packet for this connection (default: 30 seconds; applies to both dataAggregation levels)
  • Otherwise, we skip it

In a typical TCP connection, the reported events would likely look like:
SYN, SYN-ACK, ACK, PSH, PSH-ACK, (30 secs), PSH, PSH-ACK, ... FIN, FIN-ACK, FIN-ACK

As a result, we ignore all packets during those 30-second windows, which skews the reported packet, byte, and TCP flag counts from the actual values.

That said:

This changes that behavior to only observe the important flags on individual packets and report when necessary.

Could you clarify this part? From what I see, conntrack already behaves this way today — so I’m not sure this change introduces new behavior?

This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before.

This is great — it addresses the gap I mentioned earlier around ignored packets. So in that sense, this feels more like a bug fix than a new feature 🙂

@mmckeen

mmckeen commented Jun 26, 2025

Copy link
Copy Markdown
Contributor Author

@mmckeen sorry for the delay, I just got back from a break. I’ve gone through your proposed change a couple of times, and it looks solid to me. In fact, it addresses something we initially overlooked when conntrack was introduced. First, a few points to make sure we're aligned:

Previously packetparser in high dataAggregationLevel would report (mostly) every single packet since important flags were observed over the lifetime of the connection.

That’s correct. For any given packet, we currently:

  • Report it if it contains a flag set we haven’t seen before for this specific connection
  • Report it if a certain amount of time has passed since the last reported packet for this connection (default: 30 seconds; applies to both dataAggregation levels)
  • Otherwise, we skip it

In a typical TCP connection, the reported events would likely look like: SYN, SYN-ACK, ACK, PSH, PSH-ACK, (30 secs), PSH, PSH-ACK, ... FIN, FIN-ACK, FIN-ACK

As a result, we ignore all packets during those 30-second windows, which skews the reported packet, byte, and TCP flag counts from the actual values.

That said:

This changes that behavior to only observe the important flags on individual packets and report when necessary.

Could you clarify this part? From what I see, conntrack already behaves this way today — so I’m not sure this change introduces new behavior?

This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before.

This is great — it addresses the gap I mentioned earlier around ignored packets. So in that sense, this feels more like a bug fix than a new feature 🙂

I think it's a bit of both a bug fix and a new feature.

This will also always report packets if we hit important flags like TCP_URG TCP_ECE but only for the packet with that flag and not for the rest of the connection.

But yes, I think overall this is more a bug fix and that is just a minor change to reflect expected functionality that the 30 second reporting window is respected for connections without new flags.

@mmckeen

mmckeen commented Jul 10, 2025

Copy link
Copy Markdown
Contributor Author

@nddq @SRodi any further thoughts on this?

nddq
nddq previously approved these changes Jul 13, 2025

@nddq nddq left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, beside a minor comment. May I suggest naming this PR to something along the lines of fix underreporting of TCP metrics due to conntrack reporting logic? @SRodi for another pair of eyes

Comment thread pkg/plugin/conntrack/_cprog/conntrack.c
@mmckeen mmckeen changed the title feat(packetparser): Only report important packets fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability Jul 14, 2025
@mmckeen mmckeen force-pushed the reportImportantPacketsTweak branch from 11be8f2 to 056e9bf Compare July 15, 2025 15:50
@mmckeen mmckeen requested a review from nddq July 15, 2025 15:53
@mmckeen

mmckeen commented Jul 15, 2025

Copy link
Copy Markdown
Contributor Author

@nddq added the remaining flags, appreciate one more review when you get the chance 🙇

@mmckeen mmckeen force-pushed the reportImportantPacketsTweak branch 3 times, most recently from a7fe72e to ef753d9 Compare July 16, 2025 15:38

@SRodi SRodi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmckeen just a couple of very minor nit. - one question, would it be worth to add this to our documentation?

Comment thread pkg/plugin/conntrack/_cprog/conntrack.c Outdated
Comment thread pkg/plugin/conntrack/_cprog/conntrack.c Outdated
@mmckeen

mmckeen commented Jul 16, 2025

Copy link
Copy Markdown
Contributor Author

@mmckeen just a couple of very minor nit. - one question, would it be worth to add this to our documentation?

Probably, happy to follow up on that with a future PR 🙇

@mmckeen mmckeen force-pushed the reportImportantPacketsTweak branch from ef753d9 to ccf1a2e Compare July 16, 2025 18:47
@mmckeen mmckeen requested a review from SRodi July 16, 2025 18:47
…s, improve scalability

Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
@mmckeen mmckeen force-pushed the reportImportantPacketsTweak branch from ccf1a2e to c66937f Compare July 16, 2025 20:16

@SRodi SRodi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @mmckeen - please update the branch prior merge

@nddq nddq added this pull request to the merge queue Jul 17, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Jul 17, 2025
…s, improve scalability (#1665)

# Description

Previously `packetparser` in `high` `dataAggregationLevel` would report
(mostly) every single packet since important flags were observed over
the lifetime of the connection.

This changes that behavior to only observe the important flags on
individual packets and report when necessary.

This will mean less packets are reported. However, it also adds back
weighting for bytes, packets, and TCP flags so that metrics remain
accurate versus before.

I also noticed the current docs for the TCP flags metrics are
inaccurate, we only report a subset of the supported flags. Not sure if
this is intentional, however supporting more flags will put more memory
pressure on both conntrack as well as performance pressure on packet
reporting. With sampling in place, this should be more than worth it but
there may be repercussions for the performance of `low`
`dataAggregationLevel`.

## Checklist

- [X] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [X] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [X] I have correctly attributed the author(s) of the code.
- [X] I have tested the changes locally.
- [X] I have followed the project's style guidelines.
- [X] I have updated the documentation, if necessary.
- [X] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

eBPF objects compile and load as expected.

# `main` Branch

<img width="1463" alt="tcpflags main"
src="https://github.com/user-attachments/assets/167908f0-7c37-4498-a7f6-20a41110d925"
/>
<img width="1463" alt="prometheus packets retina main"
src="https://github.com/user-attachments/assets/de8ad834-ed91-4673-8643-aa9cc51b3451"
/>
<img width="1463" alt="prometheus bytes retina main"
src="https://github.com/user-attachments/assets/420f54dc-fbbe-4a7b-a290-24c5d6666518"
/>

# This Branch

<img width="1463" alt="tcpflags patched"
src="https://github.com/user-attachments/assets/88460ee0-f769-4992-b9be-392b38a64b19"
/>
<img width="1463" alt="prometheus packets retina patched"
src="https://github.com/user-attachments/assets/304bbb97-8c57-477f-9c19-91bb1510b9c7"
/>
<img width="1463" alt="prometheus bytes retina patched"
src="https://github.com/user-attachments/assets/ef917562-487e-45be-8fbb-c9573fa708c1"
/>

## Additional Notes

#1628 will be a follow-up to
this to add additional sampling functionality.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 17, 2025
@nddq nddq added this pull request to the merge queue Jul 17, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 17, 2025
@nddq nddq added this pull request to the merge queue Jul 17, 2025
Merged via the queue into microsoft:main with commit 95a48c1 Jul 17, 2025
31 checks passed
@mmckeen mmckeen deleted the reportImportantPacketsTweak branch July 22, 2025 21:05
mereta pushed a commit that referenced this pull request Dec 2, 2025
…s, improve scalability (#1665)

# Description

Previously `packetparser` in `high` `dataAggregationLevel` would report
(mostly) every single packet since important flags were observed over
the lifetime of the connection.

This changes that behavior to only observe the important flags on
individual packets and report when necessary.

This will mean less packets are reported. However, it also adds back
weighting for bytes, packets, and TCP flags so that metrics remain
accurate versus before.

I also noticed the current docs for the TCP flags metrics are
inaccurate, we only report a subset of the supported flags. Not sure if
this is intentional, however supporting more flags will put more memory
pressure on both conntrack as well as performance pressure on packet
reporting. With sampling in place, this should be more than worth it but
there may be repercussions for the performance of `low`
`dataAggregationLevel`.

## Checklist

- [X] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [X] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [X] I have correctly attributed the author(s) of the code.
- [X] I have tested the changes locally.
- [X] I have followed the project's style guidelines.
- [X] I have updated the documentation, if necessary.
- [X] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

eBPF objects compile and load as expected.

# `main` Branch

<img width="1463" alt="tcpflags main"
src="https://github.com/user-attachments/assets/167908f0-7c37-4498-a7f6-20a41110d925"
/>
<img width="1463" alt="prometheus packets retina main"
src="https://github.com/user-attachments/assets/de8ad834-ed91-4673-8643-aa9cc51b3451"
/>
<img width="1463" alt="prometheus bytes retina main"
src="https://github.com/user-attachments/assets/420f54dc-fbbe-4a7b-a290-24c5d6666518"
/>

# This Branch

<img width="1463" alt="tcpflags patched"
src="https://github.com/user-attachments/assets/88460ee0-f769-4992-b9be-392b38a64b19"
/>
<img width="1463" alt="prometheus packets retina patched"
src="https://github.com/user-attachments/assets/304bbb97-8c57-477f-9c19-91bb1510b9c7"
/>
<img width="1463" alt="prometheus bytes retina patched"
src="https://github.com/user-attachments/assets/ef917562-487e-45be-8fbb-c9573fa708c1"
/>

## Additional Notes

#1628 will be a follow-up to
this to add additional sampling functionality.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
github-merge-queue Bot pushed a commit that referenced this pull request Dec 3, 2025
# Description

This PR allows for optional sampling of packet reporting when in high
data aggregation level for `packetparser`.

By default, all packets are reported but optionally `1 out of n` packets
are sampled by random chance with the exception of certain important
control flags or when hitting the reporting interval.

This allows Retina to scale to high network volume environments at the
trade-off of some reporting granularity.

The performance impact of this is mostly for workloads with lots of new
connections, connections already tracked in the conntrack table rely on
#1665 for scalability.

The behavior added in #1665
allows for accurate reporting of metrics despite sampling being in
place.

## Related Issue

#1760

## Checklist

- [X] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [X] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [X] I have correctly attributed the author(s) of the code.
- [X] I have tested the changes locally.
- [X] I have followed the project's style guidelines.
- [X] I have updated the documentation, if necessary.
- [X] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

## Main

<img width="1487" height="860" alt="Screenshot 2025-07-22 at 4 51 24 PM"
src="https://github.com/user-attachments/assets/72bc7b42-b280-4d10-aa7b-d114b460cd73"
/>

## After the change (with default sampling rate of 1)

<img width="1487" height="860" alt="Screenshot 2025-07-22 at 4 57 36 PM"
src="https://github.com/user-attachments/assets/6c115205-3068-4e97-ac51-9980c088890d"
/>

## After the change (with sampling rate of 1000)

<img width="1487" height="856" alt="Screenshot 2025-07-22 at 5 04 22 PM"
src="https://github.com/user-attachments/assets/b5e6cd5e-9c44-446f-bc1d-996044820f16"
/>

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants