fix(packetparser): Fix under reporting of TCP flags and packet metrics, improve scalability#1665
Conversation
4384e47 to
e6e6161
Compare
5d502a9 to
11be8f2
Compare
|
@mmckeen sorry for the delay, I just got back from a break. I’ve gone through your proposed change a couple of times, and it looks solid to me. In fact, it addresses something we initially overlooked when conntrack was introduced. First, a few points to make sure we're aligned:
That’s correct. For any given packet, we currently:
In a typical TCP connection, the reported events would likely look like: As a result, we ignore all packets during those 30-second windows, which skews the reported packet, byte, and TCP flag counts from the actual values. That said:
Could you clarify this part? From what I see, conntrack already behaves this way today — so I’m not sure this change introduces new behavior?
This is great — it addresses the gap I mentioned earlier around ignored packets. So in that sense, this feels more like a bug fix than a new feature 🙂 |
I think it's a bit of both a bug fix and a new feature. This will also always report packets if we hit important flags like But yes, I think overall this is more a bug fix and that is just a minor change to reflect expected functionality that the 30 second reporting window is respected for connections without new flags. |
11be8f2 to
056e9bf
Compare
|
@nddq added the remaining flags, appreciate one more review when you get the chance 🙇 |
a7fe72e to
ef753d9
Compare
Probably, happy to follow up on that with a future PR 🙇 |
ef753d9 to
ccf1a2e
Compare
…s, improve scalability Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
ccf1a2e to
c66937f
Compare
…s, improve scalability (#1665) # Description Previously `packetparser` in `high` `dataAggregationLevel` would report (mostly) every single packet since important flags were observed over the lifetime of the connection. This changes that behavior to only observe the important flags on individual packets and report when necessary. This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before. I also noticed the current docs for the TCP flags metrics are inaccurate, we only report a subset of the supported flags. Not sure if this is intentional, however supporting more flags will put more memory pressure on both conntrack as well as performance pressure on packet reporting. With sampling in place, this should be more than worth it but there may be repercussions for the performance of `low` `dataAggregationLevel`. ## Checklist - [X] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [X] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [X] I have correctly attributed the author(s) of the code. - [X] I have tested the changes locally. - [X] I have followed the project's style guidelines. - [X] I have updated the documentation, if necessary. - [X] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed eBPF objects compile and load as expected. # `main` Branch <img width="1463" alt="tcpflags main" src="https://github.com/user-attachments/assets/167908f0-7c37-4498-a7f6-20a41110d925" /> <img width="1463" alt="prometheus packets retina main" src="https://github.com/user-attachments/assets/de8ad834-ed91-4673-8643-aa9cc51b3451" /> <img width="1463" alt="prometheus bytes retina main" src="https://github.com/user-attachments/assets/420f54dc-fbbe-4a7b-a290-24c5d6666518" /> # This Branch <img width="1463" alt="tcpflags patched" src="https://github.com/user-attachments/assets/88460ee0-f769-4992-b9be-392b38a64b19" /> <img width="1463" alt="prometheus packets retina patched" src="https://github.com/user-attachments/assets/304bbb97-8c57-477f-9c19-91bb1510b9c7" /> <img width="1463" alt="prometheus bytes retina patched" src="https://github.com/user-attachments/assets/ef917562-487e-45be-8fbb-c9573fa708c1" /> ## Additional Notes #1628 will be a follow-up to this to add additional sampling functionality. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project. Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
…s, improve scalability (#1665) # Description Previously `packetparser` in `high` `dataAggregationLevel` would report (mostly) every single packet since important flags were observed over the lifetime of the connection. This changes that behavior to only observe the important flags on individual packets and report when necessary. This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before. I also noticed the current docs for the TCP flags metrics are inaccurate, we only report a subset of the supported flags. Not sure if this is intentional, however supporting more flags will put more memory pressure on both conntrack as well as performance pressure on packet reporting. With sampling in place, this should be more than worth it but there may be repercussions for the performance of `low` `dataAggregationLevel`. ## Checklist - [X] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [X] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [X] I have correctly attributed the author(s) of the code. - [X] I have tested the changes locally. - [X] I have followed the project's style guidelines. - [X] I have updated the documentation, if necessary. - [X] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed eBPF objects compile and load as expected. # `main` Branch <img width="1463" alt="tcpflags main" src="https://github.com/user-attachments/assets/167908f0-7c37-4498-a7f6-20a41110d925" /> <img width="1463" alt="prometheus packets retina main" src="https://github.com/user-attachments/assets/de8ad834-ed91-4673-8643-aa9cc51b3451" /> <img width="1463" alt="prometheus bytes retina main" src="https://github.com/user-attachments/assets/420f54dc-fbbe-4a7b-a290-24c5d6666518" /> # This Branch <img width="1463" alt="tcpflags patched" src="https://github.com/user-attachments/assets/88460ee0-f769-4992-b9be-392b38a64b19" /> <img width="1463" alt="prometheus packets retina patched" src="https://github.com/user-attachments/assets/304bbb97-8c57-477f-9c19-91bb1510b9c7" /> <img width="1463" alt="prometheus bytes retina patched" src="https://github.com/user-attachments/assets/ef917562-487e-45be-8fbb-c9573fa708c1" /> ## Additional Notes #1628 will be a follow-up to this to add additional sampling functionality. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project. Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
# Description This PR allows for optional sampling of packet reporting when in high data aggregation level for `packetparser`. By default, all packets are reported but optionally `1 out of n` packets are sampled by random chance with the exception of certain important control flags or when hitting the reporting interval. This allows Retina to scale to high network volume environments at the trade-off of some reporting granularity. The performance impact of this is mostly for workloads with lots of new connections, connections already tracked in the conntrack table rely on #1665 for scalability. The behavior added in #1665 allows for accurate reporting of metrics despite sampling being in place. ## Related Issue #1760 ## Checklist - [X] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [X] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [X] I have correctly attributed the author(s) of the code. - [X] I have tested the changes locally. - [X] I have followed the project's style guidelines. - [X] I have updated the documentation, if necessary. - [X] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed ## Main <img width="1487" height="860" alt="Screenshot 2025-07-22 at 4 51 24 PM" src="https://github.com/user-attachments/assets/72bc7b42-b280-4d10-aa7b-d114b460cd73" /> ## After the change (with default sampling rate of 1) <img width="1487" height="860" alt="Screenshot 2025-07-22 at 4 57 36 PM" src="https://github.com/user-attachments/assets/6c115205-3068-4e97-ac51-9980c088890d" /> ## After the change (with sampling rate of 1000) <img width="1487" height="856" alt="Screenshot 2025-07-22 at 5 04 22 PM" src="https://github.com/user-attachments/assets/b5e6cd5e-9c44-446f-bc1d-996044820f16" /> --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project. Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
Description
Previously
packetparserinhighdataAggregationLevelwould report (mostly) every single packet since important flags were observed over the lifetime of the connection.This changes that behavior to only observe the important flags on individual packets and report when necessary.
This will mean less packets are reported. However, it also adds back weighting for bytes, packets, and TCP flags so that metrics remain accurate versus before.
I also noticed the current docs for the TCP flags metrics are inaccurate, we only report a subset of the supported flags. Not sure if this is intentional, however supporting more flags will put more memory pressure on both conntrack as well as performance pressure on packet reporting. With sampling in place, this should be more than worth it but there may be repercussions for the performance of
lowdataAggregationLevel.Checklist
git commit -S -s ...). See this documentation on signing commits.Screenshots (if applicable) or Testing Completed
eBPF objects compile and load as expected.
mainBranchThis Branch
Additional Notes
#1628 will be a follow-up to this to add additional sampling functionality.
Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.