Improvements to the ACR IPC protocol#1855
Conversation
Protocol-level messages: - Hello - HelloResp - ConfigData - CheckDataMsg - CheckDataAck - CheckResultMsg Add Histogram and EventPlatformEvent data types to the CheckData oneof. Extend MetricType with MonotonicCount and Historate. Remove gRPC service definition as it is no longer needed.
Restore the AcrIpc gRPC service with four RPCs: - Handshake (unary) - SendCheckData (unary with ACK) - SendCheckResult (unary) - StreamConfig (server streaming for config push). Re-enable server and client codegen for the checks proto compilation.
…init_config extend Log: - Add fields to the Log message to support all the metadata the agent and ADP support - The fields map 1:1 onto ADP's internal Log struct Histogram / sketch: - Renamed histogram.proto to sketch.proto and replaced its contents with the DDSketch wire format used by the agent's SketchPayload - CheckData oneof slot 5 is repurposed to carry pre-aggregated sketches Note: lib/saluki-components/src/sources/checks_ipc/mod.rs is left broken — it still imports the old Checks service. init_config: - The check API has both `config` and `init_config`, and many integrations rely on the init_config block for shared defaults - The previous CheckInstance message only carried the per-instance config, so an ACR-scheduled check would receive an empty init_config openmetrics: - Add support for the new openmetrics histogram bucket type
|
| message SendCheckResultResponse {} | ||
|
|
||
| // gRPC service for ACR <-> ADP IPC. | ||
| service AcrIpc { |
There was a problem hiding this comment.
Could we just call this CheckRunner, and drop the IPC part?
More generic, fits the fact that realistically anything can implement it, avoids the stuttering of "IPC" in the name when that's already implied by how it's used, etc.
| rpc Handshake(Hello) returns (HelloResp); | ||
|
|
||
| // Check data submission with acknowledgement. | ||
| rpc SendCheckData(CheckDataMsg) returns (CheckDataAck); | ||
|
|
||
| // Check result reporting. | ||
| rpc SendCheckResult(CheckResultMsg) returns (SendCheckResultResponse); | ||
|
|
||
| // Server pushes config updates to the client as a stream. | ||
| rpc StreamConfig(StreamConfigRequest) returns (stream ConfigData); |
There was a problem hiding this comment.
I realize that this is unwritten, but: can we follow the existing naming scheme for gRPC request/response message types in the Agent ecosystem? FooRequest for requests and FooResponse for responses, such that they're always paired by prefix for the same RPC operation.
As well, don't truncate words like request, response, or message.
| // Slot 5 was previously `histogram` (an unused stub). Repurposed | ||
| // to carry pre-aggregated DDSketches; the wire bytes from any | ||
| // legacy producer of the old Histogram message would be | ||
| // mis-parsed here, so don't rely on cross-version compatibility | ||
| // for slot 5 specifically. |
There was a problem hiding this comment.
Nit: comments that describe a change are a code smell of not reviewing LLM output.
We document to explain the current state of things, Git history is for everything that came before, unless there's a strong reason to couple them together.
There was a problem hiding this comment.
Yup, there are some commits that are mostly AI generated. Most of the code has been reviewed piecemeal, but I opened the PR as a draft so that I could do a higher-level review (I like the ability to mark files as reviewed because it helps me remember where I left off). Sorry for sending you the slop!
There was a problem hiding this comment.
It is in draft so I definitely don't see it as you asking us/me to review slop. 😅
If nothing else, this should mostly just serve as a reminder that LLMs love to do this sort of thing (code comments as a description of what they changed, rather than the thing they changed something to) and that we need to be vigilant in cleaning that up.
| // Distinguishes the two sender upcalls. False (the default) | ||
| // routes to `Sender.OpenmetricsBucket`; true routes to | ||
| // `Sender.HistogramBucket`. Python `submit_histogram_bucket` | ||
| // always lands as false. |
There was a problem hiding this comment.
Nit: opaque comment that requires implicit cross-repository knowledge.
It's also... confusing at face value. The naming implies multiple buckets, but this type is, definitionally, a value representing a single bucket.
| /// AcrIpc protocol version this server speaks. Bumped on | ||
| /// wire-incompatible changes; today the protocol is wire-additive so | ||
| /// we stay on 1. | ||
| const PROTOCOL_VERSION: u32 = 1; |
There was a problem hiding this comment.
Nit: we should probably define this as an enum in the checks proto stuff so that we have a consistent baseline of what protocol versions are valid or not.
|
|
||
| /// Identifier surfaced in the Handshake response so a client can tell | ||
| /// which server flavor it connected to. | ||
| const SERVER_ID: &str = "saluki-checks-ipc"; |
There was a problem hiding this comment.
We should probably be deriving the server ID through the information exposed in saluki_metadata (similarly for client ID on the ACR side) so that we can get things like the actual version of ADP for free.
Summary
Implement improvements to the IPC protocol between ADP and ACR.
Adds new types (sketches for histograms, the new openmetrics histogram bucket, extend log type). Also tries to make the protocol more robust and build failure detection into it.
How did you test this PR?
Ran ACR against both ADP and the core agent to confirm data types are correctly sent