Logs export / Observability

Currently logging is expensive in production and hence we need to be careful and 'minimal' in the logs which we do generate.
The idea here is that we want to simplify the logging story using more 'standard-practice' techniques rather than just logging on stdout, and hence also free up logging budget for observability.
The eventual aim is to enable in production tracing of: 'occasional' state machines (node startup, snapshotting, reconfiguration), per-request profiling, active trace validation of ccfraft.tla; however that is all quite far away still.

For now I propose the following rough logging structure to make it easy to enable telemetry in production and simplify and improve the current state, primarily by adding log export to fluentd.

```mermaid
flowchart LR
 subgraph Node["CCF node"]
 PROD["Producer threads LOG_*_FMT macros"]
 TRACE["Producer threads TRACE(<tag>, data)"]
 CONS["Console sink (stdout / journald)"]
 FLU["Fluentd sink per-thread queues + host reader thread"]
 PROD --> CONS
 PROD --> FLU
 TRACE --> CONS
 TRACE --> FLU
 end
 FLUENTD["fluentd in_forward :24224"]
 MDSD["mdsd"]
 GENEVA["Geneva Logs"]
 FLU -->|msgpack Forward over TCP| FLUENTD
 FLUENTD --> MDSD
 MDSD --> GENEVA
```

## Configuration UX

This is what the config could roughly look like:
```
"logging": {
 "sinks": [
 { "type": "console", "format": "Text", "level": "Info" },
 {
 "type": "fluentd_forward",
 "level": "Debug",
 "fluentd_tag": "ccf.node.1",
 "address": "127.0.0.1:24224",
 "thread_queue_capacity_bytes": "128KB",
 "reconnect_backoff": "1s",
 "connect_timeout": "2s",
 }
 ]
}
```

The default should be to preserve compatibility with what we have currently.

Note: log level should still be controlled by the CLI but that should be the most verbose level that we allow and per-sink levels are clamped silently to that.
That allows us to have just FAIL to STDOUT but INFO or DEBUG to fluentd.

## Constraints
- Performance impact should be minimal
 - Backpressure from fluentd cannot block the cchost process
 - Likely only option is to drop packets, but log to stderr that it happened (w/ debounce and aggregate count)
- Log level captured at CLI
- How to log before fluentd connected? Just buffer?

## Rough implementation plan
- Micro-msgpack library for emitting the trace messages we need
- Export to fluentd
- Performance overhead test with trace logging of requests, for landmark style performance metrics

## Non-goals
- Logging semantics
 - Message ordering vs timestamp 
 - Just sort the flow by timestamp
 - Clock sync in a distributed deployment
 - We're considering adding vector clocks or lamport clocks to the raft messages to allow a total order over these that respects causality.
 - Other cases MUST NOT rely on cross instance timestamps
 - Dynamic export sinks
- Private data in loglines controlled on a per-log-level basis, CLI arg takes care of this
- Migration of existing json logging to msgpack
 - I think we want json as a midpoint which is machine and human readable
- TLS between cchost and fluentd
 - There is support for TLS for fluentd `in_forward` but as its all on the same vnet (or localhost) this is unnecessary for v0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs export / Observability #7858

Configuration UX

Constraints

Rough implementation plan

Non-goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Logs export / Observability #7858

Description

Configuration UX

Constraints

Rough implementation plan

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions