Skip to content

issue-#151#166

Merged
andrewnguyen22 merged 5 commits into
mainfrom
issue-#151
Apr 29, 2025
Merged

issue-#151#166
andrewnguyen22 merged 5 commits into
mainfrom
issue-#151

Conversation

@andrewnguyen22
Copy link
Copy Markdown
Collaborator

Add Prometheus Metrics for Guard Rails Documentation

Description

This PR introduces Prometheus metrics for monitoring key blockchain performance indicators. The updated Guard Rails documentation now includes limits and alert recommendations for these metrics.

Why?

  • Provides real-time monitoring of node health.
  • Helps identify performance bottlenecks.
  • Enables alerting for critical blockchain operations.

Closes

closes #151

Changes Summary

  • Added Prometheus metrics for:

    • NodeStatus
    • TotalPeers
    • LastHeightTime
    • ValidatorStatus
    • BFTRound
    • BFTElectionTime
    • BFTElectionVoteTime
    • BFTProposeTime
    • BFTProposeVoteTime
    • BFTPrecommitTime
    • BFTPrecommitVoteTime
    • BFTCommitTime
    • BFTCommitProcessTime
    • NonSignerPercent
    • LargestTxSize
    • BlockSize
    • BlockProcessingTime
    • BlockVDFIterations
    • RootChainInfoTime
    • DBPartitionTime
    • DBPartitionEntries
    • DBPartitionSize
    • DBCommitTime
    • DBCommitEntries
    • DBCommitSize
    • MempoolSize
    • MempoolCount
  • Added documentation to reflect Prometheus-compatible monitoring

// GUARD RAILS DOCUMENTATION:
// *************************************************************************************************************
// This section describes 1) hard limits and 2) soft limit alert recommendations for health related metrics
//
// Metric Name          | Hard Limit  | Soft Limit | Note
// --------------------------------------------------------------------------------------------------------------------------------------
// NodeStatus           | 0           | n/a        |
// TotalPeers           | 0 peers     | 1 peer     |
// LastHeightTime       | n/a         | 5 min      | Just over 3 rounds at 20s blocks
// ValidatorStatus      | n/a         | not 1      | Monitor unexpected Pause or Unstaking
// BFTRound             | n/a         | 3 rounds   | Soft = Just below the 'LastHeight' time
// BFTElectionTime      | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing
// BFTElectionVoteTime  | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing
// BFTProposeTime       | 4 secs      | 3 secs     | Hard = config, Soft = 75% of config timing
// BFTProposeVoteTime   | 4 secs      | 3 secs     | Hard = config, Soft = 75% of config timing
// BFTPrecommitTime     | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing
// BFTPrecommitVoteTime | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing
// BFTCommitTime        | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing
// BFTCommitProcessTime | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing
// NonSignerPercent     | 33%         | 10%        | Hard = BFT upper bound
// LargestTxSize        | 4KB         | 3KB        | Hard = default mempool config, Soft = 75% of hard
// BlockSize            | 1MB-1652B   | 750KB      | Hard = param - MaxBlockHeader, Soft = 75% of param
// BlockProcessingTime  | 4 secs      | 3 secs     | Hard = MIN(ProposeTimeoutMS, ProposeVoteTimeoutMS)
// BlockVDFIterations   | n/a         | 0          | Soft = unexpected behavior
// RootChainInfoTime    | 2 secs      | 1 sec      | Hard = 10% of block time
// DBPartitionTime      | 10 min      | 5 min      | Hard = arbitrary / high likelihood of interruption
// DBPartitionEntries   | 2,000,000   | 1,500,000  | Hard = Badger default limit (configurable)
// DBPartitionSize      | 128MB       | 75MB       | Hard = Badger set limit (configurable)
// DBCommitTime         | 3 secs      | 2 secs     | Hard = soft of BlockProcessingTime
// DBCommitEntries      | 2,000,000   | 1,500,000  | Hard = Badger default limit (configurable)
// DBCommitSize         | 128MB       | 10MB       | Hard = Badger set limit (configurable)
// MempoolSize          | 10MB        | 2MB        | Hard = default config, Soft = 2 blocks
// MempoolCount         | 5,000       | 3,500      | Hard = default config, Soft = 75% of hard

@andrewnguyen22
Copy link
Copy Markdown
Collaborator Author

andrewnguyen22 commented Apr 25, 2025

@aqt01 The prioritization you asked for:

// GUARD RAILS DOCUMENTATION:
// *************************************************************************************************************
// This section describes 1) hard limits and 2) soft limit alert recommendations for health related metrics
//
// Metric Name          | Hard Limit  | Soft Limit | Note                                                     | Low Limit   | Priority
// --------------------------------------------------------------------------------------------------------------------------------------
// NodeStatus           | 0           | n/a        |                                                          | n/a         | High
// TotalPeers           | 0 peers     | 1 peer     |                                                          | 2 peers     | Low
// LastHeightTime       | n/a         | 5 min      | Just over 3 rounds at 20s blocks                         | 25 secs     | High
// ValidatorStatus      | n/a         | not 1      | Monitor unexpected Pause or Unstaking                    |             |
// BFTRound             | n/a         | 3 rounds   | Soft = Just below the 'LastHeight' time                  | Round 1     | Medium
// BFTElectionTime      | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing               | 1 sec       | Medium
// BFTElectionVoteTime  | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing               | 1 sec       | Medium
// BFTProposeTime       | 4 secs      | 3 secs     | Hard = config, Soft = 75% of config timing               | 2 sec       | High
// BFTProposeVoteTime   | 4 secs      | 3 secs     | Hard = config, Soft = 75% of config timing               | 2 sec       | High
// BFTPrecommitTime     | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing               | 1 sec       | Medium
// BFTPrecommitVoteTime | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing               | 1 sec       | Medium
// BFTCommitTime        | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing               | 1 sec       | Medium
// BFTCommitProcessTime | 2 secs      | 1.5 secs   | Hard = config, Soft = 75% of config timing               | 1 sec       | Medium
// NonSignerPercent     | 33%         | 10%        | Hard = BFT upper bound                                   | 5%          | High
// LargestTxSize        | 4KB         | 3KB        | Hard = default mempool config, Soft = 75% of hard        | 2KB         | Medium
// BlockSize            | 1MB-1652B   | 750KB      | Hard = param - MaxBlockHeader, Soft = 75% of param       | 500KB       | Medium
// BlockProcessingTime  | 4 secs      | 3 secs     | Hard = MIN(ProposeTimeoutMS, ProposeVoteTimeoutMS)       | 2 secs      | Medium
// BlockVDFIterations   | n/a         | 0          | Soft = unexpected behavior                               | n/a         | Medium
// RootChainInfoTime    | 2 secs      | 1 sec      | Hard = 10% of block time                                 | 700ms       | Medium
// DBPartitionTime      | 10 min      | 5 min      | Hard = arbitrary / high likelihood of interruption       | 2 min       | Low
// DBPartitionEntries   | 2,000,000   | 1,500,000  | Hard = Badger default limit (configurable)               | 1,000,000   | Medium
// DBPartitionSize      | 128MB       | 75MB       | Hard = Badger set limit (configurable)                   | 10 MB       | Medium
// DBCommitTime         | 3 secs      | 2 secs     | Hard = soft of BlockProcessingTime                       | 1.5 sec     | Medium
// DBCommitEntries      | 2,000,000   | 1,500,000  | Hard = Badger default limit (configurable)               | 1,000,000   | Medium
// DBCommitSize         | 128MB       | 10MB       | Hard = Badger set limit (configurable)                   | 1 MB        | High
// MempoolSize          | 10MB        | 2MB        | Hard = default config, Soft = 2 blocks                   | 500 KB      | Low
// MempoolCount         | 5,000       | 3,500      | Hard = default config, Soft = 75% of hard                | 1,000       | Low

HardLimit=Documentation only
SoftLimit=PagerDuty
LowLimit=Discord Notification

@andrewnguyen22 andrewnguyen22 added Medium Priority Medium priority issue Multi-Module Issue spans over multiple modules labels Apr 25, 2025
Copy link
Copy Markdown
Collaborator

@rem1niscence rem1niscence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good! just a single comment about the nil check on the metrics methods

Comment thread lib/metrics.go
// UpdateMempoolMetrics() updates mempool telemetry
func (m *Metrics) UpdateMempoolMetrics(txCount, size int) {
// exit if empty
if m == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't catch this on the last review but on what case m could be nil? In theory you could call the method by passing m as nil but I don't see a case where that would be done

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if Metrics is set to nil, then these calls in code won't fail. (Like in testing or if disabled)

It's an alternative to nil checking everytime you do a call to metrics.

@andrewnguyen22 andrewnguyen22 merged commit c14d0cf into main Apr 29, 2025
@andrewnguyen22 andrewnguyen22 deleted the issue-#151 branch April 29, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Medium Priority Medium priority issue Multi-Module Issue spans over multiple modules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add additional health metrics to Prometheus

2 participants