-
Notifications
You must be signed in to change notification settings - Fork 2
feat: supporting docker #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
f36dd8a
feat: supporting docker compose setup
NikolaMilosa 7eb3b40
adding multiservice discovery
NikolaMilosa 7f58c1f
making a template
NikolaMilosa e9b3842
configuring prometheus scraping
NikolaMilosa 30b5b8f
extending template
NikolaMilosa 812a437
script for expanding prometheus config
NikolaMilosa bf2ef0f
Adding readme for docker tools
NikolaMilosa aedae54
Adding a way to override sd url
NikolaMilosa 63ec25c
adding grafana
NikolaMilosa 6531b76
Adding prometheus datasource
NikolaMilosa d44b3bb
Adding sample node exporter dashboards
NikolaMilosa 35a8d4c
Adding alerting templates and notifications
NikolaMilosa 90e342e
fixing contact points
NikolaMilosa 91f6e21
adding explanations
NikolaMilosa 9810d58
updating readme
NikolaMilosa 9ff0f8b
fixing alerts
NikolaMilosa 1c45213
proper down for orchestrator
NikolaMilosa 509ffb2
Adding some node exporter alerts
NikolaMilosa 73b8f27
Refactoring some alerts
NikolaMilosa dc77de7
started readme
NikolaMilosa ae94b22
Adding docs
NikolaMilosa 54355e5
fixing default port
NikolaMilosa 9791a14
network mode host by default
NikolaMilosa 225d5da
changing the defaults in the script
NikolaMilosa 82bcd55
removing the need for chmod
NikolaMilosa File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| volumes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,262 @@ | ||
| # IC observability stack | ||
|
|
||
| This collection of software lets you use your own equipment to observe the health and | ||
| operation of the nodes in your datacenter that are part of the Internet Computer. It | ||
| is a collection of docker containers that, when deployed and configured, will collect | ||
| metrics about the performance of selected nodes and send alerts if something is off. | ||
|
|
||
| ## How does this work? | ||
|
|
||
| The stack selects IC nodes and collects metrics from each node, saving them | ||
| in a local Prometheus database every few seconds. This database is | ||
| queryable through a Grafana deployed side-to-side with Prometheus. | ||
|
|
||
| Nodes of the Internet Computer all make available to the public a series | ||
| of metrics (in [Prometheus text format](https://github.com/Prometheus/docs/blob/777846211d502a287ab2b304cb515dc779de3474/content/docs/instrumenting/exposition_formats.md#text-based-format)) | ||
| that can be collected and analyzed by software such as this. | ||
|
|
||
| The stack has to be deployed within the same *network subnet* as the nodes so that | ||
| the scraping of the nodes can take place. This practically means that the nodes and the | ||
| machine that is running this stack have to be within the same data center and connected | ||
| to the same router. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| * IPv6 connectivity. | ||
| * Root-equivalent access on your workstation, to deploy the software this stack | ||
| needs to be set up. | ||
| * Hardware. | ||
| * It is recommended to run this stack on a machine with at least 16GB ram and | ||
| 80-100GB storage. | ||
| * SSH access to the machine. | ||
| * This is needed to later remotely connect to the services and inspect issues. | ||
| * Setting up ssh server [guide](https://documentation.ubuntu.com/server/how-to/security/openssh-server/). | ||
|
|
||
| ## Preparation | ||
|
|
||
| To start preparing your workstation for scraping and observing your nodes you have to | ||
| `ssh` into the machine (or be physically next to it in order to run commands on the | ||
| machine). | ||
|
|
||
| Install [Docker](https://docs.docker.com/engine/install/) for your machine. There | ||
| shouldn't be visible differences across different operating systems. This setup has | ||
| been tested on Mac OS and Manjaro OS (Linux flavour) but hasn't been tested on | ||
| Windows machines. | ||
|
|
||
| Once you can run docker commands and have `docker compose` you can proceed to the next step. | ||
|
|
||
| ## Configuration | ||
|
|
||
| Now that you have `docker` you can proceed to configure the stack. | ||
|
|
||
| ### Scraping targets | ||
|
|
||
| First thing that you have to configure is your scraping targets. To do that, find your | ||
| [principal id](https://support.dfinity.org/hc/en-us/articles/7365913875988-What-is-a-principal) | ||
| that correclates to your node provider id. You can find that from the [public dashboard](https://dashboard.internetcomputer.org/network/providers). After that you should find the id of the data center to which | ||
| you are deploying this stack. You can find that information also on the [public dashboard](https://dashboard.internetcomputer.org/network/providers). | ||
|
|
||
| With that you can run the following command: | ||
| ```bash | ||
| # NOTE: you have to be within the same directory as this README! | ||
| docker compose -f ./docker-compose.tools.yaml run --rm prom-config-builder tools/prom-config-builder/prom_config_builder.py --node-provider-id <node-provider-id> --dc-id <dc-id> | ||
| ``` | ||
|
|
||
| Example command for node provider Dfinity Stiftung for data center se1: | ||
| ```bash | ||
| docker compose -f ./docker-compose.tools.yaml run --rm prom-config-builder tools/prom-config-builder/prom_config_builder.py --node-provider-id bvcsg-3od6r-jnydw-eysln-aql7w-td5zn-ay5m6-sibd2-jzojt-anwag-mqe --dc-id se1 | ||
| ``` | ||
|
|
||
| Once that executes, you should be able to see a new file at `./config/prometheus/config.yaml`. | ||
| This file contains the definitions for the scraping targets. It will be slightly different | ||
| for each node provider and each data center. It is not versione controlled and you can | ||
| always recreate it with running the above command if you lose it, or delete it. | ||
|
|
||
| ### Contact points | ||
|
|
||
| This stack uses [Grafana](https://grafana.com/) to present the dashboards and to send | ||
| alerts. Sending alerts is done using the [contact points](https://grafana.com/docs/grafana/latest/alerting/fundamentals/notifications/contact-points/) which need to be configured. Grafana supports | ||
| various contact points, some of them can be used for free, some of them are paid. | ||
| This stack was tested with the following contact points: | ||
| * Discord [free] | ||
| * Slack | ||
| * Google chat | ||
|
|
||
| To setup your prefered contact point copy over the template with the following command: | ||
| ```bash | ||
| cp ./config/grafana/templates/template_contact_points.yaml ./config/grafana/provisioning/alerting/contact_points.yaml | ||
| ``` | ||
|
|
||
| After that, edit the new file on path `./config/grafana/provisioning/alerting/contact_points.yaml` | ||
| and uncomment your prefered contact point and configure it (to _uncomment_ you should | ||
| remove the initial `#` from the contact point definition. | ||
|
|
||
| ## Running the stack | ||
|
|
||
| To deploy the stack run the following command: | ||
| ```bash | ||
| # This will spawn the containers | ||
| docker compose -f ./docker-compose.yaml up -d | ||
| ``` | ||
|
|
||
| It will take some time for the services to start and to sync between one another. | ||
| You can monitor if containers are failing by running `docker ps` and see if | ||
| there are some restarts happening in the containers. To see more about Troubleshooting | ||
| read *Troubleshooting* | ||
|
|
||
| ## Usage | ||
|
|
||
| Once started, you will see the following applications: | ||
| * Prometheus - http://localhost:9090 | ||
| * Grafana - http://localhost:3000 - default creds can be see in `./config/grafana/grafana.ini` | ||
| * Service discovery - http://localhost:8000 | ||
|
|
||
| After 5-10 minutes you should see targets discovered in prometheus on the [targets page](http://localhost:9090/targets?search=). Initially, they might apear in red and if you keep monitoring they should | ||
| slowly start getting blue, which means that the targets are successfuly scraped. | ||
|
|
||
| You should also see some data incoming in the grafana [sample dashboard for the | ||
| node exporter](http://localhost:3000/d/1/node-exporter?orgId=1&from=now-3h&to=now&timezone=utc&var-datasource=prometheus&var-instance=$__all&var-diskdevice=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B). | ||
| Here, you can see various information about the performance of nodes and their health. | ||
|
|
||
| [Alerting tab](http://localhost:3000/alerting/list) will show `obs alert evaluations` | ||
| which will contain a list of preconfigured alerts. Here you can see if any of the | ||
| alerts are in in a problematic state. Most of the time they should be in `Normal` | ||
| state, this means that everything is fine. Some of them may occasionally go into a | ||
| `Pending` state, which means that they crossed the threshold for an alert but still | ||
| isn't happening for long enough to consider this a problem. When a `Pending` alert | ||
| is happening for long enough it will go into `Error` mode and you should receive an | ||
| alert on your preconfigured contact point. | ||
|
|
||
| *NOTE*: Not all alerts are configured to send notifications if they fire, usually | ||
| because they are just warnings and cannot be acted upon. To see which are and | ||
| which aren't you can see `./config/grafana/provisioning/alerting/alerts.yaml` and | ||
| see which have the lavel `severity: critical` attached to them because only they | ||
| will send alerts to your contact point. If you wish to send all of them, just replace | ||
| the other ones that contain `severity: warning` with `severity: critical`. | ||
|
|
||
| ### Access to the services of the stack | ||
|
|
||
| To access the stack remotely you can do the following: | ||
| ```bash | ||
| ssh -L 3000:localhost:3000 -L 9090:localhost:9090 -L 8000:localhost:8000 <machine-with-obs-stack> | ||
| ``` | ||
|
|
||
| Example command with all parameters: | ||
| ```bash | ||
| ssh -L 3000:localhost:3000 -L 9090:localhost:9090 -i ~/.ssh/priv_key.pem myuser@192.168.15.15 | ||
| ``` | ||
|
|
||
| ## Extending | ||
|
|
||
| Extending this stack usually means adding new alerts or dashboards. Those might | ||
| be your own modifications or the ones that come from someone else. | ||
|
|
||
| ### Dashboards | ||
|
|
||
| Building grafana dashboards is usually done via [grafana ui](http://localhost:3000/dashboard/new?orgId=1&from=now-6h&to=now&timezone=browser). You can follow | ||
| [this tutorial](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/create-dashboard/) from grafana to make your custom dashboards. | ||
|
|
||
| After creating the dashboard it is suggested to export it and save it | ||
| `./config/grafana/provisioning/dashboards/`. This will make sure that | ||
| if you later need to restore the dashboard or do a fresh deployment | ||
| of grafana it will be persisted. | ||
|
|
||
| *NOTE*: Storing a dashboard and restarting can make lead to errors | ||
| due to overlapping dashboard `uids`. If you export the dashboard and | ||
| save it you should, be sure to do a clean deployment of grafana. | ||
|
|
||
| ### Alerts | ||
|
|
||
| Adding grafana alerts can also be done via [grafana ui](https://grafana.com/docs/grafana/latest/alerting/alerting-rules/create-grafana-managed-rule/). | ||
|
|
||
| Similarly to dashboards, it is suggested to export alerts and save them | ||
| in `./config/grafana/provisioning/alerting/` which will make sure that | ||
| they persist after full grafana redeployments. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Service discovery failing | ||
|
|
||
| It is possible that usually due to networking issues, service discovery component | ||
| may fail. To debug why you can run the following command to inspect the logs: | ||
| ```bash | ||
| # From the same folder of this README | ||
| docker compose logs multiservice-discovery | ||
| ``` | ||
|
|
||
| You may see something along the lines of: | ||
| ```bash | ||
| multiservice-discovery-1 | Nov 10 13:25:07.044 WARN Failed to sync registry for mercury @ interval Instant { tv_sec: 9050, tv_nsec: 707832535 }: SyncWithNnsFailed { failures: [("targets", RegistryTransportError { source: UnknownError("Failed to query get_certified_changes_since on canister rwlgt-iiaaa-aaaaa-aaaaa-cai: Request failed for http://[2606:fb40:201:1001:6801:2fff:fef5:b129]:8080/api/v2/canister/rwlgt-iiaaa-aaaaa-aaaaa-cai/query: hyper_util::client::legacy::Error(Connect, ConnectError(\"tcp connect error\", [2606:fb40:201:1001:6801:2fff:fef5:b129]:8080, Os { code: 101, kind: NetworkUnreachable, message: \"Network is unreachable\" }))") })] } | ||
| ``` | ||
|
|
||
| This is usually a transient error and may happen from time to time. What this | ||
| means is that the discovery cannot sync with the registry canister and may be | ||
| serving stale targets. Usually this is acceptable but if that happens when | ||
| deploying and the initial sync fails you may be unable to see any of your | ||
| nodes. As long as you can see your nodes on the following link you can safely | ||
| ignore the transient failures. | ||
| ```bash | ||
| curl http://localhost:8000/prom/targets?node_provider_id=<node-provider-id>&dc_id=<dc-id> | ||
| ``` | ||
|
|
||
| *NOTE*: The initial sync of service discovery may take up to 15 minutes! Syncing | ||
| will be clearly logged in the multiservice discovery. | ||
|
|
||
| ### Prometheus | ||
|
|
||
| #### No targets visible in targets view | ||
|
|
||
| If you don't see anything in the [prometheus targets view](http://localhost:9090/targets?search=), | ||
| that means that prometheus failed to receive targets from the service discovery. | ||
|
|
||
| To check the logs run: | ||
| ```bash | ||
| # From the same folder of this README | ||
| docker compose logs prometheus | ||
| ``` | ||
|
|
||
| Check if you can see your nodes by running the following command: | ||
| ```bash | ||
| curl http://localhost:8000/prom/targets?node_provider_id=<node-provider-id>&dc_id=<dc-id> | ||
| ``` | ||
|
|
||
| #### Targets visible but are being shown in read | ||
|
|
||
| You should now see 4 jobs: | ||
| * `host_node_exporter` | ||
| * `node_exporter` | ||
| * `orchestrator` | ||
| * `replica` | ||
|
|
||
| If any of them are shown in read it means that some of the targets (or all of them) | ||
| are failing to be scraped. You can see that from the logs as well: | ||
| ```bash | ||
| # From the same folder of this README | ||
| docker compose logs prometheus | ||
| ``` | ||
|
|
||
| This means that the prometheus scraper cannot reach the nodes it is trying | ||
| to scrape. It can be because the workstation for this observability stack | ||
| isn't in the same network subnet as the nodes, or due to other network | ||
| issues. | ||
|
|
||
| ### Stack restart | ||
|
|
||
| To make a full clean restart (or partial) you can do the following: | ||
| * Ensure that everything you created through grafana ui is exported | ||
| * Dashboards (save them into `./config/grafana/provisioning/dashboards/`) | ||
| * Alerts (save them into `./config/grafana/provisioning/alerting/`) | ||
| * Contact points (save them into `./config/grafana/provisioning/alerting/`) | ||
| * Message templates (save them into `./config/grafana/provisioning/alerting/`) | ||
| * Notification policies (save them into `./config/grafana/provisioning/alerting/`) | ||
| * If you haven't created the resources, or don't mind losing them | ||
| you can proceed. | ||
| * Stop the stack: `docker compose -f ./docker-compose.yaml down` | ||
| * Clean the volumes. You don't have to clean everything, pick just | ||
| ones that you wish to restart fully: | ||
| * prometheus: `rm -rf ./volumes/prometheus/` | ||
| * grafana: `rm -rf ./volumes/grafana/` | ||
| * multiservice discovery: `rm -rf ./volumes/msd/` | ||
| * Reset the folder structure: `git checkout -- ./volumes/` | ||
| * Run the stack again: `docker compose -f ./docker-compose.yaml up -d` | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| [server] | ||
| enable_gzip = true | ||
|
|
||
| [analytics] | ||
| reporting_enabled = false | ||
|
|
||
| [security] | ||
| admin_user = root | ||
| admin_password = password | ||
| admin_email = admin@localhost | ||
|
|
||
| secret_key = strongPassword@!!123 | ||
|
NikolaMilosa marked this conversation as resolved.
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| contact_points.yaml |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.