These files are generated artifacts. Do not edit them directly. The source of truth is each service's
docker-compose.yml(plus itsswarm.fragment.ymlfor Swarm-specific config). To regenerate:./stackctl.sh generateTo check for drift:./stackctl.sh sync
infrastructure.yml: Traefik, Portainer, APISIX (gateway + etcd + dashboard), Postgres, Mongo, Redisobservability.yml: Prometheus, Grafana, Loki, Tempo, OTel Collectorplatform.yml: GrowthBook (dashboard + proxy), AniTrend apps/services (anitrend,on-the-edge,edge-graphql)
- Separation of concerns:
docker-compose.ymlcarries Compose-only concerns (container_name,restart,build, image, volumes, ports, labels). Swarm-specific customizations —deployscheduling, network aliases, DNS overrides, and any key that only makes sense underdocker stack deploy— belong in the siblingswarm.fragment.yml. Do not put Swarm config indocker-compose.yml. - Shared overlay network:
traefik-public(external, attachable). Create once per swarm host. - No Compose-only keys: do not use
container_name,restart, orbuildin stacks. - Use
deployfor scheduling (mode, placement, resources) andenv_filefor configuration. - All exposed services must attach to
traefik-publicand define Traefik labels for routing. - Persist critical data via named volumes. Mark volumes as
external: trueto reuse existing data.
./stackctl.sh is the canonical deployment path. It handles preflight checks, stack
regeneration, variable rendering, and deployment in one workflow. Committed
stacks/*.yml files intentionally contain ${VAR} placeholders that must be
resolved before Swarm can use them. stackctl.sh calls tools/render_compose.py
to substitute service-local env_file values into a gitignored .rendered/
copy, then deploys the rendered file.
Prerequisites:
- Docker Engine with Swarm enabled (single-node is fine)
- The external overlay network
traefik-public - Python render dependencies installed in
tools/.venv - Optional: local TLS certs in
traefik/certs/for*.docker.localhost
Quick start:
# 0) Install the stack generation/render toolchain once per host
python3 -m venv tools/.venv
tools/.venv/bin/python -m pip install --upgrade pip
tools/.venv/bin/python -m pip install -r tools/requirements.txt
# 1) Validate your environment (safe to run repeatedly). Add --fix-network to auto-create the overlay network.
./stackctl.sh doctor --fix-network
# 2) Optionally ensure external named volumes exist before deploying
./stackctl.sh doctor --fix-volumes
# 3) Deploy all stacks and follow key logs (Traefik, Prometheus, Loki)
./stackctl.sh up
# Or deploy a subset
./stackctl.sh up -s infrastructure,observability
# Check status
./stackctl.sh status
# Tail logs for specific services
./stackctl.sh logs infrastructure_traefik observability_prometheus
# Remove stacks (keeps volumes); add --remove-network to also remove traefik-public
./stackctl.sh down -y
# For encrypted secrets, decrypt, render, deploy, and clean up in one step:
./stackctl.sh secrets deployNotes:
stackctl.shfinds stack files fromstacks/*.yml. The rendered output is written to.rendered/(gitignored); committed source stacks are never modified.- The
doctorcommand validates Compose syntax for each stack and reminds you to create.envfiles where a.env.exampleexists. For encrypted secrets, use./stackctl.sh secrets deployinstead (see Managing Secrets). - If you use local HTTPS, make sure
traefik/certs/local-cert.pemandtraefik/certs/local-key.pemexist; see below for generation. - To check for drift between compose sources and committed stacks:
./stackctl.sh sync
These commands deploy unresolved
${VAR}placeholders. Committedstacks/*.ymlare not directly deployable without pre-rendering or equivalent shell environment setup. Docker Swarm does not loadenv_filefor Compose variable interpolation -- that is a Compose CLI feature only.
If you must deploy manually, ensure all variables are resolved first (e.g., via
tools/render_compose.py or envsubst). For debugging or inspection:
# 1) Initialize Swarm (idempotent)
docker swarm init
# 2) Create shared overlay network (idempotent)
docker network create --driver=overlay --attachable traefik-public
# 3) Generate stacks from compose sources
./stackctl.sh generate
# 4) Render variables into .rendered/ files
# (handled automatically by stackctl.sh up; shown here for manual inspection)
python3 tools/render_compose.py -i stacks/infrastructure.yml -o .rendered/infrastructure.rendered.yml --repo-root .
# 5) Deploy the rendered file (not the source stack)
docker stack deploy -c .rendered/infrastructure.rendered.yml infrastructure
# 6) Verify
docker stack services infrastructure
docker stack services observability
docker stack services platform
# 7) Teardown (keeps volumes)
docker stack rm platform
docker stack rm observability
docker stack rm infrastructurestackctl.sh pre-renders variables into gitignored .rendered/ copies before
deployment. Naming follows ${stack_name}.rendered.yml:
stacks/infrastructure.yml→.rendered/infrastructure.rendered.ymlstacks/observability.yml→.rendered/observability.rendered.ymlstacks/platform.yml→.rendered/platform.rendered.yml
These files are ignored by Git and safe to regenerate at any time. Inspect without committing:
./stackctl.sh up --dry-run # validates and logs render paths
# or render manually:
python3 tools/render_compose.py -i stacks/infrastructure.yml -o /tmp/check.rendered.yml --repo-root .- Ensure each service folder has a
.envavailable. For local development, copy from.env.example; for production, use./stackctl.sh secrets deploy(see Managing Secrets). - APISIX dashboard uses
apisix/api-dashboard/config/conf.yaml(generated fromconf.example.yml). - Healthchecks have been added for Prometheus and APISIX. Consider adding them for other services as needed.
- Stacks set conservative
deploy.resourcesreservations/limits to avoid runaway memory/CPU. Adjust in ±128–256MiB steps based on telemetry. - Services use the
locallogging driver with rotation (max-size=10m,max-file=3) to reduce JSON log churn. If you prefer a global default, set it in/etc/docker/daemon.jsonand restart Docker.
- Prometheus: 3d retention (
--storage.tsdb.retention.time=3d),--query.max-concurrency=10; scrape intervals relaxed to 30s for most jobs. - Loki: retention 72h, chunk target ~1.5MiB, moderate ingestion rate, compactor retention enabled.
- Tempo: local backend with 48h retention from config; single-replica by default.
- GrowthBook: Node heap capped via
NODE_OPTIONS=--max-old-space-size=512. - Traefik: access logs disabled by default; enable temporarily if debugging.
- Verify per-stack services:
docker stack services <stack>anddocker service logs <stack>_<service>. - If Traefik can't reach a service, confirm it's attached to
traefik-publicand labels point to the correctserver.portand host. - For noisy logs or high disk writes, ensure the
localdriver is in effect and service-level logging options are applied.
For local development with HTTPS on domains like grafana.docker.localhost, Traefik is configured with a local certificatesResolver and a file provider for TLS certificates.
What this means:
- ACME/Let’s Encrypt will not issue for
.localhostdomains. Instead, generate a local development certificate and key, and place them intraefik/certs/aslocal-cert.pemandlocal-key.pem. - The dynamic config (
traefik/config/dynamic.yml) already references these files and declares thedocker.localhostSANs, including*.docker.localhost. - Set
CERT_RESOLVER=localintraefik/.env(and any service labels that reference it) to use the local resolver while Traefik serves the file-based certs.
Generate a dev cert (example using mkcert):
mkcert -install
mkcert -cert-file traefik/certs/local-cert.pem -key-file traefik/certs/local-key.pem "docker.localhost" "*.docker.localhost"Notes:
traefik/certs/.gitignoreprevents committing private keys or ACME storage files.- Browsers trust mkcert’s local CA after
mkcert -install. If not using mkcert, you may need to trust your self-signed CA manually.