Skip to content

[ci-coach] CI/CD pipeline optimization opportunities #390

@willvelida

Description

@willvelida

Summary

18 deployment pipelines and 10 reusable templates analyzed. Found 22 optimization opportunities: 3 High, 5 Medium, 14 Low impact.


High Impact

❌ Missing NuGet Package Caching

Affected: All 4 test templates (template-dotnet-run-unit-tests.yml, template-dotnet-run-contract-tests.yml, template-dotnet-run-e2e-tests.yml, template-aca-api-integration-tests.yml)

Current state: Each template runs dotnet restore without caching. A typical API pipeline invokes 3 test templates = 3 independent restores per pipeline. Across all 18 pipelines, that's ~54 uncached restore operations per full CI cycle.

Recommended fix: Add actions/cache for ~/.nuget/packages with a key based on **/packages.lock.json or **/*.csproj hash in each test template.

Estimated savings: 30-60 seconds per job × 54 jobs = 27-54 minutes of aggregate runner time per full cycle.

❌ Missing Docker Layer Caching

Affected: template-acr-push-image.yml, template-ai-bom-push-image.yml

Current state: Both templates set up Docker Buildx (docker/setup-buildx-action) but use docker build directly without leveraging Buildx cache backends. No cache-from or cache-to parameters.

Recommended fix: Switch to docker buildx build with --cache-from type=gha --cache-to type=gha,mode=max to use GitHub Actions cache for Docker layers.

Estimated savings: 1-3 minutes per container build. With 15+ container-building pipelines, potential savings of 15-45 minutes per full cycle.

❌ Missing Concurrency Groups on All Deploy Pipelines

Affected: All 18 deploy-*.yml pipelines

Current state: Zero pipelines define a concurrency block. Per conventions, deployment workflows must have concurrency: { group: deploy-{service}-${{ github.ref }}, cancel-in-progress: false } to prevent parallel deploys to the same environment.

Recommended fix: Add a concurrency block to each pipeline.

Risk: Without this, simultaneous PRs can trigger parallel deploys, causing race conditions in infrastructure provisioning.


Medium Impact

⚠️ Missing timeout-minutes on All Jobs

Affected: All 18 deploy pipelines and all 10 templates (0 have timeout-minutes)

Current state: No job in any pipeline or template specifies timeout-minutes. GitHub's default is 360 minutes (6 hours), meaning stuck jobs consume runner minutes silently.

Recommended fix: Add timeout-minutes: 15 for test jobs, timeout-minutes: 20 for container builds, timeout-minutes: 10 for Bicep operations.

⚠️ Missing Path Filter Self-Reference (8 pipelines)

Affected: deploy-activity-api.yml, deploy-activity-service.yml, deploy-core-infra.yml, deploy-food-api.yml, deploy-food-service.yml, deploy-sleep-api.yml, deploy-ui.yml, deploy-vitals-api.yml

Current state: These pipelines don't include their own workflow file in the paths: filter, meaning changes to the pipeline itself won't trigger a validation run.

Recommended fix: Add .github/workflows/deploy-{service}.yml to each pipeline's paths: array.

⚠️ Missing Bicep What-If Preview (3 pipelines)

Affected: deploy-chat-api.yml, deploy-reporting-api.yml, deploy-ui.yml

Current state: These pipelines deploy Bicep infrastructure but skip the template-bicep-whatif stage, meaning infrastructure changes deploy without a preview of what will change.

Recommended fix: Add a preview job using template-bicep-whatif.yml before the deploy stage.

⚠️ Inconsistent Coverage Threshold

Affected: deploy-reporting-api.yml (threshold: 60%, all others: 70%)

Current state: One pipeline uses a lower threshold than the documented 70% minimum.

Recommended fix: Raise to 70% or document the exception.

⚠️ env-setup Job Overhead

Affected: All 17 non-infra deploy pipelines

Current state: A separate env-setup job runs on its own runner solely to propagate the DOTNET_VERSION env var as a job output. This adds ~15-30 seconds of runner spin-up overhead per pipeline.

Recommended fix: Remove the dedicated job and pass the version directly as a literal string to template inputs, or use a workflow-level env var that templates read.


Low Impact

14 low-impact findings (click to expand)

Redundant ACR Server Lookup

Affected: All API/service deploy pipelines with retrieve-container-image-dev job

The template-acr-push-image.yml already queries the ACR login server. The separate retrieve-container-image-dev job duplicates this lookup on a fresh runner just to pass the server name to Bicep deploy stages.

Sequential Bicep Stages

Affected: All pipelines with lint → validate → preview → deploy chain

lint and validate have no data dependency and could run in parallel.

Duplicate dotnet restore + dotnet build Across Test Tiers

Unit tests, contract tests, and E2E tests each independently restore and build the same solution. A shared build artifact could eliminate 2 of 3 build cycles.

Coverage Threshold Default Repetition

Most pipelines pass coverage-threshold: 70 which is already the template default. These explicit values add noise without benefit.

No Workflow File Self-Reference (additional patterns)

Some pipelines that DO have self-references are missing the template files they depend on in their paths filter.

Template template-aca-api-integration-tests.yml Has Unused Secrets

Defines client-id, tenant-id, subscription-id secrets but the E2E template already handles Azure auth.

Action Pin Comment Inconsistency

Some actions have version comments (# v5, # v2) while others don't. Minor readability issue.

Reporting Service Cadence Pipelines Are Nearly Identical

deploy-reporting-service-weekly.yml, deploy-reporting-service-monthly.yml, and deploy-reporting-service-yearly.yml are structurally identical pipelines that could potentially be consolidated into one parameterized workflow.

docker build Instead of docker buildx build

The ACR push template sets up Buildx but then calls plain docker build, not leveraging multi-platform or advanced Buildx features.

No Artifact Sharing Between Test Jobs

Each test tier (unit, contract, E2E) checks out code, restores, and builds independently. A shared build artifact step could reduce duplication.

E2E Template Cosmos Emulator Wait Is Redundant

The service container already has a health check configured. The additional Wait for Cosmos DB Emulator step with 30 retries may be unnecessary.

deploy-core-infra.yml Has No Test Stages

Infra-only pipeline — expected, but could benefit from a Bicep unit test stage (e.g., az bicep test).

SBOM Generation Uses continue-on-error: true

Trivy SBOM and dependency graph steps silently swallow failures. Consider removing continue-on-error or adding failure notifications.

No Matrix Strategy for Similar Pipelines

18 nearly identical pipelines could potentially use a matrix strategy with a shared workflow, reducing maintenance burden.


Run Performance (last 7 days — Deploy Activity Api sample)
Run Conclusion Duration Attempt
§25777862846 ✅ success 12.4 min 1
§25750348087 ✅ success 82.4 min 1
§25532065233 ✅ success 11.5 min 1

Typical pipeline duration: 11-13 minutes. Outlier at 82 minutes suggests runner queuing or a slow E2E test run.

Conventions Compliance
Convention Status Notes
Action pinning (SHA) ✅ Pass All third-party actions pinned to SHA
Permissions scoped ✅ Pass Workflow-level permissions correctly set
Concurrency groups ❌ Fail 0/18 pipelines have concurrency blocks
Path filter self-reference ❌ Fail 8/18 pipelines missing self-reference
Timeout-minutes ❌ Fail 0/28 jobs have timeout-minutes
OIDC authentication ✅ Pass All Azure auth uses OIDC via azure/login
Template reference style ✅ Pass All use ./.github/workflows/template-*.yml
Secret handling ✅ Pass No secrets echoed or passed as CLI args

References:

Generated by CI Optimization Coach · ● 4.3M ·

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions