Increase tpa-pgsql-bee memory limits to fix OOMKills#23
Conversation
The PostgreSQL pod (tpa-pgsql-bee) is repeatedly OOM-killed because the 25GB database with 50 concurrent connections consistently uses 900-980MB, exceeding the 1Gi limit. This causes SBOM upload failures with "Connection pool timed out" errors. Increase memory limit from 1Gi to 4Gi and request from 512Mi to 2Gi. Also bump CPU limit/request slightly as observed usage (350m) was already above the old 250m request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughResource allocations for the tpa-pgsql-bee container are increased in the Kubernetes Deployment manifest. CPU limits doubled from 1 to 2, memory limits increased from 1Gi to 4Gi, CPU requests doubled from 250m to 500m, and memory requests increased from 512Mi to 2Gi. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
components/trust-apps/tpa/infrastructure.yaml (1)
82-86: Add post-rollout guardrails to confirm right-sizing.After deploy, monitor restart count and memory working set for
tpa-pgsql-beefor a few days; if stable, consider a VPA recommendation loop to keep requests/limits data-driven over time.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/trust-apps/tpa/infrastructure.yaml` around lines 82 - 86, After deploying the resource with the current cpu/memory limits and requests for tpa-pgsql-bee, add a post-rollout guardrail: monitor the Pod/Deployment tpa-pgsql-bee for restartCount and container_memory_working_set_bytes for several days and create alerting rules (e.g., prometheus alerts) for unusual restarts or sustained memory pressure; if metrics remain stable for the observation window, enable a VPA recommendation loop to propose adjustments to the cpu/memory requests/limits and automate a safe rollout (dry-run VPA or PR-based changes) to keep size data-driven over time.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@components/trust-apps/tpa/infrastructure.yaml`:
- Around line 82-86: After deploying the resource with the current cpu/memory
limits and requests for tpa-pgsql-bee, add a post-rollout guardrail: monitor the
Pod/Deployment tpa-pgsql-bee for restartCount and
container_memory_working_set_bytes for several days and create alerting rules
(e.g., prometheus alerts) for unusual restarts or sustained memory pressure; if
metrics remain stable for the observation window, enable a VPA recommendation
loop to propose adjustments to the cpu/memory requests/limits and automate a
safe rollout (dry-run VPA or PR-based changes) to keep size data-driven over
time.
The PostgreSQL pod (tpa-pgsql-bee) is repeatedly OOM-killed because the 25GB database with 50 concurrent connections consistently uses 900-980MB, exceeding the 1Gi limit. This causes SBOM upload failures with "Connection pool timed out" errors.
Increase memory limit from 1Gi to 4Gi and request from 512Mi to 2Gi. Also bump CPU limit/request slightly as observed usage (350m) was already above the old 250m request.
Summary by CodeRabbit