Skip to content

fix: add auth self-healing after repeated unlock failures#131

Merged
Lerentis merged 4 commits into
Lerentis:mainfrom
pwojciechowski:fix/auth-recovery-threshold
Apr 1, 2026
Merged

fix: add auth self-healing after repeated unlock failures#131
Lerentis merged 4 commits into
Lerentis:mainfrom
pwojciechowski:fix/auth-recovery-threshold

Conversation

@pwojciechowski

Copy link
Copy Markdown
Contributor

Summary

This PR adds bounded auth self-healing to the Bitwarden CRD operator and documents the new tuning option in Helm/root docs.

Problem

In some environments, the operator can enter a persistent auth failure loop:

  • You are not logged in.
  • Failed to unlock vault
  • repeated update_managed_secret failures
    When this happens, periodic retries continue but may not recover without a manual pod restart.

What changed

Auth recovery logic

  • Added consecutive auth failure tracking in bitwarden_signin.
  • Added configurable threshold via env var:
    • BW_AUTH_FAILURE_THRESHOLD (default: 3)
  • When threshold is reached, operator now performs recovery:
    1. clears BW_SESSION
    2. runs bw logout (best effort)
    3. removes local Bitwarden CLI cache file ~/.config/Bitwarden CLI/data.json
    4. retries login + unlock
  • Counter resets after successful authentication.

Documentation and chart values

  • Added BW_AUTH_FAILURE_THRESHOLD to:
    • root README environment variable table + example
    • chart README example
    • chart values.yaml env snippet comments

Why this approach

  • Avoids brittle stderr-string matching for one specific error message.
  • Uses deterministic recovery based on repeated auth/unlock failures.
  • Improves resilience across multiple poisoned-state scenarios (stale session, broken local CLI state, etc.).

Files changed

  • src/bitwardenCrdOperator.py
  • tests/test_bitwarden_signin_recovery.py
  • README.md
  • charts/bitwarden-crd-operator/values.yaml
  • charts/bitwarden-crd-operator/README.md

Tests

Added unit tests for:

  • success resets failure counter
  • below-threshold failures do not trigger recovery
  • threshold-triggered recovery path
  • recovery clears session + cache file
  • unlock failures count toward auth failures
    Local verification run:
ruff check src/bitwardenCrdOperator.py tests/test_bitwarden_signin_recovery.py
python -m unittest discover -s tests
python -m compileall src

@Lerentis Lerentis left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pwojciechowski ,
thanks for the PR. i have some minor concerns and then it can get merged.
EDIT: Please make sure to write a changelog in the Chart.yaml and bump the versions there

cease_continuous_run = threading.Event()

class ScheduleThread(threading.Thread):
@classmethod

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add back this decorator. we should contain this thread to its own object

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — I restored @classmethod on ScheduleThread.run

Comment thread src/bitwardenCrdOperator.py Outdated
configured_threshold = os.environ.get(
"BW_AUTH_FAILURE_THRESHOLD", str(AUTH_FAILURE_THRESHOLD)
)
try:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this try expect block seems to be used to manage program flow. please refactor it to an if else statement and do not use try expect for that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I refactored the auth threshold/signin path to explicit return-value and if/else logic. Rookie mistake ;)

Comment thread src/bitwardenCrdOperator.py Outdated
auth_failures = 0
logger.info("Authentication recovery succeeded")
except BitwardenCommandException as exc:
logger.error(f"Authentication recovery failed: {exc}")

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we exit in this case to let kubernetes retry a fresh run?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, implemented: if auth recovery still fails after threshold + recovery attempt, the operator now logs the failure and exits with sys.exit(1) so Kubernetes restarts it cleanly.

@pwojciechowski pwojciechowski requested a review from Lerentis March 30, 2026 16:56
@Lerentis

Copy link
Copy Markdown
Owner

@pwojciechowski Please make sure to write a changelog in the Chart.yaml and bump the versions there

@pwojciechowski

Copy link
Copy Markdown
Contributor Author

@pwojciechowski Please make sure to write a changelog in the Chart.yaml and bump the versions there

Done. Chart.yaml updated.

@Lerentis Lerentis left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks again for sending this PR

@Lerentis Lerentis merged commit 4e9fd81 into Lerentis:main Apr 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants