Skip to content

fix(security): validate clone host exactly across providers and block /help_docs repo overrides (#2445)#2450

Draft
naorpeled wants to merge 1 commit into
mainfrom
fix/2445-help-docs-clone-url-validation
Draft

fix(security): validate clone host exactly across providers and block /help_docs repo overrides (#2445)#2450
naorpeled wants to merge 1 commit into
mainfrom
fix/2445-help-docs-clone-url-validation

Conversation

@naorpeled

Copy link
Copy Markdown
Member

Summary

Fixes #2445.

Git providers validated the clone target with a substring check (e.g. github_com not in repo_url_to_clone) before embedding an auth token into the clone URL/header. A host that merely contained the allowed host string — e.g. github.com.attacker.tld/o/r or https://github.com@attacker.tld/o/r — passed validation, so the token was sent to the attacker host. Via /help_docs, an untrusted PR commenter could set --pr_help_docs.repo_url=... and exfiltrate GITHUB_TOKEN.

The same weak pattern existed in all five providers (GitHub, GitLab, Gitea, Bitbucket Cloud, Bitbucket Server), each leaking its own token, so this is fixed as a class rather than a single call site.

Changes

Sink — new shared validator GitProvider._validate_clone_url_and_extract_path(repo_url, allowed_host):

  • requires the URL host to match the configured host exactly (case-insensitive)
  • rejects embedded credentials (userinfo) and non-http(s) schemes
  • rejects paths containing whitespace (argument-injection guard)
  • returns only the path; each provider rebuilds the clone URL from its trusted authority (host[:port]), so the token can never reach a different host, and self-hosted ports/context paths are preserved

Applied in GitHub, GitLab, Gitea, Bitbucket Cloud and Bitbucket Server. For Bitbucket Server, the git command is now built as an argv list instead of shlex.split of an f-string, so the repo URL can't be re-parsed into extra git arguments (-c ... injection against the bearer-token clone).

Source — runtime override blocklist: added pr_help_docs.repo_url, pr_help_docs.repo_default_branch and pr_help_docs.docs_path to the forbidden CLI args, so untrusted comment commands can no longer override the clone target (these must come from maintainer-controlled config).

Tests

tests/unittest/test_github_clone_url_validation.py — 20 regression tests covering lookalike hosts, userinfo injection, scheme restrictions, whitespace/arg-injection, missing path, self-hosted hosts and non-standard ports, all five providers, and the new blocklist entries. Full unit suite passes (536 tests).

Notes / follow-up

  • The token is still embedded in the clone URL for GitHub/GitLab/Gitea/Bitbucket Cloud, but now only ever sent to the validated host. Switching to credential-helper/header auth (so no token is ever placed in the URL) is reasonable hardening and can be a separate follow-up.
  • Gating /help_docs on author_association is a property of the consumer's GitHub Actions workflow, not PR-Agent code, so it is out of scope here.

… /help_docs repo overrides (#2445)

Providers validated the git clone target with a substring check (e.g.
`github_com not in repo_url_to_clone`) before embedding an auth token into the
clone URL/header. A host that merely *contained* the allowed host string, such
as `github.com.attacker.tld/o/r` or `https://github.com@attacker.tld/o/r`,
passed validation and the token was sent to the attacker host. The same weak
pattern existed in the GitHub, GitLab, Gitea, Bitbucket Cloud and Bitbucket
Server providers.

Add a shared `GitProvider._validate_clone_url_and_extract_path` helper that:
- requires the URL host to match the configured host exactly (case-insensitive)
- rejects embedded credentials (userinfo) and non-http(s) schemes
- rejects paths containing whitespace (argument-injection guard)
- returns only the path, so each provider rebuilds the clone URL from its
  trusted authority (host[:port]) and the token can never reach another host

Apply it in all five providers. For Bitbucket Server, also build the git argv
as a list instead of shlex-splitting an f-string, so the repo URL can no longer
be re-parsed into extra git arguments.

Additionally, block untrusted comment commands from overriding the clone target
by adding pr_help_docs.repo_url, pr_help_docs.repo_default_branch and
pr_help_docs.docs_path to the forbidden CLI args list.

Adds regression tests covering lookalike hosts, userinfo injection, scheme
restrictions, whitespace/arg-injection, self-hosted hosts/ports, all five
providers, and the new blocklist entries.
@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (1) 📎 Requirement gaps (1)

Context used

Grey Divider


Action required

1. GitLab clone URL embeds token 📎 Requirement gap ⛨ Security
Description
The code constructs clone URLs for multiple git providers (GitLab, GitHub, Gitea, and Bitbucket
Cloud) by embedding provider access tokens directly in the URL (e.g., https://oauth2:<*******t;@... /
https://<token>@... / https://x-token-auth:<*******t;@...). This violates the requirement to avoid
placing secrets in clone URLs where they can leak via logs, process arguments, or stored git remote
strings.
Code

pr_agent/git_providers/gitlab_provider.py[R977-979]

+        # Rebuild from the trusted authority (host[:port]) so the token cannot reach a different host.
+        scheme = (base_parsed.scheme or "https") + "://"
+        clone_url = f"{scheme}oauth2:{access_token}@{base_parsed.netloc}{repo_full_name}"
Evidence
PR Compliance ID 10 prohibits token-in-URL cloning and embedding secrets into clone URLs. The cited
implementations explicitly format provider tokens into the clone_url string, including GitLab’s
oauth2:{access_token}@..., GitHub’s {github_token}@..., Gitea’s {gitea_token}@..., and
Bitbucket Cloud’s x-token-auth:{token}@..., demonstrating that secrets are being placed directly
into the URL in each case.

Do not embed GITHUB_TOKEN (or other secrets) into git clone URLs
pr_agent/git_providers/gitlab_provider.py[958-980]
pr_agent/git_providers/github_provider.py[1218-1224]
pr_agent/git_providers/gitea_provider.py[724-734]
pr_agent/git_providers/bitbucket_provider.py[633-645]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Several provider implementations build git clone URLs by embedding authentication tokens directly in the URL (GitLab `oauth2:<token>@...`, GitHub/Gitea `<token>@...`, Bitbucket Cloud `x-token-auth:<token>@...`), which violates compliance requirements.

## Issue Context
Compliance (PR Compliance ID 10) requires that tokens/secrets are not embedded in URLs used for git operations because they can be exposed via logs, process lists/args, or persisted git remote strings. Prefer authentication approaches that keep secrets out of URLs, such as git credential helpers or `http.extraHeader`-based auth.

## Fix Focus Areas
- pr_agent/git_providers/gitlab_provider.py[977-980]
- pr_agent/git_providers/github_provider.py[1220-1223]
- pr_agent/git_providers/gitea_provider.py[731-734]
- pr_agent/git_providers/bitbucket_provider.py[633-645]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Tests contain token-like literals 📘 Rule violation ⛨ Security
Description
New unit tests include hardcoded token-like strings (e.g., ghp_SECRETTOKEN, glpat-TOKEN) which
can trigger secret scanners or be mistaken for real credentials. This violates the requirement to
avoid hardcoded secrets/tokens in the repository.
Code

tests/unittest/test_github_clone_url_validation.py[R11-18]

+def _make_stub(token="ghp_SECRETTOKEN", base_url_html="https://github.com", deployment_type="user"):
+    """Build a real GithubProvider instance without running __init__, setting only the
+    attributes used by _prepare_clone_url_with_token (inherited base helpers stay available)."""
+    stub = object.__new__(GithubProvider)
+    stub.auth = SimpleNamespace(token=token)
+    stub.base_url_html = base_url_html
+    stub.deployment_type = deployment_type
+    return stub
Evidence
PR Compliance ID 12 forbids committing secrets/tokens in code/tests. The added test file hardcodes
multiple token-like strings, including a GitHub PAT-like prefix ghp_.

AGENTS.md: No Hardcoded Secrets or Tokens in Code or Configuration
tests/unittest/test_github_clone_url_validation.py[11-18]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new tests hardcode token-like values (e.g., GitHub `ghp_...` style), which violates the no-hardcoded-secrets policy and may trip secret scanning.

## Issue Context
Even in tests, committed token-like strings should be avoided; use clearly non-secret placeholders (e.g., `TEST_TOKEN`), fixtures, or environment-provided dummy values.

## Fix Focus Areas
- tests/unittest/test_github_clone_url_validation.py[11-18]
- tests/unittest/test_github_clone_url_validation.py[31-45]
- tests/unittest/test_github_clone_url_validation.py[97-134]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Hardcoded GitHub clone scheme 🐞 Bug ☼ Reliability
Description
GithubProvider._prepare_clone_url_with_token always prefixes the clone URL with "https://" while the
PR removed the previous validation that the configured base URL uses that scheme. If
GITHUB.BASE_URL/base_url_html is configured as http:// for an internal GitHub Enterprise, host
validation will pass but cloning will still use https and fail.
Code

pr_agent/git_providers/github_provider.py[R1211-1223]

+        # Determine the single host we are allowed to clone from (e.g. 'github.com' or 'github.<org>.com')
+        # and validate the requested url against it, rather than a weak substring check (issue #2445).
+        base_parsed = urlparse(github_base_url)
+        repo_full_name = self._validate_clone_url_and_extract_path(repo_url_to_clone, base_parsed.hostname)
        if not repo_full_name:
-            get_logger().error(f"url to clone: {repo_url_to_clone} is malformed")
            return None

+        # Always rebuild the clone url from the trusted authority (host[:port]) so the token can never
+        # reach a different host, while preserving any non-standard port of the configured base url.
        clone_url = scheme
        if self.deployment_type == 'app':
            clone_url += "git:"
-        clone_url += f"{github_token}@{github_com}{repo_full_name}"
+        clone_url += f"{github_token}@{base_parsed.netloc}{repo_full_name}"
Evidence
The provider constructs clone URLs with a hardcoded scheme = "https://" but parses the host/netloc
from base_url_html, which is derived from the configurable GITHUB.BASE_URL setting and may be
http://... in some deployments. Since scheme checking was removed, such configurations now proceed
to rebuild a clone URL that forces HTTPS and can break cloning.

pr_agent/git_providers/github_provider.py[40-42]
pr_agent/git_providers/github_provider.py[1197-1224]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`GithubProvider._prepare_clone_url_with_token` hardcodes the clone URL scheme to `https://` but no longer enforces that the configured `base_url_html` uses HTTPS. This can cause clones to fail on self-hosted deployments that intentionally use `http://`.

### Issue Context
The refactor now uses `urlparse(github_base_url)` for host validation and netloc reconstruction, but still uses a fixed `scheme = "https://"` when rebuilding the clone URL.

### Fix Focus Areas
- pr_agent/git_providers/github_provider.py[1197-1224]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

4. Unused shlex import 🐞 Bug ⚙ Maintainability
Description
BitbucketServerProvider no longer uses shlex.split after switching to an argv list, but the module
still imports shlex. This leaves dead code and can break builds in environments enforcing
unused-import checks.
Code

pr_agent/git_providers/bitbucket_server_provider.py[R571-578]

+        # Build argv as a list (not shlex.split of an f-string) so neither the token nor the repo url
+        # can be re-parsed into extra git arguments (issue #2445 hardening).
+        cli_args = [
+            "git", "clone",
+            "-c", f"http.extraHeader=Authorization: Bearer {bearer_token}",
+            "--filter=blob:none", "--depth", "1",
+            repo_url, dest_folder,
+        ]
Evidence
After the refactor, the only remaining occurrence of shlex is the import and a comment; the actual
clone command is built as a Python list, so shlex is unused.

pr_agent/git_providers/bitbucket_server_provider.py[1-12]
pr_agent/git_providers/bitbucket_server_provider.py[565-578]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`pr_agent/git_providers/bitbucket_server_provider.py` still imports `shlex`, but the refactor replaced `shlex.split(...)` with an argv list and no longer references `shlex`.

### Issue Context
This is now an unused import (only referenced in a comment), which can fail lint/CI or confuse future maintainers.

### Fix Focus Areas
- pr_agent/git_providers/bitbucket_server_provider.py[1-12]
- pr_agent/git_providers/bitbucket_server_provider.py[565-578]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@github-actions github-actions Bot added the bug label Jun 16, 2026
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

PR Summary by Qodo

Fix clone URL host validation across providers and block help_docs overrides
🐞 Bug fix 🧪 Tests 🕐 20-40 Minutes

Grey Divider

Walkthroughs

Description
• Enforce exact clone host validation to prevent auth token exfiltration via lookalike URLs.
• Rebuild clone URLs from trusted configured authorities across GitHub/GitLab/Gitea/Bitbucket.
• Block /help_docs runtime overrides of repo/branch/path from untrusted comment commands.
Diagram
graph TD
  PRC["Untrusted comment command (/help_docs)"] --> CLI["CliArgs.validate_user_args"] --> PRV["Provider _prepare_clone_url_with_token"] --> VAL["GitProvider._validate_clone_url_and_extract_path"] --> URL["Rebuilt clone URL (trusted host[:port])"] --> GIT["git clone (argv list / subprocess)"] --> HOST["Remote git host (configured)"]
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Use header/credential-helper auth (avoid tokens in URLs)
  • ➕ Tokens never appear in clone URLs (reduces log/telemetry leakage risk)
  • ➕ Consistent approach across providers (e.g., git -c http.extraHeader=...)
  • ➖ More provider-specific behavior/compatibility work (especially GitHub App vs PAT flows)
  • ➖ May require extra git config handling for different hosts and redirects
2. Centralize cloning into a single hardened clone helper
  • ➕ Single place to enforce argv-only subprocess usage, URL validation, and logging redaction
  • ➕ Reduces duplicated provider logic over time
  • ➖ Larger refactor touching provider interfaces and tests
  • ➖ Higher short-term risk than the current targeted fix

Recommendation: Merge this PR’s approach: exact hostname validation + rebuilding URLs from trusted configured authority is the correct minimal-risk remediation across providers. Consider a follow-up to migrate providers away from embedding tokens in clone URLs (prefer headers/credential helper), but that’s a larger compatibility change and is reasonable to do separately.

Grey Divider

File Changes

Bug fix (6)
cli_args.py Block /help_docs repo/branch/path overrides in forbidden CLI args +1/-1

Block /help_docs repo/branch/path overrides in forbidden CLI args

• Extends the base64-encoded forbidden argument list to include pr_help_docs.repo_url, pr_help_docs.repo_default_branch, and pr_help_docs.docs_path. Prevents untrusted comment commands from overriding the help-docs clone target and related settings at runtime.

pr_agent/algo/cli_args.py


bitbucket_provider.py Harden Bitbucket Cloud clone URL validation and rebuilding +7/-9

Harden Bitbucket Cloud clone URL validation and rebuilding

• Replaces substring-based host validation with a shared structural validator enforcing an exact bitbucket.org hostname match. Rebuilds the clone URL from the trusted host + validated path while embedding the appropriate token.

pr_agent/git_providers/bitbucket_provider.py


bitbucket_server_provider.py Validate Bitbucket Server clone targets and prevent git argv injection +23/-7

Validate Bitbucket Server clone targets and prevent git argv injection

• Validates the clone URL host exactly against the configured Bitbucket Server base URL and rebuilds the returned clone URL using the trusted authority. Switches git invocation to an argv list to prevent repo URL/token re-parsing into additional git arguments.

pr_agent/git_providers/bitbucket_server_provider.py


gitea_provider.py Use shared validator and rebuild Gitea clone URLs from trusted authority +8/-13

Use shared validator and rebuild Gitea clone URLs from trusted authority

• Removes substring checks against base_url and instead validates repo_url_to_clone via the shared helper against the configured Gitea hostname. Reconstructs clone URL using base_url netloc (preserving ports) plus the validated path.

pr_agent/git_providers/gitea_provider.py


github_provider.py Enforce exact GitHub hostname match before embedding token in clone URL +8/-13

Enforce exact GitHub hostname match before embedding token in clone URL

• Replaces substring host checks with structural URL parsing and exact hostname validation against base_url_html. Rebuilds the clone URL using the configured netloc (preserving non-standard ports) so the token cannot be sent to a different host.

pr_agent/git_providers/github_provider.py


gitlab_provider.py Validate GitLab clone host exactly and rebuild URL from configured host[:port] +12/-8

Validate GitLab clone host exactly and rebuild URL from configured host[:port]

• Stops relying on the presence of "gitlab." in the URL and instead validates the requested clone URL host exactly against gitlab_url. Rebuilds the final clone URL from the trusted authority while embedding the OAuth/private token.

pr_agent/git_providers/gitlab_provider.py


Refactor (1)
git_provider.py Add shared clone URL validator with exact host match and scheme/userinfo checks +43/-0

Add shared clone URL validator with exact host match and scheme/userinfo checks

• Introduces GitProvider._validate_clone_url_and_extract_path to parse and validate clone URLs: http(s) only, no embedded credentials, exact hostname match (case-insensitive), non-empty path, and no whitespace. Returns only the repository path so callers can rebuild the final clone URL from a trusted configured host.

pr_agent/git_providers/git_provider.py


Tests (1)
test_github_clone_url_validation.py Add regression tests for clone URL validation and help_docs override blocklist +172/-0

Add regression tests for clone URL validation and help_docs override blocklist

• Adds unit tests covering lookalike hosts, userinfo injection, scheme restrictions, missing paths, whitespace/argument-injection guards, and port preservation across GitHub/GitLab/Gitea/Bitbucket providers. Verifies new CLI arg blocklist entries prevent overriding /help_docs clone target settings.

tests/unittest/test_github_clone_url_validation.py


Grey Divider

Qodo Logo

Comment on lines +977 to +979
# Rebuild from the trusted authority (host[:port]) so the token cannot reach a different host.
scheme = (base_parsed.scheme or "https") + "://"
clone_url = f"{scheme}oauth2:{access_token}@{base_parsed.netloc}{repo_full_name}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Gitlab clone url embeds token 📎 Requirement gap ⛨ Security

The code constructs clone URLs for multiple git providers (GitLab, GitHub, Gitea, and Bitbucket
Cloud) by embedding provider access tokens directly in the URL (e.g., https://oauth2:<*******t;@... /
https://<token>@... / https://x-token-auth:<*******t;@...). This violates the requirement to avoid
placing secrets in clone URLs where they can leak via logs, process arguments, or stored git remote
strings.
Agent Prompt
## Issue description
Several provider implementations build git clone URLs by embedding authentication tokens directly in the URL (GitLab `oauth2:<token>@...`, GitHub/Gitea `<token>@...`, Bitbucket Cloud `x-token-auth:<token>@...`), which violates compliance requirements.

## Issue Context
Compliance (PR Compliance ID 10) requires that tokens/secrets are not embedded in URLs used for git operations because they can be exposed via logs, process lists/args, or persisted git remote strings. Prefer authentication approaches that keep secrets out of URLs, such as git credential helpers or `http.extraHeader`-based auth.

## Fix Focus Areas
- pr_agent/git_providers/gitlab_provider.py[977-980]
- pr_agent/git_providers/github_provider.py[1220-1223]
- pr_agent/git_providers/gitea_provider.py[731-734]
- pr_agent/git_providers/bitbucket_provider.py[633-645]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +11 to +18
def _make_stub(token="ghp_SECRETTOKEN", base_url_html="https://github.com", deployment_type="user"):
"""Build a real GithubProvider instance without running __init__, setting only the
attributes used by _prepare_clone_url_with_token (inherited base helpers stay available)."""
stub = object.__new__(GithubProvider)
stub.auth = SimpleNamespace(token=token)
stub.base_url_html = base_url_html
stub.deployment_type = deployment_type
return stub

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Tests contain token-like literals 📘 Rule violation ⛨ Security

New unit tests include hardcoded token-like strings (e.g., ghp_SECRETTOKEN, glpat-TOKEN) which
can trigger secret scanners or be mistaken for real credentials. This violates the requirement to
avoid hardcoded secrets/tokens in the repository.
Agent Prompt
## Issue description
The new tests hardcode token-like values (e.g., GitHub `ghp_...` style), which violates the no-hardcoded-secrets policy and may trip secret scanning.

## Issue Context
Even in tests, committed token-like strings should be avoided; use clearly non-secret placeholders (e.g., `TEST_TOKEN`), fixtures, or environment-provided dummy values.

## Fix Focus Areas
- tests/unittest/test_github_clone_url_validation.py[11-18]
- tests/unittest/test_github_clone_url_validation.py[31-45]
- tests/unittest/test_github_clone_url_validation.py[97-134]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@naorpeled naorpeled marked this pull request as draft June 16, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR-Agent /help_docs allows untrusted commenters to override git clone target and exfiltrate GITHUB_TOKEN via weak hostname validation

1 participant