Skip to content

Fix SSRF redirect bypass in scraping fetches#6331

Open
theCyberTech wants to merge 3 commits into
mainfrom
fix/scrape-redirect-ssrf
Open

Fix SSRF redirect bypass in scraping fetches#6331
theCyberTech wants to merge 3 commits into
mainfrom
fix/scrape-redirect-ssrf

Conversation

@theCyberTech

@theCyberTech theCyberTech commented Jun 25, 2026

Copy link
Copy Markdown
Member

Summary

Fixes the worker-side SSRF redirect bypass in scraping and URL-backed RAG loaders by routing direct URL fetches through a redirect-aware safe HTTP helper.

Changes

  • Adds safe_get() to validate the initial URL and every redirect target before following redirects.
  • Replaces direct worker-side requests.get() calls in ScrapeWebsiteTool, ScrapeElementFromWebsiteTool, and URL-backed RAG loaders.
  • Adds regression tests proving a public-looking URL that redirects to 127.0.0.1 is blocked before the second request is made.

Validation

  • uv run pytest lib/crewai-tools/tests/utilities/test_safe_requests.py lib/crewai-tools/tests/utilities/test_safe_path.py lib/crewai-tools/tests/rag/test_webpage_loader.py lib/crewai-tools/tests/rag/test_docx_loader.py lib/crewai-tools/tests/rag/test_json_loader.py lib/crewai-tools/tests/rag/test_csv_loader.py lib/crewai-tools/tests/rag/test_xml_loader.py lib/crewai-tools/tests/rag/test_mdx_loader.py -q → 97 passed
  • uv run ruff check ... on touched files → passed
  • uv run mypy ... on touched files → passed

Notes

This closes the redirect-following bypass for worker-side direct fetches. Full DNS TOCTOU/IP pinning remains a separate lower-level hardening follow-up.

Summary by CodeRabbit

  • Security
    • Web content, PDF, DOCX, and documentation-site downloads now use safer URL fetching to reduce the risk of unsafe redirects and restricted-network access.
    • Redirects are handled with stricter safety checks, including limiting redirect chains and sanitizing credential-related data on cross-origin redirects.
  • Bug Fixes
    • Improved reliability and more consistent failure behavior when fetching remote content and following redirects.
  • Tests
    • Expanded coverage for redirect safety and credential/cookie handling during redirects.

Copilot AI review requested due to automatic review settings June 25, 2026 04:04
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds safe_get, a redirect-aware HTTP GET helper, and switches several loaders and website scraping tools to use it. It also adds DNS fixtures and tests covering private IP rejection, redirect validation, safe redirects, and redirect limits.

Changes

Redirect-safe HTTP fetching

Layer / File(s) Summary
safe_get redirect handling
lib/crewai-tools/src/crewai_tools/security/safe_requests.py
safe_get validates the initial URL, disables automatic redirects, validates redirect targets, caps redirect hops, and returns responses with redirect history.
RAG loader fetch calls
lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py, lib/crewai-tools/src/crewai_tools/rag/loaders/docx_loader.py, lib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.py, lib/crewai-tools/src/crewai_tools/rag/loaders/utils.py, lib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.py
The docs site, DOCX, PDF, URL utility, and webpage loader modules import safe_get and replace direct requests.get calls while keeping parsing, temporary-file, and error-handling flows unchanged.
Website scraping tools
lib/crewai-tools/src/crewai_tools/tools/scrape_element_from_website/scrape_element_from_website.py, lib/crewai-tools/src/crewai_tools/tools/scrape_website_tool/scrape_website_tool.py
The website scraping tools import safe_get and replace the prior validate_url plus requests.get fetch path in _run before HTML parsing and element extraction continue.
safe_get tests and DNS fixture
lib/crewai-tools/tests/rag/conftest.py, lib/crewai-tools/tests/utilities/test_safe_requests.py
The test fixture pins DNS for example hosts, and the new tests cover direct private-IP rejection, redirect target validation, safe relative redirects, and redirect limit failures.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main security fix: blocking SSRF redirect bypasses in scraping-related fetches.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/scrape-redirect-ssrf

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@corridor-security corridor-security Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Issues

  • Credential leakage across redirects
    The new safe_get() manually follows redirects but reuses the original headers and cookies for each redirected host. A user-controlled URL or open redirect can cause Authorization/Cookie values intended for one origin to be sent to an attacker-controlled public host.

Summary: This PR adds redirect-aware URL validation for scraping and URL-backed RAG fetches, reducing the direct SSRF redirect bypass but introducing credential forwarding risk during redirects.

Risk: Medium risk. The new public URL fetch helper crosses an external trust boundary and can leak configured authentication headers or cookies when following attacker-controlled cross-origin redirects.

Recommendations:

  • Strip Authorization, Cookie, and explicit cookies when the redirect target changes origin, or reject cross-origin redirects whenever credentials are present.
  • Add regression tests for redirects from an authenticated trusted origin to a different public host to ensure credentials are not forwarded.

Comment thread lib/crewai-tools/src/crewai_tools/security/safe_requests.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an SSRF redirect bypass in worker-side scraping and URL-backed RAG loaders by introducing a redirect-aware HTTP helper that re-validates each redirect target before issuing the next request.

Changes:

  • Added safe_get() in crewai_tools.security.safe_requests to validate the initial URL and each redirect target (manual redirect handling with allow_redirects=False).
  • Replaced direct requests.get() usage with safe_get() across scraping tools and multiple RAG loaders.
  • Added regression tests and test DNS fixtures to ensure redirects to internal IPs are blocked before the second request.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
lib/crewai-tools/src/crewai_tools/security/safe_requests.py Adds safe_get() to validate URLs across redirects.
lib/crewai-tools/src/crewai_tools/tools/scrape_website_tool/scrape_website_tool.py Routes website fetch through safe_get() instead of requests.get().
lib/crewai-tools/src/crewai_tools/tools/scrape_element_from_website/scrape_element_from_website.py Routes element-scrape fetch through safe_get() instead of requests.get().
lib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.py Uses safe_get() for webpage loading.
lib/crewai-tools/src/crewai_tools/rag/loaders/utils.py Uses safe_get() for shared URL-loading helper.
lib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.py Uses safe_get() for PDF URL downloads.
lib/crewai-tools/src/crewai_tools/rag/loaders/docx_loader.py Uses safe_get() for DOCX URL downloads.
lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py Uses safe_get() for docs-site fetching.
lib/crewai-tools/tests/utilities/test_safe_requests.py Adds redirect-focused SSRF regression tests for safe_get().
lib/crewai-tools/tests/rag/conftest.py Adds autouse DNS fixture to stabilize URL validation in RAG tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/crewai-tools/src/crewai_tools/security/safe_requests.py
Comment thread lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/utils.py`:
- Line 37: The redirect handling in safe_get currently carries the original
headers across every hop, which can leak secrets to different domains. Update
safe_get in safe_requests.py to track the previous and next URL netlocs during
redirect resolution, and when the host changes, strip sensitive request headers
before the next requests.get call. Make sure the header cleanup covers
Authorization, Authorization-*, Cookie, and any other secret-bearing custom
headers used by callers such as the RAG loaders and scrape tools.

In `@lib/crewai-tools/src/crewai_tools/security/safe_requests.py`:
- Around line 19-25: The redirect handling in safe_requests.py is reusing
request_kwargs unchanged across every hop, which can leak Authorization/Cookie
state and ignore Set-Cookie responses. Update the redirect flow in the request
loop to use a requests.Session, or otherwise rebuild per-hop kwargs in a way
that strips sensitive headers on cross-origin redirects and carries cookies
forward safely; keep the logic localized around the current_url loop and the
request_kwargs/history/redirects_followed handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1c30100e-5382-43c6-bf0a-38da30d36328

📥 Commits

Reviewing files that changed from the base of the PR and between 01fc389 and cd3fa72.

📒 Files selected for processing (10)
  • lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py
  • lib/crewai-tools/src/crewai_tools/rag/loaders/docx_loader.py
  • lib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.py
  • lib/crewai-tools/src/crewai_tools/rag/loaders/utils.py
  • lib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.py
  • lib/crewai-tools/src/crewai_tools/security/safe_requests.py
  • lib/crewai-tools/src/crewai_tools/tools/scrape_element_from_website/scrape_element_from_website.py
  • lib/crewai-tools/src/crewai_tools/tools/scrape_website_tool/scrape_website_tool.py
  • lib/crewai-tools/tests/rag/conftest.py
  • lib/crewai-tools/tests/utilities/test_safe_requests.py

Comment thread lib/crewai-tools/src/crewai_tools/rag/loaders/utils.py
Comment thread lib/crewai-tools/src/crewai_tools/security/safe_requests.py
@github-actions github-actions Bot added size/L and removed size/M labels Jun 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
lib/crewai-tools/src/crewai_tools/security/safe_requests.py (1)

38-48: 🔒 Security & Privacy | 🟠 Major

Strip non-header credentials before the next hop.

auth remains in request_kwargs after header/cookie stripping, causing credentials to be sent to cross-origin redirect targets. params is also replayed to redirect targets, which can leak initial-query secrets that should not persist across hops.

  • Remove auth on all redirect hops, and drop first-hop params before following any redirect.
 def _strip_cross_origin_credentials(request_kwargs: dict[str, Any]) -> dict[str, Any]:
     sanitized = {**request_kwargs}
     headers = sanitized.get("headers")
     if headers:
         sanitized["headers"] = {
             key: value
             for key, value in headers.items()
             if not _is_sensitive_header(str(key))
         }
     sanitized.pop("cookies", None)
+    sanitized.pop("auth", None)
     return sanitized
-        if not _same_origin(current_url, redirect_url):
-            request_kwargs = _strip_cross_origin_credentials(request_kwargs)
+        next_request_kwargs = {**request_kwargs}
+        next_request_kwargs.pop("params", None)
+        if not _same_origin(current_url, redirect_url):
+            next_request_kwargs = _strip_cross_origin_credentials(next_request_kwargs)
+        request_kwargs = next_request_kwargs

Applies to: 83–84 as well.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai-tools/src/crewai_tools/security/safe_requests.py` around lines 38
- 48, Update _strip_cross_origin_credentials in safe_requests.py so it also
removes non-header credentials before forwarding a redirected request: strip
auth on every cross-origin hop, and clear params as well as cookies so initial
query secrets are not replayed. Keep the existing header filtering logic, but
make sure the sanitized request_kwargs returned by
_strip_cross_origin_credentials no longer contains auth or params when building
the next request.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@lib/crewai-tools/src/crewai_tools/security/safe_requests.py`:
- Around line 38-48: Update _strip_cross_origin_credentials in safe_requests.py
so it also removes non-header credentials before forwarding a redirected
request: strip auth on every cross-origin hop, and clear params as well as
cookies so initial query secrets are not replayed. Keep the existing header
filtering logic, but make sure the sanitized request_kwargs returned by
_strip_cross_origin_credentials no longer contains auth or params when building
the next request.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e695953c-0583-47eb-809c-da84e2c9478a

📥 Commits

Reviewing files that changed from the base of the PR and between cd3fa72 and 7455e51.

📒 Files selected for processing (2)
  • lib/crewai-tools/src/crewai_tools/security/safe_requests.py
  • lib/crewai-tools/tests/utilities/test_safe_requests.py

@corridor-security corridor-security Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Issues

  • Server-Side Request Forgery (SSRF) via DNS Rebinding (TOCTOU)
    The new safe_get() function validates the hostname/IP via validate_url() but then performs the actual HTTP request with a fresh DNS resolution (requests.get(current_url, ...)). An attacker controlling DNS for a supplied URL can return a public IP during validation and then immediately rebind to a private/reserved IP for the actual connection, enabling internal network access (e.g., 127.0.0.1, 169.254.169.254). This time-of-check/time-of-use gap exists for the initial request and each redirect hop, allowing a realistic SSRF bypass despite validation.

Recommendations:

  • Pin the resolved IP address from validate_url() and perform the connection to that IP while setting the Host header (and SNI for TLS) to the original hostname, or
  • Use a custom requests/urllib3 adapter that enforces a trusted DNS resolution result for the lifetime of the request and redirect chain, rejecting any re-resolution to private/reserved ranges.
  • Maintain the same pinned-IP set across redirects (re-validate targets and re-pin before requesting), or abort on cross-origin redirects that cannot be safely pinned.

redirects_followed = 0

while True:
response = requests.get(current_url, timeout=timeout, **request_kwargs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation step checks the URL against private/reserved IP ranges, but the subsequent request performs a new DNS resolution, creating a TOCTOU window for DNS rebinding that can re-route the connection to an internal address.

current_url = validate_url(url)
...
response = requests.get(current_url, timeout=timeout, **request_kwargs)

An attacker who controls the DNS for a supplied URL can return a public IP at validation time and then immediately rebind to 127.0.0.1 or another blocked IP for the actual connection. This enables SSRF into internal services despite the validation.

Remediation:

  • Pin the resolved IP from validate_url() and perform the request to that IP while setting the Host header (and SNI for TLS) to the original hostname, or
  • Mount a custom urllib3/requests adapter that uses a trusted DNS resolution result for the entire request (including redirects) and refuses connections if the resolved IP falls into a private/reserved range.
  • Apply the same IP pinning logic to each validated redirect target before following it; abort when safe pinning cannot be maintained.

For more details, see the finding in Corridor.

Provide feedback: Reply with whether this is a valid vulnerability or false positive to help improve Corridor's accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants