Fix SSRF redirect bypass in scraping fetches#6331
Conversation
📝 WalkthroughWalkthroughThe PR adds ChangesRedirect-safe HTTP fetching
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Security Issues
- Credential leakage across redirects
The newsafe_get()manually follows redirects but reuses the originalheadersandcookiesfor each redirected host. A user-controlled URL or open redirect can cause Authorization/Cookie values intended for one origin to be sent to an attacker-controlled public host.
Summary: This PR adds redirect-aware URL validation for scraping and URL-backed RAG fetches, reducing the direct SSRF redirect bypass but introducing credential forwarding risk during redirects.
Risk: Medium risk. The new public URL fetch helper crosses an external trust boundary and can leak configured authentication headers or cookies when following attacker-controlled cross-origin redirects.
Recommendations:
- Strip
Authorization,Cookie, and explicitcookieswhen the redirect target changes origin, or reject cross-origin redirects whenever credentials are present. - Add regression tests for redirects from an authenticated trusted origin to a different public host to ensure credentials are not forwarded.
There was a problem hiding this comment.
Pull request overview
This PR addresses an SSRF redirect bypass in worker-side scraping and URL-backed RAG loaders by introducing a redirect-aware HTTP helper that re-validates each redirect target before issuing the next request.
Changes:
- Added
safe_get()increwai_tools.security.safe_requeststo validate the initial URL and each redirect target (manual redirect handling withallow_redirects=False). - Replaced direct
requests.get()usage withsafe_get()across scraping tools and multiple RAG loaders. - Added regression tests and test DNS fixtures to ensure redirects to internal IPs are blocked before the second request.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/crewai-tools/src/crewai_tools/security/safe_requests.py | Adds safe_get() to validate URLs across redirects. |
| lib/crewai-tools/src/crewai_tools/tools/scrape_website_tool/scrape_website_tool.py | Routes website fetch through safe_get() instead of requests.get(). |
| lib/crewai-tools/src/crewai_tools/tools/scrape_element_from_website/scrape_element_from_website.py | Routes element-scrape fetch through safe_get() instead of requests.get(). |
| lib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.py | Uses safe_get() for webpage loading. |
| lib/crewai-tools/src/crewai_tools/rag/loaders/utils.py | Uses safe_get() for shared URL-loading helper. |
| lib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.py | Uses safe_get() for PDF URL downloads. |
| lib/crewai-tools/src/crewai_tools/rag/loaders/docx_loader.py | Uses safe_get() for DOCX URL downloads. |
| lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.py | Uses safe_get() for docs-site fetching. |
| lib/crewai-tools/tests/utilities/test_safe_requests.py | Adds redirect-focused SSRF regression tests for safe_get(). |
| lib/crewai-tools/tests/rag/conftest.py | Adds autouse DNS fixture to stabilize URL validation in RAG tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/utils.py`:
- Line 37: The redirect handling in safe_get currently carries the original
headers across every hop, which can leak secrets to different domains. Update
safe_get in safe_requests.py to track the previous and next URL netlocs during
redirect resolution, and when the host changes, strip sensitive request headers
before the next requests.get call. Make sure the header cleanup covers
Authorization, Authorization-*, Cookie, and any other secret-bearing custom
headers used by callers such as the RAG loaders and scrape tools.
In `@lib/crewai-tools/src/crewai_tools/security/safe_requests.py`:
- Around line 19-25: The redirect handling in safe_requests.py is reusing
request_kwargs unchanged across every hop, which can leak Authorization/Cookie
state and ignore Set-Cookie responses. Update the redirect flow in the request
loop to use a requests.Session, or otherwise rebuild per-hop kwargs in a way
that strips sensitive headers on cross-origin redirects and carries cookies
forward safely; keep the logic localized around the current_url loop and the
request_kwargs/history/redirects_followed handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 1c30100e-5382-43c6-bf0a-38da30d36328
📒 Files selected for processing (10)
lib/crewai-tools/src/crewai_tools/rag/loaders/docs_site_loader.pylib/crewai-tools/src/crewai_tools/rag/loaders/docx_loader.pylib/crewai-tools/src/crewai_tools/rag/loaders/pdf_loader.pylib/crewai-tools/src/crewai_tools/rag/loaders/utils.pylib/crewai-tools/src/crewai_tools/rag/loaders/webpage_loader.pylib/crewai-tools/src/crewai_tools/security/safe_requests.pylib/crewai-tools/src/crewai_tools/tools/scrape_element_from_website/scrape_element_from_website.pylib/crewai-tools/src/crewai_tools/tools/scrape_website_tool/scrape_website_tool.pylib/crewai-tools/tests/rag/conftest.pylib/crewai-tools/tests/utilities/test_safe_requests.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
lib/crewai-tools/src/crewai_tools/security/safe_requests.py (1)
38-48: 🔒 Security & Privacy | 🟠 MajorStrip non-header credentials before the next hop.
authremains inrequest_kwargsafter header/cookie stripping, causing credentials to be sent to cross-origin redirect targets.paramsis also replayed to redirect targets, which can leak initial-query secrets that should not persist across hops.
- Remove
authon all redirect hops, and drop first-hopparamsbefore following any redirect.def _strip_cross_origin_credentials(request_kwargs: dict[str, Any]) -> dict[str, Any]: sanitized = {**request_kwargs} headers = sanitized.get("headers") if headers: sanitized["headers"] = { key: value for key, value in headers.items() if not _is_sensitive_header(str(key)) } sanitized.pop("cookies", None) + sanitized.pop("auth", None) return sanitized- if not _same_origin(current_url, redirect_url): - request_kwargs = _strip_cross_origin_credentials(request_kwargs) + next_request_kwargs = {**request_kwargs} + next_request_kwargs.pop("params", None) + if not _same_origin(current_url, redirect_url): + next_request_kwargs = _strip_cross_origin_credentials(next_request_kwargs) + request_kwargs = next_request_kwargsApplies to: 83–84 as well.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/crewai-tools/src/crewai_tools/security/safe_requests.py` around lines 38 - 48, Update _strip_cross_origin_credentials in safe_requests.py so it also removes non-header credentials before forwarding a redirected request: strip auth on every cross-origin hop, and clear params as well as cookies so initial query secrets are not replayed. Keep the existing header filtering logic, but make sure the sanitized request_kwargs returned by _strip_cross_origin_credentials no longer contains auth or params when building the next request.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@lib/crewai-tools/src/crewai_tools/security/safe_requests.py`:
- Around line 38-48: Update _strip_cross_origin_credentials in safe_requests.py
so it also removes non-header credentials before forwarding a redirected
request: strip auth on every cross-origin hop, and clear params as well as
cookies so initial query secrets are not replayed. Keep the existing header
filtering logic, but make sure the sanitized request_kwargs returned by
_strip_cross_origin_credentials no longer contains auth or params when building
the next request.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: e695953c-0583-47eb-809c-da84e2c9478a
📒 Files selected for processing (2)
lib/crewai-tools/src/crewai_tools/security/safe_requests.pylib/crewai-tools/tests/utilities/test_safe_requests.py
There was a problem hiding this comment.
Security Issues
- Server-Side Request Forgery (SSRF) via DNS Rebinding (TOCTOU)
The newsafe_get()function validates the hostname/IP viavalidate_url()but then performs the actual HTTP request with a fresh DNS resolution (requests.get(current_url, ...)). An attacker controlling DNS for a supplied URL can return a public IP during validation and then immediately rebind to a private/reserved IP for the actual connection, enabling internal network access (e.g., 127.0.0.1, 169.254.169.254). This time-of-check/time-of-use gap exists for the initial request and each redirect hop, allowing a realistic SSRF bypass despite validation.
Recommendations:
- Pin the resolved IP address from
validate_url()and perform the connection to that IP while setting theHostheader (and SNI for TLS) to the original hostname, or - Use a custom
requests/urllib3 adapter that enforces a trusted DNS resolution result for the lifetime of the request and redirect chain, rejecting any re-resolution to private/reserved ranges. - Maintain the same pinned-IP set across redirects (re-validate targets and re-pin before requesting), or abort on cross-origin redirects that cannot be safely pinned.
| redirects_followed = 0 | ||
|
|
||
| while True: | ||
| response = requests.get(current_url, timeout=timeout, **request_kwargs) |
There was a problem hiding this comment.
The validation step checks the URL against private/reserved IP ranges, but the subsequent request performs a new DNS resolution, creating a TOCTOU window for DNS rebinding that can re-route the connection to an internal address.
current_url = validate_url(url)
...
response = requests.get(current_url, timeout=timeout, **request_kwargs)An attacker who controls the DNS for a supplied URL can return a public IP at validation time and then immediately rebind to 127.0.0.1 or another blocked IP for the actual connection. This enables SSRF into internal services despite the validation.
Remediation:
- Pin the resolved IP from
validate_url()and perform the request to that IP while setting theHostheader (and SNI for TLS) to the original hostname, or - Mount a custom urllib3/requests adapter that uses a trusted DNS resolution result for the entire request (including redirects) and refuses connections if the resolved IP falls into a private/reserved range.
- Apply the same IP pinning logic to each validated redirect target before following it; abort when safe pinning cannot be maintained.
For more details, see the finding in Corridor.
Provide feedback: Reply with whether this is a valid vulnerability or false positive to help improve Corridor's accuracy.
Summary
Fixes the worker-side SSRF redirect bypass in scraping and URL-backed RAG loaders by routing direct URL fetches through a redirect-aware safe HTTP helper.
Changes
safe_get()to validate the initial URL and every redirect target before following redirects.requests.get()calls inScrapeWebsiteTool,ScrapeElementFromWebsiteTool, and URL-backed RAG loaders.127.0.0.1is blocked before the second request is made.Validation
uv run pytest lib/crewai-tools/tests/utilities/test_safe_requests.py lib/crewai-tools/tests/utilities/test_safe_path.py lib/crewai-tools/tests/rag/test_webpage_loader.py lib/crewai-tools/tests/rag/test_docx_loader.py lib/crewai-tools/tests/rag/test_json_loader.py lib/crewai-tools/tests/rag/test_csv_loader.py lib/crewai-tools/tests/rag/test_xml_loader.py lib/crewai-tools/tests/rag/test_mdx_loader.py -q→ 97 passeduv run ruff check ...on touched files → passeduv run mypy ...on touched files → passedNotes
This closes the redirect-following bypass for worker-side direct fetches. Full DNS TOCTOU/IP pinning remains a separate lower-level hardening follow-up.
Summary by CodeRabbit