fix: check deny list in webscrappers#6559
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates checkDenyList into the Cheerio, Playwright, and Puppeteer document loaders to prevent unauthorized URL access. However, the review highlights critical security concerns regarding Server-Side Request Forgery (SSRF) and DNS rebinding, as checking only the initial URL does not prevent bypasses via HTTP redirects. For Cheerio, the reviewer suggests overriding CheerioWebBaseLoader.prototype.scrape to use secureFetch. For Playwright and Puppeteer, which execute in separate browser processes, the reviewer recommends implementing network-level isolation or configuring a secure forward proxy to block intranet access.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| import { SelectorType } from 'cheerio' | ||
| import { ICommonObject, INodeOutputsValue, IDocument, INode, INodeData, INodeParams } from '../../../src/Interface' | ||
| import { handleEscapeCharacters, webCrawl, xmlScrape } from '../../../src/utils' | ||
| import { checkDenyList } from '../../../src/httpSecurity' |
There was a problem hiding this comment.
To prevent Server-Side Request Forgery (SSRF) via HTTP redirects or DNS rebinding, we should override CheerioWebBaseLoader.prototype.scrape to use secureFetch instead of the default fetch. secureFetch already resolves and validates all URLs in the redirect chain against the deny list and pins the resolved IP address.
import { checkDenyList, secureFetch } from '../../../src/httpSecurity'
import { CheerioWebBaseLoader } from '@langchain/community/document_loaders/web/cheerio'
import { load } from 'cheerio'
// @ts-ignore
CheerioWebBaseLoader.prototype.scrape = async function (this: any) {
const response = await secureFetch(this.webPath)
const html = await response.text()
return load(html)
}|
|
||
| async function playwrightLoader(url: string): Promise<Document[] | undefined> { | ||
| try { | ||
| await checkDenyList(url) |
There was a problem hiding this comment.
The checkDenyList(url) check here only validates the initial URL. Because Playwright launches a full browser that follows HTTP redirects automatically, a malicious URL could redirect to an internal IP address (e.g., http://127.0.0.1 or http://169.254.169.254), bypassing this check entirely and leading to Server-Side Request Forgery (SSRF). Additionally, it is vulnerable to DNS rebinding attacks.
Since Playwright runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access.
|
|
||
| async function puppeteerLoader(url: string): Promise<Document[] | undefined> { | ||
| try { | ||
| await checkDenyList(url) |
There was a problem hiding this comment.
The checkDenyList(url) check here only validates the initial URL. Because Puppeteer launches a full browser that follows HTTP redirects automatically, a malicious URL could redirect to an internal IP address (e.g., http://127.0.0.1 or http://169.254.169.254), bypassing this check entirely and leading to Server-Side Request Forgery (SSRF). Additionally, it is vulnerable to DNS rebinding attacks.
Since Puppeteer runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access.
No description provided.