Skip to content

fix: check deny list in webscrappers#6559

Open
0xi4o wants to merge 1 commit into
mainfrom
fix/flowise-394
Open

fix: check deny list in webscrappers#6559
0xi4o wants to merge 1 commit into
mainfrom
fix/flowise-394

Conversation

@0xi4o

@0xi4o 0xi4o commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates checkDenyList into the Cheerio, Playwright, and Puppeteer document loaders to prevent unauthorized URL access. However, the review highlights critical security concerns regarding Server-Side Request Forgery (SSRF) and DNS rebinding, as checking only the initial URL does not prevent bypasses via HTTP redirects. For Cheerio, the reviewer suggests overriding CheerioWebBaseLoader.prototype.scrape to use secureFetch. For Playwright and Puppeteer, which execute in separate browser processes, the reviewer recommends implementing network-level isolation or configuring a secure forward proxy to block intranet access.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

import { SelectorType } from 'cheerio'
import { ICommonObject, INodeOutputsValue, IDocument, INode, INodeData, INodeParams } from '../../../src/Interface'
import { handleEscapeCharacters, webCrawl, xmlScrape } from '../../../src/utils'
import { checkDenyList } from '../../../src/httpSecurity'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

To prevent Server-Side Request Forgery (SSRF) via HTTP redirects or DNS rebinding, we should override CheerioWebBaseLoader.prototype.scrape to use secureFetch instead of the default fetch. secureFetch already resolves and validates all URLs in the redirect chain against the deny list and pins the resolved IP address.

import { checkDenyList, secureFetch } from '../../../src/httpSecurity'
import { CheerioWebBaseLoader } from '@langchain/community/document_loaders/web/cheerio'
import { load } from 'cheerio'

// @ts-ignore
CheerioWebBaseLoader.prototype.scrape = async function (this: any) {
    const response = await secureFetch(this.webPath)
    const html = await response.text()
    return load(html)
}


async function playwrightLoader(url: string): Promise<Document[] | undefined> {
try {
await checkDenyList(url)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The checkDenyList(url) check here only validates the initial URL. Because Playwright launches a full browser that follows HTTP redirects automatically, a malicious URL could redirect to an internal IP address (e.g., http://127.0.0.1 or http://169.254.169.254), bypassing this check entirely and leading to Server-Side Request Forgery (SSRF). Additionally, it is vulnerable to DNS rebinding attacks.

Since Playwright runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access.


async function puppeteerLoader(url: string): Promise<Document[] | undefined> {
try {
await checkDenyList(url)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The checkDenyList(url) check here only validates the initial URL. Because Puppeteer launches a full browser that follows HTTP redirects automatically, a malicious URL could redirect to an internal IP address (e.g., http://127.0.0.1 or http://169.254.169.254), bypassing this check entirely and leading to Server-Side Request Forgery (SSRF). Additionally, it is vulnerable to DNS rebinding attacks.

Since Puppeteer runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant