Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions packages/components/nodes/documentloaders/Cheerio/Cheerio.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import { parse } from 'css-what'
import { SelectorType } from 'cheerio'
import { ICommonObject, INodeOutputsValue, IDocument, INode, INodeData, INodeParams } from '../../../src/Interface'
import { handleEscapeCharacters, webCrawl, xmlScrape } from '../../../src/utils'
import { checkDenyList } from '../../../src/httpSecurity'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

To prevent Server-Side Request Forgery (SSRF) via HTTP redirects or DNS rebinding, we should override CheerioWebBaseLoader.prototype.scrape to use secureFetch instead of the default fetch. secureFetch already resolves and validates all URLs in the redirect chain against the deny list and pins the resolved IP address.

import { checkDenyList, secureFetch } from '../../../src/httpSecurity'
import { CheerioWebBaseLoader } from '@langchain/community/document_loaders/web/cheerio'
import { load } from 'cheerio'

// @ts-ignore
CheerioWebBaseLoader.prototype.scrape = async function (this: any) {
    const response = await secureFetch(this.webPath)
    const html = await response.text()
    return load(html)
}


class Cheerio_DocumentLoaders implements INode {
label: string
Expand Down Expand Up @@ -148,6 +149,7 @@ class Cheerio_DocumentLoaders implements INode {

async function cheerioLoader(url: string): Promise<any> {
try {
await checkDenyList(url)
let docs: IDocument[] = []
if (url.endsWith('.pdf')) {
if (process.env.DEBUG === 'true')
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import { test } from 'linkifyjs'
import { omit } from 'lodash'
import { handleEscapeCharacters, INodeOutputsValue, webCrawl, xmlScrape } from '../../../src'
import { ICommonObject, INode, INodeData, INodeParams } from '../../../src/Interface'
import { checkDenyList } from '../../../src/httpSecurity'

class Playwright_DocumentLoaders implements INode {
label: string
Expand Down Expand Up @@ -189,6 +190,7 @@ class Playwright_DocumentLoaders implements INode {

async function playwrightLoader(url: string): Promise<Document[] | undefined> {
try {
await checkDenyList(url)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The checkDenyList(url) check here only validates the initial URL. Because Playwright launches a full browser that follows HTTP redirects automatically, a malicious URL could redirect to an internal IP address (e.g., http://127.0.0.1 or http://169.254.169.254), bypassing this check entirely and leading to Server-Side Request Forgery (SSRF). Additionally, it is vulnerable to DNS rebinding attacks.

Since Playwright runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access.

let docs = []

const executablePath = process.env.PLAYWRIGHT_EXECUTABLE_PATH
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import { omit } from 'lodash'
import { PuppeteerLifeCycleEvent } from 'puppeteer'
import { handleEscapeCharacters, INodeOutputsValue, webCrawl, xmlScrape } from '../../../src'
import { ICommonObject, INode, INodeData, INodeParams } from '../../../src/Interface'
import { checkDenyList } from '../../../src/httpSecurity'

class Puppeteer_DocumentLoaders implements INode {
label: string
Expand Down Expand Up @@ -180,6 +181,7 @@ class Puppeteer_DocumentLoaders implements INode {

async function puppeteerLoader(url: string): Promise<Document[] | undefined> {
try {
await checkDenyList(url)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The checkDenyList(url) check here only validates the initial URL. Because Puppeteer launches a full browser that follows HTTP redirects automatically, a malicious URL could redirect to an internal IP address (e.g., http://127.0.0.1 or http://169.254.169.254), bypassing this check entirely and leading to Server-Side Request Forgery (SSRF). Additionally, it is vulnerable to DNS rebinding attacks.

Since Puppeteer runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access.

let docs: Document[] = []

const executablePath = process.env.PUPPETEER_EXECUTABLE_PATH
Expand Down
Loading