-
-
Notifications
You must be signed in to change notification settings - Fork 24.6k
fix: check deny list in webscrappers #6559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,6 +10,7 @@ import { test } from 'linkifyjs' | |
| import { omit } from 'lodash' | ||
| import { handleEscapeCharacters, INodeOutputsValue, webCrawl, xmlScrape } from '../../../src' | ||
| import { ICommonObject, INode, INodeData, INodeParams } from '../../../src/Interface' | ||
| import { checkDenyList } from '../../../src/httpSecurity' | ||
|
|
||
| class Playwright_DocumentLoaders implements INode { | ||
| label: string | ||
|
|
@@ -189,6 +190,7 @@ class Playwright_DocumentLoaders implements INode { | |
|
|
||
| async function playwrightLoader(url: string): Promise<Document[] | undefined> { | ||
| try { | ||
| await checkDenyList(url) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Since Playwright runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access. |
||
| let docs = [] | ||
|
|
||
| const executablePath = process.env.PLAYWRIGHT_EXECUTABLE_PATH | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,6 +6,7 @@ import { omit } from 'lodash' | |
| import { PuppeteerLifeCycleEvent } from 'puppeteer' | ||
| import { handleEscapeCharacters, INodeOutputsValue, webCrawl, xmlScrape } from '../../../src' | ||
| import { ICommonObject, INode, INodeData, INodeParams } from '../../../src/Interface' | ||
| import { checkDenyList } from '../../../src/httpSecurity' | ||
|
|
||
| class Puppeteer_DocumentLoaders implements INode { | ||
| label: string | ||
|
|
@@ -180,6 +181,7 @@ class Puppeteer_DocumentLoaders implements INode { | |
|
|
||
| async function puppeteerLoader(url: string): Promise<Document[] | undefined> { | ||
| try { | ||
| await checkDenyList(url) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Since Puppeteer runs in a separate browser process, request interception or network-level isolation is required to fully secure it. If possible, run the browser scraper in an isolated network environment (e.g., a container/sandbox with firewall rules blocking private IP ranges) or configure a secure forward proxy (like Squid) that blocks intranet access. |
||
| let docs: Document[] = [] | ||
|
|
||
| const executablePath = process.env.PUPPETEER_EXECUTABLE_PATH | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent Server-Side Request Forgery (SSRF) via HTTP redirects or DNS rebinding, we should override
CheerioWebBaseLoader.prototype.scrapeto usesecureFetchinstead of the default fetch.secureFetchalready resolves and validates all URLs in the redirect chain against the deny list and pins the resolved IP address.