From 322e23d486156802b1129d91a5312f652e57a789 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 19 May 2026 16:01:31 +0000 Subject: [PATCH] README: fix CSV column names + restructure Troubleshooting/Usage flow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Factual fixes: - Output section listed CSV columns as "Post Title / Post URL / Broken Link / Error Type", but the code writes the snake_case form (post_title, post_url, broken_link, error_type via csv.DictWriter in checker.py). Replaced with the actual column names plus human-readable descriptions in a second table column. - "Example Output" block contained per-post "Checking N links..." and "Found N broken links in this post" lines which only appear when --verbose is passed (they're emitted via self._log without force=True). Relabeled the block as `$ ... --verbose ...` and added a one-liner noting which lines are suppressed without --verbose. Structural: - Moved Troubleshooting to after Usage / Example Output (was before Usage, which interrupted the "how do I use this?" reading flow). - Renamed "CLI Options" to "`check` Subcommand Options" and added a pointer to `substack-link-checker --help` for the other subcommands, whose flags weren't documented anywhere. - Removed run_link_checker.ps1 from the Subcommands table (it's a separate PowerShell wrapper, not a subcommand). Added a small "Scheduled / automated runs" subsection describing it. - Dropped "(Optional)" from the Authentication section heading; bot protection rejects most unauthenticated scans in practice, so the auth flow is more recommended than optional. Added a one-line note explaining when it's needed. Additions: - Quick Start now mentions `substack-link-checker demo` as an optional install smoke-test. - One-line mention of `--version` flag below the options table. - New "Importing Previous Results" subsection in Usage with example invocations for both .xlsx and .csv (the `import` subcommand was documented in the Subcommands table but never demonstrated). Docs only — no code or test changes. 38 pytest tests still pass. --- README.md | 213 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 131 insertions(+), 82 deletions(-) diff --git a/README.md b/README.md index cd8c016..a0d49d3 100644 --- a/README.md +++ b/README.md @@ -29,6 +29,9 @@ Checking broken links in your newsletter archive shouldn't cost $100+/month for # Install (provides the `substack-link-checker` CLI) pip install git+https://github.com/jcddc83/substack-broken-link-checker.git +# Smoke-test your install against a handful of known-good/bad URLs +substack-link-checker demo + # Check all posts from 2024 substack-link-checker check --base-url https://YOUR.substack.com --year 2024 @@ -36,6 +39,8 @@ substack-link-checker check --base-url https://YOUR.substack.com --year 2024 substack-link-checker check --base-url https://YOUR.substack.com --url-file posts.txt ``` +Confirm the installed version with `substack-link-checker --version`. + ## Installation ```bash @@ -80,9 +85,11 @@ because its name collides with the new package — use `substack-link-checker check ...` or `python -m substack_link_checker check ...` instead. -## Authentication (Optional) +## Authentication -If Substack blocks your requests or you need to check paywalled content, use your session cookie: +Optional in principle, but **usually needed in practice** — Substack's +bot protection rejects most unauthenticated archive scans. Use your +session cookie: 1. Log into your Substack in a browser 2. Open Developer Tools (F12) → Application → Cookies @@ -105,77 +112,6 @@ so it does not end up in your shell history or in `ps aux`. See **Note:** Your session cookie expires after a few weeks. If you start getting 403 errors, get a fresh cookie from your browser. -## Troubleshooting - -Common failure modes and how to fix them: - -### `HTTP 403 Forbidden` when fetching the sitemap or post pages - -Substack's bot protection is rejecting unauthenticated requests. In -order of likelihood: - -1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication-optional) - above) so you're requesting as a logged-in user. -2. If you had a cookie set: it has probably expired (Substack rotates - session cookies every few weeks). Grab a fresh one from DevTools. -3. If both are current: lower `--concurrency` (try `--concurrency 3`) - so you look less bot-like. - -### `Sitemap returns no posts for --year YYYY` - -The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for -your Substack — some accounts only expose a single combined sitemap. -Fall back to scraping the archive page: - -```bash -substack-link-checker fetch-archive https://YOUR.substack.com 2024 -# Produces archive_urls_2024.txt -substack-link-checker check --base-url https://YOUR.substack.com \ - --url-file archive_urls_2024.txt -``` - -### `DNS Failure` or `Timeout` for links that work in your browser - -The target site is rate-limiting or geo-blocking the checker, not -actually broken. Add it to `--skip-domains` so it's assumed OK: - -```bash -substack-link-checker check ... --skip-domains rate-limited.example.com -``` - -For a recurring list, put one domain per line in a file and pass -`--skip-domains-file path/to/file.txt`. - -### `Connection Error: ...ssl:default` / `SSL Error` - -The target host is using an old TLS version Python's `ssl` module no -longer accepts by default. Usually the right call is to flag the -domain as broken (it really is unreachable from a modern client): - -```bash -substack-link-checker check ... --broken-domains old-tls.example.com -``` - -### Many `Soft 404 (page title indicates error)` results that look fine - -The detector matches phrases like "page not found" in the page ``. -If a legitimate post happens to have one of those phrases in its title, -it'll be misflagged. Open the report, eyeball the URL, and if it's -genuinely live, ignore those rows. - -### The CSV report file is empty / has only a header - -Either no broken links were found (look for "No broken links found!" -in the summary) or the run was interrupted before report generation. -The tool only writes the CSV on a successful completion of all posts. - -### `--only-new` is not skipping anything - -Make sure `--history-file` points at the same JSON file you used on -the previous run. The history file is the source of truth for which -posts have already been checked; without it `--only-new` has nothing -to compare against. - ## Usage ### Basic Usage @@ -228,10 +164,30 @@ substack-link-checker check --base-url https://example.substack.com \ --url-file unchecked_posts.txt --history-file checked_posts.json ``` -## Example Output +### Importing Previous Results + +If you have an existing Excel or CSV file from a prior scan (or another +tool), `import` extracts unique post URLs into the history file so +`--only-new` will skip them on future runs. +The input file must have a column whose header contains "Post URL" +(case-insensitive, also matches `post_url`). Other columns are ignored. + +```bash +# From an Excel report +substack-link-checker import previous_report.xlsx --history-file checked_posts.json + +# Or from a CSV +substack-link-checker import previous_report.csv --history-file checked_posts.json ``` -$ substack-link-checker check --base-url https://example.substack.com --year 2024 + +Excel imports require `pandas` and `openpyxl`, which are installed +automatically as part of the package. + +## Example Output (`--verbose`) + +``` +$ substack-link-checker check --base-url https://example.substack.com --year 2024 --verbose Substack Broken Link Checker ================================================== @@ -268,7 +224,85 @@ Generating report: broken_links_report.csv Report generated with 5 broken links ``` -## CLI Options +Without `--verbose`, the per-post "Checking N links…" and "Found N +broken links in this post" lines are suppressed; the header, progress +counter, and SUMMARY block are always shown. + +## Troubleshooting + +Common failure modes and how to fix them: + +### `HTTP 403 Forbidden` when fetching the sitemap or post pages + +Substack's bot protection is rejecting unauthenticated requests. In +order of likelihood: + +1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication) + above) so you're requesting as a logged-in user. +2. If you had a cookie set: it has probably expired (Substack rotates + session cookies every few weeks). Grab a fresh one from DevTools. +3. If both are current: lower `--concurrency` (try `--concurrency 3`) + so you look less bot-like. + +### `Sitemap returns no posts for --year YYYY` + +The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for +your Substack — some accounts only expose a single combined sitemap. +Fall back to scraping the archive page: + +```bash +substack-link-checker fetch-archive https://YOUR.substack.com 2024 +# Produces archive_urls_2024.txt +substack-link-checker check --base-url https://YOUR.substack.com \ + --url-file archive_urls_2024.txt +``` + +### `DNS Failure` or `Timeout` for links that work in your browser + +The target site is rate-limiting or geo-blocking the checker, not +actually broken. Add it to `--skip-domains` so it's assumed OK: + +```bash +substack-link-checker check ... --skip-domains rate-limited.example.com +``` + +For a recurring list, put one domain per line in a file and pass +`--skip-domains-file path/to/file.txt`. + +### `Connection Error: ...ssl:default` / `SSL Error` + +The target host is using an old TLS version Python's `ssl` module no +longer accepts by default. Usually the right call is to flag the +domain as broken (it really is unreachable from a modern client): + +```bash +substack-link-checker check ... --broken-domains old-tls.example.com +``` + +### Many `Soft 404 (page title indicates error)` results that look fine + +The detector matches phrases like "page not found" in the page `<title>`. +If a legitimate post happens to have one of those phrases in its title, +it'll be misflagged. Open the report, eyeball the URL, and if it's +genuinely live, ignore those rows. + +### The CSV report file is empty / has only a header + +Either no broken links were found (look for "No broken links found!" +in the summary) or the run was interrupted before report generation. +The tool only writes the CSV on a successful completion of all posts. + +### `--only-new` is not skipping anything + +Make sure `--history-file` points at the same JSON file you used on +the previous run. The history file is the source of truth for which +posts have already been checked; without it `--only-new` has nothing +to compare against. + +## `check` Subcommand Options + +The options below apply to `substack-link-checker check`. For other +subcommands, run `substack-link-checker <subcommand> --help`. | Option | Short | Description | |--------|-------|-------------| @@ -289,6 +323,9 @@ Report generated with 5 broken links | `--verbose` | `-v` | Show detailed progress | | `--limit` | `-l` | Max posts to check | +Top-level flags: `--version` prints the installed version; `--help` +lists all subcommands. + ## Subcommands | Command | Purpose | @@ -298,15 +335,27 @@ Report generated with 5 broken links | `substack-link-checker import` | Import previous results from Excel/CSV into history | | `substack-link-checker fetch-archive` | Extract URLs from the `/archive` page (fallback when the sitemap doesn't work) | | `substack-link-checker demo` | Self-contained demo against a handful of known-good/bad URLs | -| `run_link_checker.ps1` | Windows Task Scheduler automation (PowerShell) | + +### Scheduled / automated runs + +`run_link_checker.ps1` (at the repo root) is a PowerShell wrapper meant +for Windows Task Scheduler. It runs `compare` to find new posts, then +`check` to scan them, writing reports to `reports/` with a timestamped +filename. Set `$SUBSTACK_URL` and `$PROJECT_DIR` at the top of the +script before first use. ## Output -The tool generates a CSV report with columns: -- **Post Title**: Title of the post containing the broken link -- **Post URL**: URL of the post -- **Broken Link**: The broken URL -- **Error Type**: What went wrong (HTTP 404, DNS Failure, SSL Error, etc.) +The tool generates a CSV report with the following columns (header row +is written by `csv.DictWriter`, so the names below are exactly what +appears in the file): + +| Column | Description | +|---|---| +| `post_title` | Title of the post containing the broken link | +| `post_url` | URL of the post | +| `broken_link` | The broken URL | +| `error_type` | What went wrong (e.g. `HTTP 404`, `DNS Failure`, `SSL Error`) | ## Error Types Detected