From 322e23d486156802b1129d91a5312f652e57a789 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 19 May 2026 16:01:31 +0000
Subject: [PATCH] README: fix CSV column names + restructure
 Troubleshooting/Usage flow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Factual fixes:
- Output section listed CSV columns as "Post Title / Post URL /
  Broken Link / Error Type", but the code writes the snake_case form
  (post_title, post_url, broken_link, error_type via csv.DictWriter
  in checker.py). Replaced with the actual column names plus
  human-readable descriptions in a second table column.
- "Example Output" block contained per-post "Checking N links..." and
  "Found N broken links in this post" lines which only appear when
  --verbose is passed (they're emitted via self._log without
  force=True). Relabeled the block as `$ ... --verbose ...` and added
  a one-liner noting which lines are suppressed without --verbose.

Structural:
- Moved Troubleshooting to after Usage / Example Output (was before
  Usage, which interrupted the "how do I use this?" reading flow).
- Renamed "CLI Options" to "`check` Subcommand Options" and added a
  pointer to `substack-link-checker <subcommand> --help` for the
  other subcommands, whose flags weren't documented anywhere.
- Removed run_link_checker.ps1 from the Subcommands table (it's a
  separate PowerShell wrapper, not a subcommand). Added a small
  "Scheduled / automated runs" subsection describing it.
- Dropped "(Optional)" from the Authentication section heading;
  bot protection rejects most unauthenticated scans in practice, so
  the auth flow is more recommended than optional. Added a one-line
  note explaining when it's needed.

Additions:
- Quick Start now mentions `substack-link-checker demo` as an
  optional install smoke-test.
- One-line mention of `--version` flag below the options table.
- New "Importing Previous Results" subsection in Usage with example
  invocations for both .xlsx and .csv (the `import` subcommand was
  documented in the Subcommands table but never demonstrated).

Docs only — no code or test changes. 38 pytest tests still pass.
---
 README.md | 213 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 131 insertions(+), 82 deletions(-)
diff --git a/README.md b/README.md
index cd8c016..a0d49d3 100644
--- a/README.md
+++ b/README.md
@@ -29,6 +29,9 @@ Checking broken links in your newsletter archive shouldn't cost $100+/month for
 # Install (provides the `substack-link-checker` CLI)
 pip install git+https://github.com/jcddc83/substack-broken-link-checker.git
 
+# Smoke-test your install against a handful of known-good/bad URLs
+substack-link-checker demo
+
 # Check all posts from 2024
 substack-link-checker check --base-url https://YOUR.substack.com --year 2024
 
@@ -36,6 +39,8 @@ substack-link-checker check --base-url https://YOUR.substack.com --year 2024
 substack-link-checker check --base-url https://YOUR.substack.com --url-file posts.txt
 ```
 
+Confirm the installed version with `substack-link-checker --version`.
+
 ## Installation
 
 ```bash
@@ -80,9 +85,11 @@ because its name collides with the new package — use
 `substack-link-checker check ...` or `python -m substack_link_checker check ...`
 instead.
 
-## Authentication (Optional)
+## Authentication
 
-If Substack blocks your requests or you need to check paywalled content, use your session cookie:
+Optional in principle, but **usually needed in practice** — Substack's
+bot protection rejects most unauthenticated archive scans. Use your
+session cookie:
 
 1. Log into your Substack in a browser
 2. Open Developer Tools (F12) → Application → Cookies
@@ -105,77 +112,6 @@ so it does not end up in your shell history or in `ps aux`. See
 
 **Note:** Your session cookie expires after a few weeks. If you start getting 403 errors, get a fresh cookie from your browser.
 
-## Troubleshooting
-
-Common failure modes and how to fix them:
-
-### `HTTP 403 Forbidden` when fetching the sitemap or post pages
-
-Substack's bot protection is rejecting unauthenticated requests. In
-order of likelihood:
-
-1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication-optional)
-   above) so you're requesting as a logged-in user.
-2. If you had a cookie set: it has probably expired (Substack rotates
-   session cookies every few weeks). Grab a fresh one from DevTools.
-3. If both are current: lower `--concurrency` (try `--concurrency 3`)
-   so you look less bot-like.
-
-### `Sitemap returns no posts for --year YYYY`
-
-The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for
-your Substack — some accounts only expose a single combined sitemap.
-Fall back to scraping the archive page:
-
-```bash
-substack-link-checker fetch-archive https://YOUR.substack.com 2024
-# Produces archive_urls_2024.txt
-substack-link-checker check --base-url https://YOUR.substack.com \
-    --url-file archive_urls_2024.txt
-```
-
-### `DNS Failure` or `Timeout` for links that work in your browser
-
-The target site is rate-limiting or geo-blocking the checker, not
-actually broken. Add it to `--skip-domains` so it's assumed OK:
-
-```bash
-substack-link-checker check ... --skip-domains rate-limited.example.com
-```
-
-For a recurring list, put one domain per line in a file and pass
-`--skip-domains-file path/to/file.txt`.
-
-### `Connection Error: ...ssl:default` / `SSL Error`
-
-The target host is using an old TLS version Python's `ssl` module no
-longer accepts by default. Usually the right call is to flag the
-domain as broken (it really is unreachable from a modern client):
-
-```bash
-substack-link-checker check ... --broken-domains old-tls.example.com
-```
-
-### Many `Soft 404 (page title indicates error)` results that look fine
-
-The detector matches phrases like "page not found" in the page `<title>`.
-If a legitimate post happens to have one of those phrases in its title,
-it'll be misflagged. Open the report, eyeball the URL, and if it's
-genuinely live, ignore those rows.
-
-### The CSV report file is empty / has only a header
-
-Either no broken links were found (look for "No broken links found!"
-in the summary) or the run was interrupted before report generation.
-The tool only writes the CSV on a successful completion of all posts.
-
-### `--only-new` is not skipping anything
-
-Make sure `--history-file` points at the same JSON file you used on
-the previous run. The history file is the source of truth for which
-posts have already been checked; without it `--only-new` has nothing
-to compare against.
-
 ## Usage
 
 ### Basic Usage
@@ -228,10 +164,30 @@ substack-link-checker check --base-url https://example.substack.com \
     --url-file unchecked_posts.txt --history-file checked_posts.json
 ```
 
-## Example Output
+### Importing Previous Results
+
+If you have an existing Excel or CSV file from a prior scan (or another
+tool), `import` extracts unique post URLs into the history file so
+`--only-new` will skip them on future runs.
 
+The input file must have a column whose header contains "Post URL"
+(case-insensitive, also matches `post_url`). Other columns are ignored.
+
+```bash
+# From an Excel report
+substack-link-checker import previous_report.xlsx --history-file checked_posts.json
+
+# Or from a CSV
+substack-link-checker import previous_report.csv --history-file checked_posts.json
 ```
-$ substack-link-checker check --base-url https://example.substack.com --year 2024
+
+Excel imports require `pandas` and `openpyxl`, which are installed
+automatically as part of the package.
+
+## Example Output (`--verbose`)
+
+```
+$ substack-link-checker check --base-url https://example.substack.com --year 2024 --verbose
 
 Substack Broken Link Checker
 ==================================================
@@ -268,7 +224,85 @@ Generating report: broken_links_report.csv
 Report generated with 5 broken links
 ```
 
-## CLI Options
+Without `--verbose`, the per-post "Checking N links…" and "Found N
+broken links in this post" lines are suppressed; the header, progress
+counter, and SUMMARY block are always shown.
+
+## Troubleshooting
+
+Common failure modes and how to fix them:
+
+### `HTTP 403 Forbidden` when fetching the sitemap or post pages
+
+Substack's bot protection is rejecting unauthenticated requests. In
+order of likelihood:
+
+1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication)
+   above) so you're requesting as a logged-in user.
+2. If you had a cookie set: it has probably expired (Substack rotates
+   session cookies every few weeks). Grab a fresh one from DevTools.
+3. If both are current: lower `--concurrency` (try `--concurrency 3`)
+   so you look less bot-like.
+
+### `Sitemap returns no posts for --year YYYY`
+
+The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for
+your Substack — some accounts only expose a single combined sitemap.
+Fall back to scraping the archive page:
+
+```bash
+substack-link-checker fetch-archive https://YOUR.substack.com 2024
+# Produces archive_urls_2024.txt
+substack-link-checker check --base-url https://YOUR.substack.com \
+    --url-file archive_urls_2024.txt
+```
+
+### `DNS Failure` or `Timeout` for links that work in your browser
+
+The target site is rate-limiting or geo-blocking the checker, not
+actually broken. Add it to `--skip-domains` so it's assumed OK:
+
+```bash
+substack-link-checker check ... --skip-domains rate-limited.example.com
+```
+
+For a recurring list, put one domain per line in a file and pass
+`--skip-domains-file path/to/file.txt`.
+
+### `Connection Error: ...ssl:default` / `SSL Error`
+
+The target host is using an old TLS version Python's `ssl` module no
+longer accepts by default. Usually the right call is to flag the
+domain as broken (it really is unreachable from a modern client):
+
+```bash
+substack-link-checker check ... --broken-domains old-tls.example.com
+```
+
+### Many `Soft 404 (page title indicates error)` results that look fine
+
+The detector matches phrases like "page not found" in the page `<title>`.
+If a legitimate post happens to have one of those phrases in its title,
+it'll be misflagged. Open the report, eyeball the URL, and if it's
+genuinely live, ignore those rows.
+
+### The CSV report file is empty / has only a header
+
+Either no broken links were found (look for "No broken links found!"
+in the summary) or the run was interrupted before report generation.
+The tool only writes the CSV on a successful completion of all posts.
+
+### `--only-new` is not skipping anything
+
+Make sure `--history-file` points at the same JSON file you used on
+the previous run. The history file is the source of truth for which
+posts have already been checked; without it `--only-new` has nothing
+to compare against.
+
+## `check` Subcommand Options
+
+The options below apply to `substack-link-checker check`. For other
+subcommands, run `substack-link-checker <subcommand> --help`.
 
 | Option | Short | Description |
 |--------|-------|-------------|
@@ -289,6 +323,9 @@ Report generated with 5 broken links
 | `--verbose` | `-v` | Show detailed progress |
 | `--limit` | `-l` | Max posts to check |
 
+Top-level flags: `--version` prints the installed version; `--help`
+lists all subcommands.
+
 ## Subcommands
 
 | Command | Purpose |
@@ -298,15 +335,27 @@ Report generated with 5 broken links
 | `substack-link-checker import` | Import previous results from Excel/CSV into history |
 | `substack-link-checker fetch-archive` | Extract URLs from the `/archive` page (fallback when the sitemap doesn't work) |
 | `substack-link-checker demo` | Self-contained demo against a handful of known-good/bad URLs |
-| `run_link_checker.ps1` | Windows Task Scheduler automation (PowerShell) |
+
+### Scheduled / automated runs
+
+`run_link_checker.ps1` (at the repo root) is a PowerShell wrapper meant
+for Windows Task Scheduler. It runs `compare` to find new posts, then
+`check` to scan them, writing reports to `reports/` with a timestamped
+filename. Set `$SUBSTACK_URL` and `$PROJECT_DIR` at the top of the
+script before first use.
 
 ## Output
 
-The tool generates a CSV report with columns:
-- **Post Title**: Title of the post containing the broken link
-- **Post URL**: URL of the post
-- **Broken Link**: The broken URL
-- **Error Type**: What went wrong (HTTP 404, DNS Failure, SSL Error, etc.)
+The tool generates a CSV report with the following columns (header row
+is written by `csv.DictWriter`, so the names below are exactly what
+appears in the file):
+
+| Column | Description |
+|---|---|
+| `post_title` | Title of the post containing the broken link |
+| `post_url` | URL of the post |
+| `broken_link` | The broken URL |
+| `error_type` | What went wrong (e.g. `HTTP 404`, `DNS Failure`, `SSL Error`) |
 
 ## Error Types Detected