diff --git a/README.md b/README.md
index cd8c016..a0d49d3 100644
--- a/README.md
+++ b/README.md
@@ -29,6 +29,9 @@ Checking broken links in your newsletter archive shouldn't cost $100+/month for
# Install (provides the `substack-link-checker` CLI)
pip install git+https://github.com/jcddc83/substack-broken-link-checker.git
+# Smoke-test your install against a handful of known-good/bad URLs
+substack-link-checker demo
+
# Check all posts from 2024
substack-link-checker check --base-url https://YOUR.substack.com --year 2024
@@ -36,6 +39,8 @@ substack-link-checker check --base-url https://YOUR.substack.com --year 2024
substack-link-checker check --base-url https://YOUR.substack.com --url-file posts.txt
```
+Confirm the installed version with `substack-link-checker --version`.
+
## Installation
```bash
@@ -80,9 +85,11 @@ because its name collides with the new package — use
`substack-link-checker check ...` or `python -m substack_link_checker check ...`
instead.
-## Authentication (Optional)
+## Authentication
-If Substack blocks your requests or you need to check paywalled content, use your session cookie:
+Optional in principle, but **usually needed in practice** — Substack's
+bot protection rejects most unauthenticated archive scans. Use your
+session cookie:
1. Log into your Substack in a browser
2. Open Developer Tools (F12) → Application → Cookies
@@ -105,77 +112,6 @@ so it does not end up in your shell history or in `ps aux`. See
**Note:** Your session cookie expires after a few weeks. If you start getting 403 errors, get a fresh cookie from your browser.
-## Troubleshooting
-
-Common failure modes and how to fix them:
-
-### `HTTP 403 Forbidden` when fetching the sitemap or post pages
-
-Substack's bot protection is rejecting unauthenticated requests. In
-order of likelihood:
-
-1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication-optional)
- above) so you're requesting as a logged-in user.
-2. If you had a cookie set: it has probably expired (Substack rotates
- session cookies every few weeks). Grab a fresh one from DevTools.
-3. If both are current: lower `--concurrency` (try `--concurrency 3`)
- so you look less bot-like.
-
-### `Sitemap returns no posts for --year YYYY`
-
-The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for
-your Substack — some accounts only expose a single combined sitemap.
-Fall back to scraping the archive page:
-
-```bash
-substack-link-checker fetch-archive https://YOUR.substack.com 2024
-# Produces archive_urls_2024.txt
-substack-link-checker check --base-url https://YOUR.substack.com \
- --url-file archive_urls_2024.txt
-```
-
-### `DNS Failure` or `Timeout` for links that work in your browser
-
-The target site is rate-limiting or geo-blocking the checker, not
-actually broken. Add it to `--skip-domains` so it's assumed OK:
-
-```bash
-substack-link-checker check ... --skip-domains rate-limited.example.com
-```
-
-For a recurring list, put one domain per line in a file and pass
-`--skip-domains-file path/to/file.txt`.
-
-### `Connection Error: ...ssl:default` / `SSL Error`
-
-The target host is using an old TLS version Python's `ssl` module no
-longer accepts by default. Usually the right call is to flag the
-domain as broken (it really is unreachable from a modern client):
-
-```bash
-substack-link-checker check ... --broken-domains old-tls.example.com
-```
-
-### Many `Soft 404 (page title indicates error)` results that look fine
-
-The detector matches phrases like "page not found" in the page `
`.
-If a legitimate post happens to have one of those phrases in its title,
-it'll be misflagged. Open the report, eyeball the URL, and if it's
-genuinely live, ignore those rows.
-
-### The CSV report file is empty / has only a header
-
-Either no broken links were found (look for "No broken links found!"
-in the summary) or the run was interrupted before report generation.
-The tool only writes the CSV on a successful completion of all posts.
-
-### `--only-new` is not skipping anything
-
-Make sure `--history-file` points at the same JSON file you used on
-the previous run. The history file is the source of truth for which
-posts have already been checked; without it `--only-new` has nothing
-to compare against.
-
## Usage
### Basic Usage
@@ -228,10 +164,30 @@ substack-link-checker check --base-url https://example.substack.com \
--url-file unchecked_posts.txt --history-file checked_posts.json
```
-## Example Output
+### Importing Previous Results
+
+If you have an existing Excel or CSV file from a prior scan (or another
+tool), `import` extracts unique post URLs into the history file so
+`--only-new` will skip them on future runs.
+The input file must have a column whose header contains "Post URL"
+(case-insensitive, also matches `post_url`). Other columns are ignored.
+
+```bash
+# From an Excel report
+substack-link-checker import previous_report.xlsx --history-file checked_posts.json
+
+# Or from a CSV
+substack-link-checker import previous_report.csv --history-file checked_posts.json
```
-$ substack-link-checker check --base-url https://example.substack.com --year 2024
+
+Excel imports require `pandas` and `openpyxl`, which are installed
+automatically as part of the package.
+
+## Example Output (`--verbose`)
+
+```
+$ substack-link-checker check --base-url https://example.substack.com --year 2024 --verbose
Substack Broken Link Checker
==================================================
@@ -268,7 +224,85 @@ Generating report: broken_links_report.csv
Report generated with 5 broken links
```
-## CLI Options
+Without `--verbose`, the per-post "Checking N links…" and "Found N
+broken links in this post" lines are suppressed; the header, progress
+counter, and SUMMARY block are always shown.
+
+## Troubleshooting
+
+Common failure modes and how to fix them:
+
+### `HTTP 403 Forbidden` when fetching the sitemap or post pages
+
+Substack's bot protection is rejecting unauthenticated requests. In
+order of likelihood:
+
+1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication)
+ above) so you're requesting as a logged-in user.
+2. If you had a cookie set: it has probably expired (Substack rotates
+ session cookies every few weeks). Grab a fresh one from DevTools.
+3. If both are current: lower `--concurrency` (try `--concurrency 3`)
+ so you look less bot-like.
+
+### `Sitemap returns no posts for --year YYYY`
+
+The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for
+your Substack — some accounts only expose a single combined sitemap.
+Fall back to scraping the archive page:
+
+```bash
+substack-link-checker fetch-archive https://YOUR.substack.com 2024
+# Produces archive_urls_2024.txt
+substack-link-checker check --base-url https://YOUR.substack.com \
+ --url-file archive_urls_2024.txt
+```
+
+### `DNS Failure` or `Timeout` for links that work in your browser
+
+The target site is rate-limiting or geo-blocking the checker, not
+actually broken. Add it to `--skip-domains` so it's assumed OK:
+
+```bash
+substack-link-checker check ... --skip-domains rate-limited.example.com
+```
+
+For a recurring list, put one domain per line in a file and pass
+`--skip-domains-file path/to/file.txt`.
+
+### `Connection Error: ...ssl:default` / `SSL Error`
+
+The target host is using an old TLS version Python's `ssl` module no
+longer accepts by default. Usually the right call is to flag the
+domain as broken (it really is unreachable from a modern client):
+
+```bash
+substack-link-checker check ... --broken-domains old-tls.example.com
+```
+
+### Many `Soft 404 (page title indicates error)` results that look fine
+
+The detector matches phrases like "page not found" in the page ``.
+If a legitimate post happens to have one of those phrases in its title,
+it'll be misflagged. Open the report, eyeball the URL, and if it's
+genuinely live, ignore those rows.
+
+### The CSV report file is empty / has only a header
+
+Either no broken links were found (look for "No broken links found!"
+in the summary) or the run was interrupted before report generation.
+The tool only writes the CSV on a successful completion of all posts.
+
+### `--only-new` is not skipping anything
+
+Make sure `--history-file` points at the same JSON file you used on
+the previous run. The history file is the source of truth for which
+posts have already been checked; without it `--only-new` has nothing
+to compare against.
+
+## `check` Subcommand Options
+
+The options below apply to `substack-link-checker check`. For other
+subcommands, run `substack-link-checker --help`.
| Option | Short | Description |
|--------|-------|-------------|
@@ -289,6 +323,9 @@ Report generated with 5 broken links
| `--verbose` | `-v` | Show detailed progress |
| `--limit` | `-l` | Max posts to check |
+Top-level flags: `--version` prints the installed version; `--help`
+lists all subcommands.
+
## Subcommands
| Command | Purpose |
@@ -298,15 +335,27 @@ Report generated with 5 broken links
| `substack-link-checker import` | Import previous results from Excel/CSV into history |
| `substack-link-checker fetch-archive` | Extract URLs from the `/archive` page (fallback when the sitemap doesn't work) |
| `substack-link-checker demo` | Self-contained demo against a handful of known-good/bad URLs |
-| `run_link_checker.ps1` | Windows Task Scheduler automation (PowerShell) |
+
+### Scheduled / automated runs
+
+`run_link_checker.ps1` (at the repo root) is a PowerShell wrapper meant
+for Windows Task Scheduler. It runs `compare` to find new posts, then
+`check` to scan them, writing reports to `reports/` with a timestamped
+filename. Set `$SUBSTACK_URL` and `$PROJECT_DIR` at the top of the
+script before first use.
## Output
-The tool generates a CSV report with columns:
-- **Post Title**: Title of the post containing the broken link
-- **Post URL**: URL of the post
-- **Broken Link**: The broken URL
-- **Error Type**: What went wrong (HTTP 404, DNS Failure, SSL Error, etc.)
+The tool generates a CSV report with the following columns (header row
+is written by `csv.DictWriter`, so the names below are exactly what
+appears in the file):
+
+| Column | Description |
+|---|---|
+| `post_title` | Title of the post containing the broken link |
+| `post_url` | URL of the post |
+| `broken_link` | The broken URL |
+| `error_type` | What went wrong (e.g. `HTTP 404`, `DNS Failure`, `SSL Error`) |
## Error Types Detected