Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 131 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,18 @@ Checking broken links in your newsletter archive shouldn't cost $100+/month for
# Install (provides the `substack-link-checker` CLI)
pip install git+https://github.com/jcddc83/substack-broken-link-checker.git

# Smoke-test your install against a handful of known-good/bad URLs
substack-link-checker demo

# Check all posts from 2024
substack-link-checker check --base-url https://YOUR.substack.com --year 2024

# Check posts from a file
substack-link-checker check --base-url https://YOUR.substack.com --url-file posts.txt
```

Confirm the installed version with `substack-link-checker --version`.

## Installation

```bash
Expand Down Expand Up @@ -80,9 +85,11 @@ because its name collides with the new package — use
`substack-link-checker check ...` or `python -m substack_link_checker check ...`
instead.

## Authentication (Optional)
## Authentication

If Substack blocks your requests or you need to check paywalled content, use your session cookie:
Optional in principle, but **usually needed in practice** — Substack's
bot protection rejects most unauthenticated archive scans. Use your
session cookie:

1. Log into your Substack in a browser
2. Open Developer Tools (F12) → Application → Cookies
Expand All @@ -105,77 +112,6 @@ so it does not end up in your shell history or in `ps aux`. See

**Note:** Your session cookie expires after a few weeks. If you start getting 403 errors, get a fresh cookie from your browser.

## Troubleshooting

Common failure modes and how to fix them:

### `HTTP 403 Forbidden` when fetching the sitemap or post pages

Substack's bot protection is rejecting unauthenticated requests. In
order of likelihood:

1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication-optional)
above) so you're requesting as a logged-in user.
2. If you had a cookie set: it has probably expired (Substack rotates
session cookies every few weeks). Grab a fresh one from DevTools.
3. If both are current: lower `--concurrency` (try `--concurrency 3`)
so you look less bot-like.

### `Sitemap returns no posts for --year YYYY`

The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for
your Substack — some accounts only expose a single combined sitemap.
Fall back to scraping the archive page:

```bash
substack-link-checker fetch-archive https://YOUR.substack.com 2024
# Produces archive_urls_2024.txt
substack-link-checker check --base-url https://YOUR.substack.com \
--url-file archive_urls_2024.txt
```

### `DNS Failure` or `Timeout` for links that work in your browser

The target site is rate-limiting or geo-blocking the checker, not
actually broken. Add it to `--skip-domains` so it's assumed OK:

```bash
substack-link-checker check ... --skip-domains rate-limited.example.com
```

For a recurring list, put one domain per line in a file and pass
`--skip-domains-file path/to/file.txt`.

### `Connection Error: ...ssl:default` / `SSL Error`

The target host is using an old TLS version Python's `ssl` module no
longer accepts by default. Usually the right call is to flag the
domain as broken (it really is unreachable from a modern client):

```bash
substack-link-checker check ... --broken-domains old-tls.example.com
```

### Many `Soft 404 (page title indicates error)` results that look fine

The detector matches phrases like "page not found" in the page `<title>`.
If a legitimate post happens to have one of those phrases in its title,
it'll be misflagged. Open the report, eyeball the URL, and if it's
genuinely live, ignore those rows.

### The CSV report file is empty / has only a header

Either no broken links were found (look for "No broken links found!"
in the summary) or the run was interrupted before report generation.
The tool only writes the CSV on a successful completion of all posts.

### `--only-new` is not skipping anything

Make sure `--history-file` points at the same JSON file you used on
the previous run. The history file is the source of truth for which
posts have already been checked; without it `--only-new` has nothing
to compare against.

## Usage

### Basic Usage
Expand Down Expand Up @@ -228,10 +164,30 @@ substack-link-checker check --base-url https://example.substack.com \
--url-file unchecked_posts.txt --history-file checked_posts.json
```

## Example Output
### Importing Previous Results

If you have an existing Excel or CSV file from a prior scan (or another
tool), `import` extracts unique post URLs into the history file so
`--only-new` will skip them on future runs.

The input file must have a column whose header contains "Post URL"
(case-insensitive, also matches `post_url`). Other columns are ignored.

```bash
# From an Excel report
substack-link-checker import previous_report.xlsx --history-file checked_posts.json

# Or from a CSV
substack-link-checker import previous_report.csv --history-file checked_posts.json
```
$ substack-link-checker check --base-url https://example.substack.com --year 2024

Excel imports require `pandas` and `openpyxl`, which are installed
automatically as part of the package.

## Example Output (`--verbose`)

```
$ substack-link-checker check --base-url https://example.substack.com --year 2024 --verbose

Substack Broken Link Checker
==================================================
Expand Down Expand Up @@ -268,7 +224,85 @@ Generating report: broken_links_report.csv
Report generated with 5 broken links
```

## CLI Options
Without `--verbose`, the per-post "Checking N links…" and "Found N
broken links in this post" lines are suppressed; the header, progress
counter, and SUMMARY block are always shown.

## Troubleshooting

Common failure modes and how to fix them:

### `HTTP 403 Forbidden` when fetching the sitemap or post pages

Substack's bot protection is rejecting unauthenticated requests. In
order of likelihood:

1. Set `SUBSTACK_COOKIE` (see [Authentication](#authentication)
above) so you're requesting as a logged-in user.
2. If you had a cookie set: it has probably expired (Substack rotates
session cookies every few weeks). Grab a fresh one from DevTools.
3. If both are current: lower `--concurrency` (try `--concurrency 3`)
so you look less bot-like.

### `Sitemap returns no posts for --year YYYY`

The year-specific sitemap (e.g. `/sitemap-2024.xml`) doesn't exist for
your Substack — some accounts only expose a single combined sitemap.
Fall back to scraping the archive page:

```bash
substack-link-checker fetch-archive https://YOUR.substack.com 2024
# Produces archive_urls_2024.txt
substack-link-checker check --base-url https://YOUR.substack.com \
--url-file archive_urls_2024.txt
```

### `DNS Failure` or `Timeout` for links that work in your browser

The target site is rate-limiting or geo-blocking the checker, not
actually broken. Add it to `--skip-domains` so it's assumed OK:

```bash
substack-link-checker check ... --skip-domains rate-limited.example.com
```

For a recurring list, put one domain per line in a file and pass
`--skip-domains-file path/to/file.txt`.

### `Connection Error: ...ssl:default` / `SSL Error`

The target host is using an old TLS version Python's `ssl` module no
longer accepts by default. Usually the right call is to flag the
domain as broken (it really is unreachable from a modern client):

```bash
substack-link-checker check ... --broken-domains old-tls.example.com
```

### Many `Soft 404 (page title indicates error)` results that look fine

The detector matches phrases like "page not found" in the page `<title>`.
If a legitimate post happens to have one of those phrases in its title,
it'll be misflagged. Open the report, eyeball the URL, and if it's
genuinely live, ignore those rows.

### The CSV report file is empty / has only a header

Either no broken links were found (look for "No broken links found!"
in the summary) or the run was interrupted before report generation.
The tool only writes the CSV on a successful completion of all posts.

### `--only-new` is not skipping anything

Make sure `--history-file` points at the same JSON file you used on
the previous run. The history file is the source of truth for which
posts have already been checked; without it `--only-new` has nothing
to compare against.

## `check` Subcommand Options

The options below apply to `substack-link-checker check`. For other
subcommands, run `substack-link-checker <subcommand> --help`.

| Option | Short | Description |
|--------|-------|-------------|
Expand All @@ -289,6 +323,9 @@ Report generated with 5 broken links
| `--verbose` | `-v` | Show detailed progress |
| `--limit` | `-l` | Max posts to check |

Top-level flags: `--version` prints the installed version; `--help`
lists all subcommands.

## Subcommands

| Command | Purpose |
Expand All @@ -298,15 +335,27 @@ Report generated with 5 broken links
| `substack-link-checker import` | Import previous results from Excel/CSV into history |
| `substack-link-checker fetch-archive` | Extract URLs from the `/archive` page (fallback when the sitemap doesn't work) |
| `substack-link-checker demo` | Self-contained demo against a handful of known-good/bad URLs |
| `run_link_checker.ps1` | Windows Task Scheduler automation (PowerShell) |

### Scheduled / automated runs

`run_link_checker.ps1` (at the repo root) is a PowerShell wrapper meant
for Windows Task Scheduler. It runs `compare` to find new posts, then
`check` to scan them, writing reports to `reports/` with a timestamped
filename. Set `$SUBSTACK_URL` and `$PROJECT_DIR` at the top of the
script before first use.

## Output

The tool generates a CSV report with columns:
- **Post Title**: Title of the post containing the broken link
- **Post URL**: URL of the post
- **Broken Link**: The broken URL
- **Error Type**: What went wrong (HTTP 404, DNS Failure, SSL Error, etc.)
The tool generates a CSV report with the following columns (header row
is written by `csv.DictWriter`, so the names below are exactly what
appears in the file):

| Column | Description |
|---|---|
| `post_title` | Title of the post containing the broken link |
| `post_url` | URL of the post |
| `broken_link` | The broken URL |
| `error_type` | What went wrong (e.g. `HTTP 404`, `DNS Failure`, `SSL Error`) |

## Error Types Detected

Expand Down
Loading