Skip to content

WSM (Westminster) scraper failing — ModGov endpoint redirecting to 404 error page #379

Description

@symroe

Error (2026-06-20)

is_status error: wreq::Error { kind: Status(404, None), uri: https://committees.westminster.gov.uk/mgError.aspx }
Duration: 6.5s

Investigation

The scraper's base_url in metadata.json is http://committees.westminster.gov.uk (HTTP). The framework constructs http://committees.westminster.gov.uk/mgWebService.asmx/GetCouncillorsByWard and the following redirect chain occurs:

  1. http://committees.westminster.gov.uk/mgWebService.asmx/GetCouncillorsByWard → 301 → HTTPS
  2. https://committees.westminster.gov.uk/mgWebService.asmx/GetCouncillorsByWard → 302 → mgError.aspx
  3. https://committees.westminster.gov.uk/mgError.aspx404

This means the ModGov web service on Westminster's server is responding but the application is returning an error (redirect to mgError.aspx), and the error page itself isn't found (404). This is an application-level error, not a network or certificate issue.

Direct fetch of https://committees.westminster.gov.uk returns HTTP 403 (Cloudflare WAF), so the domain is live but behind Cloudflare. The ModGov service endpoint itself is returning an application error rather than councillor data.

Fix patterns ruled out

  1. HTTPS migration — Would fix the http://https:// mismatch, but the HTTPS endpoint itself is broken (returning 302 → mgError.aspx)
  2. verify_requests = False — Not a cert issue; the TLS handshake completes
  3. http_lib = "playwright" — Might bypass the Cloudflare WAF on the main domain, but the underlying ModGov application error (redirect to error page) would still occur
  4. URL changes — Unknown; Westminster's main website may reference a new councillors URL or a different CMS

What needs to happen

  1. Check whether committees.westminster.gov.uk is temporarily broken or if Westminster has migrated their ModGov instance or moved to a different CMS.
  2. If still on ModGov at a different subdomain, update base_url in metadata.json.
  3. Also migrate base_url from http:// to https:// regardless.
  4. Add http_lib = "playwright" if the new URL is behind Cloudflare.

This requires visual inspection of Westminster's council website to find the current councillors URL.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions