Change `Annotation.Gwernnet` essay metadata extraction from live web scraping to parsing Pandoc Gwerndown docs directly

Right now, annotation metadata for Gwern.net essays such as `/design` are generated by scraping the live page `https://gwern.net/design`, rather than by reading and parsing the Pandoc Gwerndown directly, handled in [`Annotation.Gwernnet` ](/static/build/Annotation/Gwernnet.hs). The original rationale was that this was an easy hack to get at the final compiled metadata and a ToC (inaccessible in the Pandoc API) etc, and use Tagsoup to easily extract out elements like `div.abstract` rather than wrangling the Pandoc API. As so often with such 'clever' hacks, it has gradually become a liability and is a serious burden, as it means that the annotations are perpetually out of date, the rescrapes have to happen monthly, they have to be versioned to incorporate tag metadata, they create massive Git churn when thousands of entries may update on a rescrape, the scraping logic is ever more fiddly and unreliable, nice-to-have refinements like pruning the ToC are difficult to implement...

The right thing obviously is to simply read the Gwerndown files directly, and extract the necessary information from the YAML metadata and perhaps immediately-compiled HTML output, but this has been too tedious to implement in the past.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `Annotation.Gwernnet` essay metadata extraction from live web scraping to parsing Pandoc Gwerndown docs directly #52

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Change Annotation.Gwernnet essay metadata extraction from live web scraping to parsing Pandoc Gwerndown docs directly #52

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Change `Annotation.Gwernnet` essay metadata extraction from live web scraping to parsing Pandoc Gwerndown docs directly #52