Right now, annotation metadata for Gwern.net essays such as /design are generated by scraping the live page https://gwern.net/design, rather than by reading and parsing the Pandoc Gwerndown directly, handled in Annotation.Gwernnet . The original rationale was that this was an easy hack to get at the final compiled metadata and a ToC (inaccessible in the Pandoc API) etc, and use Tagsoup to easily extract out elements like div.abstract rather than wrangling the Pandoc API. As so often with such 'clever' hacks, it has gradually become a liability and is a serious burden, as it means that the annotations are perpetually out of date, the rescrapes have to happen monthly, they have to be versioned to incorporate tag metadata, they create massive Git churn when thousands of entries may update on a rescrape, the scraping logic is ever more fiddly and unreliable, nice-to-have refinements like pruning the ToC are difficult to implement...
The right thing obviously is to simply read the Gwerndown files directly, and extract the necessary information from the YAML metadata and perhaps immediately-compiled HTML output, but this has been too tedious to implement in the past.
Right now, annotation metadata for Gwern.net essays such as
/designare generated by scraping the live pagehttps://gwern.net/design, rather than by reading and parsing the Pandoc Gwerndown directly, handled inAnnotation.Gwernnet. The original rationale was that this was an easy hack to get at the final compiled metadata and a ToC (inaccessible in the Pandoc API) etc, and use Tagsoup to easily extract out elements likediv.abstractrather than wrangling the Pandoc API. As so often with such 'clever' hacks, it has gradually become a liability and is a serious burden, as it means that the annotations are perpetually out of date, the rescrapes have to happen monthly, they have to be versioned to incorporate tag metadata, they create massive Git churn when thousands of entries may update on a rescrape, the scraping logic is ever more fiddly and unreliable, nice-to-have refinements like pruning the ToC are difficult to implement...The right thing obviously is to simply read the Gwerndown files directly, and extract the necessary information from the YAML metadata and perhaps immediately-compiled HTML output, but this has been too tedious to implement in the past.