Implement automatic formatting of footnote references#163
Conversation
Use documentTextDetection instead of textDetection for better recognition of complex diacritic characters used in mnp. Bug: T383002
Bug: T375919 Part 1: Core processor and unit tests
Bug: T375919 Part 2: Controller & cache integration + API parameter
Bug: T375919 Part 3: UI checkbox and i18n
|
@samwilson can you please review this PR? This feature would be very useful. In Ukrainian Wikisource we currently have a proofreading contest with works that contain a lot of references and this would help a lot with formatting |
samwilson
left a comment
There was a problem hiding this comment.
Looks good! A few stylistic comments only (and the fact that the homepage doesn't work any more! :-P ).
The ReferencePostProcessor object could be injected into the controller, and not have all its methods be static, but however you prefer is fine.
| use Symfony\Contracts\Cache\ItemInterface; | ||
|
|
||
| // phpcs:ignore MediaWiki.Classes.UnusedUseStatement.UnusedUse | ||
| // phpcs:ignore MediaWiki.Classes.UnusedUseStatement.UnusedUse |
There was a problem hiding this comment.
These two lines appear to not be ignoring anything.
| } | ||
| static::$params['crop'] = array_map( 'intval', $crop ); | ||
|
|
||
| static::$params['handle_refs'] = (bool)$this->request->query->get( 'handle_refs', false ); |
There was a problem hiding this comment.
Could use query->getBoolean().
| * The main form and result page. | ||
| * @Route("/", name="home") | ||
| * @return Response | ||
| */ |
| "tesseract-internal-error": "The tesseract engine returned an internal error.", | ||
| "error-json": "The engine returned an invalid JSON response.", | ||
| "handle-refs": "Automatically format footnote references", | ||
| "handle-refs-help": "Detects footnote markers (e.g. ¹, 1)) and wraps the matching footnote text in <ref> tags. Works best on clean, single-column book pages.", |
There was a problem hiding this comment.
Suggest wrapping the example footnote markers in <code> to avoid the slightly odd-looking double closing parenthesis. It might be good to more fully document the three supported syntaxes here as well, so people know what to expect.
Also, the brackets of <ref> need to be encoded.
| // The last non-blank line of the document must be either: | ||
| // - A footnote start | ||
| // - A continuation of a footnote start higher up | ||
| } |
There was a problem hiding this comment.
This if statement is empty, is that intentional?
| // It consists of footnote start lines, continuation lines, and optionally blank lines. | ||
| // The block must start with a valid footnote line. | ||
|
|
||
| $footnoteStartIndex = -1; |
There was a problem hiding this comment.
This immediately gets overwritten below on line 134.
|
The OCR tool Git repo has now moved to GitLab. Sorry to make extra work, but could this PR please be pushed to https://gitlab.wikimedia.org/toolforge-repos/ocr instead of GitHub? Thanks! |
Adds an opt-in feature to automatically format academic footnotes in OCR results. It detects footnote definitions at the bottom of a page and replaces their inline markers with standard Wikitext
<ref>footnote text</ref>tags. Includes a UI toggle and API support, with caching effectively isolated.Bug: T375919