Skip to content
This repository was archived by the owner on May 28, 2026. It is now read-only.

Implement automatic formatting of footnote references#163

Open
vm-pranavan wants to merge 14 commits into
wikimedia:mainfrom
vm-pranavan:ARH
Open

Implement automatic formatting of footnote references#163
vm-pranavan wants to merge 14 commits into
wikimedia:mainfrom
vm-pranavan:ARH

Conversation

@vm-pranavan

Copy link
Copy Markdown
Contributor

Adds an opt-in feature to automatically format academic footnotes in OCR results. It detects footnote definitions at the bottom of a page and replaces their inline markers with standard Wikitext <ref>footnote text</ref> tags. Includes a UI toggle and API support, with caching effectively isolated.

Bug: T375919

Use documentTextDetection instead of textDetection for better
recognition of complex diacritic characters used in mnp.

Bug: T383002
Bug: T375919
Part 1: Core processor and unit tests
Bug: T375919
Part 2: Controller & cache integration + API parameter
Bug: T375919
Part 3: UI checkbox and i18n
@bicolino34

Copy link
Copy Markdown
Contributor

@samwilson can you please review this PR? This feature would be very useful. In Ukrainian Wikisource we currently have a proofreading contest with works that contain a lot of references and this would help a lot with formatting

@samwilson samwilson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few stylistic comments only (and the fact that the homepage doesn't work any more! :-P ).

The ReferencePostProcessor object could be injected into the controller, and not have all its methods be static, but however you prefer is fine.

use Symfony\Contracts\Cache\ItemInterface;

// phpcs:ignore MediaWiki.Classes.UnusedUseStatement.UnusedUse
// phpcs:ignore MediaWiki.Classes.UnusedUseStatement.UnusedUse

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines appear to not be ignoring anything.

}
static::$params['crop'] = array_map( 'intval', $crop );

static::$params['handle_refs'] = (bool)$this->request->query->get( 'handle_refs', false );

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use query->getBoolean().

* The main form and result page.
* @Route("/", name="home")
* @return Response
*/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deletes the / route.

Comment thread i18n/en.json
"tesseract-internal-error": "The tesseract engine returned an internal error.",
"error-json": "The engine returned an invalid JSON response.",
"handle-refs": "Automatically format footnote references",
"handle-refs-help": "Detects footnote markers (e.g. ¹, 1)) and wraps the matching footnote text in <ref> tags. Works best on clean, single-column book pages.",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest wrapping the example footnote markers in <code> to avoid the slightly odd-looking double closing parenthesis. It might be good to more fully document the three supported syntaxes here as well, so people know what to expect.

Also, the brackets of <ref> need to be encoded.

// The last non-blank line of the document must be either:
// - A footnote start
// - A continuation of a footnote start higher up
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement is empty, is that intentional?

// It consists of footnote start lines, continuation lines, and optionally blank lines.
// The block must start with a valid footnote line.

$footnoteStartIndex = -1;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This immediately gets overwritten below on line 134.

@bicolino34

Copy link
Copy Markdown
Contributor

@vm-pranavan

@samwilson

Copy link
Copy Markdown
Member

The OCR tool Git repo has now moved to GitLab. Sorry to make extra work, but could this PR please be pushed to https://gitlab.wikimedia.org/toolforge-repos/ocr instead of GitHub? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants