Skip to content
This repository was archived by the owner on May 28, 2026. It is now read-only.

Add new OCR parameter to normalize the result text#112

Draft
stweil wants to merge 1 commit into
wikimedia:mainfrom
stweil:normalize
Draft

Add new OCR parameter to normalize the result text#112
stweil wants to merge 1 commit into
wikimedia:mainfrom
stweil:normalize

Conversation

@stweil

@stweil stweil commented Sep 22, 2023

Copy link
Copy Markdown
Contributor

No description provided.

@stweil

stweil commented Sep 22, 2023

Copy link
Copy Markdown
Contributor Author

Example: Tesseract OCR with and without normalization.

The normalization works with any OCR engine. The cache always stores the original OCR text. Therefore it is possible to switch to normalized text without a new OCR run.

* Normalize result by replacing some historic characters
*/
public function normalize() {
$this->text = strtr( $this->text, [

@stweil stweil Sep 22, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some (and more) of these translations could be done with Normalizer::normalize( $this->text, Normalizer::FORM_KC ), but that causes a runtime conflict with the Symfony class which is also called Normalizer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

@samwilson samwilson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good addition, but note that there's been various discussions over the years about how to normalize OCR output, and not always with huge agreement. Mainly because different Wikisources want to do things differently, and many already have gadgets in place for doing the exact replacements that they want.

For example T278443 fix issue with lines being formatted incorrectly, and T250185 Make Wikisource-OCR handle paragraphs better.

I think there needs to be a way to make this configurable per-project, or perhaps retrieve a config from on-wiki (e.g. a normalize_config param could point to a JSON page's URL, where the actual replacement patterns are defined).

* Normalize result by replacing some historic characters
*/
public function normalize() {
$this->text = strtr( $this->text, [

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

@stweil stweil marked this pull request as draft September 23, 2023 08:05
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants