Add new OCR parameter to normalize the result text by stweil · Pull Request #112 · wikimedia/wikimedia-ocr

stweil · 2023-09-22T16:07:03Z

No description provided.

stweil · 2023-09-22T16:10:43Z

Example: Tesseract OCR with and without normalization.

The normalization works with any OCR engine. The cache always stores the original OCR text. Therefore it is possible to switch to normalized text without a new OCR run.

stweil · 2023-09-22T16:23:13Z

+	 * Normalize result by replacing some historic characters
+	 */
+	public function normalize() {
+		$this->text = strtr( $this->text, [


Some (and more) of these translations could be done with Normalizer::normalize( $this->text, Normalizer::FORM_KC ), but that causes a runtime conflict with the Symfony class which is also called Normalizer.

but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

samwilson

This looks like a good addition, but note that there's been various discussions over the years about how to normalize OCR output, and not always with huge agreement. Mainly because different Wikisources want to do things differently, and many already have gadgets in place for doing the exact replacements that they want.

For example T278443 fix issue with lines being formatted incorrectly, and T250185 Make Wikisource-OCR handle paragraphs better.

I think there needs to be a way to make this configurable per-project, or perhaps retrieve a config from on-wiki (e.g. a normalize_config param could point to a JSON page's URL, where the actual replacement patterns are defined).

samwilson · 2023-09-23T00:19:24Z

+	 * Normalize result by replacing some historic characters
+	 */
+	public function normalize() {
+		$this->text = strtr( $this->text, [


but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil commented Sep 22, 2023

View reviewed changes

samwilson reviewed Sep 23, 2023

View reviewed changes

stweil marked this pull request as draft September 23, 2023 08:05

Add new OCR parameter to normalize the result text

744ad1c

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil force-pushed the normalize branch from 4322273 to 744ad1c Compare October 12, 2023 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new OCR parameter to normalize the result text#112

Add new OCR parameter to normalize the result text#112
stweil wants to merge 1 commit into
wikimedia:mainfrom
stweil:normalize

stweil commented Sep 22, 2023

Uh oh!

stweil commented Sep 22, 2023 •

edited

Loading

Uh oh!

stweil Sep 22, 2023 •

edited

Loading

Uh oh!

samwilson Sep 23, 2023

Uh oh!

samwilson left a comment

Uh oh!

samwilson Sep 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

stweil commented Sep 22, 2023

Uh oh!

stweil commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stweil Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samwilson Sep 23, 2023

Choose a reason for hiding this comment

Uh oh!

samwilson left a comment

Choose a reason for hiding this comment

Uh oh!

samwilson Sep 23, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

stweil commented Sep 22, 2023 •

edited

Loading

stweil Sep 22, 2023 •

edited

Loading