When a paragraph doesn't fit without hyphenation, hyphenate_paragraph splits every token at alphabetic/non-alphabetic boundaries: leading/trailing punctuation, apostrophes and infix hyphens become standalone paragraph boxes. cleanup_paragraph only re-merges the alphabetic segments (that's all hyph_indices covers), so the punctuation stays detached in the page's text boxes.
Reader::text_excerpt joins boxes with a space, so everything built from a selection gets spurious spaces around punctuation:
- highlight/annotation text (also as exported),
- Search on a selection (the query no longer matches the book's own text),
- Define on a selection.
For example, highlighting a sentence from a hyphenated paragraph stores:
“ No , it can ' t be right — there must be a mistake somewhere ,” he thought .
instead of:
“No, it can't be right — there must be a mistake somewhere,” he thought.
This affects any language once a paragraph goes through the hyphenation pass (i.e. the words must be split at line ends: justified text in a narrow column or with a large font). In a 1200-page test layout nearly every page with running text was affected.
I'll send a PR shortly: recording the whole token as the merge range in hyphenate_paragraph lets the existing cleanup_paragraph machinery glue the punctuation back; line breaking itself is unaffected.
When a paragraph doesn't fit without hyphenation,
hyphenate_paragraphsplits every token at alphabetic/non-alphabetic boundaries: leading/trailing punctuation, apostrophes and infix hyphens become standalone paragraph boxes.cleanup_paragraphonly re-merges the alphabetic segments (that's allhyph_indicescovers), so the punctuation stays detached in the page's text boxes.Reader::text_excerptjoins boxes with a space, so everything built from a selection gets spurious spaces around punctuation:For example, highlighting a sentence from a hyphenated paragraph stores:
instead of:
This affects any language once a paragraph goes through the hyphenation pass (i.e. the words must be split at line ends: justified text in a narrow column or with a large font). In a 1200-page test layout nearly every page with running text was affected.
I'll send a PR shortly: recording the whole token as the merge range in
hyphenate_paragraphlets the existingcleanup_paragraphmachinery glue the punctuation back; line breaking itself is unaffected.