Skip to content

Fix text rendering matrix order#426

Open
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:fix-text-rendering-matrix-order
Open

Fix text rendering matrix order#426
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:fix-text-rendering-matrix-order

Conversation

@splitbrain

Copy link
Copy Markdown
Contributor

This one caused me some headache when working on handling text rotations. Some documents seemed to have their coordinate system flipped. This was caused by a flipped matrix multiplication.

The remaining code used abs() to deal with the resulting negative coordinates, but in some cases that resulted in wrongly ordered words (good example are the footnotes in issue-254). The abs() calls have been removed.

In fact the use of abs() would also break calculations where the mediabox' origin is moved in a way that results glyphs being positioned at negative coordinate. A unit test covers this case.

Now the tricky thing is, that the wrong matrix multiplication accidentally influenced the space detection. Basically it made the set threshold of 0.4em dependent on the page's transformation matrix which was accidentally a pretty good match for the size of spaces in the Google Docs produced documents which use a page scale of 75%. With the multiplication matrix fixed, this was no longer the case and those samples lost all their spaces. You can see that in e9db1f7.

Claude code is convinced that the proper way to detect spaces is to differentiate between real spaces (eg. space characters in a text element) and glyph positioned gaps. With the latter needing a much narrower threshold. This is implemented in e9db1f7 and works fine for the sample docs. TBH I am not 100% sure this is the correct approach, but I can not come up with anything better. The fact is that a simple single threshold does not work with each document.

@splitbrain

Copy link
Copy Markdown
Contributor Author

quick note: I could not shake the feeling that the two different thresholds are a weird hack. I spent most of the day trying to understand what's going on. I believe I am closer to a real space detection solution, but my branch is a bit of a mess right now.

PR #425 and the first commit (e9db1f7) in this branch are correct IMHO. The space detection is too naive and uses weird heuristics. So maybe ignore this part for now.

@PrinsFrank PrinsFrank left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're on a roll! I really appreciate the effort you've put into this!

I think it's fine to lose some spaces in some samples for now, and the fix can be merged without immediately recovering the lost spaces. Doing one thing per PR and also seeing what that affects in the samples makes things easier to understand then doing multiple things in the same PR.

The inverse multiplication is a bug in itself, so let's focus on that first and get that merged first!

Any changes in subsequent PRs can than be compared and effects of those can be verified independently. I'll make sure not to release a new version of this library until the samples are not too far regressed from the previous release so we don't break peoples' text extraction!

Comment thread src/Document/ContentStream/Command/Operator/State/TextShowingOperator.php Outdated
@splitbrain splitbrain force-pushed the fix-text-rendering-matrix-order branch from 5e64326 to bdb6b8a Compare June 24, 2026 18:55
getPositionedTextElement() multiplied the CTM by the text matrix (CTM x Tm)
rather than the text matrix by the CTM (Tm x CTM) -- the order every other
operator in the codebase already uses. The two agree whenever the CTM carries
no scale, rotation or reflection, so most pages were unaffected; but a page that
installs a scaled or Y-flipped CTM before drawing text (Google Docs and similar
exports) placed glyphs at the negated, out-of-MediaBox position, with advances
measured in unscaled text space. That mis-positioning scrambled reading order
(e.g. the issue-254 footnotes read as gibberish) and distorted glyph spacing.

Multiply text-space by the CTM so positions and advances are in page space at
any scale or rotation.

With positions now in page space, TextOverlapStrategy no longer needs to sort
lines by abs(offsetY) -- a workaround for the negated coordinates this bug
produced. Order by signed offsetY, which is also correct for a page whose
MediaBox origin legitimately sits above its content (all-negative Y), where
abs() would reverse the reading order.
@splitbrain splitbrain force-pushed the fix-text-rendering-matrix-order branch from bdb6b8a to e081cb3 Compare June 24, 2026 19:03
@splitbrain

Copy link
Copy Markdown
Contributor Author

Good call. I removed the space fixing commit. Regarding space detection see #429

@splitbrain splitbrain requested a review from PrinsFrank June 24, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants