Fix text rendering matrix order by splitbrain · Pull Request #426 · PrinsFrank/pdfparser

splitbrain · 2026-06-23T22:31:31Z

This one caused me some headache when working on handling text rotations. Some documents seemed to have their coordinate system flipped. This was caused by a flipped matrix multiplication.

The remaining code used abs() to deal with the resulting negative coordinates, but in some cases that resulted in wrongly ordered words (good example are the footnotes in issue-254). The abs() calls have been removed.

In fact the use of abs() would also break calculations where the mediabox' origin is moved in a way that results glyphs being positioned at negative coordinate. A unit test covers this case.

Now the tricky thing is, that the wrong matrix multiplication accidentally influenced the space detection. Basically it made the set threshold of 0.4em dependent on the page's transformation matrix which was accidentally a pretty good match for the size of spaces in the Google Docs produced documents which use a page scale of 75%. With the multiplication matrix fixed, this was no longer the case and those samples lost all their spaces. You can see that in e9db1f7.

Claude code is convinced that the proper way to detect spaces is to differentiate between real spaces (eg. space characters in a text element) and glyph positioned gaps. With the latter needing a much narrower threshold. This is implemented in e9db1f7 and works fine for the sample docs. TBH I am not 100% sure this is the correct approach, but I can not come up with anything better. The fact is that a simple single threshold does not work with each document.

splitbrain · 2026-06-24T17:04:57Z

quick note: I could not shake the feeling that the two different thresholds are a weird hack. I spent most of the day trying to understand what's going on. I believe I am closer to a real space detection solution, but my branch is a bit of a mess right now.

PR #425 and the first commit (e9db1f7) in this branch are correct IMHO. The space detection is too naive and uses weird heuristics. So maybe ignore this part for now.

PrinsFrank

You're on a roll! I really appreciate the effort you've put into this!

I think it's fine to lose some spaces in some samples for now, and the fix can be merged without immediately recovering the lost spaces. Doing one thing per PR and also seeing what that affects in the samples makes things easier to understand then doing multiple things in the same PR.

The inverse multiplication is a bug in itself, so let's focus on that first and get that merged first!

Any changes in subsequent PRs can than be compared and effects of those can be verified independently. I'll make sure not to release a new version of this library until the samples are not too far regressed from the previous release so we don't break peoples' text extraction!

getPositionedTextElement() multiplied the CTM by the text matrix (CTM x Tm) rather than the text matrix by the CTM (Tm x CTM) -- the order every other operator in the codebase already uses. The two agree whenever the CTM carries no scale, rotation or reflection, so most pages were unaffected; but a page that installs a scaled or Y-flipped CTM before drawing text (Google Docs and similar exports) placed glyphs at the negated, out-of-MediaBox position, with advances measured in unscaled text space. That mis-positioning scrambled reading order (e.g. the issue-254 footnotes read as gibberish) and distorted glyph spacing. Multiply text-space by the CTM so positions and advances are in page space at any scale or rotation. With positions now in page space, TextOverlapStrategy no longer needs to sort lines by abs(offsetY) -- a workaround for the negated coordinates this bug produced. Order by signed offsetY, which is also correct for a page whose MediaBox origin legitimately sits above its content (all-negative Y), where abs() would reverse the reading order.

splitbrain · 2026-06-24T19:05:30Z

Good call. I removed the space fixing commit. Regarding space detection see #429

PrinsFrank reviewed Jun 24, 2026

View reviewed changes

Comment thread src/Document/ContentStream/Command/Operator/State/TextShowingOperator.php Outdated

splitbrain force-pushed the fix-text-rendering-matrix-order branch from 5e64326 to bdb6b8a Compare June 24, 2026 18:55

splitbrain force-pushed the fix-text-rendering-matrix-order branch from bdb6b8a to e081cb3 Compare June 24, 2026 19:03

splitbrain requested a review from PrinsFrank June 24, 2026 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix text rendering matrix order#426

Fix text rendering matrix order#426
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:fix-text-rendering-matrix-order

splitbrain commented Jun 23, 2026

Uh oh!

splitbrain commented Jun 24, 2026

Uh oh!

PrinsFrank left a comment

Uh oh!

Uh oh!

splitbrain commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

splitbrain commented Jun 23, 2026

Uh oh!

splitbrain commented Jun 24, 2026

Uh oh!

PrinsFrank left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

splitbrain commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants