Fix text rendering matrix order#426
Conversation
|
quick note: I could not shake the feeling that the two different thresholds are a weird hack. I spent most of the day trying to understand what's going on. I believe I am closer to a real space detection solution, but my branch is a bit of a mess right now. PR #425 and the first commit (e9db1f7) in this branch are correct IMHO. The space detection is too naive and uses weird heuristics. So maybe ignore this part for now. |
PrinsFrank
left a comment
There was a problem hiding this comment.
You're on a roll! I really appreciate the effort you've put into this!
I think it's fine to lose some spaces in some samples for now, and the fix can be merged without immediately recovering the lost spaces. Doing one thing per PR and also seeing what that affects in the samples makes things easier to understand then doing multiple things in the same PR.
The inverse multiplication is a bug in itself, so let's focus on that first and get that merged first!
Any changes in subsequent PRs can than be compared and effects of those can be verified independently. I'll make sure not to release a new version of this library until the samples are not too far regressed from the previous release so we don't break peoples' text extraction!
5e64326 to
bdb6b8a
Compare
getPositionedTextElement() multiplied the CTM by the text matrix (CTM x Tm) rather than the text matrix by the CTM (Tm x CTM) -- the order every other operator in the codebase already uses. The two agree whenever the CTM carries no scale, rotation or reflection, so most pages were unaffected; but a page that installs a scaled or Y-flipped CTM before drawing text (Google Docs and similar exports) placed glyphs at the negated, out-of-MediaBox position, with advances measured in unscaled text space. That mis-positioning scrambled reading order (e.g. the issue-254 footnotes read as gibberish) and distorted glyph spacing. Multiply text-space by the CTM so positions and advances are in page space at any scale or rotation. With positions now in page space, TextOverlapStrategy no longer needs to sort lines by abs(offsetY) -- a workaround for the negated coordinates this bug produced. Order by signed offsetY, which is also correct for a page whose MediaBox origin legitimately sits above its content (all-negative Y), where abs() would reverse the reading order.
bdb6b8a to
e081cb3
Compare
|
Good call. I removed the space fixing commit. Regarding space detection see #429 |
This one caused me some headache when working on handling text rotations. Some documents seemed to have their coordinate system flipped. This was caused by a flipped matrix multiplication.
The remaining code used
abs()to deal with the resulting negative coordinates, but in some cases that resulted in wrongly ordered words (good example are the footnotes in issue-254). The abs() calls have been removed.In fact the use of
abs()would also break calculations where the mediabox' origin is moved in a way that results glyphs being positioned at negative coordinate. A unit test covers this case.Now the tricky thing is, that the wrong matrix multiplication accidentally influenced the space detection. Basically it made the set threshold of 0.4em dependent on the page's transformation matrix which was accidentally a pretty good match for the size of spaces in the Google Docs produced documents which use a page scale of 75%. With the multiplication matrix fixed, this was no longer the case and those samples lost all their spaces. You can see that in e9db1f7.
Claude code is convinced that the proper way to detect spaces is to differentiate between real spaces (eg. space characters in a text element) and glyph positioned gaps. With the latter needing a much narrower threshold. This is implemented in e9db1f7 and works fine for the sample docs. TBH I am not 100% sure this is the correct approach, but I can not come up with anything better. The fact is that a simple single threshold does not work with each document.