wcag/ChunkParser: scale Type3 glyph advances by /FontMatrix, not hardcoded 1/1000#732
Conversation
…coded 1/1000 PDFont.getWidth(code) returns the raw /Widths value in the font's glyph space. For Type3 fonts glyph space is defined by /FontMatrix, not 1/1000, so the hardcoded /1000 divisor produced advances off by (1000 * FontMatrix[0]) for any Type3 font whose FontMatrix != 1/1000 -- corrupting TextPiece end positions and text-chunk bounding boxes (dropped duplicate glyphs / spurious spaces downstream). Use the horizontal /FontMatrix component for a PDType3Font (null/length/zero guarded), 1/1000 otherwise. The TJ-array displacement path (obj.getReal()/1000, text-space thousandths and font-independent) is left unchanged.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthrough
ChangesType3 Font Glyph Width Scaling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@augustovillar Since the font matrix should also be used in other places, I created a separate PR that incorporates your code. Thank you for your contribution! |
Problem
ChunkParser#parseStringcomputes each glyph's text-space advance as:PDFont#getWidth(code)returns the raw/Widthsvalue in the font's glyph space. For most fonts glyph space is 1/1000 text units, so dividing by 1000 is correct — but Type3 fonts define their own glyph space via/FontMatrix. For a Type3 font whose/FontMatrixis e.g.[.0004883 0 0 -.0004883 0 0](= 1/2048), dividing by 1000 instead of by 2048 makes every advance ~2.048× too large, corruptingTextPieceend positions and the resulting text-chunk bounding boxes.Impact
Consumers of these metrics mis-assemble text. In opendataloader-pdf (which bundles this module), real-world Type3 PDFs drop a duplicate adjacent glyph and gain a spurious space —
1.66E-2is extracted as1. 6E-2,ProgrammeasProgra me. Downstream report: opendataloader-project/opendataloader-pdf#578This follows up #722 (which refactored this method but kept the hardcoded
/ 1000); same spirit as veraPDF-parser #511 (CFF CIDFontMatrix).Fix
Add a
getGlyphWidthToTextSpaceScale(PDFont)helper that returns the horizontal/FontMatrixcomponent (fontMatrix[0]) for aPDType3Font(null/length/zero-guarded), and the unchanged1/1000for every other font; the factor is hoisted out of the per-code loop. TheTJ-array displacement path (obj.getReal() / 1000, which is text-space thousandths and font-independent) is intentionally left unchanged.Verification
integrationbranch (mvn -pl wcag-validation -am compile, BUILD SUCCESS).wcag-validation1.31.65) on a public real-world Type3 PDF (EPD epd28600): before →1. 6E-2, 34 spacing artifacts; after →1.66E-2, 0 artifacts. Same toolchain, only this patch toggled.instanceof PDType3Fontguard leaves the default path untouched — and previously-correct values are unchanged.Summary by CodeRabbit