space detection approach #429

splitbrain · 2026-06-24T18:51:02Z

splitbrain
Jun 24, 2026

Detecting where to put spaces is one of the hardest parts in extracting text from PDF. The current version of the library (main) works okay, but has some bugs and some of it only works accidentally on the sample files but will most probably fail on other documents (see #425 and #426). The library currently also fails with rotated texts - a feature I am working on but which is surfacing bugs as mentioned before.

The more I look into the issues, the clearer it becomes that the current approach to detecting spaces from gaps between glyphs is insufficient.

main currently relies on a matrix multiplication bug that accidentally ties a fixed space threshold of 0.4em to the page scale.

The code in ContentStream::getText() is roughly:

$gap = $next->absoluteMatrix->offsetX
     - $prev->absoluteMatrix->offsetX
     - $advanceOfPrev;

// a space is inserted when the gap clears the threshold:
$isSpace = $gap >= $fontSize * $scaleX * 0.40;   // ← the 0.40 threshold, scaled by the page matrix

Spaces should not depend on the page scale but on the font size alone. A space is usually about 0.278em wide - roughly a constant fraction of the font size - and space detection can use a threshold a little below that (about 0.25em) to read a gap as a space.

With the multiplication bug fixed (first commit in #426) it becomes apparent that the current space detection does not work correctly. After some investigation the reason is two-fold:

it works on glyph widths only and does not take kerning into account
it measures the previous element with the font of the next element - a simple typo that can throw off widths when fonts change

In main that advance is computed roughly like:

$advanceOfPrev = $next->getFont($document, $page)   // ← the NEXT element's font ...
    ->getWidthForChars($prev->getCodePoints());     // ← ... measuring the PREVIOUS element's glyphs, and the kerning numbers are dropped

The solution is to properly calculate the "cursor advance": how far does the virtual cursor advance after drawing a glyph and what's the distance to the next glyph? That is the real gap to measure.

To implement this there are two approaches. They differ mainly in where the advance is computed: at the moment we compare two elements, or up front while we walk the content stream and build the elements.

Option A: "reconstruct" the advance

This keeps the current architecture mostly intact. A dedicated method computes the advance for a run from its own font, spacing and kerning numbers, and the gap check in ContentStream::getText() calls it:

// PositionedTextElement::getAdvanceWidth() is new; getText() calls it per comparison
$gap = $next->absoluteMatrix->offsetX
     - $prev->absoluteMatrix->offsetX
     - $prev->getAdvanceWidth($document, $page);

The element walk in getPositionedTextElements() is untouched. The text matrix is still not advanced, and fonts are only resolved here, at the comparison. It also leaves one edge case unfixed. When two Tj/TJ follow each other with no repositioning in between, both report the same X position. For space detection this does not matter, but the position is still wrong and might result in bugs further on.

Option B: advance the text matrix

This is closer to the PDF spec which separates a line matrix from a text matrix and says the text matrix should "advance" with drawing each glyph. The work moves into the element walk in ContentStream::getPositionedTextElements(): as we walk, the text matrix advances after each run and we store where it ended on the element (a new endMatrix). The gap check in getText() is then a plain subtraction:

// the cursor was advanced during the walk, so the end position is already stored
$gap = $next->absoluteMatrix->offsetX
     - $prev->endMatrix->offsetX;

Adjusting the matrix this way would fix the Tj/TJ issue. It does however mean that getPositionedTextElements() now needs the font there (so it takes $document/$page). This breaks the clean separation of getPositionedTextElements() only being responsible for layout independent of fonts - it also means unit tests on it no longer work without also mocking a font.

I feel like option B is the "more correct" version (and incidentally is also what most other libraries seem to do). But it is also a bit more invasive and a bigger architecture change.

I think this is also questioning the goals of this library. If we only plan to extract somewhat sensible text (eg. for search indexing) then Option A is probably good enough. If there are long term goals of more layout analysis option B is probably the way to go.

@PrinsFrank what do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

space detection approach #429

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

space detection approach #429

Uh oh!

splitbrain Jun 24, 2026

Replies: 0 comments

splitbrain
Jun 24, 2026