space detection approach #429
splitbrain
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Detecting where to put spaces is one of the hardest parts in extracting text from PDF. The current version of the library (
main) works okay, but has some bugs and some of it only works accidentally on the sample files but will most probably fail on other documents (see #425 and #426). The library currently also fails with rotated texts - a feature I am working on but which is surfacing bugs as mentioned before.The more I look into the issues, the clearer it becomes that the current approach to detecting spaces from gaps between glyphs is insufficient.
maincurrently relies on a matrix multiplication bug that accidentally ties a fixed space threshold of 0.4em to the page scale.The code in
ContentStream::getText()is roughly:Spaces should not depend on the page scale but on the font size alone. A space is usually about 0.278em wide - roughly a constant fraction of the font size - and space detection can use a threshold a little below that (about 0.25em) to read a gap as a space.
With the multiplication bug fixed (first commit in #426) it becomes apparent that the current space detection does not work correctly. After some investigation the reason is two-fold:
In
mainthat advance is computed roughly like:The solution is to properly calculate the "cursor advance": how far does the virtual cursor advance after drawing a glyph and what's the distance to the next glyph? That is the real gap to measure.
To implement this there are two approaches. They differ mainly in where the advance is computed: at the moment we compare two elements, or up front while we walk the content stream and build the elements.
Option A: "reconstruct" the advance
This keeps the current architecture mostly intact. A dedicated method computes the advance for a run from its own font, spacing and kerning numbers, and the gap check in
ContentStream::getText()calls it:The element walk in
getPositionedTextElements()is untouched. The text matrix is still not advanced, and fonts are only resolved here, at the comparison. It also leaves one edge case unfixed. When twoTj/TJfollow each other with no repositioning in between, both report the same X position. For space detection this does not matter, but the position is still wrong and might result in bugs further on.Option B: advance the text matrix
This is closer to the PDF spec which separates a line matrix from a text matrix and says the text matrix should "advance" with drawing each glyph. The work moves into the element walk in
ContentStream::getPositionedTextElements(): as we walk, the text matrix advances after each run and we store where it ended on the element (a newendMatrix). The gap check ingetText()is then a plain subtraction:Adjusting the matrix this way would fix the
Tj/TJissue. It does however mean thatgetPositionedTextElements()now needs the font there (so it takes$document/$page). This breaks the clean separation ofgetPositionedTextElements()only being responsible for layout independent of fonts - it also means unit tests on it no longer work without also mocking a font.I feel like option B is the "more correct" version (and incidentally is also what most other libraries seem to do). But it is also a bit more invasive and a bigger architecture change.
I think this is also questioning the goals of this library. If we only plan to extract somewhat sensible text (eg. for search indexing) then Option A is probably good enough. If there are long term goals of more layout analysis option B is probably the way to go.
@PrinsFrank what do you think?
Beta Was this translation helpful? Give feedback.
All reactions