Skip to content

[textractprettyprinter] does not return the last row of a table when using get_text_from_layout_json #430

@abouberthe

Description

@abouberthe

I used the following code to extract information from documents, including text and tables:

textract_json = call_textract(
input_document=byte_img,
features=[Textract_Features.TABLES, Textract_Features.LAYOUT, Textract_Features.FORMS],
boto3_textract_client=textract_client
)

layout = get_text_from_layout_json(textract_json, exclude_figure_text=False)

if 1 in layout.keys():
full_text = layout[1]
else:
full_text = ''

However, when testing it on the attached document (document_anonyme_1.jpg), the resulting text output (document_anonymise_1.txt) is missing the last row of the table — specifically, the row that contains "COPYRIGHT EOT ..." does not appear.

Could you please help me resolve this issue?

For reference, I am using the following versions of the relevant packages:

amazon-textract-caller: 0.2.4

amazon-textract-prettyprinter: 0.1.10

amazon-textract-response-parser: 0.1.48

amazon-textract-textractor: 1.9.2

Image

document_anonymise_1.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions