Releases: PrinsFrank/pdfparser
Releases · PrinsFrank/pdfparser
v3.1.0 Performance improvements & several text extraction / dictionary parsing fixes
What's Changed
- Add missing value type for dictionaryKey Kids restoring all dictionaryKey having value types by @PrinsFrank in #398
- Reduce function calls in TextOverlapStrategy to improve performance by @PrinsFrank in #400
- Default to inMemoryStream with smaller temporary content to reduce filesystem calls by @PrinsFrank in #401
- Optimize visited node tracking in textOverlapStrategy by @PrinsFrank in #402
- Fall back to Identity H decoding when font has no CMap and Encoding set by @PrinsFrank in #403
- Accept leading zeros in integerValue by @PrinsFrank in #404
- Remove unnecessary stream wrapping in CompressedObject causing extra memory consumption by @PrinsFrank in #405
- Handle trailing comments in dictionaries by @PrinsFrank in #406
- Handle comments between key value pairs in dictionaries by @PrinsFrank in #407
- Gracefully handle missing stream lengths by @PrinsFrank in #408
- Strip comments from widths array by @PrinsFrank in #409
- Perms can be a byte string by @PrinsFrank in #410
- Add zizmor workflow and harden github actions by @PrinsFrank in #411
- Bump actions/checkout from 6.0.2 to 6.0.3 by @dependabot[bot] in #412
- Fix dictionary parsing error where nesting content is decreased by 3 levels when encountering ">>>>" by @PrinsFrank in #415
Full Changelog: v3.0.1...v3.1.0
v3.0.1 Fixes text extraction issues when font has simple encoding with differences, performance improvements
What's Changed
- Optimize LiteralStringEscapeCharacter::getReplacementSet as it uses significant memory/cpu by @PrinsFrank in #394
- TextShowingOperator interacts with transformation matrix resulting in offsetY changes by @PrinsFrank in #396
- Don't prioritize simple encoding when font has differences by @PrinsFrank in #395
Full Changelog: v3.0.0...v3.0.1
v3.0.0 Encrypted document support, major text extraction updates
New Features
- Added support for encrypted PDFs
- New Text grouping algorithm: text with majority vertical overlap is considered part of the same line. Fixes subscript-superscript extraction issues.
- Several transformation matrix issues solved, fixing text extraction/ordering issues
💖 Sponsorship
If you depend on this package and want to support its maintenance, please consider sponsoring me. I'll continue maintaining and releasing updates regardless, but sponsorships help cover the time it takes to review changes and keep everything accurate.
Other changes
- Fix scale operand doesn't accept trailing 0 by @szepeviktor in #303
- Dictionary entry for key Order can be of type ReferenceValueArray by @PrinsFrank in #316
- Value for dictionaryKey AP can be a single dictionary by @PrinsFrank in #319
- Automatically resolve values from subdictionaries when expected value type is not a dictionary by @PrinsFrank in #321
- Simplify type checks XObject by @PrinsFrank in #322
- Implement value retrieval from ancestor nodes in page tree for inheritable properties by @PrinsFrank in #323
- Automatically resolve references in dictionary entries when retrieving values by @PrinsFrank in #325
- feat(rectangle): add width and height helpers by @vitormattos in #317
- Fix invalid section reference for file encryption key calculation by @PrinsFrank in #327
- User password entry length should always be 32 regardless of security handler revision by @PrinsFrank in #328
- Add file encryption key to metadata for samples by @PrinsFrank in #329
- Enable support for encrypted documents by @PrinsFrank in #282
- Add sample with user/owner password by @PrinsFrank in #332
- Add information about debugging file encryption to CONTRIBUTING.md by @PrinsFrank in #333
- Add support for all escape sequences in literal strings by @PrinsFrank in #334
- Support octals with one or two digits (next to support for three) in string literals by @PrinsFrank in #335
- Clean up decoding of string literals and hex strings in EncryptDictionary and use getText instead by @PrinsFrank in #336
- Fix improper handling of hex encoded binary strings in password entries by @PrinsFrank in #337
- Update minimum required PHP version to 8.2 by @PrinsFrank in #338
- Switch from readonly properties to readonly classes wherever possible by @PrinsFrank in #339
- Check file encryption key for samples by @PrinsFrank in #330
- Add upgrade guide for v3 by @PrinsFrank in #340
- Document argument for getValueForKey on dictionary is now required by @PrinsFrank in #341
- Recover userPassword from ownerPassword to also add support for ownerPasswords by @PrinsFrank in #343
- Cache calculated file encryption key on document by @PrinsFrank in #344
- Fix newly discovered PHPStan issue by @PrinsFrank in #346
- Update sponsorship section in README by @PrinsFrank in #345
- Properly parse hex strings by @PrinsFrank in #348
- Decrypt dictionary entries while parsing dictionaries in encrypted documents by @PrinsFrank in #347
- Decrypt content of compressed objects before parsing by @PrinsFrank in #349
- Replace escaped characters in encrypted strings before running decryption by @PrinsFrank in #350
- Check dictionary and page content for encrypted documents by @PrinsFrank in #342
- Add missing PNG predictor algorithms by @PrinsFrank in #351
- Flate decode columns should be multiplied by colors if present by @PrinsFrank in #352
- Ignore "endobj" markers in streams and search after length of stream dictionarymarker for it to allow for proper embedded PDF support by @PrinsFrank in #296
- The resource dictionary is now inherited by @PrinsFrank in #353
- Add sample with different font sizes by @PrinsFrank in #354
- Abstract line grouping strategy to make it replaceable by @PrinsFrank in #355
- Fix incorrect matrix multiplication in Move and MoveOffsetLeading operators causing scrambled text by @PrinsFrank in #356
- Apply transformation for NEXT_LINE Text positioning operator by @PrinsFrank in #358
- Add new overlap grouping strategy for text by @PrinsFrank in #357
- Fix initial text state not being set and appended/restored from stack resulting in lost textObjects by @PrinsFrank in #359
- Added sample file for #272 by @k00ni in #273
- Fix issues with operators that interact with both text state and transformation matrix by @PrinsFrank in #360
- Fix incorrect inverse matrix multiplication in graphicsStateOperator by @PrinsFrank in #361
- Handle text extraction with inverted Y-axis by @PrinsFrank in #362
- Use LineFeed as default page separator when extracting text for multiple pages by @PrinsFrank in #363
- Add sample from issue #290 by @PrinsFrank in #364
- Properly support encrypted documents in sample generation by @PrinsFrank in #365
- Move CONTRIBUTING.md to root of project by @PrinsFrank in #366
- FontReference can be any non-whitespace character by @PrinsFrank in #368
- Add benchmark comparison image to README.md by @PrinsFrank in #369
- Don't traverse loop nodes in page trees by @PrinsFrank in #370
- Support PAGE objects without CONTENTS by @PrinsFrank in #372
- Support NameValues in toUnicodeCMap dictionary entries for font objects by @PrinsFrank in #373
- Fix reference value array parsing when nr of items is divisible by 3 but items are not references by @PrinsFrank in #374
- Gracefully handle empty streams by @PrinsFrank in #375
- Gracefully handle only newlines between stream markers by @PrinsFrank in #376
- Universal reference value support now that auto resolving of references is implemented by @PrinsFrank in #377
- Handle empty crossReference types by @PrinsFrank in #378
- Add support for ASCIIHexDecode by @PrinsFrank in #380
- Properly handle object content that is not surrounded by newlines by @PrinsFrank in #379
- Encoding can be name values by @PrinsFrank in #381
- Gracefully handle multiple end operators for text objects by @PrinsFrank in #382
- Extend auto resolving of reference value to nameValues and dictionaries by @PrinsFrank in #383
- Update description in composer.json by @PrinsFrank in #384
- Fix parsing of comments in content streams by @PrinsFrank in #385
- Parsing ReferenceArrayValues in ArrayValues should preserve outside brackets by @PrinsFrank in #386
- Fix issues in retrieval of object content for uncompressed objects with content on the same line as start of object marker by @PrinsFrank in #387
- Ignore comments before text objects by @PrinsFrank in #388
- Auto resolve Reference value arrays by @PrinsFrank in #389
- Add sample from issue #215 by @PrinsFrank in #390
- CIDFontWidths can be empty by @PrinsFrank in #391
New Contributors
- @vitormattos made their first contribut...
v3.0.0-alpha.2
What's Changed
- FontReference can be any non-whitespace character by @PrinsFrank in #368
- Add benchmark comparison image to README.md by @PrinsFrank in #369
- Don't traverse loop nodes in page trees by @PrinsFrank in #370
- Support PAGE objects without CONTENTS by @PrinsFrank in #372
- Support NameValues in toUnicodeCMap dictionary entries for font objects by @PrinsFrank in #373
- Fix reference value array parsing when nr of items is divisible by 3 but items are not references by @PrinsFrank in #374
- Gracefully handle empty streams by @PrinsFrank in #375
- Gracefully handle only newlines between stream markers by @PrinsFrank in #376
- Universal reference value support now that auto resolving of references is implemented by @PrinsFrank in #377
- Handle empty crossReference types by @PrinsFrank in #378
- Add support for ASCIIHexDecode by @PrinsFrank in #380
- Properly handle object content that is not surrounded by newlines by @PrinsFrank in #379
- Encoding can be name values by @PrinsFrank in #381
- Gracefully handle multiple end operators for text objects by @PrinsFrank in #382
- Extend auto resolving of reference value to nameValues and dictionaries by @PrinsFrank in #383
Full Changelog: v3.0.0-alpha.1...v3.0.0-alpha.2
v3.0.0-alpha.1 Encrypted document support, major text extraction updates
New Features
- Added support for encrypted PDFs
- New Text grouping algorithm: text with majority vertical overlap is considered part of the same line. Fixes subscript-superscript extraction issues.
- Several transformation matrix issues solved, fixing text extraction/ordering issues
Other changes
- Fix scale operand doesn't accept trailing 0 by @szepeviktor in #303
- Dictionary entry for key Order can be of type ReferenceValueArray by @PrinsFrank in #316
- Value for dictionaryKey AP can be a single dictionary by @PrinsFrank in #319
- Automatically resolve values from subdictionaries when expected value type is not a dictionary by @PrinsFrank in #321
- Simplify type checks XObject by @PrinsFrank in #322
- Implement value retrieval from ancestor nodes in page tree for inheritable properties by @PrinsFrank in #323
- Automatically resolve references in dictionary entries when retrieving values by @PrinsFrank in #325
- feat(rectangle): add width and height helpers by @vitormattos in #317
- Fix invalid section reference for file encryption key calculation by @PrinsFrank in #327
- User password entry length should always be 32 regardless of security handler revision by @PrinsFrank in #328
- Add file encryption key to metadata for samples by @PrinsFrank in #329
- Enable support for encrypted documents by @PrinsFrank in #282
- Add sample with user/owner password by @PrinsFrank in #332
- Add information about debugging file encryption to CONTRIBUTING.md by @PrinsFrank in #333
- Add support for all escape sequences in literal strings by @PrinsFrank in #334
- Support octals with one or two digits (next to support for three) in string literals by @PrinsFrank in #335
- Clean up decoding of string literals and hex strings in EncryptDictionary and use getText instead by @PrinsFrank in #336
- Fix improper handling of hex encoded binary strings in password entries by @PrinsFrank in #337
- Update minimum required PHP version to 8.2 by @PrinsFrank in #338
- Switch from readonly properties to readonly classes wherever possible by @PrinsFrank in #339
- Check file encryption key for samples by @PrinsFrank in #330
- Add upgrade guide for v3 by @PrinsFrank in #340
- Document argument for getValueForKey on dictionary is now required by @PrinsFrank in #341
- Recover userPassword from ownerPassword to also add support for ownerPasswords by @PrinsFrank in #343
- Cache calculated file encryption key on document by @PrinsFrank in #344
- Fix newly discovered PHPStan issue by @PrinsFrank in #346
- Update sponsorship section in README by @PrinsFrank in #345
- Properly parse hex strings by @PrinsFrank in #348
- Decrypt dictionary entries while parsing dictionaries in encrypted documents by @PrinsFrank in #347
- Decrypt content of compressed objects before parsing by @PrinsFrank in #349
- Replace escaped characters in encrypted strings before running decryption by @PrinsFrank in #350
- Check dictionary and page content for encrypted documents by @PrinsFrank in #342
- Add missing PNG predictor algorithms by @PrinsFrank in #351
- Flate decode columns should be multiplied by colors if present by @PrinsFrank in #352
- Ignore "endobj" markers in streams and search after length of stream dictionarymarker for it to allow for proper embedded PDF support by @PrinsFrank in #296
- The resource dictionary is now inherited by @PrinsFrank in #353
- Add sample with different font sizes by @PrinsFrank in #354
- Abstract line grouping strategy to make it replaceable by @PrinsFrank in #355
- Fix incorrect matrix multiplication in Move and MoveOffsetLeading operators causing scrambled text by @PrinsFrank in #356
- Apply transformation for NEXT_LINE Text positioning operator by @PrinsFrank in #358
- Add new overlap grouping strategy for text by @PrinsFrank in #357
- Fix initial text state not being set and appended/restored from stack resulting in lost textObjects by @PrinsFrank in #359
- Added sample file for #272 by @k00ni in #273
- Fix issues with operators that interact with both text state and transformation matrix by @PrinsFrank in #360
- Fix incorrect inverse matrix multiplication in graphicsStateOperator by @PrinsFrank in #361
- Handle text extraction with inverted Y-axis by @PrinsFrank in #362
- Use LineFeed as default page separator when extracting text for multiple pages by @PrinsFrank in #363
- Add sample from issue #290 by @PrinsFrank in #364
- Properly support encrypted documents in sample generation by @PrinsFrank in #365
- Move CONTRIBUTING.md to root of project by @PrinsFrank in #366
New Contributors
- @vitormattos made their first contribution in #317
Full Changelog: v2.8.0...v3.0.0-alpha.1
v2.8.0 Fixes several memory issues, image extraction improvements, minor bugfixes
What's Changed
- Fix parsing issues on dictionaries in content streams where brackets in dictionaries were considered part of text array by @PrinsFrank in #292
- Support color space arrays by @PrinsFrank in #293
- Add support for images with bitsPerComponent 1 by @PrinsFrank in #294
- Fix newly detected unhandled match errors by @PrinsFrank in #307
- Decrease memory of textToUnicode by not creating intermediate arrays with str_split and array_map by @PrinsFrank in #305
- Parse CrossReferenceTable per line instead of entire table at once to reduce memory footprint by @PrinsFrank in #309
- Exit crossReferenceSection traversal when content at byte offset has already been parsed. Fixes #301 by @PrinsFrank in #310
Full Changelog: v2.7.0...v2.8.0
v2.7.0 Image (LUTS) & text extraction improvements
What's Changed
- Integrate samples from pdf-samples repository into this repo by @PrinsFrank in #266
- Update CONTRIBUTING.md with instructions on adding samples by @PrinsFrank in #267
- Add sample from #255 by @PrinsFrank in #268
- Add samples from previous issues by @PrinsFrank in #269
- Fix colorspace parsing issues by @PrinsFrank in #270
- Add support for LUTs in rasterized images by @PrinsFrank in #274
- Add sample from #235 by @PrinsFrank in #271
- Only fall back to identity decoding when toUnicodeCMap is not set, fixes #254 by @PrinsFrank in #276
Full Changelog: v2.6.3...v2.7.0
v2.6.3 Support for hex chars in name objects, method to retrieve subtype for embedded file & deprecation fixes
What's Changed
- Remove deprecated method usages and add PHPStan package to prevent future use by @PrinsFrank in #263
- Add support for hex characters in name objects by @PrinsFrank in #264
- Add method to retrieve subtype for embedded file by @PrinsFrank in #265
Full Changelog: v2.6.2...v2.6.3
v2.6.2 Associated file can reference object with reference array, support for octal escape sequences in dates
What's Changed
- AF key can contain reference to object with reference array by @PrinsFrank in #261
- Support octal escape sequences in date values by @PrinsFrank in #262
Full Changelog: v2.6.1...v2.6.2
v2.6.1 Dictionary array parsing fix
What's Changed
- Fix bug where dictionary arrays are parsed as reference arrays when its number of components is divisible by 3 by @PrinsFrank in #260
Full Changelog: v2.6.0...v2.6.1