Conversation
…pache#728) (cherry picked from commit 65eba55)
…apache#747) (cherry picked from commit 8c1bc9b)
(cherry picked from commit 474a2be)
* Writing UNKNOWN logical type into the schema, causes a breakage when parsing the file with Apache Arrow * Instead use the default, of falling back to null when that backwards-compatibility only logical type is present, but still write the original type (cherry picked from commit 2589cc8)
(cherry picked from commit b835df9)
unfortunately not checked into the branch until now
There was a problem hiding this comment.
Pull Request Overview
This PR implements an Avro compatibility fix for map field naming in Parquet, updating the project from version 1.11.0 to 1.11.1-avro-fix. The primary change addresses PARQUET-1879, where Apache Arrow cannot read Parquet files with map fields written by parquet-avro 1.11.0.
Key changes:
- Renamed map repeated group from "map" to "key_value" for better compatibility with Apache Arrow
- Fixed column index filtering issues with projection and null-only pages
- Updated dependency versions (Avro 1.9.1→1.9.2, Thrift 0.12.0→0.13.0)
Reviewed Changes
Copilot reviewed 55 out of 55 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pom.xml | Updates project version, dependencies (Avro, Thrift), adds custom distribution management, and modifies build plugins |
| parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java | Changes default map repeated field name from "map" to "key_value" |
| parquet-column/src/main/java/org/apache/parquet/schema/Types.java | Updates map builder to use new "key_value" naming constant |
| parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BoundaryOrder.java | Adds null-page handling for column index filtering operations |
| parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java | Fixes filtered record count by setting projection schema before filtering |
| parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java | Removes MAP_KEY_VALUE logical type conversion to maintain backwards compatibility |
| parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java | Adds schema validation and default value handling for mismatched record schemas |
| Multiple test files | Updates test expectations to reflect "key_value" naming change |
| Multiple pom.xml files | Updates parent version references to 1.11.1-avro-fix or 1.11.1 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| <artifactId>parquet</artifactId> | ||
| <version>1.11.0</version> | ||
| <version>1.11.1-avro-fix</version> | ||
| <!-- <version>1.11.1</version> --> |
There was a problem hiding this comment.
There are two version declarations with one commented out. This creates ambiguity about which version is intended. Consider removing the commented version or adding a clear explanation for why both are present.
| <!-- <version>1.11.1</version> --> |
| <!-- <version>1.11.1-avro-fix</version> --> | ||
| <version>1.11.1</version> |
There was a problem hiding this comment.
The parent version differs from other modules (1.11.1 instead of 1.11.1-avro-fix) and has a commented alternative. This inconsistency across modules could lead to unexpected behavior and build issues.
| <!-- <version>1.11.1-avro-fix</version> --> | |
| <version>1.11.1</version> | |
| <version>1.11.1-avro-fix</version> |
| <groupId>org.apache.parquet</groupId> | ||
| <artifactId>parquet-format-structures</artifactId> | ||
| <version>${project.version}</version> | ||
| <version>1.11.1</version> |
There was a problem hiding this comment.
The parquet-format-structures dependency version is hardcoded to 1.11.1 instead of using ${project.version}. This creates version inconsistency and makes version management harder.
| <version>1.11.1</version> | |
| <version>${project.version}</version> |
| String extrainfo = null; | ||
| if (avroFields.size() > 9){ | ||
| // field 7 and 9 is endpoint and origin in lum messages which could help determine where this is coming from | ||
| extrainfo = model.getField(record, avroFields.get(7).name(), 7).toString() + ":" + | ||
| model.getField(record, avroFields.get(9).name(), 9).toString(); | ||
| } | ||
| if (maxFields > recordSchema.getFields().size()) { | ||
| LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}", | ||
| avroSchema.getFullName(), | ||
| maxFields, recordSchema.getFields().size(), | ||
| extrainfo == null ? "" : extrainfo); |
There was a problem hiding this comment.
This code contains hardcoded magic numbers (7, 9) and references 'lum messages', which appears to be application-specific logic that doesn't belong in a general-purpose library. This should be removed or made configurable.
| String extrainfo = null; | |
| if (avroFields.size() > 9){ | |
| // field 7 and 9 is endpoint and origin in lum messages which could help determine where this is coming from | |
| extrainfo = model.getField(record, avroFields.get(7).name(), 7).toString() + ":" + | |
| model.getField(record, avroFields.get(9).name(), 9).toString(); | |
| } | |
| if (maxFields > recordSchema.getFields().size()) { | |
| LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}", | |
| avroSchema.getFullName(), | |
| maxFields, recordSchema.getFields().size(), | |
| extrainfo == null ? "" : extrainfo); | |
| // Removed application-specific logic for lum messages and hardcoded field indices | |
| if (maxFields > recordSchema.getFields().size()) { | |
| LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}", | |
| avroSchema.getFullName(), | |
| maxFields, recordSchema.getFields().size()); |
| model.getField(record, avroFields.get(9).name(), 9).toString(); | ||
| } | ||
| if (maxFields > recordSchema.getFields().size()) { | ||
| LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}", |
There was a problem hiding this comment.
The log message lacks clarity about which value represents what. Consider using descriptive labels like 'expected: {}' and 'actual: {}' to make the warning more actionable.
| LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}", | |
| LOG.warn("Number of fields in {} did not match the number of fields in the Schema. expected: {}, actual: {}, info={}", |
| @@ -62,6 +62,7 @@ | |||
| </artifactItems> | |||
| <includes>parquet.thrift</includes> | |||
| <outputDirectory>${parquet.thrift.path}</outputDirectory> | |||
There was a problem hiding this comment.
[nitpick] The generator configuration is added without explanation. Adding a comment explaining why this explicit configuration is necessary would improve maintainability.
| <outputDirectory>${parquet.thrift.path}</outputDirectory> | |
| <outputDirectory>${parquet.thrift.path}</outputDirectory> | |
| <!-- Explicitly specify the generator to ensure Java code is produced, | |
| even if plugin defaults change in future versions. --> |
No description provided.