Skip to content

Avro fix#1

Open
jidler wants to merge 15 commits into
masterfrom
avro-fix
Open

Avro fix#1
jidler wants to merge 15 commits into
masterfrom
avro-fix

Conversation

@jidler

@jidler jidler commented Oct 24, 2025

Copy link
Copy Markdown
Collaborator

No description provided.

gszadovszky and others added 15 commits January 22, 2020 11:22
* Writing UNKNOWN logical type into the schema, causes a breakage
  when parsing the file with Apache Arrow
* Instead use the default, of falling back to null when that
  backwards-compatibility only logical type is present, but still
  write the original type

(cherry picked from commit 2589cc8)
unfortunately not checked into the branch until now

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an Avro compatibility fix for map field naming in Parquet, updating the project from version 1.11.0 to 1.11.1-avro-fix. The primary change addresses PARQUET-1879, where Apache Arrow cannot read Parquet files with map fields written by parquet-avro 1.11.0.

Key changes:

  • Renamed map repeated group from "map" to "key_value" for better compatibility with Apache Arrow
  • Fixed column index filtering issues with projection and null-only pages
  • Updated dependency versions (Avro 1.9.1→1.9.2, Thrift 0.12.0→0.13.0)

Reviewed Changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pom.xml Updates project version, dependencies (Avro, Thrift), adds custom distribution management, and modifies build plugins
parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java Changes default map repeated field name from "map" to "key_value"
parquet-column/src/main/java/org/apache/parquet/schema/Types.java Updates map builder to use new "key_value" naming constant
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BoundaryOrder.java Adds null-page handling for column index filtering operations
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java Fixes filtered record count by setting projection schema before filtering
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java Removes MAP_KEY_VALUE logical type conversion to maintain backwards compatibility
parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java Adds schema validation and default value handling for mismatched record schemas
Multiple test files Updates test expectations to reflect "key_value" naming change
Multiple pom.xml files Updates parent version references to 1.11.1-avro-fix or 1.11.1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pom.xml
<artifactId>parquet</artifactId>
<version>1.11.0</version>
<version>1.11.1-avro-fix</version>
<!-- <version>1.11.1</version> -->

Copilot AI Oct 24, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two version declarations with one commented out. This creates ambiguity about which version is intended. Consider removing the commented version or adding a clear explanation for why both are present.

Suggested change
<!-- <version>1.11.1</version> -->

Copilot uses AI. Check for mistakes.
Comment thread parquet-tools/pom.xml
Comment on lines +24 to +25
<!-- <version>1.11.1-avro-fix</version> -->
<version>1.11.1</version>

Copilot AI Oct 24, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent version differs from other modules (1.11.1 instead of 1.11.1-avro-fix) and has a commented alternative. This inconsistency across modules could lead to unexpected behavior and build issues.

Suggested change
<!-- <version>1.11.1-avro-fix</version> -->
<version>1.11.1</version>
<version>1.11.1-avro-fix</version>

Copilot uses AI. Check for mistakes.
Comment thread parquet-common/pom.xml
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-format-structures</artifactId>
<version>${project.version}</version>
<version>1.11.1</version>

Copilot AI Oct 24, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet-format-structures dependency version is hardcoded to 1.11.1 instead of using ${project.version}. This creates version inconsistency and makes version management harder.

Suggested change
<version>1.11.1</version>
<version>${project.version}</version>

Copilot uses AI. Check for mistakes.
Comment on lines +189 to +199
String extrainfo = null;
if (avroFields.size() > 9){
// field 7 and 9 is endpoint and origin in lum messages which could help determine where this is coming from
extrainfo = model.getField(record, avroFields.get(7).name(), 7).toString() + ":" +
model.getField(record, avroFields.get(9).name(), 9).toString();
}
if (maxFields > recordSchema.getFields().size()) {
LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",
avroSchema.getFullName(),
maxFields, recordSchema.getFields().size(),
extrainfo == null ? "" : extrainfo);

Copilot AI Oct 24, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code contains hardcoded magic numbers (7, 9) and references 'lum messages', which appears to be application-specific logic that doesn't belong in a general-purpose library. This should be removed or made configurable.

Suggested change
String extrainfo = null;
if (avroFields.size() > 9){
// field 7 and 9 is endpoint and origin in lum messages which could help determine where this is coming from
extrainfo = model.getField(record, avroFields.get(7).name(), 7).toString() + ":" +
model.getField(record, avroFields.get(9).name(), 9).toString();
}
if (maxFields > recordSchema.getFields().size()) {
LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",
avroSchema.getFullName(),
maxFields, recordSchema.getFields().size(),
extrainfo == null ? "" : extrainfo);
// Removed application-specific logic for lum messages and hardcoded field indices
if (maxFields > recordSchema.getFields().size()) {
LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}",
avroSchema.getFullName(),
maxFields, recordSchema.getFields().size());

Copilot uses AI. Check for mistakes.
model.getField(record, avroFields.get(9).name(), 9).toString();
}
if (maxFields > recordSchema.getFields().size()) {
LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",

Copilot AI Oct 24, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message lacks clarity about which value represents what. Consider using descriptive labels like 'expected: {}' and 'actual: {}' to make the warning more actionable.

Suggested change
LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",
LOG.warn("Number of fields in {} did not match the number of fields in the Schema. expected: {}, actual: {}, info={}",

Copilot uses AI. Check for mistakes.
@@ -62,6 +62,7 @@
</artifactItems>
<includes>parquet.thrift</includes>
<outputDirectory>${parquet.thrift.path}</outputDirectory>

Copilot AI Oct 24, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The generator configuration is added without explanation. Adding a comment explaining why this explicit configuration is necessary would improve maintainability.

Suggested change
<outputDirectory>${parquet.thrift.path}</outputDirectory>
<outputDirectory>${parquet.thrift.path}</outputDirectory>
<!-- Explicitly specify the generator to ensure Java code is produced,
even if plugin defaults change in future versions. -->

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants