Avro fix by jidler · Pull Request #1 · fullcontact/parquet-mr

jidler · 2025-10-24T21:40:34Z

No description provided.

…pache#728) (cherry picked from commit 65eba55)

…che#732) (cherry picked from commit f0fc29f)

…apache#747) (cherry picked from commit 8c1bc9b)

(cherry picked from commit 474a2be)

* Writing UNKNOWN logical type into the schema, causes a breakage when parsing the file with Apache Arrow * Instead use the default, of falling back to null when that backwards-compatibility only logical type is present, but still write the original type (cherry picked from commit 2589cc8)

(cherry picked from commit b835df9)

unfortunately not checked into the branch until now

Copilot

Pull Request Overview

This PR implements an Avro compatibility fix for map field naming in Parquet, updating the project from version 1.11.0 to 1.11.1-avro-fix. The primary change addresses PARQUET-1879, where Apache Arrow cannot read Parquet files with map fields written by parquet-avro 1.11.0.

Key changes:

Renamed map repeated group from "map" to "key_value" for better compatibility with Apache Arrow
Fixed column index filtering issues with projection and null-only pages
Updated dependency versions (Avro 1.9.1→1.9.2, Thrift 0.12.0→0.13.0)

Reviewed Changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pom.xml	Updates project version, dependencies (Avro, Thrift), adds custom distribution management, and modifies build plugins
parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java	Changes default map repeated field name from "map" to "key_value"
parquet-column/src/main/java/org/apache/parquet/schema/Types.java	Updates map builder to use new "key_value" naming constant
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BoundaryOrder.java	Adds null-page handling for column index filtering operations
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java	Fixes filtered record count by setting projection schema before filtering
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java	Removes MAP_KEY_VALUE logical type conversion to maintain backwards compatibility
parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java	Adds schema validation and default value handling for mismatched record schemas
Multiple test files	Updates test expectations to reflect "key_value" naming change
Multiple pom.xml files	Updates parent version references to 1.11.1-avro-fix or 1.11.1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-24T21:49:28Z

  <artifactId>parquet</artifactId>
-  <version>1.11.0</version>
+  <version>1.11.1-avro-fix</version>
+  <!-- <version>1.11.1</version> -->


There are two version declarations with one commented out. This creates ambiguity about which version is intended. Consider removing the commented version or adding a clear explanation for why both are present.

Suggested change

Copilot · 2025-10-24T21:49:29Z

+    <!-- <version>1.11.1-avro-fix</version> -->
+    <version>1.11.1</version>


The parent version differs from other modules (1.11.1 instead of 1.11.1-avro-fix) and has a commented alternative. This inconsistency across modules could lead to unexpected behavior and build issues.

Suggested change



<version>1.11.1</version>

<version>1.11.1-avro-fix</version>

Copilot · 2025-10-24T21:49:29Z

      <groupId>org.apache.parquet</groupId>
      <artifactId>parquet-format-structures</artifactId>
-      <version>${project.version}</version>
+      <version>1.11.1</version>


The parquet-format-structures dependency version is hardcoded to 1.11.1 instead of using ${project.version}. This creates version inconsistency and makes version management harder.

Suggested change

<version>1.11.1</version>

<version>${project.version}</version>

Copilot · 2025-10-24T21:49:30Z

+    String extrainfo = null;
+    if (avroFields.size() > 9){
+      // field 7 and 9 is endpoint and origin in lum messages which could help determine where this is coming from
+      extrainfo = model.getField(record, avroFields.get(7).name(), 7).toString() + ":" +
+                  model.getField(record, avroFields.get(9).name(), 9).toString();
+    }
+    if (maxFields > recordSchema.getFields().size()) {
+      LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",
+        avroSchema.getFullName(),
+        maxFields, recordSchema.getFields().size(),
+        extrainfo == null ? "" : extrainfo);


This code contains hardcoded magic numbers (7, 9) and references 'lum messages', which appears to be application-specific logic that doesn't belong in a general-purpose library. This should be removed or made configurable.

Suggested change

String extrainfo = null;

if (avroFields.size() > 9){

// field 7 and 9 is endpoint and origin in lum messages which could help determine where this is coming from

extrainfo = model.getField(record, avroFields.get(7).name(), 7).toString() + ":" +

model.getField(record, avroFields.get(9).name(), 9).toString();

}

if (maxFields > recordSchema.getFields().size()) {

LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",

avroSchema.getFullName(),

maxFields, recordSchema.getFields().size(),

extrainfo == null ? "" : extrainfo);

// Removed application-specific logic for lum messages and hardcoded field indices

if (maxFields > recordSchema.getFields().size()) {

LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}",

avroSchema.getFullName(),

maxFields, recordSchema.getFields().size());

Copilot · 2025-10-24T21:49:30Z

+                  model.getField(record, avroFields.get(9).name(), 9).toString();
+    }
+    if (maxFields > recordSchema.getFields().size()) {
+      LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",


The log message lacks clarity about which value represents what. Consider using descriptive labels like 'expected: {}' and 'actual: {}' to make the warning more actionable.

Suggested change

LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",

LOG.warn("Number of fields in {} did not match the number of fields in the Schema. expected: {}, actual: {}, info={}",

Copilot · 2025-10-24T21:49:30Z

@@ -62,6 +62,7 @@
               </artifactItems>
               <includes>parquet.thrift</includes>
               <outputDirectory>${parquet.thrift.path}</outputDirectory>


[nitpick] The generator configuration is added without explanation. Adding a comment explaining why this explicit configuration is necessary would improve maintainability.

Suggested change

<outputDirectory>${parquet.thrift.path}</outputDirectory>

<outputDirectory>${parquet.thrift.path}</outputDirectory>

gszadovszky and others added 15 commits January 22, 2020 11:22

Prepare next development iteration

aa20d77

PARQUET-1740: Make ParquetFileReader.getFilteredRecordCount public (a…

fa48449

…pache#728) (cherry picked from commit 65eba55)

PARQUET-1744: Some filters throws ArrayIndexOutOfBoundsException (apa…

49de5b4

…che#732) (cherry picked from commit f0fc29f)

PARQUET-1765: Invalid filteredRowCount in InternalParquetRecordReader (…

f284d63

…apache#747) (cherry picked from commit 8c1bc9b)

PARQUET-1796: Bump Avro from 1.9.1 to 1.9.2 (apache#759)

1d68b52

(cherry picked from commit 474a2be)

Update CHANGES.md for 1.11.1rc0

869a570

[maven-release-plugin] prepare release apache-parquet-1.11.1-rc0

1796c55

[maven-release-plugin] prepare for next development iteration

d69e0d9

PARQUET-1853: Minimize shaded fastutil in parquet-avro (apache#790)

cf835b9

(cherry picked from commit b835df9)

Update CHANGES.md for 1.11.1rc1

5bda355

[maven-release-plugin] prepare release apache-parquet-1.11.1-rc1

765bd5c

fix writerecordfields

6865e21

a few other version munges

db42a63

these changes were made on 4-26-24

b77aec2

unfortunately not checked into the branch until now

jidler requested review from a-tekin-fullcontact and Copilot October 24, 2025 21:48

Copilot AI reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avro fix#1

Avro fix#1
jidler wants to merge 15 commits into
masterfrom
avro-fix

jidler commented Oct 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 24, 2025

Uh oh!

Copilot AI Oct 24, 2025

Uh oh!

Copilot AI Oct 24, 2025

Uh oh!

Copilot AI Oct 24, 2025

Uh oh!

Copilot AI Oct 24, 2025

Uh oh!

Copilot AI Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

		<!-- <version>1.11.1-avro-fix</version> -->
		<version>1.11.1</version>

	<!-- <version>1.11.1-avro-fix</version> -->
	<version>1.11.1</version>
	<version>1.11.1-avro-fix</version>

	<version>1.11.1</version>
	<version>${project.version}</version>

	LOG.warn("Number of fields in {} did not match the number of fields in the Schema. {} {}, info={}",
	LOG.warn("Number of fields in {} did not match the number of fields in the Schema. expected: {}, actual: {}, info={}",

Uh oh!

Conversation

jidler commented Oct 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants