feat: Avro#8
Merged
Merged
Conversation
- Implemented ManifestEncoder for encoding v1 and v2 Iceberg manifests. - Created ManifestListEncoder for encoding manifest lists in v1 and v2 formats. - Developed MiniAvroEncoder for writing Avro OCF files. - Added tests for reading and writing manifests and manifest lists, including support for various codecs. - Implemented OcfReader to read Avro OCF headers and validate codec and schema. - Added smoke tests to verify end-to-end functionality from snapshot to data file paths.
- Updated compatibility matrix to reflect Phase 2 development status and validated features. - Refactored AvroToIcebergType to use switch expressions for better readability. - Enhanced BinaryDecoder to include range checks for ReadInt, ReadBytesSpan, and ReadFixed methods. - Improved error handling in BinaryDecoder for negative sizes and out-of-range values. - Simplified OcfHeader class constructor using primary constructor syntax. - Added checks in AvroSchemaParser for logical types and fixed sizes, ensuring proper validation. - Introduced unit tests to cover new validation rules and edge cases in Avro schema parsing and decoding. - Updated various tests to use modern syntax and ensure consistency across test cases.
There was a problem hiding this comment.
Pull request overview
This PR advances IcebergSharp to “Phase 2” by adding a minimal, hardened Avro OCF reader specialized for Iceberg manifest-lists and manifests, along with domain models, fixtures/tests, and updated docs/CI enforcement.
Changes:
- Add stream-based Avro OCF decoding for Iceberg manifest-list and manifest files (
null+deflatecodecs), including schema parsing and block decoding hardening. - Introduce core domain models for manifests (manifest files, entries, data files, partition tuples) and end-to-end smoke coverage from metadata JSON → manifests.
- Update docs/changelog/status messaging and tighten CI with explicit
dotnet formatverification steps.
Reviewed changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/IcebergSharp.Tests/Avro/Phase2SmokeTests.cs | End-to-end smoke test from parsed table metadata to manifest-list + manifest traversal. |
| tests/IcebergSharp.Tests/Avro/OcfHeaderTests.cs | Unit tests for OCF header parsing (schema/codec/magic/codec validation). |
| tests/IcebergSharp.Tests/Avro/ManifestReaderTests.cs | Manifest reader coverage across v1/v2, stats/bounds, codecs, replay semantics. |
| tests/IcebergSharp.Tests/Avro/ManifestListReaderTests.cs | Manifest-list reader coverage across v1/v2, codecs, mixed content, replay semantics. |
| tests/IcebergSharp.Tests/Avro/Helpers/MiniAvroEncoder.cs | Minimal Avro OCF fixture encoder used by tests. |
| tests/IcebergSharp.Tests/Avro/Helpers/ManifestListEncoder.cs | Test helper to encode manifest-list OCF fixtures (v1/v2). |
| tests/IcebergSharp.Tests/Avro/Helpers/ManifestEncoder.cs | Test helper to encode manifest OCF fixtures (v1/v2). |
| tests/IcebergSharp.Tests/Avro/BinaryDecoderTests.cs | Binary decoder hardening and correctness tests (varints, bounds, truncation, etc.). |
| tests/IcebergSharp.Tests/Avro/AvroSchemaParserTests.cs | Schema parser tests for supported Avro constructs and validation behavior. |
| src/IcebergSharp.Core/Properties/AssemblyInfo.cs | Expose Core internals to Avro + test assemblies for Phase 2 implementation/testing. |
| src/IcebergSharp.Core/PartitionValues.cs | New domain type for decoded partition tuples (names/types/values). |
| src/IcebergSharp.Core/ManifestFile.cs | New domain types for manifest-list entries and per-partition summaries. |
| src/IcebergSharp.Core/ManifestEntry.cs | New domain types for manifest entries and data files (content/format/stats/bounds). |
| src/IcebergSharp.Avro/Properties/AssemblyInfo.cs | Expose Avro internals to test assembly. |
| src/IcebergSharp.Avro/ManifestReadOptions.cs | Options for manifest reading (LeaveOpen, MaxBlockSize). |
| src/IcebergSharp.Avro/ManifestReader.cs | Streaming manifest reader built on OCF + schema parser + entry sink. |
| src/IcebergSharp.Avro/ManifestListReadOptions.cs | Options for manifest-list reading (LeaveOpen, MaxBlockSize). |
| src/IcebergSharp.Avro/ManifestListReader.cs | Streaming manifest-list reader with format version detection. |
| src/IcebergSharp.Avro/Internal/Schema/AvroSchemaParser.cs | AOT-safe Utf8JsonReader schema parsing with strict shape validation. |
| src/IcebergSharp.Avro/Internal/Schema/AvroSchema.cs | Minimal internal Avro schema model (primitives/records/unions/arrays/etc.). |
| src/IcebergSharp.Avro/Internal/Ocf/OcfReader.cs | Async OCF block reader with codec dispatch, sync checking, pooling, and limits. |
| src/IcebergSharp.Avro/Internal/Ocf/OcfHeader.cs | Parsed OCF header model (schema/codec/sync/metadata). |
| src/IcebergSharp.Avro/Internal/Errors/AvroFormatException.cs | Format-specific exception type with offset/location support. |
| src/IcebergSharp.Avro/Internal/Decode/ShapeDecoder.cs | Schema-guided skipping and partition-value decoding (logical types). |
| src/IcebergSharp.Avro/Internal/Decode/PartitionRecordSink.cs | Partition record decoding into PartitionValues + schema resolution. |
| src/IcebergSharp.Avro/Internal/Decode/NullableReader.cs | Helpers for [null, T] unions and nullable field reading. |
| src/IcebergSharp.Avro/Internal/Decode/ManifestFileSink.cs | Decode manifest-list records into ManifestFile. |
| src/IcebergSharp.Avro/Internal/Decode/ManifestEntrySink.cs | Decode manifest records into ManifestEntry/DataFile. |
| src/IcebergSharp.Avro/Internal/Decode/IcebergFieldIds.cs | Central constants for Iceberg field-id routing. |
| src/IcebergSharp.Avro/Internal/Decode/BinaryDecoder.cs | In-memory Avro binary decoder with truncation/validation checks. |
| src/IcebergSharp.Avro/Internal/Decode/AvroToIcebergType.cs | Map Avro partition schema nodes back to Iceberg types. |
| src/IcebergSharp.Avro/Internal/Codec/NullCodec.cs | Null codec implementation for OCF blocks. |
| src/IcebergSharp.Avro/Internal/Codec/IBlockCodec.cs | Codec interface for OCF block decoding. |
| src/IcebergSharp.Avro/Internal/Codec/DeflateCodec.cs | Deflate codec implementation for OCF blocks. |
| README.md | Update project status and surface area to Phase 2. |
| docs/compatibility-matrix.md | Update compatibility/status claims for Phase 2 readers. |
| CHANGELOG.md | Document Phase 2 additions and CI formatting enforcement. |
| .gitignore | Ignore additional IDE cache artifacts. |
| .github/workflows/ci.yml | Split dotnet format verification into whitespace/style/analyzers in CI. |
Comments suppressed due to low confidence (4)
tests/IcebergSharp.Tests/Avro/Helpers/ManifestEncoder.cs:136
- MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).
tests/IcebergSharp.Tests/Avro/Helpers/ManifestListEncoder.cs:101 - MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).
tests/IcebergSharp.Tests/Avro/OcfHeaderTests.cs:34 - MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).
tests/IcebergSharp.Tests/Avro/OcfHeaderTests.cs:60 - MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+108
to
+113
| public static byte[] EncodeV1(IEnumerable<V1Entry> entries, string codec = "null") | ||
| { | ||
| var enc = new MiniAvroEncoder(V1Schema, codec, new Dictionary<string, byte[]> | ||
| { | ||
| ["format-version"] = "1"u8.ToArray(), | ||
| }); |
Comment on lines
+77
to
+82
| public static byte[] EncodeV1(IEnumerable<V1Entry> entries, string codec = "null") | ||
| { | ||
| var enc = new MiniAvroEncoder(V1Schema, codec); | ||
| foreach (V1Entry e in entries) | ||
| { | ||
| enc.StartRecord(); |
Comment on lines
+13
to
+18
| const string Schema = """{"type":"record","name":"r","fields":[{"name":"x","type":"long","field-id":1}]}"""; | ||
| var enc = new MiniAvroEncoder(Schema, "null"); | ||
| enc.StartRecord(); | ||
| enc.WriteLong(42); | ||
|
|
||
| using var stream = new MemoryStream(enc.ToArray()); |
Comment on lines
+146
to
+153
| _output.Write(_magic); | ||
|
|
||
| // Metadata map: encoded as one block with negative-count prefix (entry count + byte size), then 0. | ||
| var keys = new List<KeyValuePair<string, byte[]>> | ||
| { | ||
| new("avro.schema", Encoding.UTF8.GetBytes(schemaJson)), | ||
| new("avro.codec", Encoding.UTF8.GetBytes(codec)), | ||
| }; |
Comment on lines
+98
to
+114
| var recordCount = recordCountResult.Value; | ||
| var byteSize = await ReadVarLongAsync(cancellationToken).ConfigureAwait(false); | ||
|
|
||
| if (recordCount <= 0) | ||
| { | ||
| throw new AvroFormatException($"invalid block record count {recordCount}", blockStartOffset); | ||
| } | ||
|
|
||
| if (recordCount > int.MaxValue) | ||
| { | ||
| throw new AvroFormatException($"block record count {recordCount} exceeds supported maximum {int.MaxValue}", blockStartOffset); | ||
| } | ||
|
|
||
| if (byteSize <= 0 || byteSize > _maxBlockSize) | ||
| { | ||
| throw new AvroFormatException($"invalid block byte size {byteSize} (max {_maxBlockSize})", blockStartOffset); | ||
| } |
Comment on lines
+56
to
+64
| case IcebergFieldIds.AddedFilesCount: | ||
| addedFilesCount = checked((int)NullableReader.ReadLongRequiredOrZero(ref decoder, field.Schema)); | ||
| break; | ||
| case IcebergFieldIds.ExistingFilesCount: | ||
| existingFilesCount = checked((int)NullableReader.ReadLongRequiredOrZero(ref decoder, field.Schema)); | ||
| break; | ||
| case IcebergFieldIds.DeletedFilesCount: | ||
| deletedFilesCount = checked((int)NullableReader.ReadLongRequiredOrZero(ref decoder, field.Schema)); | ||
| break; |
Comment on lines
+247
to
+256
| private async ValueTask<long?> TryReadVarLongAsync(CancellationToken cancellationToken) | ||
| { | ||
| // Returns null on clean EOF before any byte is read; otherwise reads a full varint. | ||
| var firstByte = new byte[1]; | ||
| var n = await _stream.ReadAsync(firstByte.AsMemory(0, 1), cancellationToken).ConfigureAwait(false); | ||
| if (n == 0) | ||
| { | ||
| return null; | ||
| } | ||
|
|
Comment on lines
+83
to
+85
| dotnet-version: | | ||
| 8.0.x | ||
| 9.0.x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nullanddeflatecodecs, schema parsing, and manifest/domain decoding.AvroFormatExceptionwhere appropriate.Validation
dotnet test IcebergSharp.slnxdotnet format whitespace IcebergSharp.slnx --verify-no-changes --no-restoredotnet format style IcebergSharp.slnx --verify-no-changes --severity info --no-restoredotnet format analyzers IcebergSharp.slnx --verify-no-changes --severity info --no-restore