Skip to content

feat: Avro#8

Merged
AndreaBozzo merged 4 commits into
mainfrom
ab/avro
May 15, 2026
Merged

feat: Avro#8
AndreaBozzo merged 4 commits into
mainfrom
ab/avro

Conversation

@AndreaBozzo

@AndreaBozzo AndreaBozzo commented May 15, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add the Phase 2 Avro manifest-list and manifest reader stack, including OCF header/block reading, null and deflate codecs, schema parsing, and manifest/domain decoding.
  • Add core manifest domain models plus fixture encoders and unit/smoke coverage for v1/v2 manifest lists, manifests, stats, bounds, codecs, replay, and metadata-to-manifest traversal.
  • Harden malformed-file handling: truncated schemas, invalid logical types, oversized lengths, invalid metadata/block headers, record-count caps, and overflow paths now surface as AvroFormatException where appropriate.
  • Update README, compatibility matrix, and changelog to reflect Phase 2 development status.
  • Enforce strict whitespace/style/analyzer checks in CI and document the multi-runtime setup.

Validation

  • dotnet test IcebergSharp.slnx
  • dotnet format whitespace IcebergSharp.slnx --verify-no-changes --no-restore
  • dotnet format style IcebergSharp.slnx --verify-no-changes --severity info --no-restore
  • dotnet format analyzers IcebergSharp.slnx --verify-no-changes --severity info --no-restore

- Implemented ManifestEncoder for encoding v1 and v2 Iceberg manifests.
- Created ManifestListEncoder for encoding manifest lists in v1 and v2 formats.
- Developed MiniAvroEncoder for writing Avro OCF files.
- Added tests for reading and writing manifests and manifest lists, including support for various codecs.
- Implemented OcfReader to read Avro OCF headers and validate codec and schema.
- Added smoke tests to verify end-to-end functionality from snapshot to data file paths.
- Updated compatibility matrix to reflect Phase 2 development status and validated features.
- Refactored AvroToIcebergType to use switch expressions for better readability.
- Enhanced BinaryDecoder to include range checks for ReadInt, ReadBytesSpan, and ReadFixed methods.
- Improved error handling in BinaryDecoder for negative sizes and out-of-range values.
- Simplified OcfHeader class constructor using primary constructor syntax.
- Added checks in AvroSchemaParser for logical types and fixed sizes, ensuring proper validation.
- Introduced unit tests to cover new validation rules and edge cases in Avro schema parsing and decoding.
- Updated various tests to use modern syntax and ensure consistency across test cases.
@AndreaBozzo AndreaBozzo marked this pull request as ready for review May 15, 2026 09:29
@AndreaBozzo AndreaBozzo self-assigned this May 15, 2026
@AndreaBozzo AndreaBozzo added enhancement New feature or request .NET Pull requests that update .NET code labels May 15, 2026
@AndreaBozzo AndreaBozzo requested a review from Copilot May 15, 2026 09:29
@AndreaBozzo AndreaBozzo changed the title [codex] Harden Avro manifest reading feat: Avro May 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances IcebergSharp to “Phase 2” by adding a minimal, hardened Avro OCF reader specialized for Iceberg manifest-lists and manifests, along with domain models, fixtures/tests, and updated docs/CI enforcement.

Changes:

  • Add stream-based Avro OCF decoding for Iceberg manifest-list and manifest files (null + deflate codecs), including schema parsing and block decoding hardening.
  • Introduce core domain models for manifests (manifest files, entries, data files, partition tuples) and end-to-end smoke coverage from metadata JSON → manifests.
  • Update docs/changelog/status messaging and tighten CI with explicit dotnet format verification steps.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/IcebergSharp.Tests/Avro/Phase2SmokeTests.cs End-to-end smoke test from parsed table metadata to manifest-list + manifest traversal.
tests/IcebergSharp.Tests/Avro/OcfHeaderTests.cs Unit tests for OCF header parsing (schema/codec/magic/codec validation).
tests/IcebergSharp.Tests/Avro/ManifestReaderTests.cs Manifest reader coverage across v1/v2, stats/bounds, codecs, replay semantics.
tests/IcebergSharp.Tests/Avro/ManifestListReaderTests.cs Manifest-list reader coverage across v1/v2, codecs, mixed content, replay semantics.
tests/IcebergSharp.Tests/Avro/Helpers/MiniAvroEncoder.cs Minimal Avro OCF fixture encoder used by tests.
tests/IcebergSharp.Tests/Avro/Helpers/ManifestListEncoder.cs Test helper to encode manifest-list OCF fixtures (v1/v2).
tests/IcebergSharp.Tests/Avro/Helpers/ManifestEncoder.cs Test helper to encode manifest OCF fixtures (v1/v2).
tests/IcebergSharp.Tests/Avro/BinaryDecoderTests.cs Binary decoder hardening and correctness tests (varints, bounds, truncation, etc.).
tests/IcebergSharp.Tests/Avro/AvroSchemaParserTests.cs Schema parser tests for supported Avro constructs and validation behavior.
src/IcebergSharp.Core/Properties/AssemblyInfo.cs Expose Core internals to Avro + test assemblies for Phase 2 implementation/testing.
src/IcebergSharp.Core/PartitionValues.cs New domain type for decoded partition tuples (names/types/values).
src/IcebergSharp.Core/ManifestFile.cs New domain types for manifest-list entries and per-partition summaries.
src/IcebergSharp.Core/ManifestEntry.cs New domain types for manifest entries and data files (content/format/stats/bounds).
src/IcebergSharp.Avro/Properties/AssemblyInfo.cs Expose Avro internals to test assembly.
src/IcebergSharp.Avro/ManifestReadOptions.cs Options for manifest reading (LeaveOpen, MaxBlockSize).
src/IcebergSharp.Avro/ManifestReader.cs Streaming manifest reader built on OCF + schema parser + entry sink.
src/IcebergSharp.Avro/ManifestListReadOptions.cs Options for manifest-list reading (LeaveOpen, MaxBlockSize).
src/IcebergSharp.Avro/ManifestListReader.cs Streaming manifest-list reader with format version detection.
src/IcebergSharp.Avro/Internal/Schema/AvroSchemaParser.cs AOT-safe Utf8JsonReader schema parsing with strict shape validation.
src/IcebergSharp.Avro/Internal/Schema/AvroSchema.cs Minimal internal Avro schema model (primitives/records/unions/arrays/etc.).
src/IcebergSharp.Avro/Internal/Ocf/OcfReader.cs Async OCF block reader with codec dispatch, sync checking, pooling, and limits.
src/IcebergSharp.Avro/Internal/Ocf/OcfHeader.cs Parsed OCF header model (schema/codec/sync/metadata).
src/IcebergSharp.Avro/Internal/Errors/AvroFormatException.cs Format-specific exception type with offset/location support.
src/IcebergSharp.Avro/Internal/Decode/ShapeDecoder.cs Schema-guided skipping and partition-value decoding (logical types).
src/IcebergSharp.Avro/Internal/Decode/PartitionRecordSink.cs Partition record decoding into PartitionValues + schema resolution.
src/IcebergSharp.Avro/Internal/Decode/NullableReader.cs Helpers for [null, T] unions and nullable field reading.
src/IcebergSharp.Avro/Internal/Decode/ManifestFileSink.cs Decode manifest-list records into ManifestFile.
src/IcebergSharp.Avro/Internal/Decode/ManifestEntrySink.cs Decode manifest records into ManifestEntry/DataFile.
src/IcebergSharp.Avro/Internal/Decode/IcebergFieldIds.cs Central constants for Iceberg field-id routing.
src/IcebergSharp.Avro/Internal/Decode/BinaryDecoder.cs In-memory Avro binary decoder with truncation/validation checks.
src/IcebergSharp.Avro/Internal/Decode/AvroToIcebergType.cs Map Avro partition schema nodes back to Iceberg types.
src/IcebergSharp.Avro/Internal/Codec/NullCodec.cs Null codec implementation for OCF blocks.
src/IcebergSharp.Avro/Internal/Codec/IBlockCodec.cs Codec interface for OCF block decoding.
src/IcebergSharp.Avro/Internal/Codec/DeflateCodec.cs Deflate codec implementation for OCF blocks.
README.md Update project status and surface area to Phase 2.
docs/compatibility-matrix.md Update compatibility/status claims for Phase 2 readers.
CHANGELOG.md Document Phase 2 additions and CI formatting enforcement.
.gitignore Ignore additional IDE cache artifacts.
.github/workflows/ci.yml Split dotnet format verification into whitespace/style/analyzers in CI.
Comments suppressed due to low confidence (4)

tests/IcebergSharp.Tests/Avro/Helpers/ManifestEncoder.cs:136

  • MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).
    tests/IcebergSharp.Tests/Avro/Helpers/ManifestListEncoder.cs:101
  • MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).
    tests/IcebergSharp.Tests/Avro/OcfHeaderTests.cs:34
  • MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).
    tests/IcebergSharp.Tests/Avro/OcfHeaderTests.cs:60
  • MiniAvroEncoder implements IDisposable, but the instance created here is never disposed. With repo-wide analyzers enabled and TreatWarningsAsErrors=true, this is likely to trip CA2000 and fail the build. Either wrap the encoder in a using/try-finally, or drop IDisposable from MiniAvroEncoder (it only owns MemoryStreams).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +108 to +113
public static byte[] EncodeV1(IEnumerable<V1Entry> entries, string codec = "null")
{
var enc = new MiniAvroEncoder(V1Schema, codec, new Dictionary<string, byte[]>
{
["format-version"] = "1"u8.ToArray(),
});
Comment on lines +77 to +82
public static byte[] EncodeV1(IEnumerable<V1Entry> entries, string codec = "null")
{
var enc = new MiniAvroEncoder(V1Schema, codec);
foreach (V1Entry e in entries)
{
enc.StartRecord();
Comment on lines +13 to +18
const string Schema = """{"type":"record","name":"r","fields":[{"name":"x","type":"long","field-id":1}]}""";
var enc = new MiniAvroEncoder(Schema, "null");
enc.StartRecord();
enc.WriteLong(42);

using var stream = new MemoryStream(enc.ToArray());
Comment on lines +146 to +153
_output.Write(_magic);

// Metadata map: encoded as one block with negative-count prefix (entry count + byte size), then 0.
var keys = new List<KeyValuePair<string, byte[]>>
{
new("avro.schema", Encoding.UTF8.GetBytes(schemaJson)),
new("avro.codec", Encoding.UTF8.GetBytes(codec)),
};
Comment thread src/IcebergSharp.Avro/Internal/Ocf/OcfReader.cs Outdated
Comment on lines +98 to +114
var recordCount = recordCountResult.Value;
var byteSize = await ReadVarLongAsync(cancellationToken).ConfigureAwait(false);

if (recordCount <= 0)
{
throw new AvroFormatException($"invalid block record count {recordCount}", blockStartOffset);
}

if (recordCount > int.MaxValue)
{
throw new AvroFormatException($"block record count {recordCount} exceeds supported maximum {int.MaxValue}", blockStartOffset);
}

if (byteSize <= 0 || byteSize > _maxBlockSize)
{
throw new AvroFormatException($"invalid block byte size {byteSize} (max {_maxBlockSize})", blockStartOffset);
}
Comment thread src/IcebergSharp.Avro/Internal/Decode/NullableReader.cs Outdated
Comment on lines +56 to +64
case IcebergFieldIds.AddedFilesCount:
addedFilesCount = checked((int)NullableReader.ReadLongRequiredOrZero(ref decoder, field.Schema));
break;
case IcebergFieldIds.ExistingFilesCount:
existingFilesCount = checked((int)NullableReader.ReadLongRequiredOrZero(ref decoder, field.Schema));
break;
case IcebergFieldIds.DeletedFilesCount:
deletedFilesCount = checked((int)NullableReader.ReadLongRequiredOrZero(ref decoder, field.Schema));
break;
Comment on lines +247 to +256
private async ValueTask<long?> TryReadVarLongAsync(CancellationToken cancellationToken)
{
// Returns null on clean EOF before any byte is read; otherwise reads a full varint.
var firstByte = new byte[1];
var n = await _stream.ReadAsync(firstByte.AsMemory(0, 1), cancellationToken).ConfigureAwait(false);
if (n == 0)
{
return null;
}

Comment thread .github/workflows/ci.yml
Comment on lines +83 to +85
dotnet-version: |
8.0.x
9.0.x
@AndreaBozzo AndreaBozzo merged commit 1c4e379 into main May 15, 2026
6 checks passed
@AndreaBozzo AndreaBozzo deleted the ab/avro branch May 15, 2026 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request .NET Pull requests that update .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants