Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,16 @@ jobs:
uses: actions/setup-dotnet@v5
with:
global-json-file: global.json
# The SDK comes from global.json; these runtime installs are required
# because projects target net8.0 and net9.0.
dotnet-version: |
8.0.x
9.0.x

- name: NuGet package cache
# actions/cache can fail on Windows before restore even runs; keep the
# optimization on Unix runners and let Windows restore normally.
if: runner.os != 'Windows'
uses: actions/cache@v5
with:
path: ~/.nuget/packages
Expand Down Expand Up @@ -80,9 +85,23 @@ jobs:
- uses: actions/setup-dotnet@v5
with:
global-json-file: global.json
# dotnet format loads all target frameworks, so install both runtimes
# even though global.json chooses the SDK.
dotnet-version: |
8.0.x
9.0.x
Comment on lines +90 to +92

- name: Restore
run: dotnet restore IcebergSharp.slnx

- name: Verify whitespace formatting
run: dotnet format whitespace IcebergSharp.slnx --verify-no-changes --no-restore

- name: Verify code style
run: dotnet format style IcebergSharp.slnx --verify-no-changes --severity info --no-restore

- name: Verify formatting
run: dotnet format IcebergSharp.slnx --verify-no-changes --severity info
- name: Verify analyzer fixes
run: dotnet format analyzers IcebergSharp.slnx --verify-no-changes --severity info --no-restore

pack:
name: pack (dry run)
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ publish/
*.suo
*.userosscache
*.sln.docstates
*.lscache

# IDE
.vs/
Expand Down
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,17 @@ the 0.x line may include breaking changes; they will always be called out under
- Phase 0 scaffolding: solution layout, CI, license, scope and roadmap.
- Phase 1 core types and metadata: `Schema`, `IcebergType` hierarchy,
`TableMetadata` JSON parser, and fixture-based round-trip tests.
- Phase 2 Avro manifest reading: stream-based manifest-list and manifest readers,
Avro OCF `null` / `deflate` codecs, dynamic schema parsing, and Phase 2 smoke
coverage from metadata to manifests.

### Changed
- CI now enforces whitespace, style, and analyzer fixes explicitly with
`dotnet format --verify-no-changes`.

### Fixed
- Hardened Avro decoding and schema parsing against truncated schema JSON,
oversized encoded lengths, invalid block headers, and invalid logical-type
annotations.

[Unreleased]: https://github.com/AndreaBozzo/IcebergSharp/commits/main
23 changes: 16 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,10 @@ doesn't expose Iceberg's metadata to them. There's no native client that gives a
no embedded query engine — just metadata and Arrow batches you can hand to
DuckDB.NET, ML.NET, or Power BI.

> **Status:** Phase 0 — repository scaffolding only. No public API yet. See the
> [roadmap](#roadmap) for what is coming and when.
> **Status:** Phase 2 development. Core Iceberg metadata parsing and stream-based
> Avro manifest / manifest-list readers are implemented and covered by unit
> tests. Catalog, scan planning, file IO, and Parquet data reads are still on the
> roadmap.

---

Expand Down Expand Up @@ -133,6 +135,13 @@ var lastWeek = table.NewScan()
See [docs/compatibility-matrix.md](docs/compatibility-matrix.md) for the up-to-date
matrix of supported catalogs, table-format versions, and storage backends.

Current implemented surface:

- `IcebergSharp.Core`: Iceberg v1/v2 table metadata, schemas, partition specs,
sort orders, snapshots, and manifest domain models.
- `IcebergSharp.Avro`: stream-based Avro OCF readers for Iceberg manifest lists
and manifests, including `null` and `deflate` codecs.

Target servers for v1:

- Apache Polaris (reference implementation)
Expand All @@ -148,8 +157,8 @@ Target servers for v1:
| Phase | Weeks | Deliverable |
| --- | --- | --- |
| 0. Scaffolding | done | Repo, CI, license, solution layout |
| 1. Core types & metadata | 1-2 | `Schema`, `TableMetadata`, JSON parser |
| 2. Avro manifest reader | 3-4 | Custom mini Avro OCF reader for manifests |
| 1. Core types & metadata | done | `Schema`, `TableMetadata`, JSON parser |
| 2. Avro manifest reader | in progress | Custom mini Avro OCF reader for manifests |
| 3. REST catalog client | 5-6 | OAuth2 / Bearer / SigV4, dynamic discovery |
| 4. Scan planning & pruning | 7-8 | Partition + stats pruning, residual filters |
| 5. Parquet + schema evolution | 9-10 | Field-id resolution, add/drop/rename column |
Expand Down Expand Up @@ -180,9 +189,9 @@ column stats, and streams Parquet rows with field-id resolution.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). The project is in early development the
fastest way to help is to try the prerelease packages once Phase 1 ships and report
incompatibilities against your specific catalog.
See [CONTRIBUTING.md](CONTRIBUTING.md). The project is in early development; the
fastest way to help right now is to try the metadata and manifest readers against
real Iceberg tables and report incompatible schemas, codecs, or manifest shapes.

---

Expand Down
11 changes: 8 additions & 3 deletions docs/compatibility-matrix.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Compatibility matrix

> Last updated: Phase 0 — entries marked _planned_ are targets, not validated yet.
> Last updated: Phase 2 development. Entries marked _planned_ are targets, not
> validated yet.

## Catalogs

Expand All @@ -18,8 +19,8 @@

| Spec version | Status |
| --- | --- |
| v1 | 🟡 planned for Phase 1 |
| v2 | 🟡 planned for Phase 1 (primary target) |
| v1 | ✅ metadata JSON + Avro manifests covered by unit fixtures |
| v2 | ✅ metadata JSON + Avro manifests covered by unit fixtures |
| v3 | 🟢 stretch goal — depends on spec stability |

## Storage backends
Expand All @@ -36,6 +37,7 @@

| Format | Read | Write |
| --- | --- | --- |
| Iceberg manifest Avro OCF | ✅ `null` + `deflate` codecs | ⛔ out of scope for v1 |
| Parquet | 🟡 planned for Phase 5 | ⛔ out of scope for v1 |
| ORC | ⛔ out of scope for v1 | ⛔ out of scope for v1 |
| Avro (data files, not manifests) | ⛔ out of scope for v1 | ⛔ out of scope for v1 |
Expand All @@ -44,6 +46,9 @@

| Feature | Status |
| --- | --- |
| Table metadata JSON parsing | ✅ validated with v1/v2 fixtures |
| Manifest-list reading | ✅ stream-based Avro OCF reader |
| Manifest reading | ✅ stream-based Avro OCF reader |
| Schema evolution (add / drop / rename / promote) | 🟡 planned for Phase 5 |
| Partition spec evolution | 🟡 planned for Phase 4 |
| Snapshot isolation / time travel | 🟡 planned for Phase 4 |
Expand Down
56 changes: 56 additions & 0 deletions src/IcebergSharp.Avro/Internal/Codec/DeflateCodec.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
using System.Buffers;
using System.IO.Compression;

namespace IcebergSharp.Avro.Internal.Codec;

/// <summary>
/// Avro's <c>deflate</c> codec is raw DEFLATE (no zlib header). Wraps
/// <see cref="DeflateStream"/> in decompress mode over the source bytes.
/// </summary>
internal sealed class DeflateCodec : IBlockCodec
{
public static DeflateCodec Instance { get; } = new();
private DeflateCodec() { }

public string Name => "deflate";

public int Decode(ReadOnlySpan<byte> source, ref byte[] destination)
{
// System.IO.Compression doesn't accept ReadOnlySpan; copy the source
// into a rented buffer-backed MemoryStream. Renting is cheap relative
// to the decompression itself.
var sourceArr = ArrayPool<byte>.Shared.Rent(source.Length);
try
{
source.CopyTo(sourceArr);
using var src = new MemoryStream(sourceArr, 0, source.Length, writable: false);
using var deflate = new DeflateStream(src, CompressionMode.Decompress, leaveOpen: false);

var totalWritten = 0;
while (true)
{
if (totalWritten >= destination.Length)
{
var grown = ArrayPool<byte>.Shared.Rent(destination.Length * 2);
Buffer.BlockCopy(destination, 0, grown, 0, totalWritten);
ArrayPool<byte>.Shared.Return(destination);
destination = grown;
}

var n = deflate.Read(destination, totalWritten, destination.Length - totalWritten);
if (n == 0)
{
break;
}

totalWritten += n;
}

return totalWritten;
}
finally
{
ArrayPool<byte>.Shared.Return(sourceArr);
}
}
}
14 changes: 14 additions & 0 deletions src/IcebergSharp.Avro/Internal/Codec/IBlockCodec.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
namespace IcebergSharp.Avro.Internal.Codec;

internal interface IBlockCodec
{
/// <summary>Codec name as it appears in the OCF header (<c>null</c>, <c>deflate</c>).</summary>
string Name { get; }

/// <summary>
/// Decodes <paramref name="source"/> into <paramref name="destination"/>, resizing
/// <paramref name="destination"/> via the array pool if it isn't large enough.
/// Returns the number of valid bytes in <paramref name="destination"/>.
/// </summary>
int Decode(ReadOnlySpan<byte> source, ref byte[] destination);
}
23 changes: 23 additions & 0 deletions src/IcebergSharp.Avro/Internal/Codec/NullCodec.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
using System.Buffers;

namespace IcebergSharp.Avro.Internal.Codec;

internal sealed class NullCodec : IBlockCodec
{
public static NullCodec Instance { get; } = new();
private NullCodec() { }

public string Name => "null";

public int Decode(ReadOnlySpan<byte> source, ref byte[] destination)
{
if (destination.Length < source.Length)
{
ArrayPool<byte>.Shared.Return(destination);
destination = ArrayPool<byte>.Shared.Rent(source.Length);
}

source.CopyTo(destination);
return source.Length;
}
}
68 changes: 68 additions & 0 deletions src/IcebergSharp.Avro/Internal/Decode/AvroToIcebergType.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
using IcebergSharp.Avro.Internal.Schema;
using IcebergSharp.Types;

namespace IcebergSharp.Avro.Internal.Decode;

/// <summary>
/// Maps an <see cref="AvroSchema"/> node back to the Iceberg type that produced
/// it. Only used for partition columns — the manifest's partition record has a
/// dynamic schema and Phase 4 wants <see cref="IcebergType"/>s on the boxed
/// partition values.
/// </summary>
internal static class AvroToIcebergType
{
public static IcebergType Resolve(AvroSchema schema)
{
// Peel off a nullable union to get to the carrier.
if (schema is AvroUnion u)
{
schema = u.NonNull;
}

return schema switch
{
AvroPrimitive p => ResolvePrimitive(p),
AvroFixed f => ResolveFixed(f),
_ => throw new NotSupportedException($"partition columns cannot have Avro schema kind {schema.GetType().Name}"),
};
}

private static IcebergType ResolvePrimitive(AvroPrimitive p)
{
return p.LogicalType switch
{
AvroLogicalType.Date => DateType.Instance,
AvroLogicalType.TimeMicros or AvroLogicalType.TimeMillis => TimeType.Instance,
AvroLogicalType.TimestampMicros => TimestampTzType.Instance,
AvroLogicalType.TimestampMillis => TimestampType.Instance,
AvroLogicalType.Uuid => UuidType.Instance,
AvroLogicalType.Decimal => new DecimalType(p.DecimalPrecision, p.DecimalScale),
_ => p.Kind switch
{
AvroPrimitiveKind.Boolean => BooleanType.Instance,
AvroPrimitiveKind.Int => IntType.Instance,
AvroPrimitiveKind.Long => LongType.Instance,
AvroPrimitiveKind.Float => FloatType.Instance,
AvroPrimitiveKind.Double => DoubleType.Instance,
AvroPrimitiveKind.Bytes => BinaryType.Instance,
AvroPrimitiveKind.String => StringType.Instance,
_ => throw new NotSupportedException($"unsupported Avro primitive {p.Kind} for partition column"),
},
};
}

private static IcebergType ResolveFixed(AvroFixed f)
{
if (f.LogicalType == AvroLogicalType.Decimal)
{
return new DecimalType(f.DecimalPrecision, f.DecimalScale);
}

if (f.LogicalType == AvroLogicalType.Uuid || f.Size == 16)
{
return UuidType.Instance;
}

return new FixedType(f.Size);
}
}
Loading
Loading