Skip to content

Pool the dictionary buffer when training a Zstandard dictionary#129125

Open
alinpahontu2912 wants to merge 1 commit into
dotnet:mainfrom
alinpahontu2912:alin/zstd-dictionary-pool
Open

Pool the dictionary buffer when training a Zstandard dictionary#129125
alinpahontu2912 wants to merge 1 commit into
dotnet:mainfrom
alinpahontu2912:alin/zstd-dictionary-pool

Conversation

@alinpahontu2912
Copy link
Copy Markdown
Member

ZstandardDictionary.Train allocates new byte[maxDictionarySize] on every call to hold the trained dictionary returned by the native ZDICT_trainFromBuffer. Upstream zstd guidance puts a reasonable dictionary size at ~100 KB and the CLI defaults to 112,640 bytes — both above the .NET 85,000-byte LOH threshold. So for the recommended-and-up sizes, every training call costs a fresh Large Object Heap allocation that survives until the next gen2 collection. This change rents the buffer from ArrayPool<byte>.Shared and returns it (with clearArray: true) once the trained dictionary has been copied into its own array.

Why dictionary sizes are LOH-heavy

Upstream zstd (zdict.h):

"A reasonable dictionary size, the dictBufferCapacity, is about 100KB. The zstd CLI defaults to a 110KB dictionary."

CLI manpage (zstd.1.md): --maxdict=# default is 112640 bytes.

The managed wrapper repeats this guidance (ZstandardDictionary.cs:85-87) and only guards the lower bound (ThrowIfLessThan(maxDictionarySize, 256, ...) at line 128) — there is no upper-bound check, so MB-sized buffers are also legal.

.NET LOH threshold is 85,000 bytes, so the recommended ~100 KB dictionary and the 112,640-byte CLI default both land on the LOH on every call.

Empirical evidence

I modelled the exact allocation pattern (new byte[N] vs Rent/Return) and measured per-call allocations with GC.GetAllocatedBytesForCurrentThread (Release build, 1,000 iterations, workstation GC, .NET 11 preview):

Buffer size new byte[] per call Rent per call (steady state) Reduction Hits LOH?
4 KB 4,120 B 0 B 100 % no
50 KB 50,024 B 0 B 100 % no
100 KB (zstd recommended) 102,424 B 0 B 100 % YES
110 KB (zstd CLI default) 112,664 B 0 B 100 % YES
1 MB 1,048,600 B 0 B 100 % YES

For the recommended/default dictionary sizes, every Train call previously produced a ~100 KB LOH allocation that only a gen2 collection could reclaim. After this change, steady-state Train allocations for the dictionary buffer drop to zero.

Reproducer

Program.cs (run from a scratch net11.0 console project, dotnet run -c Release):

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
using System;
using System.Buffers;
using System.Runtime.CompilerServices;

int[] sizes = { 4_096, 50_000, 102_400, 112_640, 1_048_576 };
const int Iterations = 1_000;

foreach (int size in sizes)
{
    BeforePath(size);
    AfterPath(size);
    GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();

    long b0 = GC.GetAllocatedBytesForCurrentThread();
    for (int i = 0; i < Iterations; i++) BeforePath(size);
    long b1 = GC.GetAllocatedBytesForCurrentThread();

    GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();

    long a0 = GC.GetAllocatedBytesForCurrentThread();
    for (int i = 0; i < Iterations; i++) AfterPath(size);
    long a1 = GC.GetAllocatedBytesForCurrentThread();

    Console.WriteLine($"size={size,9:N0}  new/iter={(b1 - b0) / Iterations,9:N0} B  rent/iter={(a1 - a0) / Iterations,9:N0} B");
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void BeforePath(int n)
{
    byte[] b = new byte[n];
    b[0] = 1;
    GC.KeepAlive(b);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void AfterPath(int n)
{
    byte[] b = ArrayPool<byte>.Shared.Rent(n);
    try
    {
        b[0] = 1;
        GC.KeepAlive(b);
    }
    finally
    {
        ArrayPool<byte>.Shared.Return(b, clearArray: true);
    }
}

Output (workstation GC, single thread):

size=    4,096  new/iter=    4,120 B  rent/iter=        0 B
size=   50,000  new/iter=   50,024 B  rent/iter=        0 B
size=  102,400  new/iter=  102,424 B  rent/iter=        0 B
size=  112,640  new/iter=  112,664 B  rent/iter=        0 B
size=1,048,576  new/iter=1,048,600 B  rent/iter=        0 B

The 24-byte overhead per allocation is the 64-bit byte[] object header (sync block + method table pointer + length). Rent reports 0 bytes/iter because the pool returns the same array each iteration after warm-up. The pre-PR path stays on the SOH for the first two rows and lands on the LOH for the last three.

ZstandardDictionary.Train allocated 'new byte[maxDictionarySize]' on every call. Dictionary sizes are typically tens to hundreds of KB (zstd recommends up to ~100 KB, but the API allows more), so each training call paid for a fresh GC allocation that often landed on the LOH.

Rent the buffer from ArrayPool<byte>.Shared instead. Create copies the trained slice into an exact-sized array before returning, so the rented buffer can be returned immediately. Use clearArray: true on Return because the trained dictionary is derived from caller-supplied samples and must not linger in the shared pool.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces repeated large allocations in ZstandardDictionary.Train by renting the native training output buffer from ArrayPool<byte>.Shared instead of allocating a new byte[] every call, then returning the rented buffer after copying the trained dictionary into its own array.

Changes:

  • Replace per-call new byte[maxDictionarySize] with ArrayPool<byte>.Shared.Rent(maxDictionarySize) for the native training output buffer.
  • Ensure the rented dictionary buffer is returned in a finally block, currently using clearArray: true to avoid retaining caller-derived data in the pool.

Comment on lines +152 to +153
// Clear before returning: the trained dictionary is derived from caller-supplied samples.
ArrayPool<byte>.Shared.Return(dictionaryBuffer, clearArray: true);
Comment on lines +152 to +153
// Clear before returning: the trained dictionary is derived from caller-supplied samples.
ArrayPool<byte>.Shared.Return(dictionaryBuffer, clearArray: true);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Clear before returning: the trained dictionary is derived from caller-supplied samples.
ArrayPool<byte>.Shared.Return(dictionaryBuffer, clearArray: true);
ArrayPool<byte>.Shared.Return(dictionaryBuffer);

We practically never clear buffers outside of crypto, I don't see a reason to do it here

ZstandardUtils.ThrowIfError(dictSize);
return Create(dictionaryBuffer.AsSpan(0, (int)dictSize));
}
finally
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a try/finally for the lengthsArray buffer, can we avoid even more nesting by reusing the existing blocks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants