This document describes the public SQL functions provided by the h5db extension.
- Table Functions
- Scalar Functions
- Test Functions
- Remote Access
- Multi-File Inputs and Globbing
- Settings
- Type Mapping
- Error Handling
- Performance Notes
- Limitations
- See Also
Recursively lists namespace entries in an HDF5 file. The result is path-oriented:
each absolute namespace path is emitted as its own row, even when multiple paths
resolve to the same underlying object. Selected HDF5 attributes can be projected
as additional columns with h5_attr(...).
Parameters:
filename_or_filenames(VARCHAR or VARCHAR[]): Local path or remote URL to the HDF5 file, a glob pattern, or a list of exact filenames/URLs and/or glob patterns. See Remote Access and Multi-File Inputs and Globbing.projected_attributes(variadic, optional): Zero or more projected attribute markers:h5_attr()h5_attr(name)h5_attr(name, default_value)h5_alias(alias_name, h5_attr(...))
filename(BOOLEAN or VARCHAR, named, optional): Add a visible filename column.trueuses the column namefilename; a string uses the provided column name instead. Without this option, hidden virtualfilenameremains available by explicit reference.swmr(BOOLEAN, named, optional): Open in SWMR read mode (default:false)
Returns: Table with columns:
path(VARCHAR): Absolute namespace path for the emitted rowtype(VARCHAR): Entry type:groupdatasetdatatypelinkfor dangling or otherwise unresolved local linksexternalfor external links
dtype(VARCHAR): Data type for datasets (e.g.,int32,float64,string)shape(UBIGINT[]): Dataset dimensionsNULLfor non-dataset rows[]for scalar datasets[d0, d1, ...]for array datasets
- one additional column per projected attribute
h5_attr()produces columnh5_attrwith typeMAP(VARCHAR, VARIANT)h5_attr(name)produces columnnamewith typeVARIANTh5_attr(name, default_value)produces columnnamewith the type ofdefault_valueh5_alias(...)overrides the default output name
- hidden virtual column
filename(VARCHAR), available by explicit reference
Traversal Semantics:
/is returned as agrouprow.- Hard-link aliases, soft-link aliases, and other namespace aliases are path-complete: if two paths resolve to the same local object, both paths appear.
- Resolved group aliases are traversed recursively in a cycle-safe way.
- Dangling links are returned as
type = 'link'leaf rows. - External links are returned as
type = 'external'leaf rows and are not traversed. - Projected attributes and dataset metadata are populated only for resolved local rows. Unresolved and external rows use projected defaults.
Projected Attributes:
- Projected attributes are declared with
h5_attr(...)and may be renamed withh5_alias(...). - Only resolved local rows read attribute values. Unresolved and external rows use projected defaults.
- Projected output names must be unique under DuckDB's case-insensitive identifier matching; duplicate projected names
or collisions with
path,type,dtype, orshapefail at bind time. - See
h5_attr(...)below for bind-time rules, default handling, string behavior, and supported attribute forms.
Example:
SELECT * FROM h5_tree('data.h5');
SELECT * FROM h5_tree('data.h5', swmr := true);
SELECT * FROM h5_tree('runs/run_*.h5');
SELECT * FROM h5_tree(['runs/calibration.h5', 'runs/run_*.h5']);
-- Project an attribute as an extra column
SELECT path, type, NX_class
FROM h5_tree(
'data.h5',
h5_attr('NX_class', NULL::VARCHAR)
);
-- Distinguish identical paths across multiple files
SELECT filename, path
FROM h5_tree('runs/run_*.h5')
WHERE path = '/entry/data';
-- Project all attributes as one map-valued column
SELECT path, h5_attr
FROM h5_tree(
'data.h5',
h5_attr()
);
-- Rename a projected attribute column
SELECT path, type, time
FROM h5_tree(
'data.h5',
h5_alias('time', h5_attr('count_time', 0::DOUBLE))
);Lists the immediate children of a group. This is the shallow counterpart to
h5_tree(). The scalar form is documented below.
The table form:
SELECT * FROM h5_ls('data.h5');
SELECT * FROM h5_ls('data.h5', '/entry/instrument');returns the same row shape as h5_tree(), but only for the immediate children of
the requested group.
Parameters:
filename_or_filenames(VARCHAR or VARCHAR[]): Local path or remote URL to the HDF5 file, a glob pattern, or a list of exact filenames/URLs and/or glob patterns. See Remote Access and Multi-File Inputs and Globbing.group_path(VARCHAR, optional): Group path to list. Defaults to/in the table form.projected_attributes(variadic, optional): Zero or more projected attribute markers:h5_attr()h5_attr(name)h5_attr(name, default_value)h5_alias(alias_name, h5_attr(...))
filename(BOOLEAN or VARCHAR, named, optional): Add a visible filename column.trueuses the column namefilename; a string uses the provided column name instead. Without this option, hidden virtualfilenameremains available by explicit reference.swmr(BOOLEAN, named, optional): Open in SWMR read mode (default:false)
Returns: Table with the same row shape as h5_tree(), plus hidden virtual column filename (VARCHAR)
Semantics:
- the path must resolve to a group, otherwise the function errors
- only immediate children are returned; the requested group itself is not returned
- external groups can be listed if the external target resolves successfully
- row typing,
dtype, andshapeare the same ash5_tree() - projected attributes use the same
h5_attr(...)semantics - projected output names must be unique under DuckDB's case-insensitive identifier matching and must not collide with
path,type,dtype, orshape
Examples:
-- List the root group's immediate children
SELECT * FROM h5_ls('data.h5');
-- List one specific group
SELECT * FROM h5_ls('data.h5', '/entry/instrument');
-- List the same group across multiple files
SELECT * FROM h5_ls('runs/run_*.h5', '/entry/instrument');
-- Project attributes on the immediate children
SELECT path, NX_class
FROM h5_ls(
'data.h5',
'/entry/instrument',
h5_attr('NX_class', NULL::VARCHAR)
);Reads data from one or more datasets in an HDF5 file.
Parameters:
filename_or_filenames(VARCHAR or VARCHAR[]): Local path or remote URL to the HDF5 file, a glob pattern, or a list of exact filenames/URLs and/or glob patterns. See Remote Access and Multi-File Inputs and Globbing.dataset_path(VARCHAR or STRUCT): Dataset path(s) to read. Useh5_rse()for run-start encoded columnsh5_index()can be provided to add a virtual index column namedindexh5_alias(name, definition)can be used to rename a column definition- Additional dataset paths can be provided (variadic arguments)
filename(BOOLEAN or VARCHAR, named, optional): Add a visible filename column.trueuses the column namefilename; a string uses the provided column name instead. Without this option, hidden virtualfilenameremains available by explicit reference.swmr(BOOLEAN, named, optional): Open in SWMR read mode (default:false)
Returns: Table with one column per dataset, plus hidden virtual column filename (VARCHAR)
Column Naming: Column names are extracted from the last component of the dataset path, for example /group/data
becomes data. Output names must be unique under DuckDB's case-insensitive identifier matching. Use h5_alias(...)
when two dataset paths or generated columns would otherwise collide.
Scalar Datasets:
- Scalar (rank-0) datasets are returned as constant columns.
- If all selected datasets are scalar,
h5_read()returns a single row. - If any non-scalar dataset is present, scalar columns are broadcast to the row count of the non-scalar datasets.
- If multiple non-scalar regular datasets are selected, the output row count is the minimum outer dimension among them.
Multi-file Semantics:
- Rows are concatenated file by file.
h5_index()is the outermost-dimension row index within each matched file.- Duplicate filename matches are preserved.
filenameidentifies which file produced each row.- All matched files must have compatible column definitions.
Type Support:
- Numeric: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64
- Strings: fixed-length and variable-length
- Arrays: 1D-4D multi-dimensional arrays
Examples:
-- Read a single dataset
SELECT * FROM h5_read('data.h5', '/measurements');
-- Read multiple datasets
SELECT * FROM h5_read('data.h5', '/timestamps', '/temperatures', '/pressures');
-- Read from nested groups
SELECT * FROM h5_read('data.h5', '/experiment1/group_a/data');
-- Read matching local or SFTP files
SELECT * FROM h5_read('runs/run_*.h5', '/counts');
SELECT * FROM h5_read('sftp://beamline.example.org/data/run_*.h5', '/entry/data');
-- Expand a list of exact files and/or patterns in order
SELECT * FROM h5_read(
['runs/calibration.h5', 'runs/run_*.h5'],
'/counts'
);
-- Materialize filename into SELECT *
SELECT *
FROM h5_read('runs/run_*.h5', '/counts', filename := true);
-- Mix regular and run-start encoded columns
SELECT * FROM h5_read(
'data.h5',
'/timestamp',
h5_rse('/status_run_starts', '/status_values')
);
-- Add a virtual index column
SELECT * FROM h5_read('data.h5', h5_index(), '/measurements');
-- Rename a column definition
SELECT * FROM h5_read('data.h5', h5_alias('idx', h5_index()), '/measurements');
-- Enable SWMR read mode
SELECT * FROM h5_read('data.h5', '/measurements', swmr := true);Reads attributes from an object or the file root.
Parameters:
filename_or_filenames(VARCHAR or VARCHAR[]): Local path or remote URL to the HDF5 file, a glob pattern, or a list of exact filenames/URLs and/or glob patterns. See Remote Access and Multi-File Inputs and Globbing.object_path(VARCHAR): Path to the target object, or/for rootfilename(BOOLEAN or VARCHAR, named, optional): Add a visible filename column.trueuses the column namefilename; a string uses the provided column name instead. Without this option, hidden virtualfilenameremains available by explicit reference.swmr(BOOLEAN, named, optional): Open in SWMR read mode (default:false)
Returns: One row per matched file where each non-virtual column represents one attribute
- Column names are the attribute names
- Column types match the attribute types (numeric, string, or arrays)
- If the target object has no attributes, the function raises
IO Error: Object has no attributes: ... in file: ... - Invalid UTF-8 string attribute values raise an error in
h5_attributes() - Multiple matched files must have the same attribute names, types, and order.
- Attribute output names must be unique under DuckDB's case-insensitive identifier matching.
- Hidden virtual column
filename(VARCHAR), available by explicit reference.
Type Support:
- Numeric scalars and string scalars
- Simple 1D numeric and string arrays, returned as DuckDB
LISTs - HDF5
H5T_ARRAY-typed attributes are supported when they are 1D - Empty numeric and string attribute arrays are supported
Examples:
-- Read attributes from a dataset
SELECT * FROM h5_attributes('data.h5', '/measurements');
-- Read the same object's attributes across many files
SELECT filename, units
FROM h5_attributes('runs/run_*.h5', '/measurements');
-- Read specific attributes
SELECT units, description FROM h5_attributes('data.h5', '/temperature');
-- Read attributes from a group
SELECT * FROM h5_attributes('data.h5', '/experiment1');
-- Read file-level (root) attributes
SELECT * FROM h5_attributes('data.h5', '/');
-- Enable SWMR read mode
SELECT * FROM h5_attributes('data.h5', '/measurements', swmr := true);
-- Materialize filename into SELECT *
SELECT *
FROM h5_attributes('runs/run_*.h5', '/measurements', filename := true);Planning-time helper that expands a filename input using the same h5db filename expansion rules as table-valued functions and returns the first expanded filename.
Use this when one part of a query should read only one representative file from a filename pattern or filename list, while another part of the query still reads every matched file.
Parameters:
filename_or_filenames(VARCHAR or VARCHAR[]): Local path or remote URL to an HDF5 file, a glob pattern, or a list of exact filenames/URLs and/or glob patterns. See Remote Access and Multi-File Inputs and Globbing.
Returns: VARCHAR
Semantics:
- An exact filename is returned unchanged.
- A single glob returns the first file after h5db's normal sorted glob expansion.
- A
VARCHAR[]input expands entries left-to-right and returns the first expanded filename. h5_first_file(...)is evaluated while the statement is bound, not while rows are executed. This is what lets it be used as an argument to typed table functions such ash5_read(...), which need a concrete file at bind time to infer their output schema.- The argument must be known while the query is planned, for example a literal, constant expression, or prepared statement parameter. Per-row column values are not supported.
- SFTP inputs use h5db's SFTP expansion rules.
- A glob pattern that matches no files raises an error.
Examples:
SELECT h5_first_file('runs/run_*.h5');
-- Read detector metadata once, while reading event data from every matched file.
WITH pixels AS (
FROM h5_read(
h5_first_file('runs/run_*.h5'),
'/entry/instrument/detector/pixel_id',
'/entry/instrument/detector/x',
'/entry/instrument/detector/y'
)
),
events AS (
FROM h5_read(
'runs/run_*.h5',
'/entry/events/pixel_id',
'/entry/events/time'
)
)
SELECT *
FROM events
JOIN pixels USING (pixel_id);Returns the immediate children of a group as a MAP(VARCHAR, STRUCT(...)) keyed
by child name. This is the scalar counterpart to the table-valued h5_ls().
Parameters:
filename(VARCHAR): File path or remote URL. May vary per row. See Remote Access.group_path(VARCHAR): Group path to list. May vary per row.projected_attributes(variadic, optional): Zero or moreh5_attr(...)orh5_alias(..., h5_attr(...))expressions. These must be constant expressions.
Returns: MAP(VARCHAR, STRUCT(path, type, dtype, shape, ...projected attrs...))
Semantics:
group_pathmust resolve to a group or the function errorsfilenameis used as a path or URL per input row; lists and glob patterns are not expandedNULLfilenameyieldsNULLNULLgroup_pathyieldsNULL- named parameters such as
swmr := trueorfilename := trueare not supported in the scalar form - projected attributes use the same
h5_attr(...)semantics as tableh5_ls() - projected output names must be unique under DuckDB's case-insensitive identifier matching and must not collide with
path,type,dtype, orshape
Examples:
SELECT h5_ls('data.h5', '/entry/instrument');
WITH groups(path) AS (
VALUES ('/'), ('/entry/instrument')
)
SELECT path, cardinality(h5_ls('data.h5', path))
FROM groups;
WITH files(filename) AS (
VALUES ('run_1.h5'), ('run_2.h5')
)
SELECT filename, cardinality(h5_ls(filename, '/entry/instrument'))
FROM files;Scalar variant of h5_ls() that forces swmr = true.
Parameters: Same as scalar h5_ls()
Returns: Same as scalar h5_ls()
Example:
SELECT h5_ls_swmr('data.h5', '/entry/instrument');Creates a run-start encoded (RSE) column specification for use with h5_read().
Parameters:
run_starts_path(VARCHAR): Path to dataset containing run start indicesvalues_path(VARCHAR): Path to dataset containing values for each run
Returns: STRUCT with fields {encoding, run_starts, values}
Requirements:
run_startsmust be an integer dataset, strictly increasingrun_startsandvaluesmust have the same length- At least one regular (non-RSE) column must be present in the
h5_read()call to determine total row count
Notes:
- If
run_starts[0] > 0, rows before the first run start are returned as NULLs.
Example:
SELECT * FROM h5_read(
'data.h5',
'/row_index', -- Regular column determines row count
h5_rse('/state_starts', '/state_vals') -- RSE column expands based on run_starts
);For detailed information on run-start encoding, see RSE_USAGE.md.
Adds a virtual index column for the outermost dimension when used with h5_read().
Parameters: none
Returns: STRUCT tag used by h5_read()
Default column name: index
When h5_read(...) reads multiple files, h5_index() still means the
outermost-dimension row index within the current file, so it starts at 0 for
each matched file.
Example:
SELECT index, measurements FROM h5_read('data.h5', h5_index(), '/measurements');Predicate pushdown: range-like filters on the index column (for example index >= 100, index BETWEEN 10 AND 20,
or comparison-cast forms with bind-time constants) are used to reduce I/O when the planner can normalize them into
simple comparisons.
Creates a projected-attribute definition for use with h5_tree() or h5_ls().
Parameters:
name(VARCHAR, optional): Attribute name to read. Any non-NULLbind-time-resolvedVARCHARexpression is allowed.default_value(ANY, optional): Bind-time constant default value. Its type becomes the output type of the projected column for single-attribute projection. If omitted,h5_attr(name)defaults toNULL::VARIANT.
Returns: STRUCT wrapper used by h5_tree() and h5_ls()
Bind-time rules:
h5_attr(name)is shorthand forh5_attr(name, NULL::VARIANT)h5_attr()defaults to output column nameh5_attrh5_attr(name)andh5_attr(name, default_value)default to output column namenameh5_alias(...)overrides the default output namenamemust resolve to a non-NULLVARCHARvalue at bind time- constant expressions such as
lower('STRING_ATTR')are allowed - row-dependent projected-attribute definitions are not allowed
- if
default_valueis provided, it must be a bind-time constant expression with a concrete type - typed
NULLdefaults such asNULL::VARCHARare allowed - untyped
NULLdefaults are rejected
Runtime behavior:
h5_attr()projects all attributes into oneMAP(VARCHAR, VARIANT)column namedh5_attr- on resolved rows with no attributes,
h5_attr()produces{} - on unresolved or external rows,
h5_attr()producesNULL - unsupported attribute values inside
h5_attr()becomeNULLmap entries - missing attributes use
default_valueorNULL::VARIANTwhen the default is omitted - present attributes are first converted using the same nominal HDF5-to-DuckDB type rules as
h5_attributes() - single-attribute projections are then cast to the type of
default_value
String behavior:
VARIANT-typed orBLOB-typed projected scalar string attributes preserve invalid UTF-8 values asBLOB- flexible projected string arrays preserve invalid elements as
BLOB, includingVARIANT[],VARIANT[N],BLOB[], andBLOB[N]outputs - text-typed projected attributes still fail on invalid UTF-8 string values
Supported attribute forms:
- numeric and string scalars
- simple 1D numeric and string arrays
- 1D
H5T_ARRAY-typed attributes - empty numeric and string arrays
- multidimensional attribute dataspaces are not supported
Example:
SELECT path, h5_attr
FROM h5_tree(
'data.h5',
h5_attr()
);
SELECT path, type, NX_class
FROM h5_tree(
'data.h5',
h5_attr('NX_class')
);
SELECT path, type, NX_class
FROM h5_tree(
'data.h5',
h5_attr('NX_class', NULL::VARCHAR)
);Renames a column definition when used with h5_read() or a projected-attribute
definition when used with h5_tree() or h5_ls().
Parameters:
name(VARCHAR): Column name to use in the outputdefinition(VARCHAR or STRUCT): A dataset path, a column definition likeh5_rse()orh5_index(), or a projected attribute definition fromh5_attr()
Returns: STRUCT wrapper used by h5_read(), h5_tree(), and h5_ls()
The alias name participates in the same case-insensitive duplicate-name checks as normal output column names.
Example:
SELECT * FROM h5_read(
'data.h5',
h5_alias('idx', h5_index()),
h5_alias('temp_c', '/temperature')
);
SELECT path, time
FROM h5_tree(
'data.h5',
h5_alias('time', h5_attr('count_time', 0::DOUBLE))
);These functions are provided for testing and verification purposes:
Returns the HDF5 library version being used.
All h5db functions that open HDF5 files accept local paths or remote URLs as filename.
- DuckDB-backed remote schemes such as
http://,https://,s3://,s3a://,s3n://,r2://,gcs://,gs://, andhf://are opened through DuckDB's filesystem stack. h5db auto-loads the required DuckDB extension when needed. sftp://URLs are handled by h5db's built-in SFTP backend.- Remote opens are immutable snapshot reads. This applies to both DuckDB-backed remote paths and SFTP.
- The table-valued h5db functions accept
swmr := truefor local files. Remote table-function opens accept the parameter, but remote paths are opened as immutable snapshots, soH5F_ACC_SWMR_READis not used there. - Remote reads use DuckDB's external file cache when
enable_external_file_cacheis enabled. HDF5 metadata reads and other small reads are cached as reusable byte ranges. Large raw dataset reads are routed through cached reads only up to a per-query budget, currently 200 MiB; after that h5db uses direct remote reads for the rest of the query.
Before using sftp:// URLs, create a DuckDB secret of type sftp whose scope matches the URL you want to access.
Example:
CREATE OR REPLACE SECRET beamline_sftp (
TYPE sftp,
SCOPE 'sftp://beamline.example.org/',
USERNAME 'alice',
USE_AGENT true,
KNOWN_HOSTS_PATH '/home/alice/.ssh/known_hosts'
);
SELECT * FROM h5_read(
'sftp://beamline.example.org/data/run001.h5',
'/entry/data'
);Required secret fields:
USERNAME- exactly one of
PASSWORD,KEY_PATH, orUSE_AGENT - at least one of
KNOWN_HOSTS_PATHorHOST_KEY_FINGERPRINT
Optional secret fields:
KEY_PASSPHRASE: passphrase for an encrypted private keyPORT: default22HOST_KEY_ALGORITHMS: comma-separated libssh2 host-key algorithms accepted during the SSH handshake, in preference order. If omitted, matching entries inKNOWN_HOSTS_PATHare used to prefer known host-key algorithms; otherwise libssh2 uses its default host-key algorithm list. Example:'ssh-ed25519,ecdsa-sha2-nistp256'
Notes:
HOST_KEY_FINGERPRINTaccepts OpenSSH-style SHA256 fingerprints (SHA256:<base64>) and SHA1 hex fingerprints.- Password auth via
PASSWORDand explicit key-file auth viaKEY_PATHwith optionalKEY_PASSPHRASEare also supported. USE_AGENTuses libssh2's SSH agent support. On Unix-like systems this uses the agent exposed throughSSH_AUTH_SOCK. On Windows, libssh2 uses its supported agent backends (for example Pageant/OpenSSH) when available.sftp://URLs must not include usernames or passwords.
The table-valued h5_tree(...), table-valued h5_ls(...), h5_read(...), and h5_attributes(...) accept:
- a single filename or URL (
VARCHAR) - a glob pattern (
VARCHAR) - a list of exact filenames/URLs and/or glob patterns (
VARCHAR[])
Globbing is supported for local paths, for sftp:// URLs, and for DuckDB-backed remote schemes when the underlying
DuckDB filesystem supports globbing.
Glob syntax:
*?- bracket classes such as
[12]or[a-z] - one recursive
**segment per pattern
- A single glob expands to lexicographically sorted matches.
- A
VARCHAR[]input expands entries left-to-right. - Duplicate files are preserved and are processed more than once.
- A pattern that matches no files raises an error.
h5_read(...)concatenates rows file by file.h5_attributes(...)emits one row per matched file.h5_index()is the outermost-dimension row index within each matched file.- Table-valued
h5_tree(...), table-valuedh5_ls(...),h5_read(...), andh5_attributes(...)expose a hidden virtualfilenamecolumn that can be referenced explicitly. filename := trueaddsfilenameto the visible output schema.filename := 'source_file'adds the same visible filename column but uses the provided column name instead offilename. In that case the column must be referenced assource_file; hiddenfilenameis not also available.- All matched files in
h5_read(...)must have compatible column definitions. - All matched files in
h5_attributes(...)must expose the same attributes in the same order with the same names and types. - Scalar
h5_ls(...)accepts one filename/URL expression per row and does not expand filename lists or glob patterns. - For local paths and DuckDB-backed remote schemes, glob expansion uses DuckDB's filesystem stack. For
sftp://URLs, glob expansion is handled by h5db's SFTP backend. - In both cases, glob expansion follows DuckDB's other multi-file reader semantics such as
read_parquet(...). In particular, recursive**does not traverse symlink directories. - On Windows, HDF5 1.14.6 with its native file driver fails to open a file symlink when the symlink itself is passed as the filename. h5db does not resolve those paths before calling HDF5, so final-component file symlink paths are not supported there.
h5_tree(...) and table-valued h5_ls(...) can apply selective filters on the virtual filename column before
opening each expanded file. This includes hidden filename filters and filters on the visible renamed column when using
filename := 'source_file'.
Useful shapes include simple comparisons, IN, LIKE, and filename predicates inside AND filters, as long as DuckDB
pushes the filter into the table function and the predicate can be evaluated from the filename alone. Dynamic filters
whose values come from joins, semi-joins, scalar subqueries, or later query blocks may still open all expanded files.
This file-open pruning currently applies only to h5_tree(...) and table-valued h5_ls(...). h5_read(...) and
h5_attributes(...) do not use filename filters to avoid opening files.
SELECT * FROM h5_tree('runs/**/*.h5');
SELECT * FROM h5_read(
['runs/run_001.h5', 'runs/run_*.h5'],
'/counts'
);
SELECT * FROM h5_read(
'sftp://beamline.example.org/data/run_*.h5',
'/entry/data'
);
SELECT filename, units
FROM h5_attributes('runs/run_*.h5', '/entry/data');
SELECT path, type
FROM h5_tree('runs/**/*.h5')
WHERE filename LIKE '%/run_042.h5';
SELECT source_file, path
FROM h5_ls('runs/run_*.h5', '/entry', filename := 'source_file')
WHERE source_file IN ('runs/run_001.h5', 'runs/run_002.h5');Default SWMR read mode for h5db functions when no explicit swmr := ... named
parameter is provided. Defaults to false.
Example:
SET h5db_swmr_default = true;Target batch size used by h5_read for numeric chunk caching and scan chunk sizing. Accepts DuckDB memory-size
strings such as '1MB', '8MB', '512KB'. Defaults to '1MB'. Values above 1GB are clamped to 1GB.
This is a target, not a hard memory cap. For HDF5 chunked datasets, h5_read may align cache chunks upward to the
dataset's first-dimension HDF5 chunk size.
Example:
SET h5db_batch_size = '4MB';| HDF5 Type | Size | DuckDB Type |
|---|---|---|
| H5T_INTEGER (signed) | 1 byte | TINYINT |
| H5T_INTEGER (signed) | 2 bytes | SMALLINT |
| H5T_INTEGER (signed) | 4 bytes | INTEGER |
| H5T_INTEGER (signed) | 8 bytes | BIGINT |
| H5T_INTEGER (unsigned) | 1 byte | UTINYINT |
| H5T_INTEGER (unsigned) | 2 bytes | USMALLINT |
| H5T_INTEGER (unsigned) | 4 bytes | UINTEGER |
| H5T_INTEGER (unsigned) | 8 bytes | UBIGINT |
| H5T_FLOAT | 4 bytes | FLOAT |
| H5T_FLOAT | 8 bytes | DOUBLE |
| H5T_STRING | variable/fixed | VARCHAR |
HDF5 arrays are mapped to DuckDB ARRAY types:
| HDF5 Shape | DuckDB Type |
|---|---|
| [N] | N rows of scalar |
| [N, M] | N rows of TYPE[M] |
| [N, M, P] | N rows of TYPE[P][M] |
| [N, M, P, Q] | N rows of TYPE[Q][P][M] |
Note: Arrays with more than 4 dimensions are not currently supported. Note: Multi-dimensional string datasets are not currently supported.
All functions provide clear error messages for common issues:
- File not found: "IO Error: Failed to open HDF5 file"
- Invalid path: "IO Error: Failed to open dataset/object: ... in file: ..."
- Unsupported type: "IO Error: Unsupported HDF5 type"
- Invalid RSE data: "IO Error: RSE run_starts must be strictly increasing: ... in file: ..."
- No attributes: "IO Error: Object has no attributes: ... in file: ..."
h5_readprojection pushdown: Only reads columns actually used by your query, skipping unused datasets entirely for significant performance gainsh5_readpredicate pushdown: Range-like filters with static constants are applied during scan for RSE andh5_index()columns to reduce I/O. Supported shapes include=,<,<=,>,>=,BETWEEN, bind-time-foldable RHS expressions, and comparison-cast forms that normalize to those operators. Unsupported boolean shapes such asOR,!=,IS DISTINCT FROM, or expressions likeindex + 1 > 10remain post-scan filters.h5_tree/h5_lsfilename filter pruning: Filters on the hidden or visible renamedfilenamecolumn can remove files from expanded glob/list inputs before the file is opened. This applies only toh5_tree(...)and table-valuedh5_ls(...), not toh5_read(...)orh5_attributes(...).- Chunked reading: Data is read in chunks with optimized cache management for memory efficiency
- Hyperslab selection: Uses HDF5's hyperslab selection for efficient partial reads
- RSE optimization: Run-start encoded data is expanded on-the-fly with O(1) amortized cost per row
- Parallel scanning:
h5_readcan scan different row ranges in parallel where the dataset layout and query allow it
- Compound types (HDF5 structs) are not currently supported
- Enum types are not currently supported
- Opaque, bitfield, reference, time-like, and non-string variable-length HDF5 type classes are not supported
- Datasets with >4 dimensions are not supported
- Multi-dimensional string datasets are not supported
- Attribute multidimensional dataspaces are not supported (only scalar and 1D)
- ../README.md - Project overview and quick start
- RSE_USAGE.md - Detailed guide to run-start encoding
- developer/DEVELOPER.md - Development and building guide