A DuckDB extension for scanning Lustre MDT (Metadata Target) devices directly from SQL. It reads ldiskfs-formatted MDT devices through the Lustre-patched libext2fs from Whamcloud e2fsprogs.
- DuckDB v1.5.1
- Lustre-patched
libext2fs(Whamcloude2fsprogs)
This extension requires the Lustre-patched e2fsprogs, not the standard distribution package.
Download: https://downloads.whamcloud.com/public/e2fsprogs/
git clone --recurse-submodules https://github.com/tsukuba-hpcs/lustre-query.git
cd lustre-query
makeBuild artifacts are written to build/release/extension/lustre_query/.
LOAD 'build/release/extension/lustre_query/lustre_query.duckdb_extension';All table functions accept either a single device path (VARCHAR) or a list of device paths (LIST(VARCHAR)).
| Signature | Description |
|---|---|
lustre_inodes(device) / lustre_inodes(devices) |
Inode metadata: FID, size, ownership, permissions, timestamps, and flags |
lustre_links(device) / lustre_links(devices) |
Hard-link entries: inode FID, parent directory FID, and filename |
lustre_layouts(device) / lustre_layouts(devices) |
LOV layout components: stripe layout, PFL/FLR/EC metadata, and pool information |
lustre_objects(device) / lustre_objects(devices) |
Per-stripe OST object placement |
lustre_dirstripe(device) / lustre_dirstripe(devices) |
Directory stripe parent rows for plain directories, DNE1 remote-parent cases, and DNE2 striped directories |
Scalar functions take the lookup target as the first argument and the device or device list as the second argument.
| Signature | Returns | Description |
|---|---|---|
lustre_fid2path(fid, device) / lustre_fid2path(fid, devices) |
LIST(VARCHAR) |
Resolves a FID to one or more filesystem paths by traversing LinkEA. Multiple paths can be returned for hard-linked files. |
lustre_path2fid(path, device) / lustre_path2fid(path, devices) |
VARCHAR |
Resolves a filesystem path to a Lustre FID by walking the directory tree from the MDT root. |
The public table functions accept these named parameters:
| Name | Type | Default | Description |
|---|---|---|---|
skip_no_fid |
BOOLEAN |
true |
Skip inodes that do not have a valid Lustre FID |
skip_no_linkea |
BOOLEAN |
true |
Skip inodes that do not have LinkEA metadata |
lustre_dirstripe and the optimizer's internal dirstripe fused scans force skip_no_linkea = false, because directory stripe resolution depends on link metadata.
-- Scan all inodes from one MDT
SELECT *
FROM lustre_inodes('/dev/mapper/mdt0');
-- Find large files owned by a specific user
SELECT fid, size, mtime
FROM lustre_inodes('/dev/mapper/mdt0')
WHERE uid = 1000 AND size > 1e9
ORDER BY size DESC;
-- Look up hard-link entries for a specific FID
SELECT *
FROM lustre_links('/dev/mapper/mdt0')
WHERE fid = '[0x200000401:0x1:0x0]';
-- Join inode metadata with directory entries
SELECT i.fid, i.size, l.name
FROM lustre_inodes('/dev/mapper/mdt0') AS i
JOIN lustre_links('/dev/mapper/mdt0') AS l
ON i.fid = l.fid;
-- Resolve a FID to one or more paths
SELECT fid, lustre_fid2path(fid, '/dev/mapper/mdt0') AS paths
FROM lustre_inodes('/dev/mapper/mdt0')
WHERE uid = 0 AND type = 'file';
-- Resolve a path across multiple MDTs
SELECT lustre_path2fid('/project/a/b/file.dat',
['/dev/mapper/mdt0', '/dev/mapper/mdt1']) AS fid;
-- Scan multiple MDTs
SELECT *
FROM lustre_inodes(['/dev/mapper/mdt0', '/dev/mapper/mdt1']);
-- DNE-aware logical directory entry counts
SELECT d.dir_fid, count(*) AS entry_count
FROM lustre_links(['/dev/mapper/mdt0', '/dev/mapper/mdt1']) AS l
JOIN lustre_dirstripe(['/dev/mapper/mdt0', '/dev/mapper/mdt1']) AS d
ON l.parent_fid = d.parent_fid
JOIN lustre_inodes(['/dev/mapper/mdt0', '/dev/mapper/mdt1']) AS i
ON i.fid = d.dir_fid
WHERE i.type = 'dir'
GROUP BY d.dir_fid;
-- Find directories whose parent lives on another MDT (DNE1)
SELECT dir_fid, dir_device, lma_incompat
FROM lustre_dirstripe(['/dev/mapper/mdt0', '/dev/mapper/mdt1'])
WHERE lma_incompat & 4 != 0; -- LMAI_REMOTE_PARENT- Parallel scans partition work by block group.
- Projection pushdown and column pruning avoid decoding unused fields.
- Equality and
INpredicates on FIDs are pushed down to OI lookups when possible. lustre_layoutsandlustre_objectsare most effective when the query filters onfid.lustre_dirstripenormalizes namespace-bearing directory stripe objects to logical directory identities and skips DNE1 agent inodes automatically.
The optimizer can replace common join shapes with internal fused scans. These internal table functions are implementation details; normal queries should use the public functions above.
| Query Pattern | Internal Rewrite | Effect |
|---|---|---|
lustre_links JOIN lustre_dirstripe ON parent_fid |
lustre_link_dirstripe |
Resolves parent directory stripe rows inline |
lustre_inodes JOIN lustre_dirstripe ON fid = dir_fid |
lustre_inode_dirstripe |
Scans directory inodes together with dirstripe rows |
lustre_links JOIN lustre_dirstripe JOIN lustre_inodes |
lustre_link_inode_dirstripe |
Replaces the multi-stage join with a single fused scan plus cached lookups |
| Column | Type | Description |
|---|---|---|
fid |
VARCHAR |
Lustre FID ([seq:oid:ver]) |
ino |
UBIGINT |
Inode number |
type |
VARCHAR |
File type (file, dir, link, ...) |
mode |
UINTEGER |
Permission bits |
nlink |
UINTEGER |
Hard-link count |
uid |
UINTEGER |
User ID |
gid |
UINTEGER |
Group ID |
size |
UBIGINT |
File size in bytes |
blocks |
UBIGINT |
Block count |
atime |
TIMESTAMP |
Access time |
mtime |
TIMESTAMP |
Modification time |
ctime |
TIMESTAMP |
Change time |
projid |
UINTEGER |
Project ID |
flags |
UINTEGER |
Inode flags |
device |
VARCHAR |
Source device path |
| Column | Type | Description |
|---|---|---|
fid |
VARCHAR |
Lustre FID |
parent_fid |
VARCHAR |
Parent directory FID |
name |
VARCHAR |
Filename within parent_fid |
device |
VARCHAR |
Source device path |
| Column | Type | Description |
|---|---|---|
fid |
VARCHAR |
Lustre FID |
comp_index |
UINTEGER |
Component index |
comp_id |
UINTEGER |
Component ID |
mirror_id |
USMALLINT |
Mirror ID |
comp_flags |
UINTEGER |
Component flags |
extent_start |
UBIGINT |
Start offset |
extent_end |
UBIGINT |
End offset |
pattern |
UINTEGER |
Stripe pattern |
stripe_size |
UINTEGER |
Stripe size in bytes |
stripe_count |
USMALLINT |
Stripe count |
stripe_offset |
USMALLINT |
First OST index |
pool |
VARCHAR |
OST pool name |
dstripe_count |
UTINYINT |
EC data stripe count (k) |
cstripe_count |
UTINYINT |
EC code stripe count (p) |
compr_type |
UTINYINT |
Compression type |
compr_lvl |
UTINYINT |
Compression level |
device |
VARCHAR |
Source device path |
| Column | Type | Description |
|---|---|---|
fid |
VARCHAR |
Lustre FID |
comp_index |
UINTEGER |
Component index |
stripe_index |
UINTEGER |
Stripe index within the component |
ost_idx |
UINTEGER |
OST index |
ost_oi_id |
UBIGINT |
OST object ID |
ost_oi_seq |
UBIGINT |
OST object sequence |
device |
VARCHAR |
Source device path |
lustre_dirstripe emits directory stripe parent rows. For plain directories the stripe object is the directory itself; for DNE2 striped directories it maps each shard object back to the logical master directory FID.
| Column | Type | Description |
|---|---|---|
dir_fid |
VARCHAR |
Logical directory FID (master FID for striped directories, self for plain directories) |
parent_fid |
VARCHAR |
Namespace-bearing directory stripe object FID (shard FID, or self for plain directories) |
dir_device |
VARCHAR |
Device where dir_fid lives |
parent_device |
VARCHAR |
Device where parent_fid lives |
master_mdt_index |
UINTEGER |
MDT index of the master directory |
stripe_index |
UINTEGER |
Stripe index (0 for master/plain directories) |
stripe_count |
UINTEGER |
Total stripe count (1 for plain directories) |
hash_type |
UINTEGER |
LMV hash type and flags |
layout_version |
UINTEGER |
LMV layout version |
source |
VARCHAR |
Row source: plain, master, or slave |
lma_incompat |
UINTEGER |
LMA incompat flags |
When multiple devices are provided, cross-MDT OI lookup resolves the device columns. Masters emit authoritative shard rows; slaves are emitted only as fallback when the master cannot be reached. DNE1 agent inodes (LMAI_AGENT) are skipped automatically.
| Flag | Value | Meaning |
|---|---|---|
LMAI_RELEASED |
0x01 |
File released (HSM) |
LMAI_AGENT |
0x02 |
Agent inode (DNE1 remote directory stub); skipped by lustre_dirstripe |
LMAI_REMOTE_PARENT |
0x04 |
Parent directory is on a remote MDT |
LMAI_STRIPED |
0x08 |
Striped directory (DNE2) |
LMAI_ORPHAN |
0x10 |
Orphan inode |