timeseries-table-format

Stop managing Parquet files. Start managing time-series tables.

A Rust-native table format that brings Delta Lake/Iceberg-style transactions
to time-series data—with built-in coverage tracking for gaps and overlaps.

Built in Rust. Python bindings available on PyPI.

Early MVP: APIs and on-disk layouts may change before v1.0.

Why This Exists

Delta Lake and Apache Iceberg are great for general-purpose analytics. But if you're working with time-series specifically, a few problems come up repeatedly—coverage ("do I have data for this range?"), gaps ("where are the missing windows?"), and overlap-safe ingestion ("did I already ingest this time window?"). This project bakes those time-series primitives into the table format.

Problem	Delta/Iceberg	This Project
"Do I have data for 2024-01-15 to 2024-03-20?"	Scan metadata or query	coverage_ratio_for_range(...) → instant
"Where are the gaps in my dataset?"	Write custom logic	max_gap_len_for_range(...) → built-in
"Will this append overlap existing data?"	Hope for the best	Automatic overlap detection
Deployment complexity	JVM/Spark ecosystem	Single Rust binary

This project is ideal for:

Backtesting systems that need gap-aware data loading
Sensor/IoT data pipelines with strict coverage requirements
Financial data stores where overlap = disaster
Learning how modern table formats work (well-documented internals!)

Key Features


ACID-like transactions	Append-only commit log with optimistic concurrency control—no more corrupted datasets from failed writes
Time-first layout	Timestamp column, entity partitioning, and configurable bucket granularity baked into the format
Coverage tracking	RoaringBitmap indexes answer "where are my gaps?" in milliseconds, not minutes
Overlap-safe appends	Automatic detection prevents accidental duplicate data ingestion
DataFusion integration	SQL queries with time-based segment pruning out of the box
Rust core + Python bindings	Rust-first core (CLI + libraries) with Python bindings for local workflows
Fast ingest	7–27× faster than ClickHouse/PostgreSQL on bulk loads and daily appends

Install (Python)

pip install timeseries-table-format

Python docs: https://mag1cfrog.github.io/timeseries-table-format/

Python Quickstart

TL;DR: Session.sql(...) returns a pyarrow.Table:

import timeseries_table_format as ttf

out = ttf.Session().sql("select 1 as x")
print(type(out))  # pyarrow.Table

Convert to Polars: import polars as pl; pl.from_arrow(out)

More examples:

python/examples/quickstart_create_append_query.py
python/examples/register_and_join_two_tables.py

End-to-end example (create → append → register 2 tables → join)

from __future__ import annotations

import tempfile
from pathlib import Path

import pyarrow as pa
import pyarrow.parquet as pq

import timeseries_table_format as ttf


def _write_parquet_prices(path: Path) -> None:
    pq.write_table(
        pa.table(
            {
                "ts": pa.array([0, 3_600 * 1_000_000], type=pa.timestamp("us")),
                "symbol": pa.array(["NVDA", "NVDA"], type=pa.string()),
                "close": pa.array([1.0, 2.0], type=pa.float64()),
            }
        ),
        str(path),
    )


def _write_parquet_volumes(path: Path) -> None:
    pq.write_table(
        pa.table(
            {
                "ts": pa.array([0, 3_600 * 1_000_000], type=pa.timestamp("us")),
                "symbol": pa.array(["NVDA", "NVDA"], type=pa.string()),
                "volume": pa.array([10, 20], type=pa.int64()),
            }
        ),
        str(path),
    )


with tempfile.TemporaryDirectory() as d:
    base_dir = Path(d)

    prices_root = base_dir / "prices_tbl"
    prices = ttf.TimeSeriesTable.create(
        table_root=str(prices_root),
        time_column="ts",
        bucket="1h",
        entity_columns=["symbol"],
        timezone=None,
    )
    prices_seg = base_dir / "prices.parquet"
    _write_parquet_prices(prices_seg)
    prices.append_parquet(str(prices_seg))

    volumes_root = base_dir / "volumes_tbl"
    volumes = ttf.TimeSeriesTable.create(
        table_root=str(volumes_root),
        time_column="ts",
        bucket="1h",
        entity_columns=["symbol"],
        timezone=None,
    )
    volumes_seg = base_dir / "volumes.parquet"
    _write_parquet_volumes(volumes_seg)
    volumes.append_parquet(str(volumes_seg))

    sess = ttf.Session()
    sess.register_tstable("prices", str(prices_root))
    sess.register_tstable("volumes", str(volumes_root))

    out = sess.sql(
        """
        select p.ts as ts, p.symbol as symbol, p.close as close, v.volume as volume
        from prices p
        join volumes v
        on p.ts = v.ts and p.symbol = v.symbol
        order by p.ts
        """
    )
    print(type(out))  # pyarrow.Table
    print(out)

Core Concepts

Read this once—these terms show up throughout the project.

Bucket size (important): bucket=1h does not resample your data to “hourly”.

The bucket is the time grid used for coverage (“do I have data for this range?”) and overlap checks (“did I already ingest this time window?”).

Example: with bucket=1h, timestamps 10:05 and 10:55 fall into the same bucket (10:00–11:00). In v0.1, appending two rows for the same entity in the same bucket is treated as overlap and will be rejected.

Practical rule: choose a bucket that matches the granularity you expect to be unique per entity. Hourly bars → 1h, minute bars → 1m, etc. If you expect multiple rows per entity within an hour, don’t use 1h in v0.1.

Walkthrough: NVDA 1h + MA(5)

Fastest way to see the format end-to-end (no external services needed):

Ingest sample data (creates examples/nvda_table/):

cargo run -p timeseries-table-core --example ingest_nvda

Query with DataFusion + moving average window:

cargo run -p timeseries-table-datafusion --example query_nvda_ma

Example output:

+---------------------+--------+------------+
| ts                  | close  | ma_5       |
+---------------------+--------+------------+
| 2024-06-01T00:00:00 | 115.22 | 115.22     |
| 2024-06-01T01:00:00 | 115.55 | 115.385    |
| 2024-06-01T02:00:00 | 115.51 | 115.426667 |
| 2024-06-01T03:00:00 | 114.99 | 115.3175   |
| 2024-06-01T04:00:00 | 114.7  | 115.194    |
+---------------------+--------+------------+

Sample data lives at examples/data/nvda_1h_sample.csv (240 rows of NVDA 1h bars). The ingestion step writes a Parquet segment and appends it via the transaction log using optimistic concurrency.

Other Interfaces

Python users: see Install (Python) and Python Quickstart above.

Command-Line Interface (CLI)

# Install
cargo install timeseries-table-cli --bin tstable

# Create a table with 1-hour buckets
tstable create --table ./my_table --time-column ts --bucket 1h

# Append data (overlap-safe!)
tstable append --table ./my_table --parquet ./data.parquet

# Query with SQL
tstable query --table ./my_table --sql "SELECT * FROM my_table LIMIT 5"

See the CLI documentation for the full command reference.

Rust API

cargo add timeseries-table-format

[dependencies]
timeseries-table-format = "0.1"
tokio = { version = "1", features = ["macros", "rt-multi-thread"] }
chrono = "0.4"

use chrono::{TimeZone, Utc};
use timeseries_table_format::{TableError, TableLocation, TimeSeriesTable};

#[tokio::main]
async fn main() -> Result<(), TableError> {
    let table = TimeSeriesTable::open(TableLocation::local("./my_table")).await?;

    let start = Utc.timestamp_opt(0, 0).single().unwrap();
    let end = Utc.timestamp_opt(120, 0).single().unwrap();

    let ratio = table.coverage_ratio_for_range(start, end).await?;
    println!("Coverage ratio: {:.1}%", ratio * 100.0);

    Ok(())
}

See timeseries-table-core for full API docs.

DataFusion Integration

[dependencies]
timeseries-table-format = "0.1"

See timeseries-table-datafusion for SQL query examples.

Performance Benchmarks

Benchmarked on 73M rows of NYC taxi data (bulk load + 90 days of daily appends):

Benchmark comparison chart

vs ClickHouse	Speedup
Bulk ingest	7.7×
Daily append	3.3×
Time-range scan	2.5×

vs PostgreSQL	Speedup
Bulk ingest	27×
Daily append	5.5×
Time-range scan	80×

_{Aggregation queries (GROUP BY, filtering) are competitive with ClickHouse. Delta + Spark Q1 is now 964ms with partitioned Delta. See full benchmark methodology and results.}

Architecture

Click to expand architecture details

A time-series table consists of:

Parquet segments on disk
Each segment holds a chunk of time-sorted data (e.g., 1h bars for a symbol).
Append-only metadata log (_timeseries_log/)
- JSON commit files (0000000001.json, 0000000002.json, ...) record segment additions/removals
- CURRENT pointer tracks the latest committed version
- Version-guard OCC: read version N → commit with expected_version=N → succeed only if CURRENT is still N
Table metadata with time index
- TableKind::TimeSeries(TimeIndexSpec) with timestamp column, entity columns, bucket granularity
- Schema info and creation timestamp
Coverage bitmaps (_coverage/)
- Segment- and table-level RoaringBitmap snapshots
- Enable O(1) overlap checks and gap queries without rescanning Parquet

Project Status

Current status and near-term roadmap:

Log-based metadata layer with version-guard OCC
Time-series table abstraction + range scans
Coverage snapshots + overlap-safe appends
CLI for table management and SQL queries
DataFusion TableProvider integration
End-to-end example with sample data
Compaction / segment merging
Time-travel queries

Contributing

Contributions welcome! This project is also a learning exercise in building table formats from scratch—if you're curious about the internals, the code is heavily commented.

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,276 Commits
.cargo		.cargo
.github/workflows		.github/workflows
bench		bench
crates		crates
docs-site		docs-site
docs		docs
examples		examples
python		python
scripts		scripts
.envrc.example		.envrc.example
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
cliff.toml		cliff.toml
clippy.toml		clippy.toml
release-plz.toml		release-plz.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

timeseries-table-format

Stop managing Parquet files. Start managing time-series tables.

Why This Exists

Key Features

Install (Python)

Python Quickstart

Core Concepts

Walkthrough: NVDA 1h + MA(5)

Other Interfaces

Command-Line Interface (CLI)

Rust API

DataFusion Integration

Performance Benchmarks

Architecture

Project Status

Further Reading

Contributing

License

About

Uh oh!

Releases 17

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

timeseries-table-format

Stop managing Parquet files. Start managing time-series tables.

Why This Exists

Key Features

Install (Python)

Python Quickstart

Core Concepts

Walkthrough: NVDA 1h + MA(5)

Other Interfaces

Command-Line Interface (CLI)

Rust API

DataFusion Integration

Performance Benchmarks

Architecture

Project Status

Further Reading

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages