Skip to content

mag1cfrog/timeseries-table-format

Repository files navigation

timeseries-table-format

Rust crates.io docs.rs PyPI Python Versions License

Ferris with timeseries-table-format

Stop managing Parquet files. Start managing time-series tables.

A Rust-native table format that brings Delta Lake/Iceberg-style transactions
to time-series data—with built-in coverage tracking for gaps and overlaps.

Built in Rust. Python bindings available on PyPI.

Early MVP: APIs and on-disk layouts may change before v1.0.


Why This Exists

Delta Lake and Apache Iceberg are great for general-purpose analytics. But if you're working with time-series specifically, a few problems come up repeatedly—coverage ("do I have data for this range?"), gaps ("where are the missing windows?"), and overlap-safe ingestion ("did I already ingest this time window?"). This project bakes those time-series primitives into the table format.

Problem Delta/Iceberg This Project
"Do I have data for 2024-01-15 to 2024-03-20?" Scan metadata or query coverage_ratio_for_range(...) → instant
"Where are the gaps in my dataset?" Write custom logic max_gap_len_for_range(...) → built-in
"Will this append overlap existing data?" Hope for the best Automatic overlap detection
Deployment complexity JVM/Spark ecosystem Single Rust binary

This project is ideal for:

  • Backtesting systems that need gap-aware data loading
  • Sensor/IoT data pipelines with strict coverage requirements
  • Financial data stores where overlap = disaster
  • Learning how modern table formats work (well-documented internals!)

Key Features

ACID-like transactions Append-only commit log with optimistic concurrency control—no more corrupted datasets from failed writes
Time-first layout Timestamp column, entity partitioning, and configurable bucket granularity baked into the format
Coverage tracking RoaringBitmap indexes answer "where are my gaps?" in milliseconds, not minutes
Overlap-safe appends Automatic detection prevents accidental duplicate data ingestion
DataFusion integration SQL queries with time-based segment pruning out of the box
Rust core + Python bindings Rust-first core (CLI + libraries) with Python bindings for local workflows
Fast ingest 7–27× faster than ClickHouse/PostgreSQL on bulk loads and daily appends

Install (Python)

pip install timeseries-table-format

Python docs: https://mag1cfrog.github.io/timeseries-table-format/


Python Quickstart

TL;DR: Session.sql(...) returns a pyarrow.Table:

import timeseries_table_format as ttf

out = ttf.Session().sql("select 1 as x")
print(type(out))  # pyarrow.Table

Convert to Polars: import polars as pl; pl.from_arrow(out)

More examples:

  • python/examples/quickstart_create_append_query.py
  • python/examples/register_and_join_two_tables.py
End-to-end example (create → append → register 2 tables → join)
from __future__ import annotations

import tempfile
from pathlib import Path

import pyarrow as pa
import pyarrow.parquet as pq

import timeseries_table_format as ttf


def _write_parquet_prices(path: Path) -> None:
    pq.write_table(
        pa.table(
            {
                "ts": pa.array([0, 3_600 * 1_000_000], type=pa.timestamp("us")),
                "symbol": pa.array(["NVDA", "NVDA"], type=pa.string()),
                "close": pa.array([1.0, 2.0], type=pa.float64()),
            }
        ),
        str(path),
    )


def _write_parquet_volumes(path: Path) -> None:
    pq.write_table(
        pa.table(
            {
                "ts": pa.array([0, 3_600 * 1_000_000], type=pa.timestamp("us")),
                "symbol": pa.array(["NVDA", "NVDA"], type=pa.string()),
                "volume": pa.array([10, 20], type=pa.int64()),
            }
        ),
        str(path),
    )


with tempfile.TemporaryDirectory() as d:
    base_dir = Path(d)

    prices_root = base_dir / "prices_tbl"
    prices = ttf.TimeSeriesTable.create(
        table_root=str(prices_root),
        time_column="ts",
        bucket="1h",
        entity_columns=["symbol"],
        timezone=None,
    )
    prices_seg = base_dir / "prices.parquet"
    _write_parquet_prices(prices_seg)
    prices.append_parquet(str(prices_seg))

    volumes_root = base_dir / "volumes_tbl"
    volumes = ttf.TimeSeriesTable.create(
        table_root=str(volumes_root),
        time_column="ts",
        bucket="1h",
        entity_columns=["symbol"],
        timezone=None,
    )
    volumes_seg = base_dir / "volumes.parquet"
    _write_parquet_volumes(volumes_seg)
    volumes.append_parquet(str(volumes_seg))

    sess = ttf.Session()
    sess.register_tstable("prices", str(prices_root))
    sess.register_tstable("volumes", str(volumes_root))

    out = sess.sql(
        """
        select p.ts as ts, p.symbol as symbol, p.close as close, v.volume as volume
        from prices p
        join volumes v
        on p.ts = v.ts and p.symbol = v.symbol
        order by p.ts
        """
    )
    print(type(out))  # pyarrow.Table
    print(out)

Core Concepts

Read this once—these terms show up throughout the project.

Bucket size (important): bucket=1h does not resample your data to “hourly”.

The bucket is the time grid used for coverage (“do I have data for this range?”) and overlap checks (“did I already ingest this time window?”).

Example: with bucket=1h, timestamps 10:05 and 10:55 fall into the same bucket (10:00–11:00). In v0.1, appending two rows for the same entity in the same bucket is treated as overlap and will be rejected.

Practical rule: choose a bucket that matches the granularity you expect to be unique per entity. Hourly bars → 1h, minute bars → 1m, etc. If you expect multiple rows per entity within an hour, don’t use 1h in v0.1.


Walkthrough: NVDA 1h + MA(5)

Fastest way to see the format end-to-end (no external services needed):

  1. Ingest sample data (creates examples/nvda_table/):
cargo run -p timeseries-table-core --example ingest_nvda
  1. Query with DataFusion + moving average window:
cargo run -p timeseries-table-datafusion --example query_nvda_ma

Example output:

+---------------------+--------+------------+
| ts                  | close  | ma_5       |
+---------------------+--------+------------+
| 2024-06-01T00:00:00 | 115.22 | 115.22     |
| 2024-06-01T01:00:00 | 115.55 | 115.385    |
| 2024-06-01T02:00:00 | 115.51 | 115.426667 |
| 2024-06-01T03:00:00 | 114.99 | 115.3175   |
| 2024-06-01T04:00:00 | 114.7  | 115.194    |
+---------------------+--------+------------+

Sample data lives at examples/data/nvda_1h_sample.csv (240 rows of NVDA 1h bars). The ingestion step writes a Parquet segment and appends it via the transaction log using optimistic concurrency.


Other Interfaces

Python users: see Install (Python) and Python Quickstart above.

Command-Line Interface (CLI)

# Install
cargo install timeseries-table-cli --bin tstable

# Create a table with 1-hour buckets
tstable create --table ./my_table --time-column ts --bucket 1h

# Append data (overlap-safe!)
tstable append --table ./my_table --parquet ./data.parquet

# Query with SQL
tstable query --table ./my_table --sql "SELECT * FROM my_table LIMIT 5"

See the CLI documentation for the full command reference.

Rust API

cargo add timeseries-table-format
[dependencies]
timeseries-table-format = "0.1"
tokio = { version = "1", features = ["macros", "rt-multi-thread"] }
chrono = "0.4"
use chrono::{TimeZone, Utc};
use timeseries_table_format::{TableError, TableLocation, TimeSeriesTable};

#[tokio::main]
async fn main() -> Result<(), TableError> {
    let table = TimeSeriesTable::open(TableLocation::local("./my_table")).await?;

    let start = Utc.timestamp_opt(0, 0).single().unwrap();
    let end = Utc.timestamp_opt(120, 0).single().unwrap();

    let ratio = table.coverage_ratio_for_range(start, end).await?;
    println!("Coverage ratio: {:.1}%", ratio * 100.0);

    Ok(())
}

See timeseries-table-core for full API docs.

DataFusion Integration

[dependencies]
timeseries-table-format = "0.1"

See timeseries-table-datafusion for SQL query examples.


Performance Benchmarks

Benchmarked on 73M rows of NYC taxi data (bulk load + 90 days of daily appends):

Benchmark comparison chart

vs ClickHouse Speedup
Bulk ingest 7.7×
Daily append 3.3×
Time-range scan 2.5×
vs PostgreSQL Speedup
Bulk ingest 27×
Daily append 5.5×
Time-range scan 80×

Aggregation queries (GROUP BY, filtering) are competitive with ClickHouse. Delta + Spark Q1 is now 964ms with partitioned Delta. See full benchmark methodology and results.


Architecture

high level architecture

Click to expand architecture details

A time-series table consists of:

  • Parquet segments on disk
    Each segment holds a chunk of time-sorted data (e.g., 1h bars for a symbol).

  • Append-only metadata log (_timeseries_log/)

    • JSON commit files (0000000001.json, 0000000002.json, ...) record segment additions/removals
    • CURRENT pointer tracks the latest committed version
    • Version-guard OCC: read version N → commit with expected_version=N → succeed only if CURRENT is still N
  • Table metadata with time index

    • TableKind::TimeSeries(TimeIndexSpec) with timestamp column, entity columns, bucket granularity
    • Schema info and creation timestamp
  • Coverage bitmaps (_coverage/)

    • Segment- and table-level RoaringBitmap snapshots
    • Enable O(1) overlap checks and gap queries without rescanning Parquet

Project Status

Current status and near-term roadmap:

  • Log-based metadata layer with version-guard OCC
  • Time-series table abstraction + range scans
  • Coverage snapshots + overlap-safe appends
  • CLI for table management and SQL queries
  • DataFusion TableProvider integration
  • End-to-end example with sample data
  • Compaction / segment merging
  • Time-travel queries

Further Reading


Contributing

Contributions welcome! This project is also a learning exercise in building table formats from scratch—if you're curious about the internals, the code is heavily commented.


License

MIT License — see LICENSE for details.