Skip to content

Query-farm/vgi-scikit-learn

Repository files navigation

Vector Gateway Interface        scikit-learn

vgi-sklearn

CI PyPI Python License: MIT

Train and run machine-learning models in pure SQL. vgi-sklearn exposes scikit-learn to DuckDB as ordinary SQL functions — so you can scale features, cluster, detect outliers, train a classifier, and score new rows without leaving your query. No Python notebook, no CSV export, no glue code: your table goes in, predictions come out, and the model can live in a DuckDB column.

-- one-time: load the extension and attach the worker
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'sklearn' (TYPE vgi, LOCATION 'vgi-sklearn');   -- 'uv run sklearn_worker.py' from a checkout

-- train a model on a table and score rows, all in SQL
CREATE TABLE flowers AS SELECT * FROM sklearn.datasets.iris();

CREATE TABLE model AS
  SELECT model FROM sklearn.estimators.fit_random_forest_classifier(
    (SELECT sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm, target FROM flowers),
    target := 'target', n_estimators := 200);

SET VARIABLE m = (SELECT model FROM model);
SELECT sample_id, prediction
FROM sklearn.models.predict((SELECT * EXCLUDE (target) FROM flowers), model := getvariable('m'), id := 'sample_id')
LIMIT 5;

That's the whole loop: fit_… returns a trained model, predict scores a table through it. Everything below is variations on that theme.


How it works (read this first — it's quick)

Every modeling function follows the same SQL-friendly contract:

  • Your input table is the feature matrix. You pass it as a subquery — sklearn.preprocessing.pca((SELECT ...), …). (DuckDB allows a table function only one subquery argument, so the data goes there; everything else is a named arg.)
  • Named arguments use :=n_clusters := 3, target := 'label'.
  • target names your label column (training only). Every other column you select is a feature — so for fit, just SELECT your features and the target; don't include an identifier column. Numeric and boolean columns are used as-is; string columns are treated as categorical and one-hot-encoded automatically (the encoding is stored with the model, so predict re-applies it). Need the encoding as data instead? ordinal_encoder / one_hot_encoder expose it directly.
  • id is for getting results back, and it's per-row functions that need it. predict, the transforms, and cross_val_predict emit one row per input row and copy your id onto each, so a plain JOIN ... USING (id) reattaches results to the source. fit returns a single summary row — there's nothing to join back — so it needs no id; you just leave the identifier out of the SELECT. (fit does accept an optional id := for the convenience of passing a wide projection like SELECT *: it then drops that column from the features so the model doesn't train on a key.)
  • Features are matched by name, not position. A model trained on age, income scores correctly whether you feed it income, age or a table with extra columns — it pulls its own features by name and errors if one is missing.

If you know GROUP BY and subqueries, you already know how to use this.


Recipes

Train a model

Each estimator has its own fit_<estimator> function that exposes its real hyperparameters as typed, named SQL arguments (they autocomplete and are type-checked):

-- a classifier (your own table: customers you've labeled as churned 0/1)
CREATE TABLE churn AS
  SELECT sample_id AS customer_id, sepal_length_cm AS tenure, sepal_width_cm AS monthly_spend,
         petal_length_cm AS support_tickets, (target = 0)::INT AS churned
  FROM sklearn.datasets.iris();

SELECT estimator, task, n_samples, n_features, train_score
FROM sklearn.estimators.fit_gradient_boosting_classifier(
  -- just the features + the target; leave customer_id out (it isn't a feature)
  (SELECT tenure, monthly_spend, support_tickets, churned FROM churn),
  model_name := 'churn_gb',          -- store it in the registry under this name
  target := 'churned',               -- the label column; everything else is a feature
  n_estimators := 300,
  learning_rate := 0.05,
  max_depth := 3);

fit returns one summary row (and the model itself as a BLOB); it doesn't echo your rows, so it needs no id — you simply don't select the identifier. If it's easier to pass a wide projection that already includes an id, add id := and fit will keep that column out of the features:

-- a regressor; SELECT * includes diabetes()'s sample_id, so name it as the id
SELECT estimator, task, train_score
FROM sklearn.estimators.fit_random_forest_regressor(
  (SELECT * FROM sklearn.datasets.diabetes()),
  model_name := 'diabetes_rf', target := 'target',
  id := 'sample_id',                       -- keeps sample_id out of the features
  n_estimators := 400, max_depth := 0);    -- max_depth := 0 means "no limit"

Available estimators (each is sklearn.fit_<name>):

Family Functions Common typed args
Linear logistic_regression, linear_regression, ridge, lasso, elastic_net, ridge_classifier, sgd_classifier/_regressor, bayesian_ridge, huber_regressor, quantile_regressor C, alpha, l1_ratio, max_iter, fit_intercept, penalty, solver, loss
GLMs poisson_regressor, gamma_regressor, tweedie_regressor alpha, power, max_iter, fit_intercept
Trees / ensembles decision_tree_classifier/_regressor, random_forest_classifier/_regressor, extra_trees_classifier/_regressor, gradient_boosting_classifier/_regressor, hist_gradient_boosting_classifier/_regressor, ada_boost_classifier/_regressor, bagging_classifier/_regressor n_estimators, max_depth, learning_rate, min_samples_split, subsample, max_samples, random_state
SVM svc, svr, linear_svc, linear_svr C, kernel, gamma, degree, epsilon, loss
Neighbors knn_classifier, knn_regressor n_neighbors, weights, p
Neural net mlp_classifier, mlp_regressor hidden_units, alpha, max_iter, learning_rate_init
Naive Bayes gaussian_nb, multinomial_nb, bernoulli_nb, complement_nb var_smoothing, alpha, fit_prior, binarize
Discriminant lda, qda solver, tol, reg_param

Need a hyperparameter that isn't exposed as a typed argument? The generic sklearn.models.fit((SELECT ...), estimator := 'ridge', target := 'y', params := '{"alpha": 0.3, "solver": "svd"}') accepts any scikit-learn parameter as a JSON object.

Build a pipeline (preprocess → model in one artifact)

fit_pipeline chains preprocessing steps and a final estimator, fits them together, and stores the result as a single model — so it trains and serves without leakage, and you score it with the same predict (no separate apply):

SELECT model_name, estimator, n_features
FROM sklearn.models.fit_pipeline(
  (SELECT tenure, monthly_spend, support_tickets, churned FROM churn),
  steps := '[{"kind": "simple_imputer", "params": {"strategy": "median"}},
             {"kind": "standard_scaler"},
             {"kind": "pca", "params": {"n_components": 3}}]',
  estimator := 'logistic_regression', target := 'churned', model_name := 'churn_pipe');

-- predict (and cross_val_predict, permutation_importance, ...) work as usual
SELECT * FROM sklearn.models.predict((SELECT * FROM new_customers), model_name := 'churn_pipe', id := 'customer_id');

steps is a JSON array of {kind, params}; kind is any stored-transformer kind (standard_scaler, simple_imputer, pca, truncated_svd, …). String features are one-hot-encoded ahead of the steps automatically.

Every fit_… call returns the trained model as a model BLOB column and, when you pass model_name, saves it to the registry. So you choose where the model lives (see Where models live).

Score new data

predict streams a table through a stored model. It carries your id through and appends prediction:

-- from a registry model by name
SELECT customer_id, prediction
FROM sklearn.models.predict(
  (SELECT customer_id, tenure, monthly_spend, support_tickets FROM churn),
  model_name := 'churn_gb', id := 'customer_id');

Add with_proba := true to also get one probability column per class (proba_0, proba_1, …):

SELECT customer_id, prediction, proba_1 AS churn_probability
FROM sklearn.models.predict(
  (SELECT customer_id, tenure, monthly_spend, support_tickets FROM churn),
  model_name := 'churn_gb', id := 'customer_id', with_proba := true)
WHERE proba_1 > 0.5;

Evaluate a model honestly (no leakage, nothing stored)

cross_val_predict returns out-of-fold predictions — each row scored by a model that didn't see it — which you then compare to the truth with the metric functions:

SELECT sklearn.metrics.accuracy_score(c.churned, p.prediction) AS cv_accuracy
FROM sklearn.models.cross_val_predict(
       (SELECT customer_id, tenure, monthly_spend, support_tickets, churned FROM churn),
       estimator := 'gradient_boosting_classifier', target := 'churned', id := 'customer_id', cv := 5) p
JOIN churn c ON c.customer_id = p.customer_id;

Prefer the held-out score per fold (mean ± spread)? cross_val_score returns one row per fold:

SELECT avg(score) AS mean_cv, stddev(score) AS sd
FROM sklearn.models.cross_val_score(
       (SELECT tenure, monthly_spend, support_tickets, churned FROM churn),
       estimator := 'gradient_boosting_classifier', target := 'churned', cv := 5);

Want full control of the evaluation loop? The splitters assign each row a fold (kfold, stratified_kfold, group_kfold, timeseries_split), so you build the cross-validation in pure SQL — train on fold <> f, test on fold = f:

-- attach a stratified fold id to every row, then evaluate however you like
WITH folds AS (
  SELECT * FROM sklearn.models.stratified_kfold(
    (SELECT customer_id, churned FROM churn), id := 'customer_id', label := 'churned', n_splits := 5))
SELECT fold, count(*) FROM folds GROUP BY fold ORDER BY fold;

Which features matter? (permutation importance)

permutation_importance shuffles each feature in turn and measures the drop in a stored model's score — model-agnostic, so it works for any estimator:

SELECT feature, round(importance_mean, 4) AS importance
FROM sklearn.models.permutation_importance(
       (SELECT * FROM churn), model_name := 'churn_gb', target := 'churned')
ORDER BY importance DESC;

For a quick model-free filter, select_k_best scores each feature against the target (ANOVA F, mutual information, or chi²) and flags the top k; variance_threshold drops near-constant features. Both return one row per feature, so you pick the winners in SQL:

SELECT feature FROM sklearn.preprocessing.select_k_best(
         (SELECT tenure, monthly_spend, support_tickets, churned FROM churn),
         target := 'churned', k := 2)
WHERE selected;

Vectorize text

count_vectorizer and tfidf_vectorizer tokenize a text column into a document-term matrix in long format — (id, term, value) — which you pivot, join, or rank in SQL:

-- the 5 highest-weighted terms per document
SELECT id, term, value
FROM sklearn.preprocessing.tfidf_vectorizer((SELECT id, body FROM docs), id := 'id', text := 'body')
QUALIFY row_number() OVER (PARTITION BY id ORDER BY value DESC) <= 5;

Tune hyperparameters (grid search)

grid_search cross-validates every combination of the hyperparameters you list and returns the leaderboard. The estimator and its grid are one tagged-union argument — union_value(<estimator> := {param: [values], …}) — so you only ever see the hyperparameters that estimator actually has:

SELECT params, round(mean_test_score, 3) AS score, rank
FROM sklearn.models.grid_search(
  (SELECT tenure, monthly_spend, support_tickets, churned FROM churn),
  target := 'churned',
  estimator := union_value(gradient_boosting_classifier := {
    'n_estimators': [100, 300],
    'max_depth':    [2, 3],
    'learning_rate':[0.05, 0.1]}))
ORDER BY rank;

Only the hyperparameters you list are searched; the rest stay at their defaults. The refit best model is attached as a model BLOB on the single best row — grab it with WHERE model IS NOT NULL, or pass model_name := to also store it:

CREATE TABLE best AS
SELECT model FROM sklearn.models.grid_search(
  (SELECT tenure, monthly_spend, support_tickets, churned FROM churn),
  target := 'churned',
  estimator := union_value(svc := {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}))
WHERE model IS NOT NULL;

SET VARIABLE m = (SELECT model FROM best);
SELECT * FROM sklearn.models.predict((SELECT * FROM new_customers), model := getvariable('m'), id := 'customer_id');

grid_search uses union-typed arguments and needs a vgi-python with union-tag-preserving decoding (newer than 0.8.2). Against an older vgi-python the function is simply not registered.

Train a model per group (segment)

Sometimes you want one model per segment — per region, per cohort, per product line. fit_model is an aggregate, so GROUP BY does the partitioning for free and you get one model per group in a single query. Features go in as a named STRUCT; the target can be numeric or string class labels.

-- one churn model per segment
CREATE TABLE segment_models AS
SELECT (customer_id % 3) AS segment,       -- use your real segment column
       sklearn.models.fit_model({'tenure': tenure, 'monthly_spend': monthly_spend, 'support_tickets': support_tickets},
                         churned, estimator := 'gradient_boosting_classifier', hyperparams := '{}') AS m
FROM churn
GROUP BY segment;

SELECT segment, m.task, m.n_samples, round(m.train_score, 3) FROM segment_models;

m is a STRUCT holding the model BLOB plus diagnostics (task, n_samples, n_features, n_classes, train_score). To score, the prediction functions are scalars that take a per-row model BLOB and a feature struct — so each row is scored by its group's model via a plain join:

SELECT c.customer_id,
       sklearn.models.predict_class_one(m.m.model,
         {'tenure': c.tenure, 'monthly_spend': c.monthly_spend, 'support_tickets': c.support_tickets}) AS prediction
FROM churn c
JOIN segment_models m ON (c.customer_id % 3) = m.segment;
  • predict_one(model, features) → DOUBLE — regression / numeric class.
  • predict_class_one(model, features) → VARCHAR — the class label as text (works for string and numeric labels).
  • predict_proba_one(model, features) → DOUBLE[] — per-class probabilities.

Features align by name (reorder-safe; a missing feature errors), and a model trained on string labels predicts string labels. The model BLOB is the same format fit/grid_search produce, so these scalars also score any model you've stored in a table or registry.

Score predictions you already have

The metric functions are plain aggregates over two columns — point them at any table of (actual, predicted) and group however you like:

-- one score, or one per segment/model with GROUP BY
SELECT sklearn.metrics.r2_score(actual, predicted) AS r2,
       sklearn.metrics.mean_absolute_error(actual, predicted) AS mae
FROM my_predictions;

-- a full confusion matrix in long form
SELECT * FROM sklearn.metrics.confusion_matrix(
  (SELECT label AS y, predicted AS yhat FROM my_predictions),
  actual := 'y', predicted := 'yhat');

Prepare / transform features

All transforms take your table as a subquery, carry id through, and run fit_transform over the whole input:

-- standardize features (zero mean, unit variance)
SELECT * FROM sklearn.preprocessing.standard_scaler(
  (SELECT customer_id, tenure, monthly_spend, support_tickets FROM churn), id := 'customer_id');

-- reduce to 2 components for plotting
SELECT * FROM sklearn.preprocessing.pca(
  (SELECT customer_id, tenure, monthly_spend, support_tickets FROM churn),
  id := 'customer_id', n_components := 2);

-- fill missing values before modeling
SELECT * FROM sklearn.preprocessing.simple_imputer((SELECT ...), id := 'id', strategy := 'median');

Transforms compose — pipe one into the next as nested subqueries (scale, then cluster).

These all refit on whatever you pass them. To fit a transformer once and reuse it — scale your training data and apply the same shift/scale to new data, without leakage — use fit_transformer / apply_transform (the transform analogue of fit / predict):

-- fit a scaler on training data and store it
SELECT * FROM sklearn.preprocessing.fit_transformer(
  (SELECT tenure, monthly_spend, support_tickets FROM churn_train),
  transformer_name := 'churn_scaler', kind := 'standard_scaler');

-- apply the stored scaler to new data (uses the training mean/variance)
SELECT * FROM sklearn.preprocessing.apply_transform(
  (SELECT customer_id, tenure, monthly_spend, support_tickets FROM churn_new),
  transformer_name := 'churn_scaler', id := 'customer_id');

kind is any of standard_scaler, minmax_scaler, robust_scaler, maxabs_scaler, normalizer, power_transformer, quantile_transformer, simple_imputer, binarizer, kbins_discretizer, pca, truncated_svd (parameters via a JSON params :=). Like fit/predict, fit_transformer also returns a portable BLOB, and apply_transform accepts transformer := instead of a registry name; list_transformers / drop_transformer manage the registry.

Encode categorical (string) columns

fit/predict already one-hot string features for you, but you can also materialize the encoding. ordinal_encoder keeps a fixed width (one integer code column per feature); one_hot_encoder emits long format — one row per active cell (id, feature, category, value) — which sidesteps the unknown width of a wide one-hot:

-- integer codes, one column per categorical feature
SELECT * FROM sklearn.preprocessing.ordinal_encoder(
  (SELECT customer_id, plan, region FROM customers), id := 'customer_id');

-- one row per active category; pivot it back to a wide matrix in SQL
PIVOT sklearn.preprocessing.one_hot_encoder(
        (SELECT customer_id, plan FROM customers), id := 'customer_id')
  ON category USING sum(value) GROUP BY customer_id;

A NULL/unseen value encodes to -1 (ordinal) or contributes no active cell (one-hot).

Cluster & find outliers

-- k-means: appends a `cluster` label per row
SELECT customer_id, cluster
FROM sklearn.preprocessing.kmeans(
  (SELECT customer_id, tenure, monthly_spend, support_tickets FROM churn),
  id := 'customer_id', n_clusters := 4);

-- isolation forest: appends `anomaly_score` and `is_outlier`
SELECT customer_id, anomaly_score
FROM sklearn.preprocessing.isolation_forest(
  (SELECT customer_id, tenure, monthly_spend, support_tickets FROM churn),
  id := 'customer_id', contamination := 0.05)
WHERE is_outlier = 1;

Get sample data to play with

scikit-learn's bundled datasets are table functions — handy for trying things or building demos:

SELECT * FROM sklearn.datasets.iris();
SELECT * FROM sklearn.datasets.make_blobs(n_samples := 300, centers := 4);   -- synthetic clusters

Function reference

Datasets (no input): iris, wine, digits, breast_cancer, diabetes, california_housing, and generators make_classification, make_regression, make_blobs, make_moons, make_circles.

Models: fit_<estimator> (typed, see the table above), generic fit (escape hatch with JSON params), fit_pipeline (preprocessing steps + estimator as one model), predict, cross_val_predict, cross_val_score (per-fold held-out scores), permutation_importance (model-agnostic feature importance), partial_dependence (how a prediction moves with one feature), grid_search / randomized_search (union-typed hyperparameter search), list_models, model_info, drop_model.

Cross-validation splitters (assign folds, then evaluate in pure SQL): kfold, stratified_kfold, group_kfold, timeseries_split.

Per-group models: fit_model (aggregate — one model per GROUP BY group), predict_one / predict_class_one / predict_proba_one (scalars — per-row, by-name features).

Transforms (table in, id passthrough):

  • Scaling / preprocessing — standard_scaler, minmax_scaler, robust_scaler, maxabs_scaler, normalizer, power_transformer, quantile_transformer, binarizer, kbins_discretizer, simple_imputer
  • Encoding — ordinal_encoder, one_hot_encoder, target_encoder
  • Feature engineering — polynomial_features (interaction / power terms)
  • Text — count_vectorizer, tfidf_vectorizer (long format (id, term, value))
  • Feature selection — select_k_best, variance_threshold (per-feature scores
    • a selected flag)
  • Decomposition / manifold — pca, truncated_svd, tsne, isomap, spectral_embedding, mds
  • Clustering — kmeans, minibatch_kmeans, dbscan, optics, agglomerative_clustering, spectral_clustering, mean_shift, birch, gaussian_mixture
  • Outlier detection — isolation_forest, local_outlier_factor, one_class_svm, elliptic_envelope

Stored transformers (fit once, apply to new data — like fit/predict): fit_transformer, apply_transform, list_transformers, drop_transformer.

Metric aggregates over (y_true, y_pred):

  • Regression — mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, mean_absolute_percentage_error, max_error, median_absolute_error, mean_squared_log_error, mean_pinball_loss
  • Classification — accuracy_score, precision_score, recall_score, f1_score, balanced_accuracy_score, matthews_corrcoef, cohen_kappa_score, jaccard_score, hamming_loss, zero_one_loss
  • Probability / ranking — roc_auc_score, average_precision_score, log_loss, brier_score_loss
  • Clustering — adjusted_rand_score, normalized_mutual_info_score, adjusted_mutual_info_score, homogeneity_score, completeness_score, v_measure_score, fowlkes_mallows_score

Metrics over a table: confusion_matrix (long format), silhouette_score, and the binary curves roc_curve, precision_recall_curve, calibration_curve.

Manage the registry: SELECT * FROM sklearn.models.list_models();, sklearn.models.model_info('name'), sklearn.models.drop_model('name').


Where models live

A fit_… call always returns the model as a model BLOB and saves it to the registry when you pass model_name. Two ways to keep a model:

In a DuckDB table (BLOB). Store the model column anywhere; pass it to predict via a session variable (the data subquery is the table function's one allowed subquery, so the model scalar comes through getvariable):

CREATE TABLE models AS
  SELECT 'churn_gb' AS name, model
  FROM sklearn.estimators.fit_gradient_boosting_classifier(
    (SELECT tenure, monthly_spend, support_tickets, churned FROM churn),
    target := 'churned');

SET VARIABLE m = (SELECT model FROM models WHERE name = 'churn_gb');
SELECT * FROM sklearn.models.predict((SELECT * FROM churn), model := getvariable('m'), id := 'customer_id');

In the named registry. Pass model_name to fit_…, then reference it by name (predict(..., model_name := 'churn_gb')). The registry is local disk by default (SKLEARN_MODELS_DIR, default ./models); an S3/R2 backend is the planned drop-in (registry.get_store() is the single seam). predict takes either model_name := or model :=.

DuckDB BLOBs cap near 2 GB, so a very large ensemble may not fit in a column — use the registry for those.

Model serialization & safety

Models are stored with skops, not pickle: loading reconstructs only known types instead of executing arbitrary code, and this worker further restricts the trusted set to the scikit-learn / numpy / scipy namespaces — a crafted artifact can't smuggle in an arbitrary callable.

Note

skops removes pickle's code-execution risk, but it isn't a trust oracle — keep the registry / SKLEARN_MODELS_DIR writable only by trusted users. skops stores scikit-learn objects, so it is not version-independent: a model may fail to load or behave differently under a different scikit-learn version. The worker records the fitting version and logs a duckdb_logs() warning on mismatch. (Fully version-independent inference would mean exporting to ONNX, at the cost of estimator coverage.)


Install

pip install vgi-sklearn        # or: uvx vgi-sklearn

This provides the vgi-sklearn (stdio, for DuckDB to spawn) and vgi-sklearn-http console scripts. Then ATTACH 'sklearn' (TYPE vgi, LOCATION 'vgi-sklearn'). To attach a hosted HTTP deployment instead: ATTACH 'sklearn' (TYPE vgi, LOCATION 'https://<host>').

Run via Docker

A multi-arch image (linux/amd64 + linux/arm64) is published to ghcr.io/query-farm/vgi-sklearn by CI on every version tag vX.Y.Z (and :edge from main). One image serves both transports — http is the default; pass stdio to run the worker DuckDB spawns on-host:

# HTTP server on :8000 (mount a volume for the model registry + shared state)
docker run -p 8000:8000 -v vgi_sklearn_state:/data ghcr.io/query-farm/vgi-sklearn

# stdio worker for DuckDB to spawn on-host
ATTACH 'sklearn' (TYPE vgi, LOCATION
  'docker run -i --rm -v vgi_sklearn_state:/data ghcr.io/query-farm/vgi-sklearn stdio');

The image declares the state mount it needs via the farm.query.vgi.volumes label, so a VGI extension can discover and inject the -v mount automatically. /data holds the model registry (/data/models) and the shared BoundStorage SQLite (/data/state); mounting one named volume across instances shares both. On macOS/Windows there is no Docker host runtime for those OSes — install the cross-platform PyPI package instead (above). Images are cosign-signed (keyless) with provenance + SBOM attestations.

Local development

uv sync                       # install worker + deps from uv.lock (PyPI vgi-python)
uv run pytest tests/ -q       # unit tests (incl. pydoclint docstring gate)
uvx ruff check . && uvx ruff format --check .

To develop against local vgi-python / vgi-rpc checkouts instead of PyPI, use the Makefile targets (worker = uv run sklearn_worker.py):

make venv
make test-stdio    # SQL integration tests, worker as a subprocess
make test-http     # SQL integration tests against a local HTTP server

The test/sql/*.test sqllogictest suite is the authoritative integration test. CI (.github/workflows/ci.yml) runs the unit + SQL suites on Linux/macOS/Windows against the signed community vgi extension via a prebuilt haybarn-unittest — no local C++ build (see ci/README.md).

Publishing

The two artifacts publish independently, gated on the full CI suite:

  • ghcr.io image (docker-publish.yml) — on a version tag push (vX.Y.Z): multi-arch, built per-arch on native runners, tested in both transports before push, merged into one manifest, and cosign-signed. (Pushes to main publish :edge.)
  • PyPI (publish.yml) — on a GitHub Release: uv build && uv publish (token in the PYPI_API_TOKEN repo secret). Publishing a Release also creates the tag, so it ships the image too.

So you can cut a Docker-only release (push a tag) without touching PyPI.

The version is single-sourced from __version__ in vgi_sklearn/__init__.py (hatchling reads it; the worker advertises it as implementation_version). Bump it there before tagging — both release jobs verify the release tag matches it (ci/check-version.sh), so the PyPI wheel, the image tag, and the version the worker reports over VGI always agree.

License

MIT — see LICENSE.

About

scikit-learn for DuckDB/SQL — datasets, metrics, transforms, and a train/predict model registry, exposed as a VGI worker

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages