GeoXGB — Geometry-Aware Gradient Boosting

GeoXGB replaces conventional subsampling and bootstrapping with geometry-aware sample reduction and expansion powered by HVRT (Hierarchical Variance-Retaining Transformer).

Installation

pip install geoxgb

For hyperparameter optimisation via Optuna:

pip install "geoxgb[optimizer]"

Requires hvrt >= 2.6.1, scikit-learn, and numpy. Python >= 3.10.

The compiled C++ backend is included in the wheel and used automatically; no extra install step required.

Quick Start

from geoxgb import GeoXGBRegressor, GeoXGBClassifier

# Regression — HPO strongly recommended for learning_rate and max_depth
reg = GeoXGBRegressor(n_rounds=1000)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# Classification
clf = GeoXGBClassifier(n_rounds=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)

# Pass feature types for mixed data
clf.fit(X_train, y_train, feature_types=["continuous", "categorical", ...])

HPO is strongly recommended. learning_rate and max_depth are the two most sensitive parameters and interact strongly: optimal range is learning_rate 0.010–0.020 with high n_rounds (1 000–5 000) and shallow max_depth (2–3). These optima shift 5× across datasets. Use GeoXGBOptimizer or any Optuna/sklearn HPO tool for production models.

Key Features

Geometry-aware sampling via HVRT's variance-retaining partitions
FPS reduction — keeps geometrically diverse representatives
KDE expansion — synthesises samples in sparse regions
Adaptive noise detection — automatically backs off on noisy data
Multi-fit — refits partitions on residuals every N rounds
No overfitting — see Why high n_rounds is safe
Full interpretability — feature importances, partition traces, sample provenance
Gardener — post-hoc surgical editor: diagnose biased leaves and self-heal
GeoXGBOptimizer — Optuna TPE hyperparameter search
Categorical support — pass feature_types to handle mixed data natively
Class reweighting — class_weight for imbalanced classification
Multiclass parallelism — n_jobs for K-class one-vs-rest ensembles

Why High `n_rounds` is Safe

Standard gradient boosting memorises: every tree sees the same N rows, so adding rounds eventually overfits the training set.

GeoXGB cannot memorise. At every refit_interval, HVRT re-partitions the residual landscape and FPS selects a fresh, geometrically diverse subset. No boosting tree ever trains on the same sample twice. There is no fixed training set to memorise, so train and val loss converge smoothly and continue to improve with more rounds — the train–val gap remains small regardless of n_rounds.

Practical consequences:

More rounds is always beneficial (or neutral); it is never harmful.
convergence_tol is a compute budget feature, not an anti-overfitting guard — use it to stop early once the gradient has genuinely plateaued.
The default n_rounds=1000 is a conservative starting point; tuning up to 2000–4000 rounds consistently yields further gains on large datasets.

Parameters

Shared (GeoXGBRegressor and GeoXGBClassifier)

Parameter	Default	Description
`n_rounds`	1000	Number of boosting rounds
`learning_rate`	0.02	Shrinkage per tree. HPO recommended — optimal range 0.010–0.020
`max_depth`	3	Maximum depth of each weak learner. HPO recommended — optimal range 2–3
`min_samples_leaf`	5	Minimum samples per leaf in the weak learner (DecisionTree)
`partitioner`	`'pyramid_hart'`	Geometry partitioner: `'pyramid_hart'`, `'hart'`, `'hvrt'`
`method`	`'variance_ordered'`	Sample reduction strategy: `'variance_ordered'`, `'orthant_stratified'`, `'fps'`
`generation_strategy`	`'simplex_mixup'`	Synthetic sample generator: `'simplex_mixup'`, `'epanechnikov'`, `'laplace'`
`n_partitions`	None	Partition count (None = auto-tuned)
`hvrt_min_samples_leaf`	None	Partition minimum leaf size (None = auto-tuned)
`reduce_ratio`	0.8	Fraction of samples to keep per boosting round
`expand_ratio`	0.1	Fraction to synthesise as synthetic samples (0 = disabled)
`y_weight`	0.25	Partition blend: 0 = unsupervised geometry, 1 = fully y-driven
`variance_weighted`	False	Budget allocation by partition variance
`bandwidth`	`'auto'`	KDE bandwidth for expansion (`'auto'` = per-partition Scott's rule)
`adaptive_bandwidth`	False	Scale KDE bandwidth by local partition density
`refit_interval`	50	Refit partition tree on residuals every N rounds (None = off)
`auto_noise`	True	Auto-detect noise and modulate resampling
`noise_guard`	True	Look-ahead veto on resampling when gradient signal is structureless
`auto_expand`	True	Auto-expand small datasets to `min_train_samples`
`min_train_samples`	5000	Target training-set size when `auto_expand=True`
`adaptive_reduce_ratio`	False	Dynamically adjust reduce_ratio from gradient tail heaviness
`cache_geometry`	False	Reuse partition structure across refits (faster for large datasets)
`feature_weights`	None	Per-feature scaling applied before partition geometry sees X
`convergence_tol`	None	Stop early when gradient improvement < tol (compute budget only)
`n_jobs`	1	Parallel workers for multiclass one-vs-rest ensembles
`random_state`	42

GeoXGBClassifier only

Parameter	Default	Description
`class_weight`	None	`None`, `'balanced'`, or `{class: weight}` dict

Saving and Loading Models

from geoxgb import load_model

# Save (strips large HVRT data arrays — file stays under 100 MB)
model.save("my_model.pkl")

# Load in a new process
model = load_model("my_model.pkl")
predictions = model.predict(X_test)

Gardener — Post-Hoc Tree Surgery

Gardener wraps a fitted model and exposes manual editing tools plus automatic self-healing:

from geoxgb import Gardener

garden = Gardener(fitted_model)

# Automatic: detect biased leaves, correct, validate, commit only if better
result = garden.heal(X_train, y_train, X_val, y_val, strategy="surgery")
print(result["improvement"])   # AUC / R² delta vs baseline

# Manual tools
garden.adjust_leaf(tree_idx=5, leaf_id=3, delta=-0.02)
garden.prune(tree_idx=12, leaf_id=7)
garden.graft(X_targeted, residuals, n_rounds=10, learning_rate=0.05)
garden.rollback()              # undo last operation
garden.reset()                 # restore to original fitted state

# Derive feature weights from gradient vs geometry agreement
weights = garden.recommend_feature_weights(feature_names)
model2 = GeoXGBClassifier(feature_weights=list(weights.values()))

GeoXGBOptimizer — Optuna HPO

from geoxgb import GeoXGBOptimizer

opt = GeoXGBOptimizer(n_trials=50, cv=3, random_state=42)
opt.fit(X_train, y_train)

print(opt.best_params_)   # {'n_rounds': 1000, 'learning_rate': 0.2, ...}
print(opt.best_score_)    # best mean CV score (AUC or R²)

y_pred  = opt.predict(X_test)
y_proba = opt.predict_proba(X_test)   # classifier only

# Access the raw Optuna study for plots / analysis
opt.study_.best_trial

Trial 0 is always the v0.1.1 defaults — HPO is guaranteed to match or beat the baseline. fast=True (default) accelerates trials via geometry caching and disabled expansion, then refits the final model at full quality (cache_geometry=False, auto_expand=True).

fast=True speed and accuracy ceiling

fast=True sets cache_geometry=True, auto_expand=False, and convergence_tol=0.01 during HPO trials only. The final best_model_ is always refit at full quality (GeoXGB defaults: cache_geometry=False, auto_expand=True, no early stop).

Setting	Trial effect	Final model
`cache_geometry=True`	HVRT fitted once per trial; reused at all refit intervals	Reverts to `False` (adaptive geometry)
`auto_expand=False`	KDE expansion disabled	Reverts to `True`
`convergence_tol=0.01`	Trial self-terminates when gradient improvement < 1%	Reverts to `None`

Speed (post-HVRT 2.5.0): As of HVRT 2.5.0, the per-trial wall-clock time of fast=True and fast=False is essentially identical at n=500–1000 — measured ratio ≈ 1.0× across four benchmark datasets. HVRT 2.5.0's vectorized _budgets.py made individual HVRT operations ~3.4× faster, so tree training now dominates trial cost and cache_geometry=True saves very little wall time. The primary speed driver in HPO is therefore the n_rounds value that each proxy happens to select, not the caching mechanism.

Accuracy ceiling: The final model test score (after full-quality refit) is within 0.001–0.005 AUC/R² of the full-quality proxy for most datasets. Exception: small datasets (n<500) can show up to −0.015 R² when the fast proxy picks a suboptimal expand_ratio (expansion is disabled during trials so this parameter cannot be properly evaluated).

Do not rely on fast=True CV scores as calibrated estimates — they reflect cached-geometry conditions that differ from the final refit. Use them only for ranking configurations.

Use fast=False when accurate CV estimates matter or when expand_ratio is important to your dataset:

opt = GeoXGBOptimizer(n_trials=25, cv=5, fast=False)
opt.fit(X_train, y_train)

Both modes use the identical HVRT 2.5.0 API path internally.

Heterogeneity Detection

The boost/partition importance ratio is a heterogeneity surface map. When the two importance axes diverge it is not a red flag — it is structural information about each feature's local role:

Ratio	Interpretation
`ratio >> 1`	Prediction driver — feature drives gradient updates within local regions but does not define them
`ratio << 1`	Heterogeneity axis — feature defines where different predictive relationships apply; lower predictive contribution within each region
`ratio ~= 1`	Universally informative — both structure-defining and predictive

This operates at the individual level, not just population subgroups. Each HVRT partition is a hyperplane-bounded local region in feature space. With sufficient partitions these regions can be arbitrarily fine, approaching individual-level neighbourhoods. partition_tree_rules() exposes the exact conditions defining each individual's local region.

boost = model.feature_importances(feature_names=names)
part  = model.partition_feature_importances(feature_names=names)

avg_part = {f: np.mean([e["importances"].get(f, 0) for e in part]) for f in names}
for f in names:
    ratio = boost[f] / (avg_part[f] + 1e-10)
    print(f"{f}: ratio={ratio:.2f}")
# ratio << 1  =>  heterogeneity axis (defines local structure)
# ratio >> 1  =>  prediction driver (gradient-dominant within regions)

Validated across three synthetic scenarios in benchmarks/heterogeneity_detection_test.py:

Regime indicator: the feature that determines which local predictive relationship applies consistently has a lower ratio than within-regime predictors, regardless of whether it directly enters the prediction formula.
Interaction moderator: the ratio ordering (moderator < predictor) holds for sign-flip interactions. XGBoost tree importance conflates structure and prediction roles into a single score; the boost/partition split separates them.
Complementary roles: among two strong predictors, HVRT allocates one to anchor partition geometry (lower ratio) and the other to gradient-driven prediction (higher ratio). Role assignment is emergent — the divergence reveals local structure, not model error.

Interpretability

from geoxgb.report import model_report, print_report

# All-in-one structured report
print_report(model_report(model, X_test, y_test, feature_names=names))

# Individual report sections
from geoxgb.report import (
    noise_report,       # data quality assessment
    provenance_report,  # where did the training samples come from?
    importance_report,  # boosting vs partition feature importance
    partition_report,   # HVRT partition structure at a given round
    evolution_report,   # how geometry changed across refits
    validation_report,  # PASS/FAIL checks against known ground truth
    compare_report,     # head-to-head comparison with a baseline
)

# Raw model API
model.feature_importances(feature_names)           # boosting importance
model.partition_feature_importances(feature_names) # geometric importance
model.partition_trace()                             # full partition history
model.partition_tree_rules(round_idx=0)             # human-readable rules
model.sample_provenance()                           # reduction/expansion counts
model.noise_estimate()                              # 1.0=clean, 0.0=pure noise

Imbalanced Classification

Use class_weight='balanced' to upweight the minority class in gradient updates. This stacks with HVRT's geometric diversity preservation.

clf = GeoXGBClassifier(
    class_weight='balanced',
    auto_noise=False,   # recommended for severe imbalance (< 5% minority)
)

Large-Scale Datasets

For datasets with many thousands of samples, HVRT refitting at each refit_interval dominates wall time. Enable geometry caching to reuse the initial partition structure and reduce HVRT.fit() calls from n_rounds / refit_interval down to 1:

model = GeoXGBRegressor(
    cache_geometry=True,   # reuse HVRT partition structure across refits
    n_rounds=4000,         # safe to go high — no overfitting risk
)

For multiclass problems, parallelise the K one-vs-rest ensembles:

clf = GeoXGBClassifier(n_jobs=4)   # ~4x speedup for 5-class problems

Benchmarks

Head-to-head comparison vs XGBoost (default vs default, same CV protocol):

Dataset	Metric	GeoXGB default	GeoXGB HPO-best	XGBoost (300 est.)
diabetes (n=442)	R²	0.4675	0.4982	0.3147
friedman1 (n=1000)	R²	0.9210	0.9321	0.8434
breast_cancer (n=569)	AUC	0.9943	0.9926	0.9886
wine (n=178, 3-class)	AUC	0.9951	0.9993	0.9975
digits (n=1797, 10-class)	AUC	0.9988	0.9987	0.9990

GeoXGB default outperforms XGBoost on 4/5 datasets without any tuning. On digits (10-class), GeoXGB and XGBoost are within 0.0002 AUC (one standard deviation). HPO-best params are from a 2 000+ trial Optuna TPE study with partitioner='pyramid_hart', method='variance_ordered', and generation_strategy='simplex_mixup' fixed. Key findings: optimal learning_rate 0.012–0.015, max_depth 2–3, y_weight 0.21–0.28.

Note: XGBoost uses its default 300 estimators (learning_rate=0.1). GeoXGB uses its default 1 000 rounds (learning_rate=0.02). Both are untuned defaults evaluated with identical CV splits.

C++ backend

GeoXGB ships a compiled C++ extension (_geoxgb_cpp) built with Eigen3 and pybind11. fit() and predict() route through C++ automatically when:

The compiled extension is present (always true for wheel installs), and
No feature_types argument is passed (categorical columns still use the pure-Python path), and
convergence_tol is None (the Python path handles convergence tracking).

The Python path remains available for the full interpretability stack (noise_estimate(), sample_provenance(), partition_feature_importances(), Gardener, etc.) — pass feature_types=["continuous"] * n_features to opt in.

Causal Inference

GeoXGB's geometry-aware resampling makes it a strong base estimator for CATE and ITE tasks. HVRT partitions covariate space into locally homogeneous regions that naturally align with treatment-effect subgroups; auto_expand prevents information collapse in sparse T=0/T=1 sub-populations.

ITE metalearner usage

GeoXGB drops into any standard metalearner architecture:

import numpy as np
from sklearn.model_selection import train_test_split

X_tr, X_te, T_tr, T_te, Y_tr, Y_te = train_test_split(X, T, Y, test_size=0.25)

# Use HVRT for causal inference — noise-invariant (Theorem 3)
m0 = GeoXGBRegressor(partitioner='hvrt')
m0.fit(X_tr[T_tr == 0], Y_tr[T_tr == 0])
m1 = GeoXGBRegressor(partitioner='hvrt')
m1.fit(X_tr[T_tr == 1], Y_tr[T_tr == 1])
tau_hat = m1.predict(X_te) - m0.predict(X_te)

Why HVRT for causal inference? HVRT satisfies Theorem 3 (T-orthogonality): its cooperation measure is invariant to isotropic Gaussian covariate noise. PyramidHART loses this property — its L1-ball level sets are noise-sensitive, which degrades performance in observational settings with noisy covariates. The default partitioner='pyramid_hart' is optimal for regression; set partitioner='hvrt' when covariates have measurement error or when using S-learner-style metalearners that treat T as a feature alongside X.

PEHE benchmark on randomised trials (lower is better, n=2000, best-of-metalearners):

τ(x) type	GeoXGB (`hvrt`)	XGBoost	Honest R-forest¹
Linear (2X₁ + 1)	0.180	0.207	0.247
Nonlinear (2·sin(X₁π) + X₂²)	0.408	0.608	0.796

¹ 2-fold cross-fitted R-forest (functional core of GRF). Settings: n_rounds=500, learning_rate=0.1, max_depth=3, y_weight=0.25.

Mediation fingerprint

The boost/partition importance ratio surfaces causal structure without a separate statistical test. Features that are causally upstream of Y (i.e. X where part of the effect passes through a mediator M) have boost_imp >> partition_imp — the gradient signal recognises X as important even when HVRT geometry anchors on M (pass feature_types=[...] to enable the interpretability API):

part  = model.partition_feature_importances(feature_names=names)
boost = model.feature_importances(feature_names=names)

avg_part = {f: np.mean([e["importances"].get(f, 0) for e in part])
            for f in names}
for f in names:
    ratio = boost[f] / (avg_part[f] + 1e-10)
    print(f"{f}: boost/partition = {ratio:.2f}")
# Causally upstream features show ratio >> 1
# Mediator features show ratio < 1 (geometry anchors on them)

Doubly-robust ATE

For average treatment effect estimation under confounding, use GeoXGB as the outcome model in a doubly-robust (DR) pipeline — its nonlinear surface quality reduces IPW residuals and tightens the DR correction:

from sklearn.linear_model import LogisticRegression

prop = LogisticRegression().fit(X_tr, T_tr)
pi   = np.clip(prop.predict_proba(X_te)[:, 1], 0.05, 0.95)
mu0, mu1 = m0.predict(X_te), m1.predict(X_te)

dr_ate = (mu1 - mu0
          + T_te * (Y_te - mu1) / pi
          - (1 - T_te) * (Y_te - mu0) / (1 - pi)).mean()

See notebooks/geoxgb_causal_analysis.ipynb for the full analysis: mediators, colliders, CATE, ITE metalearners, and ATE.

When to use AutoITE instead

If you have panel / time-series data — repeated observations per entity over time — consider AutoITE (geo branch) instead. AutoITE is purpose-built for ITE estimation from longitudinal data, where the temporal dimension provides a richer identification strategy than cross-sectional metalearners.

Decision rule:

Data available	Recommended tool
Repeated observations per entity (panel / time-series)	AutoITE geo branch
Cross-sectional data only	GeoXGB + metalearner (T/S/X/DR-learner)

License

AGPL-3.0-or-later

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github		.github
benchmarks		benchmarks
cpp		cpp
dist		dist
docs		docs
examples		examples
insights		insights
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
GEOMETRY.md		GEOMETRY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
gemini_optimization_ideas.md		gemini_optimization_ideas.md
hvrt-lm_spec.md		hvrt-lm_spec.md
perf_diag.py		perf_diag.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoXGB — Geometry-Aware Gradient Boosting

Installation

Quick Start

Key Features

Why High `n_rounds` is Safe

Parameters

Shared (GeoXGBRegressor and GeoXGBClassifier)

GeoXGBClassifier only

Saving and Loading Models

Gardener — Post-Hoc Tree Surgery

GeoXGBOptimizer — Optuna HPO

fast=True speed and accuracy ceiling

Heterogeneity Detection

Interpretability

Imbalanced Classification

Large-Scale Datasets

Benchmarks

C++ backend

Causal Inference

ITE metalearner usage

Mediation fingerprint

Doubly-robust ATE

When to use AutoITE instead

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GeoXGB — Geometry-Aware Gradient Boosting

Installation

Quick Start

Key Features

Why High n_rounds is Safe

Parameters

Shared (GeoXGBRegressor and GeoXGBClassifier)

GeoXGBClassifier only

Saving and Loading Models

Gardener — Post-Hoc Tree Surgery

GeoXGBOptimizer — Optuna HPO

fast=True speed and accuracy ceiling

Heterogeneity Detection

Interpretability

Imbalanced Classification

Large-Scale Datasets

Benchmarks

C++ backend

Causal Inference

ITE metalearner usage

Mediation fingerprint

Doubly-robust ATE

When to use AutoITE instead

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why High `n_rounds` is Safe

Packages