GeoXGB replaces conventional subsampling and bootstrapping with geometry-aware sample reduction and expansion powered by HVRT (Hierarchical Variance-Retaining Transformer).
pip install geoxgbFor hyperparameter optimisation via Optuna:
pip install "geoxgb[optimizer]"Requires hvrt >= 2.6.1, scikit-learn, and numpy. Python >= 3.10.
The compiled C++ backend is included in the wheel and used automatically; no extra install step required.
from geoxgb import GeoXGBRegressor, GeoXGBClassifier
# Regression — HPO strongly recommended for learning_rate and max_depth
reg = GeoXGBRegressor(n_rounds=1000)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
# Classification
clf = GeoXGBClassifier(n_rounds=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)
# Pass feature types for mixed data
clf.fit(X_train, y_train, feature_types=["continuous", "categorical", ...])HPO is strongly recommended.
learning_rateandmax_depthare the two most sensitive parameters and interact strongly: optimal range islearning_rate0.010–0.020 with highn_rounds(1 000–5 000) and shallowmax_depth(2–3). These optima shift 5× across datasets. UseGeoXGBOptimizeror any Optuna/sklearn HPO tool for production models.
- Geometry-aware sampling via HVRT's variance-retaining partitions
- FPS reduction — keeps geometrically diverse representatives
- KDE expansion — synthesises samples in sparse regions
- Adaptive noise detection — automatically backs off on noisy data
- Multi-fit — refits partitions on residuals every N rounds
- No overfitting — see Why high
n_roundsis safe - Full interpretability — feature importances, partition traces, sample provenance
- Gardener — post-hoc surgical editor: diagnose biased leaves and self-heal
- GeoXGBOptimizer — Optuna TPE hyperparameter search
- Categorical support — pass
feature_typesto handle mixed data natively - Class reweighting —
class_weightfor imbalanced classification - Multiclass parallelism —
n_jobsfor K-class one-vs-rest ensembles
Standard gradient boosting memorises: every tree sees the same N rows, so adding rounds eventually overfits the training set.
GeoXGB cannot memorise. At every refit_interval, HVRT re-partitions the
residual landscape and FPS selects a fresh, geometrically diverse subset.
No boosting tree ever trains on the same sample twice. There is no fixed
training set to memorise, so train and val loss converge smoothly and continue
to improve with more rounds — the train–val gap remains small regardless of
n_rounds.
Practical consequences:
- More rounds is always beneficial (or neutral); it is never harmful.
convergence_tolis a compute budget feature, not an anti-overfitting guard — use it to stop early once the gradient has genuinely plateaued.- The default
n_rounds=1000is a conservative starting point; tuning up to 2000–4000 rounds consistently yields further gains on large datasets.
| Parameter | Default | Description |
|---|---|---|
n_rounds |
1000 | Number of boosting rounds |
learning_rate |
0.02 | Shrinkage per tree. HPO recommended — optimal range 0.010–0.020 |
max_depth |
3 | Maximum depth of each weak learner. HPO recommended — optimal range 2–3 |
min_samples_leaf |
5 | Minimum samples per leaf in the weak learner (DecisionTree) |
partitioner |
'pyramid_hart' |
Geometry partitioner: 'pyramid_hart', 'hart', 'hvrt' |
method |
'variance_ordered' |
Sample reduction strategy: 'variance_ordered', 'orthant_stratified', 'fps' |
generation_strategy |
'simplex_mixup' |
Synthetic sample generator: 'simplex_mixup', 'epanechnikov', 'laplace' |
n_partitions |
None | Partition count (None = auto-tuned) |
hvrt_min_samples_leaf |
None | Partition minimum leaf size (None = auto-tuned) |
reduce_ratio |
0.8 | Fraction of samples to keep per boosting round |
expand_ratio |
0.1 | Fraction to synthesise as synthetic samples (0 = disabled) |
y_weight |
0.25 | Partition blend: 0 = unsupervised geometry, 1 = fully y-driven |
variance_weighted |
False | Budget allocation by partition variance |
bandwidth |
'auto' |
KDE bandwidth for expansion ('auto' = per-partition Scott's rule) |
adaptive_bandwidth |
False | Scale KDE bandwidth by local partition density |
refit_interval |
50 | Refit partition tree on residuals every N rounds (None = off) |
auto_noise |
True | Auto-detect noise and modulate resampling |
noise_guard |
True | Look-ahead veto on resampling when gradient signal is structureless |
auto_expand |
True | Auto-expand small datasets to min_train_samples |
min_train_samples |
5000 | Target training-set size when auto_expand=True |
adaptive_reduce_ratio |
False | Dynamically adjust reduce_ratio from gradient tail heaviness |
cache_geometry |
False | Reuse partition structure across refits (faster for large datasets) |
feature_weights |
None | Per-feature scaling applied before partition geometry sees X |
convergence_tol |
None | Stop early when gradient improvement < tol (compute budget only) |
n_jobs |
1 | Parallel workers for multiclass one-vs-rest ensembles |
random_state |
42 |
| Parameter | Default | Description |
|---|---|---|
class_weight |
None | None, 'balanced', or {class: weight} dict |
from geoxgb import load_model
# Save (strips large HVRT data arrays — file stays under 100 MB)
model.save("my_model.pkl")
# Load in a new process
model = load_model("my_model.pkl")
predictions = model.predict(X_test)Gardener wraps a fitted model and exposes manual editing tools plus
automatic self-healing:
from geoxgb import Gardener
garden = Gardener(fitted_model)
# Automatic: detect biased leaves, correct, validate, commit only if better
result = garden.heal(X_train, y_train, X_val, y_val, strategy="surgery")
print(result["improvement"]) # AUC / R² delta vs baseline
# Manual tools
garden.adjust_leaf(tree_idx=5, leaf_id=3, delta=-0.02)
garden.prune(tree_idx=12, leaf_id=7)
garden.graft(X_targeted, residuals, n_rounds=10, learning_rate=0.05)
garden.rollback() # undo last operation
garden.reset() # restore to original fitted state
# Derive feature weights from gradient vs geometry agreement
weights = garden.recommend_feature_weights(feature_names)
model2 = GeoXGBClassifier(feature_weights=list(weights.values()))from geoxgb import GeoXGBOptimizer
opt = GeoXGBOptimizer(n_trials=50, cv=3, random_state=42)
opt.fit(X_train, y_train)
print(opt.best_params_) # {'n_rounds': 1000, 'learning_rate': 0.2, ...}
print(opt.best_score_) # best mean CV score (AUC or R²)
y_pred = opt.predict(X_test)
y_proba = opt.predict_proba(X_test) # classifier only
# Access the raw Optuna study for plots / analysis
opt.study_.best_trialTrial 0 is always the v0.1.1 defaults — HPO is guaranteed to match or beat
the baseline. fast=True (default) accelerates trials via geometry caching
and disabled expansion, then refits the final model at full quality
(cache_geometry=False, auto_expand=True).
fast=True sets cache_geometry=True, auto_expand=False, and
convergence_tol=0.01 during HPO trials only. The final best_model_ is
always refit at full quality (GeoXGB defaults: cache_geometry=False,
auto_expand=True, no early stop).
| Setting | Trial effect | Final model |
|---|---|---|
cache_geometry=True |
HVRT fitted once per trial; reused at all refit intervals | Reverts to False (adaptive geometry) |
auto_expand=False |
KDE expansion disabled | Reverts to True |
convergence_tol=0.01 |
Trial self-terminates when gradient improvement < 1% | Reverts to None |
Speed (post-HVRT 2.5.0): As of HVRT 2.5.0, the per-trial wall-clock time
of fast=True and fast=False is essentially identical at n=500–1000 —
measured ratio ≈ 1.0× across four benchmark datasets. HVRT 2.5.0's
vectorized _budgets.py made individual HVRT operations ~3.4× faster, so tree
training now dominates trial cost and cache_geometry=True saves very little
wall time. The primary speed driver in HPO is therefore the n_rounds value
that each proxy happens to select, not the caching mechanism.
Accuracy ceiling: The final model test score (after full-quality refit) is
within 0.001–0.005 AUC/R² of the full-quality proxy for most datasets.
Exception: small datasets (n<500) can show up to −0.015 R² when the fast proxy
picks a suboptimal expand_ratio (expansion is disabled during trials so this
parameter cannot be properly evaluated).
Do not rely on fast=True CV scores as calibrated estimates — they reflect
cached-geometry conditions that differ from the final refit. Use them only for
ranking configurations.
Use fast=False when accurate CV estimates matter or when expand_ratio is
important to your dataset:
opt = GeoXGBOptimizer(n_trials=25, cv=5, fast=False)
opt.fit(X_train, y_train)Both modes use the identical HVRT 2.5.0 API path internally.
The boost/partition importance ratio is a heterogeneity surface map. When the two importance axes diverge it is not a red flag — it is structural information about each feature's local role:
| Ratio | Interpretation |
|---|---|
ratio >> 1 |
Prediction driver — feature drives gradient updates within local regions but does not define them |
ratio << 1 |
Heterogeneity axis — feature defines where different predictive relationships apply; lower predictive contribution within each region |
ratio ~= 1 |
Universally informative — both structure-defining and predictive |
This operates at the individual level, not just population subgroups. Each
HVRT partition is a hyperplane-bounded local region in feature space. With
sufficient partitions these regions can be arbitrarily fine, approaching
individual-level neighbourhoods. partition_tree_rules() exposes the exact
conditions defining each individual's local region.
boost = model.feature_importances(feature_names=names)
part = model.partition_feature_importances(feature_names=names)
avg_part = {f: np.mean([e["importances"].get(f, 0) for e in part]) for f in names}
for f in names:
ratio = boost[f] / (avg_part[f] + 1e-10)
print(f"{f}: ratio={ratio:.2f}")
# ratio << 1 => heterogeneity axis (defines local structure)
# ratio >> 1 => prediction driver (gradient-dominant within regions)Validated across three synthetic scenarios in
benchmarks/heterogeneity_detection_test.py:
-
Regime indicator: the feature that determines which local predictive relationship applies consistently has a lower ratio than within-regime predictors, regardless of whether it directly enters the prediction formula.
-
Interaction moderator: the ratio ordering (moderator < predictor) holds for sign-flip interactions. XGBoost tree importance conflates structure and prediction roles into a single score; the boost/partition split separates them.
-
Complementary roles: among two strong predictors, HVRT allocates one to anchor partition geometry (lower ratio) and the other to gradient-driven prediction (higher ratio). Role assignment is emergent — the divergence reveals local structure, not model error.
from geoxgb.report import model_report, print_report
# All-in-one structured report
print_report(model_report(model, X_test, y_test, feature_names=names))
# Individual report sections
from geoxgb.report import (
noise_report, # data quality assessment
provenance_report, # where did the training samples come from?
importance_report, # boosting vs partition feature importance
partition_report, # HVRT partition structure at a given round
evolution_report, # how geometry changed across refits
validation_report, # PASS/FAIL checks against known ground truth
compare_report, # head-to-head comparison with a baseline
)
# Raw model API
model.feature_importances(feature_names) # boosting importance
model.partition_feature_importances(feature_names) # geometric importance
model.partition_trace() # full partition history
model.partition_tree_rules(round_idx=0) # human-readable rules
model.sample_provenance() # reduction/expansion counts
model.noise_estimate() # 1.0=clean, 0.0=pure noiseUse class_weight='balanced' to upweight the minority class in gradient
updates. This stacks with HVRT's geometric diversity preservation.
clf = GeoXGBClassifier(
class_weight='balanced',
auto_noise=False, # recommended for severe imbalance (< 5% minority)
)For datasets with many thousands of samples, HVRT refitting at each
refit_interval dominates wall time. Enable geometry caching to reuse the
initial partition structure and reduce HVRT.fit() calls from
n_rounds / refit_interval down to 1:
model = GeoXGBRegressor(
cache_geometry=True, # reuse HVRT partition structure across refits
n_rounds=4000, # safe to go high — no overfitting risk
)For multiclass problems, parallelise the K one-vs-rest ensembles:
clf = GeoXGBClassifier(n_jobs=4) # ~4x speedup for 5-class problemsHead-to-head comparison vs XGBoost (default vs default, same CV protocol):
| Dataset | Metric | GeoXGB default | GeoXGB HPO-best | XGBoost (300 est.) |
|---|---|---|---|---|
| diabetes (n=442) | R² | 0.4675 | 0.4982 | 0.3147 |
| friedman1 (n=1000) | R² | 0.9210 | 0.9321 | 0.8434 |
| breast_cancer (n=569) | AUC | 0.9943 | 0.9926 | 0.9886 |
| wine (n=178, 3-class) | AUC | 0.9951 | 0.9993 | 0.9975 |
| digits (n=1797, 10-class) | AUC | 0.9988 | 0.9987 | 0.9990 |
GeoXGB default outperforms XGBoost on 4/5 datasets without any tuning. On
digits (10-class), GeoXGB and XGBoost are within 0.0002 AUC (one standard
deviation). HPO-best params are from a 2 000+ trial Optuna TPE study with
partitioner='pyramid_hart', method='variance_ordered', and
generation_strategy='simplex_mixup' fixed. Key findings: optimal
learning_rate 0.012–0.015, max_depth 2–3, y_weight 0.21–0.28.
Note: XGBoost uses its default 300 estimators (
learning_rate=0.1). GeoXGB uses its default 1 000 rounds (learning_rate=0.02). Both are untuned defaults evaluated with identical CV splits.
GeoXGB ships a compiled C++ extension (_geoxgb_cpp) built with Eigen3 and
pybind11. fit() and predict() route through C++ automatically when:
- The compiled extension is present (always true for wheel installs), and
- No
feature_typesargument is passed (categorical columns still use the pure-Python path), and convergence_tolisNone(the Python path handles convergence tracking).
The Python path remains available for the full interpretability stack
(noise_estimate(), sample_provenance(), partition_feature_importances(),
Gardener, etc.) — pass feature_types=["continuous"] * n_features to opt in.
GeoXGB's geometry-aware resampling makes it a strong base estimator for CATE
and ITE tasks. HVRT partitions covariate space into locally homogeneous
regions that naturally align with treatment-effect subgroups; auto_expand
prevents information collapse in sparse T=0/T=1 sub-populations.
GeoXGB drops into any standard metalearner architecture:
import numpy as np
from sklearn.model_selection import train_test_split
X_tr, X_te, T_tr, T_te, Y_tr, Y_te = train_test_split(X, T, Y, test_size=0.25)
# Use HVRT for causal inference — noise-invariant (Theorem 3)
m0 = GeoXGBRegressor(partitioner='hvrt')
m0.fit(X_tr[T_tr == 0], Y_tr[T_tr == 0])
m1 = GeoXGBRegressor(partitioner='hvrt')
m1.fit(X_tr[T_tr == 1], Y_tr[T_tr == 1])
tau_hat = m1.predict(X_te) - m0.predict(X_te)Why HVRT for causal inference? HVRT satisfies Theorem 3 (T-orthogonality):
its cooperation measure is invariant to isotropic Gaussian covariate noise.
PyramidHART loses this property — its L1-ball level sets are noise-sensitive,
which degrades performance in observational settings with noisy covariates.
The default partitioner='pyramid_hart' is optimal for regression; set
partitioner='hvrt' when covariates have measurement error or when using
S-learner-style metalearners that treat T as a feature alongside X.
PEHE benchmark on randomised trials (lower is better, n=2000, best-of-metalearners):
| τ(x) type | GeoXGB (hvrt) |
XGBoost | Honest R-forest¹ |
|---|---|---|---|
| Linear (2X₁ + 1) | 0.180 | 0.207 | 0.247 |
| Nonlinear (2·sin(X₁π) + X₂²) | 0.408 | 0.608 | 0.796 |
¹ 2-fold cross-fitted R-forest (functional core of GRF). Settings: n_rounds=500, learning_rate=0.1, max_depth=3, y_weight=0.25.
The boost/partition importance ratio surfaces causal structure without a
separate statistical test. Features that are causally upstream of Y (i.e.
X where part of the effect passes through a mediator M) have
boost_imp >> partition_imp — the gradient signal recognises X as important
even when HVRT geometry anchors on M (pass feature_types=[...] to enable
the interpretability API):
part = model.partition_feature_importances(feature_names=names)
boost = model.feature_importances(feature_names=names)
avg_part = {f: np.mean([e["importances"].get(f, 0) for e in part])
for f in names}
for f in names:
ratio = boost[f] / (avg_part[f] + 1e-10)
print(f"{f}: boost/partition = {ratio:.2f}")
# Causally upstream features show ratio >> 1
# Mediator features show ratio < 1 (geometry anchors on them)For average treatment effect estimation under confounding, use GeoXGB as the outcome model in a doubly-robust (DR) pipeline — its nonlinear surface quality reduces IPW residuals and tightens the DR correction:
from sklearn.linear_model import LogisticRegression
prop = LogisticRegression().fit(X_tr, T_tr)
pi = np.clip(prop.predict_proba(X_te)[:, 1], 0.05, 0.95)
mu0, mu1 = m0.predict(X_te), m1.predict(X_te)
dr_ate = (mu1 - mu0
+ T_te * (Y_te - mu1) / pi
- (1 - T_te) * (Y_te - mu0) / (1 - pi)).mean()See notebooks/geoxgb_causal_analysis.ipynb
for the full analysis: mediators, colliders, CATE, ITE metalearners, and ATE.
If you have panel / time-series data — repeated observations per entity over time — consider AutoITE (geo branch) instead. AutoITE is purpose-built for ITE estimation from longitudinal data, where the temporal dimension provides a richer identification strategy than cross-sectional metalearners.
Decision rule:
| Data available | Recommended tool |
|---|---|
| Repeated observations per entity (panel / time-series) | AutoITE geo branch |
| Cross-sectional data only | GeoXGB + metalearner (T/S/X/DR-learner) |
AGPL-3.0-or-later