A pure Julia implementation of gradient boosting with symmetric (oblivious) decision trees and ordered target encoding for categorical features, inspired by the CatBoost algorithm.
Michi (ミチ) means cat in Japanese.
- Pure Julia gradient boosting with no Python or C++ dependencies.
- Native categorical handling via ordered target encoding (CatBoost-style), with no manual preprocessing.
- Symmetric (oblivious) decision trees as the base learner, with histogram-based split finding and threaded training.
- Custom loss functions via the
LossFunctioninterface for regression, binary, and multi-class tasks. - Standard toolkit: sample and class weights, cross-validation (with stratification), early stopping with configurable eval metrics, SHAP values, model serialisation. See the docs for the full list.
Run with threading enabled for best performance:
julia -t 4 # any thread count worksThis sets Threads.nthreads(), which MichiBoost uses during training and
inference.
using MichiBoost
X = [1.0 4.0 5.0 6.0; 4.0 5.0 6.0 7.0; 30.0 40.0 50.0 60.0]
y = [10.0, 20.0, 30.0]
model = MichiBoostRegressor(; iterations=100, learning_rate=0.1, depth=4)
fit!(model, X, y)
preds = predict(model, X)using MichiBoost
X = [0.0 3.0; 4.0 1.0; 8.0 1.0; 9.0 1.0]
y = [0.0, 0.0, 1.0, 1.0]
model = MichiBoostClassifier(; iterations=100, learning_rate=0.1, depth=4)
fit!(model, X, y)
probs = predict_proba(model, X) # P(class=1)
classes = predict(model, X) # predicted class labelsString columns are detected as categorical without further configuration:
using MichiBoost, DataFrames
df = DataFrame(color=["red", "blue", "red", "green"], size=[1.0, 2.0, 3.0, 4.0])
y = [0.0, 1.0, 0.0, 1.0]
model = MichiBoostClassifier(; iterations=50)
fit!(model, df, y)
predict(model, df)using MichiBoost, Random
Random.seed!(42)
X = randn(100, 5)
y = Float64.(X[:, 1] .+ X[:, 2] .> 0)
pool = Pool(X; label=y)
scores = cv(pool; fold_count=5, params=Dict(:iterations => 100, :depth => 4))
println("Mean test loss: ", scores.mean_test_loss)Explain individual predictions with SHAP values. Given a trained model and
feature matrix X:
shap = shap_values(model, X)
# Regression and binary classification: (n_samples, n_features) matrix.
# Multi-class: (n_samples, n_features, n_classes) array.Weight individual training samples via Pool. Reusing X and y from the
binary classification example above:
w = [1.0, 2.0, 0.5, 1.0]
pool = Pool(X; label=y, weight=w)
model = MichiBoostClassifier(; iterations=100)
fit!(model, pool)MichiBoost.jl is benchmarked against the reference CatBoost implementation (via CatBoost.jl) across four axes: correctness on held-out data, a training and inference speed sweep, threading and real-dataset scaling (UCI Covertype), and the cost of each advertised feature (CV, early stopping, SHAP, RSM, sample weights, save and load).
See benchmark/README.md for the full methodology, per-script commands, latest results, and caveats.