π Paper Β |Β π Project Page Β |Β π€ Demo Β |Β πΊοΈ Atlas Graph
A unified ranking framework that learns directly from public leaderboard interactions to recommend the best pretrained model for an unseen dataset β without ever running a candidate on the target task.
This repository contains the official implementation of ModelLens, the metric-aware ranking framework introduced in our paper "ModelLens: Finding the Best for Your Task from Myriads of Models".
Left: the learned modelβdataset atlas β a single embedding space, trained on 1.62M public benchmark records, that co-locates every model and every dataset. Models from the same family (BERT / LLaMA / T5 / ViT / Whisper / β¦) cluster together, and datasets from the same domain (NLP / Vision / Speech / Retrieval / Multimodal / Math & Code) form their own neighborhoods. The geometry reflects what works on what, not just text similarity. Β Β Right: given an unseen target dataset (here: MMMU), ModelLens returns top-K candidates that are task-appropriate β multimodal LMs such as Gemini-2.5-Pro, Step-3-VL-108B and Qwen3-VL-235B β in stark contrast to the nearest text-embedding neighbors (DeBERTa-MNLI, mDeBERTa-Vietnamese, MiniLM-IMDb) which match the description but solve the wrong problem.
The open-source model ecosystem is exploding. HuggingFace alone now hosts hundreds of thousands of pretrained models across thousands of architectures, and a practitioner facing a new task has to answer one deceptively simple question:
Which of these myriad models will do best on my dataset?
Existing answers are unsatisfying for very different reasons:
- AutoML / fine-tune-and-rank. Train every candidate on the target task and pick the winner. Optimal in the limit, infeasible at the scale of hundreds of thousands of models.
- Transferability estimation (LEEP, NCE, LogME, β¦). Cheaper than full fine-tuning, but still requires a forward pass per candidate on the target dataset. The cost grows linearly with the candidate pool, and most estimators assume a single, well-defined task setup.
- Model routing (RouterBench, RouteLLM, β¦). Fast at inference, but presupposes a tiny, hand-curated pool of ~5β30 models. Asks "which of these few?", not "which of these many?".
- Metadata-only retrieval. Embed the model card and the dataset description with a frozen text encoder, return nearest neighbors. Cheap and scalable, but as the right panel of the teaser shows, text similarity is not task similarity: a Vietnamese DeBERTa is among the nearest text-neighbors of MMMU but a hopeless choice for solving it.
ModelLens reframes model selection as a ranking problem over
(model, dataset, task, metric) tuples, learned directly from the
large-scale but noisy trace of public benchmark records. Once trained, it
ranks unseen models on unseen datasets zero-shot, using only metadata
(names, descriptions, model size, architecture family) β no forward pass
on the target dataset, no curated pool.
On a benchmark of 1.62M evaluation records spanning ~47K models and ~9.6K datasets, ModelLens surpasses both metadata-only and forward-pass transferability baselines, and its recommended Top-K pools improve five representative routers by 21%β81% across QA benchmarks.
A useful side-effect of training a single ranker over all
(model, dataset) interactions is that we can inspect the resulting
latent space directly. Each star below is a model, colored by
architecture family; the surrounding scatter / mesh shows the
datasets it has been evaluated on, colored by task domain.
The two atlases tell the same story from opposite ends:
- The semantic-only atlas (left) shows that text similarity alone produces a tangled mass: families overlap heavily in the centre, and many task-relevant distinctions (e.g. encoder-only LMs vs decoder-only LMs, multimodal vs vision-only) collapse together because their descriptions read similarly.
- The full-data atlas (right), driven by actual evaluation interactions, untangles this geometry: speech models (orange) detach cleanly from the text continent, retrieval embedders (green) form their own arc, and vision / multimodal models bridge the visionβtext boundary. Family structure is recovered from co-evaluation patterns, not supplied as a label.
The practical consequence is the right panel of the teaser: in the learned space, nearest-neighbor in fact means task-appropriate, while in the semantic-only space it means text-similar. ModelLens's recommendation quality is, in large part, a downstream effect of having the right geometry to begin with.
ModelLens/
βββ config/
β βββ FinalModel_unified_augmented.yaml # main model config (Table 1)
β βββ method_ablation/ # loss-objective ablations
β βββ ablation_information/ # structural/semantic/interaction ablations
β βββ ablation_size/ # size-prior / size-feature ablations
β βββ ablation_family/ # family-prior / family-holdout ablations
βββ module/
β βββ data/ # leaderboard corpus loader, name tokenizer
β βββ model/ # ModelLens (the paper model)
β βββ procedure/ # listwise / pairwise / pointwise / ensemble training loops
β βββ utils/ # metrics (Kendall-w Ο, NDCG@K, Hit@K, Rec@K), family extractor
βββ src/main.py # entry point: parse YAML, build model, train, evaluate
βββ figures/ # teaser & atlas figures used in this README
βββ scripts/ # one-shot training and ablation drivers
The recommended setup is conda β it pins both Python and CUDA-capable PyTorch:
conda env create -f environment.yml
conda activate modellensIf you prefer pip / venv:
# Python 3.10+ recommended; install PyTorch separately to match your CUDA.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtGPU training requires a CUDA-capable PyTorch build. Distributed (DDP)
training is supported out of the box; see scripts/train.sh.
The training corpora and the trained ModelLens recommender are released on the HuggingFace Hub. Pick the corpus version based on your needs:
| Artifact | HF repo | Rows | What's in it |
|---|---|---|---|
| π€ Corpus v1 (cleaner, smaller) | luisrui/ModelLens-corpus-v1 |
1,542,867 | Original ModelLens corpus, R1βR6 deterministic cleaning pipeline applied (~0.0007% residual noise). |
| π€ Corpus v2 (expanded, recommended) | luisrui/ModelLens-corpus-v2 |
1,807,133 | v1 + HELM (294K) + LiveBench (6K) + OpenCompass (581). Only R6 cross-source dedup re-run. |
| π€ ModelLens checkpoint | luisrui/ModelLens |
β | Trained recommender (~709 MB, slim) trained on corpus v2. Loads with strict=False. Live demo: spaces/luisrui/ModelLens. |
Both corpus repos share the same schema: a flat CSV plus the vocab / profile JSONs used at training time.
data/<corpus>/
βββ data_clean.csv (v1) / data.csv (v2) # task, dataset, model, metric, value, dataset_desp
βββ task2id.json # task vocab
βββ metric2id.json # simplified metric vocab (post-prefix-strip)
βββ family2id.json # model-family vocab
βββ model2id.json # model name -> integer id
βββ model2family.json # model name -> family
βββ model_profile.json # HF metadata (size, downloads, license, ...)
βββ model_popularity.json # HF download count
The published CSV has 6 columns (
task, dataset, model, metric, value, dataset_desp).model_sizeis available viamodel_profile.jsonkeyed by model name;value_stdis a training-time artifact and is intentionally omitted. Themetriccolumn has thetask::prefix stripped β use thetaskcolumn to disambiguate when fitting per-task models.
from huggingface_hub import hf_hub_download
import pandas as pd, json
# Corpus (v2 recommended)
csv_path = hf_hub_download(
"luisrui/ModelLens-corpus-v2", "data.csv", repo_type="dataset",
)
df = pd.read_csv(csv_path, low_memory=False)
task2id = json.load(open(hf_hub_download("luisrui/ModelLens-corpus-v2", "task2id.json", repo_type="dataset")))
metric2id = json.load(open(hf_hub_download("luisrui/ModelLens-corpus-v2", "metric2id.json", repo_type="dataset")))
# Pretrained ModelLens weights (trained on corpus v2)
ckpt = hf_hub_download("luisrui/ModelLens", "ModelLens.pt")
args = hf_hub_download("luisrui/ModelLens", "args.json")Or via π€ datasets:
from datasets import load_dataset
ds = load_dataset("luisrui/ModelLens-corpus-v2", split="train")Once downloaded, place files under ./data/<corpus>/ and point
data_name in the YAML at that subdirectory (e.g. data_name: unified_augmented_v2).
The HuggingFace mirror redistributes only numerical scores + dataset descriptions, not benchmark contents. Each underlying leaderboard (HELM, LiveBench, OpenCompass, Papers-with-Code, Open LLM Leaderboard, β¦) retains its original license.
Once data is in place:
# Train the full ModelLens model (ensemble loss, all features)
bash scripts/train.sh
# or, equivalently, single-GPU
python src/main.py --config config/FinalModel_unified_augmented.yaml
# Multi-GPU (DDP). nproc_per_node should match your number of devices.
USE_DDP=1 NPROC=4 bash scripts/train.shReproduce the loss-objective and information-source ablations:
bash scripts/run_method_ablations.sh
bash scripts/run_feature_ablations.shOutputs:
- Checkpoints β
checkpoint/mlp/<data_name>/<trail_name>/ - Logs β
log/mlp/<data_name>/<trail_name>/train.log - Optional W&B run β controlled by
use_wandbin the YAML
All hyperparameters live in YAML. Key knobs (see
config/FinalModel_unified_augmented.yaml for defaults):
| Field | Meaning |
|---|---|
model_name |
ModelLens (the paper model). |
loss_type |
ensemble, listwise, pairwise, pairwise_pointwise, listwise_pointwise, listwise_pairwise. |
id_dropout_rate |
Probability of masking a learned model/dataset ID with [UNK]. |
use_size_prior, use_family_prior |
Toggle the structural-prior head terms. |
use_size_feature |
If False, drops the size embedding from both backbone and prior. |
use_dataset_id_as_desp |
When True, the dataloader passes a global dataset id in the dataset-description slot, which the model intercepts to look up both a learned dataset embedding and a frozen description embedding. Required by ModelLens. |
lambda_list, lambda_pair, point_loss_weight |
Loss weights Ξ»_list, Ξ»_pair, Ξ»_point. |
tau |
Initial value of the learnable temperature Ο. |
topk |
List of K values for Hit@K / NDCG@K / Rec@K. |
ModelLens supports the two settings from Section 4.2.1 of the paper:
- Performance completion β randomly mask entries from a partially
observed
(model Γ dataset)matrix and predict their values. - Cold-start generalisation β hold out entire datasets or entire
models (
new_dataset_evaluation/new_model_evaluationsplit modes) and score them zero-shot.
Ranking quality is reported with Kendall-weighted Ο_w (the primary
metric, emphasising top-rank correctness) and NDCG@K, Hit@K, Rec@K, all
implemented in module/utils/metric.py.
If you find ModelLens useful in your research, please cite:
@article{cai2026modellens,
title={ModelLens: Finding the Best for Your Task from Myriads of Models},
author={Cai, Rui and Mo, Weijie Jacky and Wen, Xiaofei and Ma, Qiyao and Zhu, Wenhui and Chen, Xiwen and Chen, Muhao and Zhao, Zhe},
journal={arXiv preprint arXiv:2605.07075},
year={2026}
}Released under the MIT License β see LICENSE.


