Skip to content

luisrui/ModelLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ModelLens: Finding the Best Model for Your Task from Myriads of Models

πŸ“„ Paper Β |Β  🌐 Project Page Β |Β  πŸ€— Demo Β |Β  πŸ—ΊοΈ Atlas Graph

A unified ranking framework that learns directly from public leaderboard interactions to recommend the best pretrained model for an unseen dataset β€” without ever running a candidate on the target task.

This repository contains the official implementation of ModelLens, the metric-aware ranking framework introduced in our paper "ModelLens: Finding the Best for Your Task from Myriads of Models".

ModelLens teaser

Left: the learned model–dataset atlas β€” a single embedding space, trained on 1.62M public benchmark records, that co-locates every model and every dataset. Models from the same family (BERT / LLaMA / T5 / ViT / Whisper / …) cluster together, and datasets from the same domain (NLP / Vision / Speech / Retrieval / Multimodal / Math & Code) form their own neighborhoods. The geometry reflects what works on what, not just text similarity. Β Β Right: given an unseen target dataset (here: MMMU), ModelLens returns top-K candidates that are task-appropriate β€” multimodal LMs such as Gemini-2.5-Pro, Step-3-VL-108B and Qwen3-VL-235B β€” in stark contrast to the nearest text-embedding neighbors (DeBERTa-MNLI, mDeBERTa-Vietnamese, MiniLM-IMDb) which match the description but solve the wrong problem.


Why ModelLens

The open-source model ecosystem is exploding. HuggingFace alone now hosts hundreds of thousands of pretrained models across thousands of architectures, and a practitioner facing a new task has to answer one deceptively simple question:

Which of these myriad models will do best on my dataset?

Existing answers are unsatisfying for very different reasons:

  • AutoML / fine-tune-and-rank. Train every candidate on the target task and pick the winner. Optimal in the limit, infeasible at the scale of hundreds of thousands of models.
  • Transferability estimation (LEEP, NCE, LogME, …). Cheaper than full fine-tuning, but still requires a forward pass per candidate on the target dataset. The cost grows linearly with the candidate pool, and most estimators assume a single, well-defined task setup.
  • Model routing (RouterBench, RouteLLM, …). Fast at inference, but presupposes a tiny, hand-curated pool of ~5–30 models. Asks "which of these few?", not "which of these many?".
  • Metadata-only retrieval. Embed the model card and the dataset description with a frozen text encoder, return nearest neighbors. Cheap and scalable, but as the right panel of the teaser shows, text similarity is not task similarity: a Vietnamese DeBERTa is among the nearest text-neighbors of MMMU but a hopeless choice for solving it.

ModelLens reframes model selection as a ranking problem over (model, dataset, task, metric) tuples, learned directly from the large-scale but noisy trace of public benchmark records. Once trained, it ranks unseen models on unseen datasets zero-shot, using only metadata (names, descriptions, model size, architecture family) β€” no forward pass on the target dataset, no curated pool.

On a benchmark of 1.62M evaluation records spanning ~47K models and ~9.6K datasets, ModelLens surpasses both metadata-only and forward-pass transferability baselines, and its recommended Top-K pools improve five representative routers by 21%–81% across QA benchmarks.


What ModelLens learns: the model–dataset atlas

A useful side-effect of training a single ranker over all (model, dataset) interactions is that we can inspect the resulting latent space directly. Each star below is a model, colored by architecture family; the surrounding scatter / mesh shows the datasets it has been evaluated on, colored by task domain.

Semantic-only baseline β€” atlas built from frozen text-embedding similarity between model cards and dataset descriptions (i.e. what a metadata-only retriever sees). ModelLens (full data) β€” the same projection, but using the learned latents that absorb 1.62M co-evaluation records.
Atlas β€” semantic only Atlas β€” full data (ModelLens)

The two atlases tell the same story from opposite ends:

  • The semantic-only atlas (left) shows that text similarity alone produces a tangled mass: families overlap heavily in the centre, and many task-relevant distinctions (e.g. encoder-only LMs vs decoder-only LMs, multimodal vs vision-only) collapse together because their descriptions read similarly.
  • The full-data atlas (right), driven by actual evaluation interactions, untangles this geometry: speech models (orange) detach cleanly from the text continent, retrieval embedders (green) form their own arc, and vision / multimodal models bridge the vision–text boundary. Family structure is recovered from co-evaluation patterns, not supplied as a label.

The practical consequence is the right panel of the teaser: in the learned space, nearest-neighbor in fact means task-appropriate, while in the semantic-only space it means text-similar. ModelLens's recommendation quality is, in large part, a downstream effect of having the right geometry to begin with.


What's in this repo

ModelLens/
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ FinalModel_unified_augmented.yaml      # main model config (Table 1)
β”‚   β”œβ”€β”€ method_ablation/                       # loss-objective ablations
β”‚   β”œβ”€β”€ ablation_information/                  # structural/semantic/interaction ablations
β”‚   β”œβ”€β”€ ablation_size/                         # size-prior / size-feature ablations
β”‚   └── ablation_family/                       # family-prior / family-holdout ablations
β”œβ”€β”€ module/
β”‚   β”œβ”€β”€ data/        # leaderboard corpus loader, name tokenizer
β”‚   β”œβ”€β”€ model/       # ModelLens (the paper model)
β”‚   β”œβ”€β”€ procedure/   # listwise / pairwise / pointwise / ensemble training loops
β”‚   └── utils/       # metrics (Kendall-w Ο„, NDCG@K, Hit@K, Rec@K), family extractor
β”œβ”€β”€ src/main.py      # entry point: parse YAML, build model, train, evaluate
β”œβ”€β”€ figures/         # teaser & atlas figures used in this README
└── scripts/         # one-shot training and ablation drivers

Installation

The recommended setup is conda β€” it pins both Python and CUDA-capable PyTorch:

conda env create -f environment.yml
conda activate modellens

If you prefer pip / venv:

# Python 3.10+ recommended; install PyTorch separately to match your CUDA.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

GPU training requires a CUDA-capable PyTorch build. Distributed (DDP) training is supported out of the box; see scripts/train.sh.


Data & pretrained weights

The training corpora and the trained ModelLens recommender are released on the HuggingFace Hub. Pick the corpus version based on your needs:

Artifact HF repo Rows What's in it
πŸ€— Corpus v1 (cleaner, smaller) luisrui/ModelLens-corpus-v1 1,542,867 Original ModelLens corpus, R1–R6 deterministic cleaning pipeline applied (~0.0007% residual noise).
πŸ€— Corpus v2 (expanded, recommended) luisrui/ModelLens-corpus-v2 1,807,133 v1 + HELM (294K) + LiveBench (6K) + OpenCompass (581). Only R6 cross-source dedup re-run.
πŸ€— ModelLens checkpoint luisrui/ModelLens β€” Trained recommender (~709 MB, slim) trained on corpus v2. Loads with strict=False. Live demo: spaces/luisrui/ModelLens.

Both corpus repos share the same schema: a flat CSV plus the vocab / profile JSONs used at training time.

data/<corpus>/
β”œβ”€β”€ data_clean.csv  (v1)  /  data.csv  (v2)   # task, dataset, model, metric, value, dataset_desp
β”œβ”€β”€ task2id.json                              # task vocab
β”œβ”€β”€ metric2id.json                            # simplified metric vocab (post-prefix-strip)
β”œβ”€β”€ family2id.json                            # model-family vocab
β”œβ”€β”€ model2id.json                             # model name -> integer id
β”œβ”€β”€ model2family.json                         # model name -> family
β”œβ”€β”€ model_profile.json                        # HF metadata (size, downloads, license, ...)
└── model_popularity.json                     # HF download count

The published CSV has 6 columns (task, dataset, model, metric, value, dataset_desp). model_size is available via model_profile.json keyed by model name; value_std is a training-time artifact and is intentionally omitted. The metric column has the task:: prefix stripped β€” use the task column to disambiguate when fitting per-task models.

Downloading

from huggingface_hub import hf_hub_download
import pandas as pd, json

# Corpus (v2 recommended)
csv_path = hf_hub_download(
    "luisrui/ModelLens-corpus-v2", "data.csv", repo_type="dataset",
)
df = pd.read_csv(csv_path, low_memory=False)

task2id   = json.load(open(hf_hub_download("luisrui/ModelLens-corpus-v2", "task2id.json",   repo_type="dataset")))
metric2id = json.load(open(hf_hub_download("luisrui/ModelLens-corpus-v2", "metric2id.json", repo_type="dataset")))

# Pretrained ModelLens weights (trained on corpus v2)
ckpt = hf_hub_download("luisrui/ModelLens", "ModelLens.pt")
args = hf_hub_download("luisrui/ModelLens", "args.json")

Or via πŸ€— datasets:

from datasets import load_dataset
ds = load_dataset("luisrui/ModelLens-corpus-v2", split="train")

Once downloaded, place files under ./data/<corpus>/ and point data_name in the YAML at that subdirectory (e.g. data_name: unified_augmented_v2).

The HuggingFace mirror redistributes only numerical scores + dataset descriptions, not benchmark contents. Each underlying leaderboard (HELM, LiveBench, OpenCompass, Papers-with-Code, Open LLM Leaderboard, …) retains its original license.


Quick start

Once data is in place:

# Train the full ModelLens model (ensemble loss, all features)
bash scripts/train.sh

# or, equivalently, single-GPU
python src/main.py --config config/FinalModel_unified_augmented.yaml

# Multi-GPU (DDP). nproc_per_node should match your number of devices.
USE_DDP=1 NPROC=4 bash scripts/train.sh

Reproduce the loss-objective and information-source ablations:

bash scripts/run_method_ablations.sh
bash scripts/run_feature_ablations.sh

Outputs:

  • Checkpoints β€” checkpoint/mlp/<data_name>/<trail_name>/
  • Logs β€” log/mlp/<data_name>/<trail_name>/train.log
  • Optional W&B run β€” controlled by use_wandb in the YAML

Configuration

All hyperparameters live in YAML. Key knobs (see config/FinalModel_unified_augmented.yaml for defaults):

Field Meaning
model_name ModelLens (the paper model).
loss_type ensemble, listwise, pairwise, pairwise_pointwise, listwise_pointwise, listwise_pairwise.
id_dropout_rate Probability of masking a learned model/dataset ID with [UNK].
use_size_prior, use_family_prior Toggle the structural-prior head terms.
use_size_feature If False, drops the size embedding from both backbone and prior.
use_dataset_id_as_desp When True, the dataloader passes a global dataset id in the dataset-description slot, which the model intercepts to look up both a learned dataset embedding and a frozen description embedding. Required by ModelLens.
lambda_list, lambda_pair, point_loss_weight Loss weights Ξ»_list, Ξ»_pair, Ξ»_point.
tau Initial value of the learnable temperature Ο„.
topk List of K values for Hit@K / NDCG@K / Rec@K.

Evaluation protocol

ModelLens supports the two settings from Section 4.2.1 of the paper:

  1. Performance completion β€” randomly mask entries from a partially observed (model Γ— dataset) matrix and predict their values.
  2. Cold-start generalisation β€” hold out entire datasets or entire models (new_dataset_evaluation / new_model_evaluation split modes) and score them zero-shot.

Ranking quality is reported with Kendall-weighted Ο„_w (the primary metric, emphasising top-rank correctness) and NDCG@K, Hit@K, Rec@K, all implemented in module/utils/metric.py.


Citation

If you find ModelLens useful in your research, please cite:

@article{cai2026modellens,
  title={ModelLens: Finding the Best for Your Task from Myriads of Models},
  author={Cai, Rui and Mo, Weijie Jacky and Wen, Xiaofei and Ma, Qiyao and Zhu, Wenhui and Chen, Xiwen and Chen, Muhao and Zhao, Zhe},
  journal={arXiv preprint arXiv:2605.07075},
  year={2026}
}

License

Released under the MIT License β€” see LICENSE.

About

The source code for the paper "ModelLens: Finding the Best for Your Task from Myriads of Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors