Official implementation of MSNet: Multi-Semantic Modelling for Glass Surface Detection in the Wild.
AAAI 2026
Qianyu Cheng, Huankang Guan, Rynson W.H. Lau
Department of Computer Science, City University of Hong Kong
Links: Paper | Project Page | Code
Glass images contain three semantic components:
| Component | Description | Origin |
|---|---|---|
| Transmission | Content visible through the glass | Far side of the glass |
| Reflection | Content reflected on the glass surface | Viewer's side |
| Surrounding | Non-glass regions adjacent to the glass | Viewer's side |
MSNet is built on the observation that:
Reflection ≈ Surrounding, while both are different from Transmission.
This asymmetric multi-semantic relationship provides a discriminative cue for glass surface detection in the wild.
Input Image I
│
├──→ SDM: Semantic Decomposition Module
│ ├── DSEB: LRM extracts reflection R
│ │ CLIP(LoRA) encodes I→F_I, R→F_r
│ └── SEB: F_t = F_I·Attn(F_I,F_r) − F_r
│ F_s = F_I − Attn(F_I,F_t)·F_t
│
├──→ GSSAM: Glass-Specific SAM
│ └── SAM ViT-H + LoRA → F_g
│
└──→ ASFM: Adaptive Semantic Fusion Module
├── Fuse {F_I, F_r, F_t, F_s, F_g}
└── Generate sparse/dense prompts
│
└──→ SAM Mask Decoder → Glass Mask
AAAI26-MSNet/
├── Models/
│ ├── MSNet.py # Main MSNet model
│ ├── Base.py # PyTorch Lightning base class
│ ├── loss.py # Loss functions
│ ├── activation.py
│ ├── ablation/ # Ablation variants
│ └── reflection/
│ ├── base.py
│ ├── registry.py
│ └── location_estimator.py # LRM reflection estimator
│
├── Data/
│ ├── SAMDataLoader.py
│ ├── SAMDataLoader_ab.py
│ ├── PLdataModule.py
│ └── Tonpz_new.py # Raw image/mask to NPZ
│
├── Utils/
│ ├── metric_utils.py
│ ├── dataset_utils.py
│ ├── regularization_utils.py
│ └── utils.py
│
├── clip/ # Bundled CLIP Surgery code
├── TestImg/ # Example images
├── train.py
├── test.py
├── inference.py
├── requirements.txt
└── README.md
- Python ≥ 3.9
- PyTorch ≥ 2.0
- CUDA ≥ 11.8
- Recommended GPU: NVIDIA RTX 4090 24GB or equivalent
git clone https://github.com/chengqianyu03/AAAI26-MSNet.git
cd AAAI26-MSNet
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
# Segment Anything
pip install git+https://github.com/facebookresearch/segment-anything.gitPlease download the following checkpoints and place them under checkpoints/.
| File | Description | Link |
|---|---|---|
sam_vit_h_4b8939.pth |
SAM ViT-H backbone | Download |
model.pth |
LRM reflection estimator | LRM Repository |
best.ckpt |
Trained MSNet checkpoint | Google Drive |
Expected layout:
checkpoints/
├── sam_vit_h_4b8939.pth
├── model.pth
└── best.ckpt
MSNet estimates reflection cues on-the-fly using the built-in LRM reflection estimator.
For dataset preparation, only RGB images and binary glass masks are required.
Each .npz file contains:
| Key | Shape | Description |
|---|---|---|
data |
[1, 3, H, W] |
RGB image |
label |
[1, 1, H, W] |
Binary glass mask |
Edit the paths in Data/Tonpz_new.py, then run:
python Data/Tonpz_new.pyExpected dataset layout:
/path/to/dataset/
├── train/
│ ├── 0.npz
│ ├── 1.npz
│ └── ...
└── test/
├── 0.npz
└── ...
python train.py \
--data_dir /path/to/dataset \
--sam_checkpoint checkpoints/sam_vit_h_4b8939.pth \
--sam_model_type vit_h \
--lora_rank 512 \
--ft_dec \
--clip_lora_rank 128 \
--clip_lora_alpha 256 \
--reflection_estimator lrm \
--reflection_checkpoint checkpoints/model.pth \
--reflection_proc_size 256 \
--reflection_n_iters 3 \
--lr 2e-5 \
--weight_decay 5e-5 \
--max_epochs 50 \
--gpu 0 \
--ckpt_dir checkpoints/train| Parameter | Default | Description |
|---|---|---|
--lora_rank |
512 | SAM LoRA rank |
--clip_lora_rank |
128 | CLIP LoRA rank |
--clip_lora_alpha |
256 | CLIP LoRA scaling alpha |
--clip_layer_idx |
10 | CLIP feature layer |
--lr |
2e-5 |
Base learning rate |
--weight_decay |
5e-5 |
Weight decay |
--reflection_estimator |
lrm |
Reflection estimator |
--reflection_proc_size |
256 | Reflection estimator input size |
--reflection_n_iters |
3 | LRM iteration number |
--max_epochs |
50 | Training epochs |
--patience |
10 | Early stopping patience |
Run evaluation on the test split:
python test.py \
--data_dir /path/to/dataset \
--checkpoint_path checkpoints/best.ckpt \
--output_dir evaluation_results/test.py performs multi-threshold evaluation by default.
Metrics: IoU, MAE, F_β (β=0.3), BER
Run inference on a single image:
python inference.py \
--input_image /path/to/photo.png \
--checkpoint_path checkpoints/best.ckpt \
--output_dir results/Outputs:
results/
├── photo_prob.png # Probability map
├── photo_mask.png # Binary mask
└── photo_overlay.png # Overlay visualization
| Method | IoU | MAE | F_β | BER |
|---|---|---|---|---|
| Mask2Former | 0.732 | 0.043 | 0.838 | 8.93 |
| GlassSemNet | 0.754 | 0.041 | 0.861 | 9.77 |
| SEEN-FT | 0.751 | 0.039 | 0.856 | 8.98 |
| MSNet | 0.817 | 0.027 | 0.892 | 6.09 |
| Method | IoU | MAE | F_β | BER |
|---|---|---|---|---|
| GlassSemNet | 0.902 | 0.059 | 0.942 | 4.67 |
| GhostingNet | 0.893 | 0.054 | 0.943 | 5.13 |
| MSNet | 0.915 | 0.043 | 0.955 | 4.17 |
| Method | IoU | MAE | F_β | BER |
|---|---|---|---|---|
| GlassSemNet | 0.854 | 0.068 | 0.903 | 5.69 |
| GhostingNet | 0.838 | 0.055 | 0.904 | 6.06 |
| MSNet | 0.878 | 0.042 | 0.916 | 4.69 |
@inproceedings{cheng2026msnet,
title = {Multi-Semantic Modelling for Glass Surface Detection in the Wild},
author = {Cheng, Qianyu and Guan, Huankang and Lau, Rynson W.H.},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2026}
}This project builds upon the following excellent works:
This project is released for academic research purposes only.