Bachelor's Thesis: From Multi-View Images to Instance Diagrams: A Multi-View Detection Pipeline for Structural Relationship Inference in Small Assemblies
BRIO construction toys are assembled from discrete physical components (bolts, nuts, screws, plates, wheels, etc.) that connect through typed attachment points. Given a set of photographs of a completed construction, the goal is to automatically identify which components are present, localise them in 3D, infer their connections, and represent the result as a PlantUML instance diagram.
Construction
└── has one or many ConnectionConfiguration(s)
└── has one or more Connection(s)
└── is formed by exactly 2 Slot(s)
└── belongs to a Component
| Term | Definition |
|---|---|
| Component | A physical BRIO part (e.g., bolt, nut, plate, wheel). Has one or more slots. |
| Slot | A typed attachment point on a component (e.g., opening, pin, thread). |
| Connection | A link formed by exactly two slots joining together. |
| Connection Configuration | A group of connections that share a common joint point. |
| Construction | The complete physical assembly. |
~150 annotated BRIO construction samples. Each sample contains:
| File | Description |
|---|---|
Construction.jpg |
Photograph of the physical construction |
InstanceDiagramSN.puml |
Ground-truth PlantUML instance diagram |
Mapping.drawio |
Visual mapping between photo regions and components |
Multi-view images: ~78 photographs per sample at four elevation rings (30°, 45°, 60°, 90°) and 24 azimuth positions.
Complexity range:
- Smallest: 2 components, 1 connection
- Largest: 10+ components, 13+ connections
- Typical: 4–7 components, 3–9 connections
| Code | Part | Code | Part |
|---|---|---|---|
bo |
Bolt | nu |
Nut |
pl |
Plug | sl |
Sleeve |
wa |
Washer | ti |
Tire |
no |
Nose connector | rolo |
Long rod |
rome |
Medium rod | rosm |
Short rod |
sclo |
Long screw | scme |
Medium screw |
scsm |
Small screw | whre |
Red wheel |
whwh |
White wheel | blwo11 |
Wooden block 1×1 |
blwo21 |
Wooden block 2×1 | plwo21 |
Wooden plate 2×1 |
plwo31 |
Wooden plate 3×1 | plwo33 |
Wooden plate 3×3 |
plwo53 |
Wooden plate 5×3 | plpl53 |
Plastic plate 5×3 |
stwo3–stwo9 |
Wooden straps (lengths 3–9) | stpl5 |
Plastic strap 5 |
The implemented system is a two-stage pipeline.
Full documentation: brio_pipeline/README.md
Runs once per sample to produce 3D ground-truth labels for training. Runtime ~35 minutes per sample on GPU (cold), ~2 minutes with cache.
~20 multi-view images (all 4 elevation rings, stride 4)
↓ CLAHE enhancement — boosts local contrast for white components
↓ DUSt3R (ViT-L) — uncalibrated multi-view 3D reconstruction
↓ SAM (ViT-B) — automatic mask generation (top-N per image)
↓ Back-projection — SAM masks × pts3d → ~20N raw 3D clouds
↓ Ward clustering + sigma cleanup → N instance clouds
↓ Visual classifier (MobileNetV3-small) — majority-votes class per cluster
↓ Hungarian assignment (visual + colour cost) → class label per cloud
outputs/run_NNN_YYYYMMDD_HHMM/sample_N/results.json
Outputs feed directly into Stage 2 as YOLO training labels.
Trained on Stage 1 labels. Inference is under 2 seconds per sample on GPU.
Training (once):
label_exporter.py → YOLO-format dataset from Stage 1 outputs
train.py → fine-tunes YOLOv8n from COCO pretrained weights
Calibration (once):
calibrator.py → fixed camera rig from DUSt3R poses (normalised, averaged)
Inference:
~20 images → YOLOv8 detection → DLT triangulation → connection inference → PlantUML
All phases are run from brio_pipeline/ with short commands:
| Command | What it does |
|---|---|
./slow.sh 113 114 115 |
Annotate samples with the slow pipeline |
./train_classifier.sh |
Train the component visual classifier (run once before slow pipeline) |
./labels.sh |
Export YOLO labels from completed samples |
./calibrate.sh 113 |
Build fixed camera rig calibration |
./train.sh |
Train YOLOv8n |
./infer.sh 120 |
Run fast inference on a sample |
./visualize.sh 113 |
3D scatter plot of instance clouds |
Logs are written automatically on every run.
00-project/
├── README.md ← this file
├── .gitignore
│
├── brio_pipeline/ ← both pipelines + launcher scripts
│ ├── README.md ← full pipeline documentation
│ ├── slow.sh / train_classifier.sh / infer.sh / train.sh / ...
│ ├── brio_3d_pipeline/ ← slow annotation pipeline (source code)
│ │ ├── pipeline.py
│ │ ├── config.py
│ │ ├── backprojector.py
│ │ ├── classifier.py
│ │ ├── component_classifier.py ← MobileNetV3-small visual classifier
│ │ ├── component_map.py
│ │ ├── component_classifier.pth ← trained weights (git-ignored)
│ │ └── ...
│ └── brio_fast_pipeline/ ← fast inference pipeline
│ ├── infer.py
│ ├── train.py
│ └── ...
│
└── sam_trials/ ← earlier SAM integration experiments
Note:
brio_pipeline/brio_3d_pipeline/outputs/(DUSt3R/SAM caches, ~37 MB per sample) andcomponent_classifier.pthare excluded from git via.gitignore. Run./train_classifier.shonce to produce the weights, then./slow.sh <sample_ids>to regenerate outputs.
- Python 3.10, conda env
brio-3d - PyTorch + CUDA 12.4 (
cu124) on RTX 2070 Super (8 GB) - DUSt3R (ViT-L,
07-dust3r/), SAM ViT-B, YOLOv8n (ultralytics) - MobileNetV3-small (torchvision, ImageNet pretrained)
- WSL2 on Windows 11
Setup instructions: brio_pipeline/README.md § Environment Setup
- Wang et al. (2024) — DUSt3R: Geometric 3D Vision Made Easy
- Kirillov et al. (2023) — Segment Anything
- Jocher et al. (2023) — Ultralytics YOLOv8
- Liu et al. (2022) — PETR: Position Embedding Transformation for Multi-View 3D Object Detection
- Howard et al. (2019) — Searching for MobileNetV3