Thanks for your inspiring work.
The DINOv2Encoder and DINOv2Decoder use pretrained ViT models, which limits the input resolution to 256x256. How to allow larger resolution for inference?
- If I split input image into 256 patches, the reconstruction will have block atrifacts.
encoder = DINOv2Encoder(model_name='vit_small_patch14_dinov2.lvd142m',
model_kwargs={'img_size': 256, 'patch_size': 16, 'drop_path_rate': 0.0},
tuning_method='lat_lora', tuning_kwargs={'r': 8},
num_latent_tokens=32)
decoder = DINOv2Decoder(model_name='vit_small_patch14_dinov2.lvd142m',
model_kwargs={'img_size': 256, 'patch_size': 16, 'drop_path_rate': 0.0},
tuning_method='full', tuning_kwargs={'r': 8}, num_latent_tokens=32, use_rope=True)
Thanks for your inspiring work.
The
DINOv2EncoderandDINOv2Decoderuse pretrained ViT models, which limits the input resolution to 256x256. How to allow larger resolution for inference?