You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to propose adding support for Visual Prompt Tuning to the prompt-tuning library. This technique extends prompt tuning to vision and vision-language models, enabling parameter-efficient adaptation of visual encoders and multimodal models.
Motivation
Visual prompt tuning brings the benefits of parameter-efficient fine-tuning to computer vision and multimodal domains. This is particularly valuable for adapting large vision-language models (like CLIP) to downstream tasks without fine-tuning the entire model.
Proposed Implementation
The implementation would include:
Pixel-space prompts: Add trainable prompts as image borders, corners, or patches
Patch-level prompts: Insert prompts at the patch embedding level for Vision Transformers
Deep visual prompts: Add prompts at multiple layers of the visual encoder
Vision-language coordination: Coordinate visual and textual prompts for multimodal models
Adaptive visual prompts: Condition prompts on image content
Key Features
Multiple visual prompt insertion strategies
Support for Vision Transformers (ViT)
Coordination with textual prompts for VL models
Adaptive prompts based on image features
Integration with existing T5X/Flaxformer infrastructure
Zhou et al. (2022). "Learning to Prompt for Vision-Language Models." arXiv:2109.01134
Additional Context
I have implemented a prototype of this technique that follows the library's design patterns and coding standards. The implementation is available in my fork at https://github.com/hwilner/prompt-tuning
Would the maintainers be interested in this enhancement? I'm happy to discuss the design and implementation details further.
Summary
I would like to propose adding support for Visual Prompt Tuning to the prompt-tuning library. This technique extends prompt tuning to vision and vision-language models, enabling parameter-efficient adaptation of visual encoders and multimodal models.
Motivation
Visual prompt tuning brings the benefits of parameter-efficient fine-tuning to computer vision and multimodal domains. This is particularly valuable for adapting large vision-language models (like CLIP) to downstream tasks without fine-tuning the entire model.
Proposed Implementation
The implementation would include:
Key Features
References
Additional Context
I have implemented a prototype of this technique that follows the library's design patterns and coding standards. The implementation is available in my fork at https://github.com/hwilner/prompt-tuning
Would the maintainers be interested in this enhancement? I'm happy to discuss the design and implementation details further.