AutoCaption is a simple script that generates captions for images using the llava-v1.5-13b vision model.
To get started, clone the repository and install the necessary dependencies:
git clone --recurse-submodules -j8 https://github.com/akiselevprivate/AutoCaption.git
cd AutoCaption
pip install --upgrade pip # Enable PEP 660 support
pip install -e LLaVA/ # Install LLaVA module
pip install -r requirements.txt # Install other dependencies
huggingface-cli login # login using key for weights download, skip if env variable setOnce you have everything installed, you can use the script to generate captions for all images in a specified folder.
python main.py <image_folder> --prefix "<prefix>" --suffix "<suffix>" --encoder_prompt "<encoder_prompt>"<image_folder>: Path to the folder containing the images you want to caption.<prefix>: Optional prefix that will be added to the caption (default is empty).<suffix>: Optional suffix that will be added to the caption (default is empty).<encoder_prompt>: Optional prompt addition for the encode model (default is empty).
python main.py images --prefix "Photo of [trigger], " --encoder_prompt "for a t5 text encoder"This will generate captions for all images in the images/ folder, and each caption will start with "Photo of [trigger], " followed by the description of the image generated by the model.
- Image Folder: The script reads the images from the folder specified.
- Captioning: The script uses a pre-trained model (
llava-v1.5-13b) to generate captions. - Prefix/Suffix: You can customize the captions with a prefix and/or suffix.
This project is licensed under the MIT License.