LoRA Training Evaluation: BLIP vs Human Captioning

Samarth Reddy (3210635)
Digital Futures 2024
OCAD University

Introduction

This project explores training Lora models within the stable diffusion framework to generate images from text descriptions. The focus is on understanding how the captions used before training impact the model’s performance and ability to accurately reproduce results. This research addresses the growing interest in AI-generated imagery and its applications in fields like digital art, content creation, and education (Li et al.).

Importance

Advancements in generative AI models like Stable Diffusion 1.5, 2.0, SDXL, Turbo and 3.0 have enabled high-quality image generation from text prompts. Developing caption-based image synthesis and modification models like BLIP (Bootstrapped Language Image Pre-training) provides a foundation for researching the reproducibility and quality of generated images using unique training datasets. This work aims to advance AI’s capabilities in producing contextually appropriate visual content (Li et al.)

LoRA Tranining

LoRA (Low-Rank Adaptation) is a technique that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This greatly reduces the number of trainable parameters for downstream tasks. (Hu et al.).

SDXL 1.0
https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

Dataset

This dataset was specifically chosen for the information provided by each image, and it is predominantly used in AI training for GAN models.

Source: https://www.kaggle.com/datasets/marcinrutecki/old-photos

(“Old Photos,” n.d.)
CC0: Public Domain

Text Encoding

When a prompt is submitted into a system, it undergoes a process called tokenization and text encoding. Tokenization breaks down the prompt into smaller components, allowing the system to mathematically analyze the input. Each piece of the prompt is assigned a unique ID, representing it in a form the model can interpret. During training, the model learns the relationships between these tokens, understanding how they relate to each other within the context of the prompt based on the captions while training the model. As a result, when a prompt is processed, the model utilizes its trained weights to determine the connections between the tokens and the corresponding output, whether text or visual content. This process, although subject to evolve, is fundamental in how models manage and interpret prompt information currently. (Invoke 2024)

BLIP Captioning

BLIP (Bootstrapping Language-Image Pre-training) is a vision-language pre-training framework that enables a wider range of downstream tasks than existing methods. BLIP introduces a multimodal mixture of encoder-decoder architecture for effective multi-task pre-training and flexible transfer learning. It also proposes a new dataset bootstrapping method called Captioning and Filtering (CapFilt) for learning from noisy image-text pairs, where a captioner generates synthetic captions and a filter removes noisy ones (Li et al.).

https://arxiv.org/pdf/2201.12086.pdf
https://github.com/salesforce/BLIP

Training BLIP Captioned Model

To train a LoRA model which uses BLIP captions, following steps are taken:

Collect and curate a focused dataset of images
Generate captions for the dataset automatically using BLIP
Implement LoRA training using Kohya SS in the Runpod secure server
Evaluate the trained model’s performance through systematic testing with various prompts and captions
Document the entire process comprehensively (Li et al.).

Example

qwox woman, an old woman in a black dress and bonnet <lora:qwox woman:1>

Human Captioning

Human captioning involves manually writing descriptive captions for the training images, as opposed to using automatically generated captions from models like BLIP. The goal is to provide high-quality, contextually rich annotations that can improve the model’s understanding and generation of detailed visual content (Li et al.).

Training Human Captioned Model

The human captioned model training pipeline is similar to the BLIP pipeline, with the key difference being the caption generation step. Instead of using BLIP, captions are written manually by human annotators.

The pipeline includes:

Collecting and curating the image dataset
Manually writing captions for each image
Implementing LoRA training using Kohya SS in the Runpod secure server
Evaluating model performance
Documenting the process (Li et al.).

Example

ohwx woman, A portrait of a stern-looking woman from the mid-19th century, wearing a dark dress and a white bonnet with lace trim, conveying a sense of solemnity and strict traditionalism.<lora:ohwx woman:1>

Evaluation

The evaluation will assess LoRA models on select parameters.

Caption Tokens: The number of tokens in the generated captions will be analyzed to understand their impact on model performance. More diverse captions are expected to provide greater benefits (Hu et al.).
Training Duration: The time required to train the models using the Kohya SS implementation of LoRA will be measured to assess efficiency.
Prompt Tokens: Various prompts will be used to systematically test the model’s ability to generate images that align with the captions used during training.
Performance Speed: The inference speed of the trained models will be evaluated to ensure no additional latency is introduced compared to fine-tuned models (Hu et al.).
Reproducability: Detailed documentation of the datasets, model architectures, hyperparameters, and training procedures will be provided to ensure reproducibility of the results (Li et al.).

	Captions Tokens	Training Duration	Prompt Tokens	Performance Speed	Reproducibility
Blip Captions	Low	2 Hours	Low	100%	High
Human Captions	High	4 Hours	High	100%	Medium

Server Configuration
RunPod Secure Server
1 RTX 4090
62GB VRAM
500 GB Pod Volume

RunPod Secure Server
1 RTX 3090
24 GB VRAM
300 GB Pod Volume

Effort Impact Graph

Results

The BLIP captioned model used a dataset of 11 images. The captions had a low token count, and training took 2 hours. Prompt tokens were also low. The model achieved 100% performance speed and high reproducibility. Overall, the effort required for the BLIP captioning approach was low while the impact was high.

In contrast, the human captioned model also used 11 images, but the captions had a high token count. Training duration was longer at 4 hours, and prompt tokens were high. Like the BLIP model, 100% performance speed was achieved. However, reproducibility was only medium.

Comparing the two approaches, human captioning required significantly more effort in writing the captions and longer training time. Despite this high effort, the overall impact was medium to low. BLIP captioning was much more efficient, needing less effort in dataset preparation and training while delivering high impact.

BLIP captioning provides an advantage over human captioning in terms of the effort-to-impact ratio. The automated caption generation enables the creation of rich text descriptions with minimal manual work, leading to strong performance in the trained models (Li et al.)

Recommendations

This study provides valuable insights into the impact of captioning methods on LoRA training for Stable Diffusion models. However, there are several areas that warrant further research:

Bias in Human Knowledge: The quality and diversity of human-written captions may be limited by the annotators’ knowledge and familiarity with certain objects, ethnicities, or styles. Future work should explore methods to mitigate potential biases, such as using more diverse annotator pools or incorporating external knowledge sources to enrich the captions.

Dataset Sample Size: Due to computational constraints, this study trained the models on a small dataset of 11 images. To assess the scalability and robustness of the findings, it is crucial to conduct experiments with larger datasets. Increasing the sample size would provide a more comprehensive understanding of how the captioning methods impact model performance and help identify potential fluctuations or limitations.

Streamlining Training Procedures: The complexity of model training on overly technical platforms necessitates meticulous planning and organization. Future studies should develop and adhere to structured protocols, such as comprehensive checklists, to ensure that all critical steps are followed without omission. This approach minimizes the risk of errors that could compromise the training outcomes and enhances the efficiency and reproducibility of the process.

Diverse Participant Inclusion: Engaging participants with a wide range of backgrounds and experience levels in AI prompting can significantly enrich the evaluation of models trained on scaled datasets. This inclusive testing approach is likely to uncover nuanced insights into the models’ performance across different user demographics, thereby offering a more holistic view of their applicability and limitations.

Conclusion

The project embarked on training an AI model in image recognition, presenting a blend of challenges and discoveries. Despite computational and time constraints, the journey revealed unexpected findings, setting the stage for future exploration.

Key takeaways include:

The process introduced me to new training methods and platforms, enriching my understanding of AI in image recognition.

Contrary to the expectations that models using human-written captions would perform better, the actual outcomes demonstrated the opposite, highlighting the value of testing in AI model training.

The project has ignited a desire to further investigate checkpoint models and larger datasets, as well as to deepen my understanding of biases within AI and machine learning for upcoming endeavors.

Links

Presentation – https://pitch.com/v/ai-lora-training-evaluation-blip-vs-human-captioning-n5dbv8

Github – https://github.com/calluxpore/LoRA-Training-Evaluation-BLIP-vs-Human-Captioning

Blip Captions – https://github.com/calluxpore/LoRA-Training-Evaluation-BLIP-vs-Human-Captioning/tree/main/Blip%20Captions

Human Captions – https://github.com/calluxpore/LoRA-Training-Evaluation-BLIP-vs-Human-Captioning/tree/main/Human%20Captions

LoRA Models – BLIP Captioned – https://1drv.ms/f/s!AnfY20aNHR6Zpd0eg8_-DtXPSPJx_w?e=6Yd7kS

LoRA Models – Human Captioned – https://1drv.ms/f/s!AnfY20aNHR6Zpd0dvQqw7_9Fs2ECQw?e=2AJkLa

Workflow Videos –

References

Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2106.09685.

Invoke, dir. 2024. Creating Embeddings and Concept Models with Invoke Training – Textual Inversion & LoRAs. https://www.youtube.com/watch?v=OZIz2vvtlM4.

Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation.” arXiv. https://doi.org/10.48550/arXiv.2201.12086.

“Old Photos.” n.d. Accessed April 1, 2024. https://www.kaggle.com/datasets/marcinrutecki/old-photos.

LoRA Training Evaluation: BLIP vs Human Captioning

Introduction

Importance

LoRA Tranining

Dataset

Text Encoding

BLIP Captioning

Training BLIP Captioned Model

Example

Human Captioning

Training Human Captioned Model

Example

Evaluation

Effort Impact Graph

Results

Recommendations

Conclusion

Links

References

Comments

Leave a Reply Cancel reply