Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Flux is pretty good at text, but our autocaptioner doesn't know to transcribe text from training images. I tested it by recaptioning a dataset with fofr/batch-image-captioning using a modified version of this prompt, and got much improved legibility and steerability in the new version. Partly this is just because GPT-4o is a better VLM than LLaVA-13b, but I think this will still improve outputs for the autocaptioner.
Prediction examples: before | after
If we want to improve this further, we could use BLIP-3 as a captioner instead of LLaVA. See these results for a test image:
LLaVA 13b (old prompt)
LLavA 13b
Molmo-7b
BLIP-3
Important
Improves image captioning by updating the prompt to include transcription of text from images in
caption.py
.PROMPT
incaption.py
to include transcription of text from images in an optional final sentence, describing its styling and placement.PROMPT
to include text transcription and styling details.This description was created by for ff903c9. It will automatically update as commits are pushed.