- Built a multimodal storybook generation system that turns images into narrative PDFs
- Combined BLIP image captioning with fine-tuned KoT5 for scene-level story generation
- Processed 50,001 images with structured metadata for training and evaluation
- Delivered an end-to-end pipeline with a Streamlit demo and PDF export
This repository presents a practical exploration of multimodal narrative generation, focusing on how visual information can be transformed into coherent, scene-level stories. Rather than treating image-to-text as a single-step task, the project decomposes storytelling into captioning, structured context integration, and Transformer-based text generation.
TalesRunner emphasizes pipeline design and system integration, demonstrating how pre-trained vision–language models and fine-tuned language models can be combined to produce user-facing, end-to-end AI applications.
- Overview
- Project Timeline
- Key Features
- System Architecture
- Implementation Details
- Model Training
- Demo
- Team
TalesRunner is a full-stack AI project that transforms visual input into text-based stories. Users upload images (up to 10), optionally provide additional scene information, and the system automatically:
- Generates captions using BLIP
- Merges captions with structured metadata
- Produces narrative paragraphs using a fine-tuned KoT5 language model
- Compiles images + text into a PDF storybook
The project focuses on building a practical multimodal pipeline using pre-trained models, fine-tuned LLMs, and a user-friendly demo interface.
Jan–Feb 2025 (5 weeks)
| Week | Period | Focus & Milestones |
|---|---|---|
| 1 | Jan 11 – Jan 14 | Project scoping, task definition, first ideation |
| 2 | Jan 16 – Jan 19 | Second ideation, system design refinement |
| 3 | Jan 19 – Jan 26 | Dataset construction, baseline LM review |
| 4 | Jan 27 – Feb 3 | Dataset finalization, KoT5 fine-tuning |
| 5 | Feb 3 – Feb 10 | KoT5 fine-tuning, inference pipeline implementation |
| 6 | Feb 10 – Feb 15 | Inference demo, final integration and project wrap-up |
- Multimodal story generation pipeline combining BLIP captions and KoT5 text generation
- Structured metadata extraction from AI Hub annotations
- Custom input format with special tokens to guide narrative generation
- Fine-tuned KoT5 model with Bayesian hyperparameter optimization
- Streamlit demo enabling interactive storybook creation
- PDF export for final story compilation
- Image Upload User provides 1–10 images in order.
- Captioning (BLIP) BLIP generates an initial natural-language caption for each image.
- Metadata Integration User-provided fields + extracted annotations are combined with BLIP captions.
- Story Generation (KoT5) Fine-tuned KoT5 outputs a paragraph for each scene.
- PDF Assembly Images + story paragraphs compiled into a downloadable PDF.
-
Source: AI Hub Fairy Tale Illustration Dataset (50,001 samples)
-
Each sample includes:
- an image (
.jpg) - a metadata file (
.json)
- an image (
-
BLIP generates captions for all images
-
Annotation fields extracted:
- Required: caption, name, i_action, classification
- Optional: character, setting, emotion, causality, outcome, prediction
-
Combined to create:
dataset_train.csvdataset_val.csv
- Special tokens mark structured fields
- Required fields validated for completeness
- Optional fields replaced with
<empty>if missing - Field order randomized per sample to prevent positional bias
- Row-wise seed ensures reproducibility
- Task prefix added to guide KoT5 generation
- Baseline models reviewed: KoGPT-2, KoT5
- KoT5 selected due to stronger generalization and encoder–decoder flexibility
- Added special tokens to tokenizer vocab
- Aligned embedding matrix with extended vocabulary
- Applied masking so structural tokens do not affect attention scores
- Hyperparameter search using Bayesian Optimization
- Optimizer: AdamW
- Scheduler: Warmup + Linear decay
- Early stopping applied
Best parameters during evaluation:
num_beams = 3
length_penalty = 0.8
repetition_penalty = 1.5
no_repeat_ngram_size = 3
- BERTScore
- METEOR
- CIDEr
- SPICE
KoT5 outperformed KoGPT2 in narrative quality, coherence, and content relevance.
A Streamlit demo provides an interactive interface for story generation.
- Image upload page
- Metadata auto-filling and keyword suggestion
- Real-time inference using the fine-tuned KoT5 model
- PDF generation
To run locally:
streamlit run app.pyGPU recommended due to reliance on pre-trained models.
- Doeun Kim — Dataset construction, KoT5 fine-tuning, model training & validation
- Yujin Shin — Annotation preprocessing, inference pipeline, Streamlit demo
- Junga Woo — Baseline model experiments (KoGPT/KoT5), decoding parameter search
- Soobin Cha (PM) — Project management, model baselines, inference UI