Skip to content

TalesRunner: AI-powered Storybook Generation from Images

Notifications You must be signed in to change notification settings

doeunyy/tales-runner

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🔮 TalesRunner: AI-powered Storybook Generation from Images

TL;DR

  • Built a multimodal storybook generation system that turns images into narrative PDFs
  • Combined BLIP image captioning with fine-tuned KoT5 for scene-level story generation
  • Processed 50,001 images with structured metadata for training and evaluation
  • Delivered an end-to-end pipeline with a Streamlit demo and PDF export

This repository presents a practical exploration of multimodal narrative generation, focusing on how visual information can be transformed into coherent, scene-level stories. Rather than treating image-to-text as a single-step task, the project decomposes storytelling into captioning, structured context integration, and Transformer-based text generation.

TalesRunner emphasizes pipeline design and system integration, demonstrating how pre-trained vision–language models and fine-tuned language models can be combined to produce user-facing, end-to-end AI applications.

Table of Contents

  1. Overview
  2. Project Timeline
  3. Key Features
  4. System Architecture
  5. Implementation Details
  6. Model Training
  7. Demo
  8. Team

Overview

TalesRunner is a full-stack AI project that transforms visual input into text-based stories. Users upload images (up to 10), optionally provide additional scene information, and the system automatically:

  1. Generates captions using BLIP
  2. Merges captions with structured metadata
  3. Produces narrative paragraphs using a fine-tuned KoT5 language model
  4. Compiles images + text into a PDF storybook

The project focuses on building a practical multimodal pipeline using pre-trained models, fine-tuned LLMs, and a user-friendly demo interface.

Project Timeline

Jan–Feb 2025 (5 weeks)

Week Period Focus & Milestones
1 Jan 11 – Jan 14 Project scoping, task definition, first ideation
2 Jan 16 – Jan 19 Second ideation, system design refinement
3 Jan 19 – Jan 26 Dataset construction, baseline LM review
4 Jan 27 – Feb 3 Dataset finalization, KoT5 fine-tuning
5 Feb 3 – Feb 10 KoT5 fine-tuning, inference pipeline implementation
6 Feb 10 – Feb 15 Inference demo, final integration and project wrap-up

Key Features

  • Multimodal story generation pipeline combining BLIP captions and KoT5 text generation
  • Structured metadata extraction from AI Hub annotations
  • Custom input format with special tokens to guide narrative generation
  • Fine-tuned KoT5 model with Bayesian hyperparameter optimization
  • Streamlit demo enabling interactive storybook creation
  • PDF export for final story compilation

System Architecture

High-level Flow

  1. Image Upload User provides 1–10 images in order.
  2. Captioning (BLIP) BLIP generates an initial natural-language caption for each image.
  3. Metadata Integration User-provided fields + extracted annotations are combined with BLIP captions.
  4. Story Generation (KoT5) Fine-tuned KoT5 outputs a paragraph for each scene.
  5. PDF Assembly Images + story paragraphs compiled into a downloadable PDF.

Implementation Details

Dataset Construction

  • Source: AI Hub Fairy Tale Illustration Dataset (50,001 samples)

  • Each sample includes:

    • an image (.jpg)
    • a metadata file (.json)
  • BLIP generates captions for all images

  • Annotation fields extracted:

    • Required: caption, name, i_action, classification
    • Optional: character, setting, emotion, causality, outcome, prediction
  • Combined to create:

    • dataset_train.csv
    • dataset_val.csv

Input Encoding

  • Special tokens mark structured fields
  • Required fields validated for completeness
  • Optional fields replaced with <empty> if missing
  • Field order randomized per sample to prevent positional bias
  • Row-wise seed ensures reproducibility
  • Task prefix added to guide KoT5 generation

Model Training

Model Choices

  • Baseline models reviewed: KoGPT-2, KoT5
  • KoT5 selected due to stronger generalization and encoder–decoder flexibility

Tokenizer & Model Customization

  • Added special tokens to tokenizer vocab
  • Aligned embedding matrix with extended vocabulary
  • Applied masking so structural tokens do not affect attention scores

Training & Optimization

  • Hyperparameter search using Bayesian Optimization
  • Optimizer: AdamW
  • Scheduler: Warmup + Linear decay
  • Early stopping applied

Decoding Optimization

Best parameters during evaluation:

num_beams = 3
length_penalty = 0.8
repetition_penalty = 1.5
no_repeat_ngram_size = 3

Evaluation Metrics Used

  • BERTScore
  • METEOR
  • CIDEr
  • SPICE

KoT5 outperformed KoGPT2 in narrative quality, coherence, and content relevance.

Demo

A Streamlit demo provides an interactive interface for story generation.

Demo Features

  • Image upload page
  • Metadata auto-filling and keyword suggestion
  • Real-time inference using the fine-tuned KoT5 model
  • PDF generation

To run locally:

streamlit run app.py

GPU recommended due to reliance on pre-trained models.

Team

  • Doeun Kim — Dataset construction, KoT5 fine-tuning, model training & validation
  • Yujin Shin — Annotation preprocessing, inference pipeline, Streamlit demo
  • Junga Woo — Baseline model experiments (KoGPT/KoT5), decoding parameter search
  • Soobin Cha (PM) — Project management, model baselines, inference UI

About

TalesRunner: AI-powered Storybook Generation from Images

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.3%
  • Python 1.7%