diff --git a/README.md b/README.md index 4beed46..32eee97 100644 --- a/README.md +++ b/README.md @@ -6,8 +6,8 @@ A LangGraph-based system that processes books chapter-by-chapter to analyze and This system tracks: - **Mentioned count**: Total times a character's name or aliases appear in the text -- **Appeared scenes**: Number of scenes in which a character appears (binary per scene) - **Influence evidence**: Structured evidence of causal, social, world, pacing, and narrative gravity impact +- **Chapter summaries**: Incremental summaries of each chapter for narrative context After processing all chapters, the system synthesizes book-wide dossiers and produces a subjective influence ranking based on narrative impact, not just frequency. @@ -17,9 +17,11 @@ This initially started as a static analyzer meant to count character reference i ## Features - Chapter-by-chapter processing with LangGraph state machine +- Scene-based processing (chunks full scenes, never truncates) - Automatic character detection and alias resolution -- Scene segmentation and appearance tracking +- Scene segmentation and intelligent chunking - Structured influence evidence extraction +- Incremental chapter summarization for narrative context - Book-wide synthesis into character dossiers - Subjective influence ranking (not based on frequency) - LangSmith integration for tracing and debugging @@ -45,6 +47,7 @@ Required environment variables: - `LANGSMITH_API_KEY`: Your LangSmith API key (optional, for tracing) - `LANGSMITH_TRACING`: Set to `true` to enable tracing (default: `false`) - `LANGSMITH_PROJECT`: LangSmith project name (default: `book-influence-dev`) +- `SCENE_CHUNK_MAX_CHARS`: Maximum characters per scene chunk (default: `5000`) ## Usage @@ -103,7 +106,6 @@ The output JSON contains an array of ranked characters: "character_id": "char_001", "name": "Character Name", "aliases": ["Alias1", "Alias2"], - "appeared_scenes": 42, "mentioned_count": 310, "influence_summary": "Summary of their influence...", "ranking_rationale": "Why this character ranks here..." @@ -160,12 +162,14 @@ graph TD Start([Start]) --> Init[Init] Init --> LoadChapter[LoadChapter] LoadChapter --> SceneSegmenter[SceneSegmenter] - SceneSegmenter --> EntityRosterUpdate[EntityRosterUpdate] + SceneSegmenter --> SceneChunker[SceneChunker] + SceneChunker --> EntityRosterUpdate[EntityRosterUpdate] EntityRosterUpdate --> MentionCounter[MentionCounter] - MentionCounter --> AppearanceCounter[AppearanceCounter] - AppearanceCounter --> InfluenceExtractor[InfluenceExtractor] - InfluenceExtractor --> BookAggregator[BookAggregator] + MentionCounter --> InfluenceExtractor[InfluenceExtractor] + InfluenceExtractor --> ChapterSummarizer[ChapterSummarizer] + ChapterSummarizer --> BookAggregator[BookAggregator] BookAggregator --> NextChapter[NextChapter] + NextChapter -->|next_chunk| SceneChunker NextChapter -->|next_chapter| LoadChapter NextChapter -->|finalize| BookSynthesis[BookSynthesis] BookSynthesis --> Ranker[Ranker] @@ -184,20 +188,21 @@ graph TD 1. **Init**: Validates input and initializes state 2. **LoadChapter**: Loads current chapter and resets scratch fields 3. **SceneSegmenter**: Splits chapter into scenes -4. **EntityRosterUpdate**: Detects characters and updates alias registry -5. **MentionCounter**: Counts alias occurrences -6. **AppearanceCounter**: Counts scenes per character +4. **SceneChunker**: Selects batch of full scenes that fit within character limit +5. **EntityRosterUpdate**: Detects characters and updates alias registry +6. **MentionCounter**: Counts alias occurrences across scene chunk 7. **InfluenceExtractor**: Extracts structured influence evidence -8. **BookAggregator**: Merges chapter results into book totals -9. **NextChapter**: Conditional routing (next chapter or finalize) -10. **BookSynthesis**: Synthesizes evidence into dossiers -11. **Ranker**: Assigns subjective influence ranks +8. **ChapterSummarizer**: Incrementally summarizes chapter as scenes are processed +9. **BookAggregator**: Merges chapter results into book totals +10. **NextChapter**: Conditional routing (next chunk, next chapter, or finalize) +11. **BookSynthesis**: Synthesizes evidence into dossiers using chapter summaries +12. **Ranker**: Assigns subjective influence ranks ### State Structure The state maintains three layers: -1. **Book-level aggregates**: Persist across chapters (mentions, appearances, influence) -2. **Per-chapter scratch**: Reset each chapter (current chapter data) +1. **Book-level aggregates**: Persist across chapters (mentions, influence, chapter summaries) +2. **Per-chapter scratch**: Reset each chapter (current chapter data, scene chunks) 3. **Indexing structures**: Character canon and alias resolution ### Influence Ranking @@ -209,7 +214,7 @@ Influence ranking is **subjective** and based on: - Narrative gravity (scenes revolve around them) - Causal responsibility for major events -**Note**: Mention and appearance counts are tracked but not the primary driver for ranking. +**Note**: Mention counts are tracked but not the primary driver for ranking. The system processes chapters in scene chunks to ensure full coverage without truncation. ## Project Structure @@ -221,8 +226,8 @@ Influence ranking is **subjective** and based on: │ ├── scene_segmenter.py │ ├── entity_roster_update.py │ ├── mention_counter.py -│ ├── appearance_counter.py │ ├── influence_extractor.py +│ ├── chapter_summarizer.py │ ├── book_aggregator.py │ ├── next_chapter.py │ ├── book_synthesis.py diff --git a/env.example b/env.example index 6452ed0..c09041c 100644 --- a/env.example +++ b/env.example @@ -21,6 +21,13 @@ BOOK_SYNTHESIS_MODEL= BOOK_SYNTHESIS_TEMPERATURE=0.2 RANKER_MODEL= RANKER_TEMPERATURE=0.1 +CHAPTER_SUMMARIZER_MODEL= +CHAPTER_SUMMARIZER_TEMPERATURE=0.3 + +# Scene Chunking Configuration +# Maximum characters per scene chunk (default: 5000) +# Only full scenes are included - never truncates scenes +SCENE_CHUNK_MAX_CHARS=5000 # Environment ENV=dev diff --git a/graph.py b/graph.py index 6d3994d..a475e0b 100644 --- a/graph.py +++ b/graph.py @@ -1,9 +1,10 @@ """LangGraph assembly: wires nodes together with edges and conditional routing. Graph flow: -Init -> LoadChapter -> SceneSegmenter -> EntityRosterUpdate -> MentionCounter -> -AppearanceCounter -> InfluenceExtractor -> BookAggregator -> NextChapter -> -(if next_chapter: loop to LoadChapter, else: BookSynthesis -> Ranker -> Done) +Init -> LoadChapter -> SceneSegmenter -> SceneChunker -> EntityRosterUpdate -> +MentionCounter -> InfluenceExtractor -> ChapterSummarizer -> BookAggregator -> +NextChapter -> (if next_chunk: loop to SceneChunker, if next_chapter: loop to +LoadChapter, else: BookSynthesis -> Ranker -> Done) """ from langgraph.graph import StateGraph, END @@ -13,10 +14,11 @@ from nodes.scene_segmenter import scene_segmenter_node from nodes.entity_roster_update import entity_roster_update_node from nodes.mention_counter import mention_counter_node -from nodes.appearance_counter import appearance_counter_node from nodes.influence_extractor import influence_extractor_node from nodes.book_aggregator import book_aggregator_node from nodes.next_chapter import next_chapter_node, should_continue +from nodes.scene_segmenter import scene_chunker_node +from nodes.chapter_summarizer import chapter_summarizer_node from nodes.book_synthesis import book_synthesis_node from nodes.ranker import ranker_node @@ -34,10 +36,11 @@ def create_graph() -> StateGraph: workflow.add_node("init", init_node) workflow.add_node("load_chapter", load_chapter_node) workflow.add_node("scene_segmenter", scene_segmenter_node) + workflow.add_node("scene_chunker", scene_chunker_node) workflow.add_node("entity_roster_update", entity_roster_update_node) workflow.add_node("mention_counter", mention_counter_node) - workflow.add_node("appearance_counter", appearance_counter_node) workflow.add_node("influence_extractor", influence_extractor_node) + workflow.add_node("chapter_summarizer", chapter_summarizer_node) workflow.add_node("book_aggregator", book_aggregator_node) workflow.add_node("next_chapter", next_chapter_node) workflow.add_node("book_synthesis", book_synthesis_node) @@ -49,11 +52,12 @@ def create_graph() -> StateGraph: # Add edges workflow.add_edge("init", "load_chapter") workflow.add_edge("load_chapter", "scene_segmenter") - workflow.add_edge("scene_segmenter", "entity_roster_update") + workflow.add_edge("scene_segmenter", "scene_chunker") + workflow.add_edge("scene_chunker", "entity_roster_update") workflow.add_edge("entity_roster_update", "mention_counter") - workflow.add_edge("mention_counter", "appearance_counter") - workflow.add_edge("appearance_counter", "influence_extractor") - workflow.add_edge("influence_extractor", "book_aggregator") + workflow.add_edge("mention_counter", "influence_extractor") + workflow.add_edge("influence_extractor", "chapter_summarizer") + workflow.add_edge("chapter_summarizer", "book_aggregator") workflow.add_edge("book_aggregator", "next_chapter") # Conditional routing from next_chapter @@ -61,8 +65,9 @@ def create_graph() -> StateGraph: "next_chapter", should_continue, { - "next_chapter": "load_chapter", - "finalize": "book_synthesis", + "next_chunk": "scene_chunker", # More scenes in current chapter + "next_chapter": "load_chapter", # Move to next chapter + "finalize": "book_synthesis", # All chapters done } ) diff --git a/main.py b/main.py index dc501ce..69e51c1 100644 --- a/main.py +++ b/main.py @@ -126,8 +126,8 @@ def main(): 'alias_index': {}, 'unresolved_aliases': {}, 'book_mentions': {}, - 'book_appearances': {}, 'book_influence': {}, + 'chapter_summaries': {}, } # Get graph app @@ -180,7 +180,7 @@ def main(): logger.info("Top 5 characters by influence:") for char in ranked[:5]: logger.info(f" {char['rank']}. {char['name']} (ID: {char['character_id']})") - logger.info(f" Appeared in {char['appeared_scenes']} scenes, mentioned {char['mentioned_count']} times") + logger.info(f" Mentioned {char['mentioned_count']} times") if __name__ == "__main__": diff --git a/nodes/appearance_counter.py b/nodes/appearance_counter.py deleted file mode 100644 index 1b7574f..0000000 --- a/nodes/appearance_counter.py +++ /dev/null @@ -1,61 +0,0 @@ -"""AppearanceCounter node: counts scenes in which each character appears.""" - -from collections import defaultdict -from schemas.state import BookState -from utils.aliases import find_all_alias_matches -from utils.logging_config import get_logger - -logger = get_logger() - - -def appearance_counter_node(state: BookState) -> BookState: - """Count scenes in which each character appears (binary per scene). - - Reads: - - `current_scenes`: list of scenes - - `alias_index`: alias to char_id mapping - - Writes: - - `chapter_appearances_by_char`: dict mapping char_id -> scene count - - Args: - state: Current state - - Returns: - Updated state with appearance counts - """ - scenes = state['current_scenes'] - chapter_id = state['current_chapter_id'] - alias_index = state.get('alias_index', {}) - - logger.info(f"Counting appearances in chapter {chapter_id} ({len(scenes)} scenes)") - - # For each scene, determine which characters appear - characters_in_scenes = defaultdict(set) - - for scene in scenes: - scene_text = scene['text'] - matches_by_char = find_all_alias_matches(scene_text, alias_index, case_sensitive=False) - - # Any character with a match in this scene is considered present - for char_id in matches_by_char.keys(): - characters_in_scenes[char_id].add(scene['scene_id']) - - # Count unique scenes per character - chapter_appearances = { - char_id: len(scene_ids) - for char_id, scene_ids in characters_in_scenes.items() - } - - logger.info(f"Characters appeared in {len(chapter_appearances)} characters across scenes") - if chapter_appearances: - top_appeared = max(chapter_appearances.items(), key=lambda x: x[1]) - logger.debug(f"Most appearances: {top_appeared[0]} in {top_appeared[1]} scenes") - - updated_state: BookState = { - **state, - 'chapter_appearances_by_char': chapter_appearances, - } - - return updated_state - diff --git a/nodes/book_aggregator.py b/nodes/book_aggregator.py index 68ab573..250d691 100644 --- a/nodes/book_aggregator.py +++ b/nodes/book_aggregator.py @@ -9,14 +9,16 @@ def book_aggregator_node(state: BookState) -> BookState: """Merge per-chapter results into book-level aggregates. + Aggregates mention counts and influence evidence from the current chapter + into book-level totals. Note: This aggregates per scene chunk, so it may + be called multiple times per chapter. + Reads: - - `chapter_mentions_by_char`: per-chapter mention counts - - `chapter_appearances_by_char`: per-chapter appearance counts - - `chapter_influence_evidence`: per-chapter influence evidence + - `chapter_mentions_by_char`: per-chapter mention counts (accumulated across chunks) + - `chapter_influence_evidence`: per-chapter influence evidence (from current chunk) Writes: - - Increments `book_mentions` - - Increments `book_appearances` + - Increments `book_mentions` with chapter totals - Appends to `book_influence[char_id].evidence` - Updates `book_influence[char_id].feature_totals` @@ -28,28 +30,20 @@ def book_aggregator_node(state: BookState) -> BookState: """ chapter_id = state['current_chapter_id'] chapter_mentions = state.get('chapter_mentions_by_char', {}) - chapter_appearances = state.get('chapter_appearances_by_char', {}) chapter_evidence = state.get('chapter_influence_evidence', {}) logger.info(f"Aggregating chapter {chapter_id} results into book totals") book_mentions = state.get('book_mentions', {}).copy() - book_appearances = state.get('book_appearances', {}).copy() book_influence = state.get('book_influence', {}).copy() - # Aggregate mentions + # Aggregate mentions (chapter totals, accumulated across all chunks) mentions_added = 0 for char_id, count in chapter_mentions.items(): book_mentions[char_id] = book_mentions.get(char_id, 0) + count mentions_added += count - # Aggregate appearances - appearances_added = 0 - for char_id, count in chapter_appearances.items(): - book_appearances[char_id] = book_appearances.get(char_id, 0) + count - appearances_added += count - - # Aggregate influence evidence + # Aggregate influence evidence (from current chunk) evidence_added = 0 for char_id, evidence in chapter_evidence.items(): if char_id not in book_influence: @@ -78,13 +72,12 @@ def book_aggregator_node(state: BookState) -> BookState: totals['centered_scenes'] = totals.get('centered_scenes', 0) + len(signals.get('narrative_gravity', [])) book_influence[char_id]['feature_totals'] = totals - logger.info(f"Aggregated: {mentions_added} mentions, {appearances_added} appearances, {evidence_added} evidence entries") + logger.info(f"Aggregated: {mentions_added} mentions, {evidence_added} evidence entries") logger.debug(f"Book totals: {len(book_mentions)} characters with mentions, {len(book_influence)} with influence evidence") updated_state: BookState = { **state, 'book_mentions': book_mentions, - 'book_appearances': book_appearances, 'book_influence': book_influence, } diff --git a/nodes/book_synthesis.py b/nodes/book_synthesis.py index af00ca0..f3d83c7 100644 --- a/nodes/book_synthesis.py +++ b/nodes/book_synthesis.py @@ -18,8 +18,8 @@ def book_synthesis_node(state: BookState) -> BookState: Reads: - `book_influence`: accumulated influence evidence - `book_mentions`: total mentions per character - - `book_appearances`: total appearances per character - `characters_by_id`: all character profiles + - `chapter_summaries`: chapter summaries for narrative context Writes: - `book_plot_summary`: overall plot summary @@ -34,18 +34,19 @@ def book_synthesis_node(state: BookState) -> BookState: """ book_influence = state.get('book_influence', {}) book_mentions = state.get('book_mentions', {}) - book_appearances = state.get('book_appearances', {}) characters_by_id = state.get('characters_by_id', {}) + chapter_summaries = state.get('chapter_summaries', {}) logger.info("Synthesizing book-wide evidence into character dossiers") logger.debug(f"Processing {len(characters_by_id)} characters with {len(book_influence)} influence records") + logger.debug(f"Using {len(chapter_summaries)} chapter summaries for context") # Build prompt prompt = get_book_synthesis_prompt( book_influence, book_mentions, - book_appearances, - characters_by_id + characters_by_id, + chapter_summaries ) # Get model configuration from environment diff --git a/nodes/chapter_summarizer.py b/nodes/chapter_summarizer.py new file mode 100644 index 0000000..9fcd885 --- /dev/null +++ b/nodes/chapter_summarizer.py @@ -0,0 +1,78 @@ +"""ChapterSummarizer node: incrementally summarizes chapters as scenes are processed.""" + +import os +from schemas.state import BookState +from utils.json import parse_json_safely +from prompts import get_chapter_summary_prompt +from langchain_openai import ChatOpenAI +from langchain_core.messages import HumanMessage +from utils.logging_config import get_logger + +logger = get_logger() + + +def chapter_summarizer_node(state: BookState) -> BookState: + """Generate or update chapter summary based on current scene chunk. + + This node incrementally builds chapter summaries as scenes are processed. + Each chunk adds to the previous summary, maintaining narrative flow awareness. + + Reads: + - `current_chapter_id`: current chapter ID + - `current_scene_chunk`: current batch of scenes being processed + - `chapter_summaries`: existing chapter summaries + + Writes: + - `chapter_summaries`: updated with new/updated summary for current chapter + + Args: + state: Current state + + Returns: + Updated state with chapter summary updated + """ + chapter_id = state['current_chapter_id'] + scene_chunk = state.get('current_scene_chunk', []) + chapter_summaries = state.get('chapter_summaries', {}) + + previous_summary = chapter_summaries.get(chapter_id, "") + + logger.info(f"Summarizing chapter {chapter_id}") + logger.debug(f"Scene chunk: {len(scene_chunk)} scenes, Previous summary: {len(previous_summary)} chars") + + # Build prompt for LLM + prompt = get_chapter_summary_prompt(previous_summary, scene_chunk) + + # Get model configuration from environment + model = os.getenv("CHAPTER_SUMMARIZER_MODEL") or os.getenv("OPENAI_MODEL", "gpt-4o-mini") + temperature = float(os.getenv("CHAPTER_SUMMARIZER_TEMPERATURE", "0.3")) + + logger.debug(f"Calling LLM (model: {model}, temperature: {temperature})") + # Call LLM for summary generation + llm = ChatOpenAI(model=model, temperature=temperature) + response = llm.invoke([HumanMessage(content=prompt)]) + response_text = response.content + logger.debug("LLM response received") + + # Parse JSON response + parsed = parse_json_safely(response_text) + + updated_summaries = chapter_summaries.copy() + + if parsed and 'chapter_summary' in parsed: + new_summary = parsed['chapter_summary'] + updated_summaries[chapter_id] = new_summary + logger.info(f"Updated chapter summary ({len(new_summary)} chars)") + else: + logger.warning("Failed to parse chapter summary from LLM response") + # Keep previous summary if parsing fails + if previous_summary: + updated_summaries[chapter_id] = previous_summary + + updated_state: BookState = { + **state, + 'chapter_summaries': updated_summaries, + } + + return updated_state + diff --git a/nodes/entity_roster_update.py b/nodes/entity_roster_update.py index df575a7..9989512 100644 --- a/nodes/entity_roster_update.py +++ b/nodes/entity_roster_update.py @@ -15,11 +15,10 @@ def entity_roster_update_node(state: BookState) -> BookState: - """Update character roster and alias index for current chapter. + """Update character roster and alias index for current scene chunk. Reads: - - `current_chapter_text`: text of current chapter - - `current_scenes`: list of scenes + - `current_scene_chunk`: current batch of scenes being processed - `characters_by_id`: existing character profiles - `alias_index`: existing alias index @@ -34,20 +33,18 @@ def entity_roster_update_node(state: BookState) -> BookState: Returns: Updated state with character roster updated """ - chapter_text = state['current_chapter_text'] chapter_id = state['current_chapter_id'] - scenes = state['current_scenes'] + scene_chunk = state.get('current_scene_chunk', []) existing_characters = state.get('characters_by_id', {}) existing_alias_index = state.get('alias_index', {}) - chapter_id = state['current_chapter_id'] num_existing = len(existing_characters) logger.info(f"Updating entity roster for chapter {chapter_id}") - logger.debug(f"Existing characters: {num_existing}, Scenes: {len(scenes)}") + logger.debug(f"Existing characters: {num_existing}, Scene chunk: {len(scene_chunk)} scenes") - # Build prompt for LLM - prompt = get_entity_roster_prompt(chapter_text, existing_characters, scenes) + # Build prompt for LLM using scene chunk + prompt = get_entity_roster_prompt(existing_characters, scene_chunk) # Get model configuration from environment model = os.getenv("ENTITY_ROSTER_MODEL") or os.getenv("OPENAI_MODEL", "gpt-4o-mini") diff --git a/nodes/influence_extractor.py b/nodes/influence_extractor.py index 9840709..91c5275 100644 --- a/nodes/influence_extractor.py +++ b/nodes/influence_extractor.py @@ -12,10 +12,10 @@ def influence_extractor_node(state: BookState) -> BookState: - """Extract structured influence evidence per character for current chapter. + """Extract structured influence evidence per character for current scene chunk. Reads: - - `current_scenes`: list of scenes + - `current_scene_chunk`: current batch of scenes being processed - `characters_by_id`: character profiles - `current_chapter_id`: current chapter ID @@ -28,15 +28,15 @@ def influence_extractor_node(state: BookState) -> BookState: Returns: Updated state with influence evidence extracted """ - scenes = state['current_scenes'] + scene_chunk = state.get('current_scene_chunk', []) characters = state.get('characters_by_id', {}) chapter_id = state['current_chapter_id'] logger.info(f"Extracting influence evidence for chapter {chapter_id}") - logger.debug(f"Analyzing {len(scenes)} scenes for {len(characters)} characters") + logger.debug(f"Analyzing {len(scene_chunk)} scenes in chunk for {len(characters)} characters") - # Build prompt for LLM - prompt = get_influence_extraction_prompt(scenes, characters) + # Build prompt for LLM using scene chunk (all scenes, no truncation) + prompt = get_influence_extraction_prompt(scene_chunk, characters) # Get model configuration from environment model = os.getenv("INFLUENCE_EXTRACTOR_MODEL") or os.getenv("OPENAI_MODEL", "gpt-4o-mini") diff --git a/nodes/init.py b/nodes/init.py index 903f707..69403fb 100644 --- a/nodes/init.py +++ b/nodes/init.py @@ -51,8 +51,8 @@ def init_node(state: BookState) -> BookState: 'alias_index': state.get('alias_index', {}), 'unresolved_aliases': state.get('unresolved_aliases', {}), 'book_mentions': state.get('book_mentions', {}), - 'book_appearances': state.get('book_appearances', {}), 'book_influence': state.get('book_influence', {}), + 'chapter_summaries': state.get('chapter_summaries', {}), } logger.debug("State structures initialized") diff --git a/nodes/load_chapter.py b/nodes/load_chapter.py index e8446ad..f5ad27d 100644 --- a/nodes/load_chapter.py +++ b/nodes/load_chapter.py @@ -44,8 +44,9 @@ def load_chapter_node(state: BookState) -> BookState: 'current_chapter_text': current_chapter['text'], # Reset per-chapter scratch fields 'current_scenes': [], + 'current_scene_chunk': [], + 'processed_scene_ids': set(), 'chapter_mentions_by_char': {}, - 'chapter_appearances_by_char': {}, 'chapter_influence_evidence': {}, } diff --git a/nodes/mention_counter.py b/nodes/mention_counter.py index ced21ae..ce85cb1 100644 --- a/nodes/mention_counter.py +++ b/nodes/mention_counter.py @@ -9,39 +9,53 @@ def mention_counter_node(state: BookState) -> BookState: - """Count total alias hits across the chapter for each character. + """Count alias hits across the current scene chunk for each character. + + Counts mentions across all scenes in the current chunk. These counts + will be aggregated at the chapter level in BookAggregator. Reads: - - `current_chapter_text`: text of current chapter + - `current_scene_chunk`: current batch of scenes being processed - `alias_index`: alias to char_id mapping + - `chapter_mentions_by_char`: existing mention counts for this chapter Writes: - - `chapter_mentions_by_char`: dict mapping char_id -> mention count + - `chapter_mentions_by_char`: dict mapping char_id -> mention count (accumulated) Args: state: Current state Returns: - Updated state with mention counts + Updated state with mention counts accumulated """ - chapter_text = state['current_chapter_text'] chapter_id = state['current_chapter_id'] + scene_chunk = state.get('current_scene_chunk', []) alias_index = state.get('alias_index', {}) + existing_mentions = state.get('chapter_mentions_by_char', {}) - logger.info(f"Counting mentions in chapter {chapter_id}") - logger.debug(f"Alias index contains {len(alias_index)} aliases") + logger.info(f"Counting mentions in scene chunk for chapter {chapter_id}") + logger.debug(f"Scene chunk: {len(scene_chunk)} scenes, Alias index: {len(alias_index)} aliases") - # Find all alias matches - matches_by_char = find_all_alias_matches(chapter_text, alias_index, case_sensitive=False) + # Count mentions across all scenes in chunk + chunk_mentions = {} + for scene in scene_chunk: + scene_text = scene['text'] + matches_by_char = find_all_alias_matches(scene_text, alias_index, case_sensitive=False) + + # Accumulate counts per character + for char_id, matches in matches_by_char.items(): + chunk_mentions[char_id] = chunk_mentions.get(char_id, 0) + len(matches) - # Count matches per character - chapter_mentions = {char_id: len(matches) for char_id, matches in matches_by_char.items()} + # Merge with existing chapter mentions + chapter_mentions = existing_mentions.copy() + for char_id, count in chunk_mentions.items(): + chapter_mentions[char_id] = chapter_mentions.get(char_id, 0) + count - total_mentions = sum(chapter_mentions.values()) - logger.info(f"Found {total_mentions} total mentions across {len(chapter_mentions)} characters") - if chapter_mentions: - top_mentioned = max(chapter_mentions.items(), key=lambda x: x[1]) - logger.debug(f"Most mentioned: {top_mentioned[0]} with {top_mentioned[1]} mentions") + total_mentions = sum(chunk_mentions.values()) + logger.info(f"Found {total_mentions} mentions in this chunk across {len(chunk_mentions)} characters") + if chunk_mentions: + top_mentioned = max(chunk_mentions.items(), key=lambda x: x[1]) + logger.debug(f"Most mentioned in chunk: {top_mentioned[0]} with {top_mentioned[1]} mentions") updated_state: BookState = { **state, diff --git a/nodes/next_chapter.py b/nodes/next_chapter.py index f43a794..ddc7dde 100644 --- a/nodes/next_chapter.py +++ b/nodes/next_chapter.py @@ -7,23 +7,38 @@ def next_chapter_node(state: BookState) -> BookState: - """Update state and determine routing for next chapter or finalization. + """Update state and determine routing for next chunk, chapter, or finalization. + + Checks if more scenes remain in current chapter, or if we should move + to the next chapter or finalize. Reads: - `chapter_idx`: current chapter index - `chapters`: list of chapters + - `current_scenes`: all scenes in current chapter + - `processed_scene_ids`: scene IDs already processed Writes: - - Updates `chapter_idx` if continuing to next chapter + - Updates `chapter_idx` if moving to next chapter Returns: Updated state """ chapter_idx = state['chapter_idx'] chapters = state['chapters'] + all_scenes = state.get('current_scenes', []) + processed_ids = state.get('processed_scene_ids', set()) + + # Check if more scenes remain in current chapter + all_scene_ids = {s['scene_id'] for s in all_scenes} + remaining_scenes = all_scene_ids - processed_ids - if chapter_idx + 1 < len(chapters): - # Update chapter index in state + if remaining_scenes: + # More scenes to process in current chapter + logger.info(f"Chapter {chapter_idx + 1}/{len(chapters)}: {len(remaining_scenes)} scenes remaining, processing next chunk") + return state + elif chapter_idx + 1 < len(chapters): + # Current chapter done, move to next chapter updated_state: BookState = { **state, 'chapter_idx': chapter_idx + 1, @@ -31,6 +46,7 @@ def next_chapter_node(state: BookState) -> BookState: logger.info(f"Chapter {chapter_idx + 1}/{len(chapters)} complete, continuing to next chapter") return updated_state else: + # All chapters processed logger.info(f"All {len(chapters)} chapters processed, proceeding to finalization") return state @@ -38,16 +54,29 @@ def next_chapter_node(state: BookState) -> BookState: def should_continue(state: BookState) -> str: """Determine routing decision based on state. + Checks if more scenes remain in current chapter, if more chapters remain, + or if we should finalize. + Args: state: Current state Returns: - "next_chapter" if more chapters remain, "finalize" otherwise + "next_chunk" if more scenes in current chapter, + "next_chapter" if more chapters remain, + "finalize" if all chapters processed """ chapter_idx = state['chapter_idx'] chapters = state['chapters'] + all_scenes = state.get('current_scenes', []) + processed_ids = state.get('processed_scene_ids', set()) + + # Check if more scenes remain in current chapter + all_scene_ids = {s['scene_id'] for s in all_scenes} + remaining_scenes = all_scene_ids - processed_ids - if chapter_idx + 1 < len(chapters): + if remaining_scenes: + return "next_chunk" + elif chapter_idx + 1 < len(chapters): return "next_chapter" else: return "finalize" diff --git a/nodes/ranker.py b/nodes/ranker.py index 696e54d..65e9004 100644 --- a/nodes/ranker.py +++ b/nodes/ranker.py @@ -19,7 +19,6 @@ def ranker_node(state: BookState) -> BookState: - `book_plot_summary`: overall plot summary - `character_dossiers`: character dossiers - `book_mentions`: mention counts (reference only) - - `book_appearances`: appearance counts (reference only) - `characters_by_id`: character profiles Writes: @@ -35,7 +34,6 @@ def ranker_node(state: BookState) -> BookState: book_plot_summary = state.get('book_plot_summary', '') character_dossiers = state.get('character_dossiers', {}) book_mentions = state.get('book_mentions', {}) - book_appearances = state.get('book_appearances', {}) characters_by_id = state.get('characters_by_id', {}) logger.info("Ranking characters by influence") @@ -45,8 +43,7 @@ def ranker_node(state: BookState) -> BookState: prompt = get_ranker_prompt( book_plot_summary, character_dossiers, - book_mentions, - book_appearances + book_mentions ) # Get model configuration from environment @@ -75,7 +72,6 @@ def ranker_node(state: BookState) -> BookState: 'character_id': char_id, 'name': rank_data.get('name', char.get('canonical_name', 'Unknown')), 'aliases': rank_data.get('aliases', char.get('aliases', [])), - 'appeared_scenes': int(rank_data.get('appeared_scenes', book_appearances.get(char_id, 0))), 'mentioned_count': int(rank_data.get('mentioned_count', book_mentions.get(char_id, 0))), 'influence_summary': rank_data.get('influence_summary', ''), } diff --git a/nodes/scene_segmenter.py b/nodes/scene_segmenter.py index 429c9ec..1148ee7 100644 --- a/nodes/scene_segmenter.py +++ b/nodes/scene_segmenter.py @@ -1,5 +1,7 @@ -"""SceneSegmenter node: splits chapter text into scenes.""" +"""SceneSegmenter node: splits chapter text into scenes. +SceneChunker node: selects a batch of full scenes that fit within character limit.""" +import os from schemas.state import BookState from utils.text import segment_scenes from utils.logging_config import get_logger @@ -38,3 +40,87 @@ def scene_segmenter_node(state: BookState) -> BookState: return updated_state + +def scene_chunker_node(state: BookState) -> BookState: + """Select a batch of full scenes that fit within the character limit. + + This node selects as many complete scenes as possible without exceeding + the character limit. Scenes are never truncated - only complete scenes + are included in the chunk. + + Reads: + - `current_scenes`: all scenes in current chapter + - `processed_scene_ids`: set of scene IDs already processed + + Writes: + - `current_scene_chunk`: list of scenes selected for this batch + - `processed_scene_ids`: updated with newly selected scene IDs + + Args: + state: Current state + + Returns: + Updated state with scene chunk selected + """ + chapter_id = state['current_chapter_id'] + all_scenes = state.get('current_scenes', []) + processed_ids = state.get('processed_scene_ids', set()) + + # Get character limit from environment + max_chars = int(os.getenv("SCENE_CHUNK_MAX_CHARS", "5000")) + + logger.info(f"Selecting scene chunk for chapter {chapter_id} (max {max_chars} chars)") + + # Filter out already processed scenes + remaining_scenes = [s for s in all_scenes if s['scene_id'] not in processed_ids] + + if not remaining_scenes: + logger.info("No remaining scenes to process in this chapter") + updated_state: BookState = { + **state, + 'current_scene_chunk': [], + } + return updated_state + + # Select scenes sequentially until adding next would exceed limit + selected_scenes = [] + total_chars = 0 + + for scene in remaining_scenes: + scene_chars = len(scene['text']) + + # If this is the first scene and it exceeds limit, include it anyway + # (we never truncate scenes) + if not selected_scenes and scene_chars > max_chars: + logger.warning( + f"Scene {scene['scene_id']} ({scene_chars} chars) exceeds limit ({max_chars}), " + f"but including it anyway (no truncation)" + ) + selected_scenes.append(scene) + total_chars += scene_chars + break + + # If adding this scene would exceed limit, stop + if total_chars + scene_chars > max_chars: + break + + # Add scene to chunk + selected_scenes.append(scene) + total_chars += scene_chars + + # Update processed scene IDs + new_processed_ids = processed_ids | {s['scene_id'] for s in selected_scenes} + + logger.info( + f"Selected {len(selected_scenes)} scenes ({total_chars} chars). " + f"{len(new_processed_ids)}/{len(all_scenes)} scenes processed in chapter" + ) + + updated_state: BookState = { + **state, + 'current_scene_chunk': selected_scenes, + 'processed_scene_ids': new_processed_ids, + } + + return updated_state + diff --git a/prompts.py b/prompts.py index ff71574..38f2b82 100644 --- a/prompts.py +++ b/prompts.py @@ -11,13 +11,12 @@ PROMPT_VERSION = "1.0.0" -def get_entity_roster_prompt(chapter_text: str, existing_characters: dict, scenes: list) -> str: +def get_entity_roster_prompt(existing_characters: dict, scene_chunk: list) -> str: """Generate prompt for entity roster update (character detection). Args: - chapter_text: Full chapter text existing_characters: Existing character profiles - scenes: List of scenes in the chapter + scene_chunk: List of scenes in the current chunk being processed Returns: Prompt string @@ -30,15 +29,21 @@ def get_entity_roster_prompt(chapter_text: str, existing_characters: dict, scene if char.get('aliases'): existing_chars_str += f" Aliases: {', '.join(char['aliases'])}\n" - return f"""You are analyzing a book chapter to identify and track characters. + # Build scenes text (all scenes, no truncation) + scenes_str = "\n\n".join([ + f"Scene {i+1} ({scene['scene_id']}):\n{scene['text']}" + for i, scene in enumerate(scene_chunk) + ]) + + return f"""You are analyzing scenes from a book chapter to identify and track characters. {existing_chars_str} -Chapter text: -{chapter_text[:5000]}... +Scenes from this chapter: +{scenes_str} Tasks: -1. Identify any NEW characters introduced in this chapter +1. Identify any NEW characters introduced in these scenes 2. Identify any NEW aliases/nicknames for existing characters 3. Resolve ambiguous references where possible using context @@ -74,11 +79,11 @@ def get_entity_roster_prompt(chapter_text: str, existing_characters: dict, scene """ -def get_influence_extraction_prompt(scenes: list, characters: dict) -> str: +def get_influence_extraction_prompt(scene_chunk: list, characters: dict) -> str: """Generate prompt for influence evidence extraction. Args: - scenes: List of scenes in the chapter + scene_chunk: List of scenes in the current chunk (all scenes, no truncation) characters: Character profiles Returns: @@ -89,9 +94,10 @@ def get_influence_extraction_prompt(scenes: list, characters: dict) -> str: for char_id, char in characters.items() ]) + # Build scenes text (all scenes in chunk, no truncation) scenes_str = "\n\n".join([ - f"Scene {i+1} ({scene['scene_id']}):\n{scene['text'][:1000]}..." - for i, scene in enumerate(scenes[:10]) # Limit to first 10 scenes + f"Scene {i+1} ({scene['scene_id']}):\n{scene['text']}" + for i, scene in enumerate(scene_chunk) ]) return f"""Analyze the following scenes from a book chapter and extract influence evidence for each character. @@ -134,16 +140,16 @@ def get_influence_extraction_prompt(scenes: list, characters: dict) -> str: def get_book_synthesis_prompt( book_influence: dict, book_mentions: dict, - book_appearances: dict, - characters_by_id: dict + characters_by_id: dict, + chapter_summaries: dict ) -> str: """Generate prompt for book-wide synthesis into dossiers. Args: book_influence: Accumulated influence evidence per character book_mentions: Total mentions per character - book_appearances: Total appearances per character characters_by_id: All character profiles + chapter_summaries: Chapter summaries for narrative context Returns: Prompt string @@ -156,7 +162,6 @@ def get_book_synthesis_prompt( evidence_count = len(accumulator.get('evidence', [])) mentions = book_mentions.get(char_id, 0) - appearances = book_appearances.get(char_id, 0) # Summarize evidence types evidence_summary = [] @@ -167,15 +172,23 @@ def get_book_synthesis_prompt( char_summaries.append(f""" Character: {char_name} (ID: {char_id}) -- Mentions: {mentions}, Appearances: {appearances} +- Mentions: {mentions} - Evidence chapters: {evidence_count} - Sample evidence: {evidence_summary[:2]} """) + # Build chapter summaries section + chapter_summaries_str = "" + if chapter_summaries: + chapter_summaries_str = "\n\nChapter Summaries (for narrative context):\n" + for chapter_id, summary in sorted(chapter_summaries.items()): + chapter_summaries_str += f"\n{chapter_id}:\n{summary}\n" + return f"""Synthesize book-wide character influence evidence into compact dossiers. Character summaries: {''.join(char_summaries)} +{chapter_summaries_str} For each character, create a dossier with: 1. **arc_summary**: How they change/develop across the book @@ -186,6 +199,8 @@ def get_book_synthesis_prompt( Also provide a **book_plot_summary**: A 2-3 paragraph summary of the overall plot and major themes. +Use the chapter summaries to understand narrative flow and context when creating dossiers. + Output JSON: {{ "book_plot_summary": "...", @@ -206,11 +221,53 @@ def get_book_synthesis_prompt( """ +def get_chapter_summary_prompt(previous_summary: str, scene_chunk: list) -> str: + """Generate prompt for chapter summarization. + + Args: + previous_summary: Previous summary of the chapter (empty string if first chunk) + scene_chunk: List of scenes in the current chunk being processed + + Returns: + Prompt string + """ + # Build scenes text (all scenes, no truncation) + scenes_str = "\n\n".join([ + f"Scene {i+1} ({scene['scene_id']}):\n{scene['text']}" + for i, scene in enumerate(scene_chunk) + ]) + + previous_summary_section = "" + if previous_summary: + previous_summary_section = f""" +Previous summary of this chapter: +{previous_summary} + +""" + + return f"""You are summarizing a book chapter incrementally as scenes are processed. + +{previous_summary_section}New scenes to incorporate: +{scenes_str} + +Your task: +- If this is the first chunk (no previous summary), create an initial summary +- If there's a previous summary, extend/update it to include these new scenes +- Keep the summary concise (2-4 sentences) but comprehensive +- Focus on key events, character actions, and narrative developments +- Maintain continuity with the previous summary if it exists + +Output JSON in this format: +{{ + "chapter_summary": "Your updated summary of the chapter up to this point..." +}} +""" + + def get_ranker_prompt( book_plot_summary: str, character_dossiers: dict, - book_mentions: dict, - book_appearances: dict + book_mentions: dict ) -> str: """Generate prompt for subjective influence ranking. @@ -218,7 +275,6 @@ def get_ranker_prompt( book_plot_summary: Overall plot summary character_dossiers: Character dossiers book_mentions: Mention counts (reference only) - book_appearances: Appearance counts (reference only) Returns: Prompt string @@ -226,7 +282,7 @@ def get_ranker_prompt( dossiers_str = "\n\n".join([ f"""Character: {dossier.get('canonical_name', 'Unknown')} (ID: {char_id}) Aliases: {', '.join(dossier.get('aliases', []))} -Mentions: {book_mentions.get(char_id, 0)}, Appearances: {book_appearances.get(char_id, 0)} +Mentions: {book_mentions.get(char_id, 0)} Arc: {dossier.get('arc_summary', 'N/A')} Relationships: {', '.join(dossier.get('relationships', []))} Impact: {dossier.get('impact_summary', 'N/A')} @@ -250,7 +306,7 @@ def get_ranker_prompt( - Narrative gravity (scenes revolve around them) - Causal responsibility for major events -Mention and appearance counts are provided for reference only. Do not rank solely by frequency. +Mention counts are provided for reference only. Do not rank solely by frequency. Rank all characters from most influential (rank 1) to least influential. @@ -261,7 +317,6 @@ def get_ranker_prompt( "character_id": "char_id", "name": "Canonical Name", "aliases": ["alias1", "alias2"], - "appeared_scenes": 42, "mentioned_count": 310, "influence_summary": "Brief summary of their influence across the book", "ranking_rationale": "Why this character ranks at this position" diff --git a/schemas/state.py b/schemas/state.py index cdbf99b..f7d71b7 100644 --- a/schemas/state.py +++ b/schemas/state.py @@ -79,7 +79,6 @@ class FinalCharacterResult(TypedDict): character_id: str name: str aliases: list[str] - appeared_scenes: int mentioned_count: int influence_summary: str # book-level ranking_rationale: NotRequired[Optional[str]] # optional @@ -113,16 +112,16 @@ class BookState(TypedDict, total=False): # Book-level aggregates book_mentions: dict[str, int] # char_id -> total mention hits - book_appearances: dict[str, int] # char_id -> total scenes appeared book_influence: dict[str, InfluenceAccumulator] - chapter_summaries: NotRequired[list[dict]] # optional debug trace + chapter_summaries: dict[str, str] # chapter_id -> summary text # Per-chapter scratch (reset each chapter) current_chapter_id: str current_chapter_text: str - current_scenes: list[Scene] + current_scenes: list[Scene] # All scenes in current chapter (from SceneSegmenter) + current_scene_chunk: list[Scene] # Current batch of scenes being processed + processed_scene_ids: set[str] # Scene IDs already processed in current chapter chapter_mentions_by_char: dict[str, int] - chapter_appearances_by_char: dict[str, int] chapter_influence_evidence: dict[str, InfluenceEvidence] # Synthesis + final outputs