Skip to content

feat: Capture parent element context in TreeScraper link extraction #20

@willgriffin

Description

@willgriffin

Problem

TreeScraper extracts links but loses hierarchical context. When a link says "View" but its parent element says "Council Minutes December 30, 2024", we lose that date information.

Current Behavior

interface Link {
  href: string;
  text: string;        // Just "View"
  title?: string;
  ariaLabel?: string;
  // ... no parent context
}

Proposed Enhancement

Extend Link interface to capture parent context:

interface Link {
  href: string;
  text: string;
  title?: string;
  ariaLabel?: string;
  // NEW: Parent context
  parentText?: string;        // Immediate parent's text content
  ancestorTexts?: string[];   // Path of ancestor texts (e.g., ["2024", "Minutes", "December 30"])
  hierarchyLevel?: number;    // Depth in tree expansion
}

Implementation Notes

The code already constructs element paths in extractLinksWithTreeExpansion (lines 196-209 in tree.ts) but discards them. Changes needed:

  1. In extractLinks() (line 124), traverse up from each <a> to capture parent text
  2. In extractLinksWithTreeExpansion(), track which expansion iteration revealed each link
  3. Add parentText to the Link interface in types.ts

Use Case

Municipal sites like eckville.com have:

<div class="meeting-item">
  <span>Council Minutes December 30, 2024</span>
  <a href="/public/download/files/266576">View</a>
</div>

With parent context, praeco's parser can extract "December 30, 2024" from parentText instead of just seeing "View".

Backwards Compatibility

  • New fields are optional, won't break existing consumers
  • Existing link extraction behavior unchanged
  • Just adds more metadata to the Link objects

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions