Skip to content

Latest commit

 

History

History
316 lines (224 loc) · 7.43 KB

File metadata and controls

316 lines (224 loc) · 7.43 KB

FastMARC API Reference

MARCReader

__init__

MARCReader(fp)

Parameters:

  • fp (file): Binary file object (open(..., "rb"))

Returns: MARCReader instance

Example:

with open("records.mrc", "rb") as f:
    reader = MARCReader(f)

add_index(name, field_spec, mode=None)

Register a named index for a field.

Parameters:

  • name (str): Index identifier (used for search operations)
  • field_spec (str): Field specification
    • Control fields: "001", "008"
    • Data field subfields: "245$a", "650$a"
  • mode (str, optional):
    • "mask": Bitmask for substring search (fuzzy matching)
    • "map": Hash map for exact lookup (O(1) retrieval)
    • None: Auto-detect (control=”map”, data=”mask”)

Returns: self (for chaining)

Raises: RuntimeError if called after .build_index()

Example:

reader = (MARCReader(fp)
    .add_index("control_num", "001")              # Auto: map (control field)
    .add_index("title", "245$a")                  # Auto: mask (data field)
    .add_index("isbn", "020$a", mode="map")       # Explicit: map (ISBN)
    .add_index("subject", "650$a", mode="mask"))  # Explicit: mask (subjects)

index(charset=None)

Build indexes and execute hooks.

Parameters:

  • charset (str, optional): Custom character set for fuzzy indexing
    • Default: full 256-bit character space
    • Examples: "0123456789" (digits), "aeiou" (vowels)

Returns: self (for chaining)

Raises: ValueError if no indexes or hooks registered

Example:

# Basic indexing
reader = MARCReader(fp).add_index("245$a").build_index()

# With custom charset (memory optimization)
reader = MARCReader(fp).add_index("020$a").build_index(charset="0123456789")

# With hooks only (no search index)
reader = MARCReader(fp).hook("650$a", counter).build_index()

hook(field_specs, callable)

Register field hook for aggregation/analysis during indexing.

Parameters:

  • field_specs:
    • str: Single field (e.g., "650$a")
    • list[str]: Multiple fields (e.g., ["008", "264$c"])
  • callable: Hook function

Returns: self (for chaining)

Hook Signatures:

Single-field hook:

def hook(values: list[str]) -> None:
    """Called once per record with list of all occurrences."""
    pass

Multi-field hook:

def hook(fields: dict[str, list[str]]) -> None:
    """Called once per record with dict of fields present."""
    pass

Example:

from collections import Counter

# Single-field hook
class FieldCounter:
    def __init__(self):
        self.counts = Counter()

    def __call__(self, values):
        for v in values:
            self.counts[v] += 1

subjects = FieldCounter()
reader = MARCReader(fp).hook("650$a", subjects).build_index()
print(subjects.counts.most_common(10))

# Multi-field hook (fallback logic)
class YearExtractor:
    def __init__(self):
        self.years = Counter()

    def __call__(self, fields):
        year = None
        if "008" in fields and fields["008"]:
            year = fields["008"][0][7:11]
        elif "264$c" in fields:
            year = fields["264$c"][0].strip("[]c")
        if year and year.isdigit():
            self.years[year] += 1

years = YearExtractor()
reader = MARCReader(fp).hook(["008", "264$c"], years).build_index()

search(field_spec, text)

Search for records containing text in the specified field.

Parameters:

  • field_spec (str): Field specification (e.g., "245$a", "001")
  • text (str): Search query

Returns: list[int] - List of matching record indices

Behavior:

  • If index exists for field: Uses index (mask or map mode)
  • If no index: Performs sequential scan through all records
  • Mask mode: Case-insensitive substring match
  • Map mode: Exact value lookup (returns all collisions)

Raises: RuntimeError if .build_index() not called

Example:

reader = (MARCReader(fp)
    .add_index("control_num", "001", mode="map")
    .add_index("title", "245$a", mode="mask")
    .build_index())

# Uses map index (fast)
ids = reader.search("001", "12345")

# Uses mask index (fast)
results = reader.search("245$a", "music")

# No index - sequential scan (slower but still works)
results = reader.search("260$a", "New York")

get_record(idx)

Retrieve record by index.

Parameters:

  • idx (int): Zero-based record index

Returns: pymarc.Record

Raises:

  • RuntimeError if .build_index() not called
  • IndexError if out of range

Example:

reader = MARCReader(fp).add_index("245$a").build_index()
record = reader.get_record(0)
print(record['245']['a'])

get_index(name)

Direct access to map index dictionary.

Parameters:

  • name (str): Index name (must be mode=”map”)

Returns: dict[str, list[int]] - Value → record indices mapping

Raises: ValueError if index not found or not mode=”map”

Example:

reader = MARCReader(fp).add_index("title", "245$a", mode="map").build_index()
title_index = reader.get_index("title")

# Find duplicates
for title, indices in title_index.items():
    if len(indices) > 1:
        print(f"'{title}': {len(indices)} records")

get_all_values(field_spec)

Extract all values of a field/subfield from every record.

Parameters:

  • field_spec (str): Field specification (e.g., "001", "245$a", "650$a")

Returns: list[list[str]] - List of lists, one per record

  • Outer list length always equals number of records
  • Each inner list contains all occurrences of the field in that record
  • Inner list is empty [] if record doesn’t have the field

Raises: RuntimeError if .build_index() not called

Behavior:

  • Scans through all records sequentially
  • Preserves record-level organization
  • For repeating fields (e.g., 650$a), inner lists may have multiple entries
  • Decodes bytes to UTF-8 strings

Example:

reader = MARCReader(fp).build_index()

# Get all titles (one list per record)
all_titles = reader.get_all_values("245$a")
print(f"Total records: {len(all_titles)}")

# Count records with titles
records_with_titles = sum(1 for titles in all_titles if titles)
print(f"Records with titles: {records_with_titles}")

# Get all subject headings (repeating field)
all_subjects = reader.get_all_values("650$a")

# Find records with multiple subjects
for idx, subjects in enumerate(all_subjects):
    if len(subjects) > 3:
        record = reader.get_record(idx)
        print(f"Record {idx} has {len(subjects)} subjects")

# Flatten to get all unique subjects
from itertools import chain
unique_subjects = set(chain.from_iterable(all_subjects))
print(f"Unique subjects: {len(unique_subjects)}")

__len__()

Get total record count.

Returns: int

Raises: RuntimeError if .build_index() not called

Example:

reader = MARCReader(fp).add_index("245$a").build_index()
print(f"{len(reader):,} records")

__iter__()

Iterate through all records.

Yields: pymarc.Record

Note: Can iterate without calling .build_index() (no hooks/search)

Example:

reader = MARCReader(fp)
for record in reader:
    print(record['245']['a'])

close()

Free memory and close resources.

Note: Called automatically on garbage collection.