MARCReader(fp)Parameters:
fp(file): Binary file object (open(..., "rb"))
Returns: MARCReader instance
Example:
with open("records.mrc", "rb") as f:
reader = MARCReader(f)Register a named index for a field.
Parameters:
name(str): Index identifier (used for search operations)field_spec(str): Field specification- Control fields:
"001","008" - Data field subfields:
"245$a","650$a"
- Control fields:
mode(str, optional):"mask": Bitmask for substring search (fuzzy matching)"map": Hash map for exact lookup (O(1) retrieval)None: Auto-detect (control=”map”, data=”mask”)
Returns: self (for chaining)
Raises: RuntimeError if called after .build_index()
Example:
reader = (MARCReader(fp)
.add_index("control_num", "001") # Auto: map (control field)
.add_index("title", "245$a") # Auto: mask (data field)
.add_index("isbn", "020$a", mode="map") # Explicit: map (ISBN)
.add_index("subject", "650$a", mode="mask")) # Explicit: mask (subjects)Build indexes and execute hooks.
Parameters:
charset(str, optional): Custom character set for fuzzy indexing- Default: full 256-bit character space
- Examples:
"0123456789"(digits),"aeiou"(vowels)
Returns: self (for chaining)
Raises: ValueError if no indexes or hooks registered
Example:
# Basic indexing
reader = MARCReader(fp).add_index("245$a").build_index()
# With custom charset (memory optimization)
reader = MARCReader(fp).add_index("020$a").build_index(charset="0123456789")
# With hooks only (no search index)
reader = MARCReader(fp).hook("650$a", counter).build_index()Register field hook for aggregation/analysis during indexing.
Parameters:
field_specs:str: Single field (e.g.,"650$a")list[str]: Multiple fields (e.g.,["008", "264$c"])
callable: Hook function
Returns: self (for chaining)
Hook Signatures:
Single-field hook:
def hook(values: list[str]) -> None:
"""Called once per record with list of all occurrences."""
passMulti-field hook:
def hook(fields: dict[str, list[str]]) -> None:
"""Called once per record with dict of fields present."""
passExample:
from collections import Counter
# Single-field hook
class FieldCounter:
def __init__(self):
self.counts = Counter()
def __call__(self, values):
for v in values:
self.counts[v] += 1
subjects = FieldCounter()
reader = MARCReader(fp).hook("650$a", subjects).build_index()
print(subjects.counts.most_common(10))
# Multi-field hook (fallback logic)
class YearExtractor:
def __init__(self):
self.years = Counter()
def __call__(self, fields):
year = None
if "008" in fields and fields["008"]:
year = fields["008"][0][7:11]
elif "264$c" in fields:
year = fields["264$c"][0].strip("[]c")
if year and year.isdigit():
self.years[year] += 1
years = YearExtractor()
reader = MARCReader(fp).hook(["008", "264$c"], years).build_index()Search for records containing text in the specified field.
Parameters:
field_spec(str): Field specification (e.g.,"245$a","001")text(str): Search query
Returns: list[int] - List of matching record indices
Behavior:
- If index exists for field: Uses index (mask or map mode)
- If no index: Performs sequential scan through all records
- Mask mode: Case-insensitive substring match
- Map mode: Exact value lookup (returns all collisions)
Raises: RuntimeError if .build_index() not called
Example:
reader = (MARCReader(fp)
.add_index("control_num", "001", mode="map")
.add_index("title", "245$a", mode="mask")
.build_index())
# Uses map index (fast)
ids = reader.search("001", "12345")
# Uses mask index (fast)
results = reader.search("245$a", "music")
# No index - sequential scan (slower but still works)
results = reader.search("260$a", "New York")Retrieve record by index.
Parameters:
idx(int): Zero-based record index
Returns: pymarc.Record
Raises:
RuntimeErrorif.build_index()not calledIndexErrorif out of range
Example:
reader = MARCReader(fp).add_index("245$a").build_index()
record = reader.get_record(0)
print(record['245']['a'])Direct access to map index dictionary.
Parameters:
name(str): Index name (must be mode=”map”)
Returns: dict[str, list[int]] - Value → record indices mapping
Raises: ValueError if index not found or not mode=”map”
Example:
reader = MARCReader(fp).add_index("title", "245$a", mode="map").build_index()
title_index = reader.get_index("title")
# Find duplicates
for title, indices in title_index.items():
if len(indices) > 1:
print(f"'{title}': {len(indices)} records")Extract all values of a field/subfield from every record.
Parameters:
field_spec(str): Field specification (e.g.,"001","245$a","650$a")
Returns: list[list[str]] - List of lists, one per record
- Outer list length always equals number of records
- Each inner list contains all occurrences of the field in that record
- Inner list is empty
[]if record doesn’t have the field
Raises: RuntimeError if .build_index() not called
Behavior:
- Scans through all records sequentially
- Preserves record-level organization
- For repeating fields (e.g.,
650$a), inner lists may have multiple entries - Decodes bytes to UTF-8 strings
Example:
reader = MARCReader(fp).build_index()
# Get all titles (one list per record)
all_titles = reader.get_all_values("245$a")
print(f"Total records: {len(all_titles)}")
# Count records with titles
records_with_titles = sum(1 for titles in all_titles if titles)
print(f"Records with titles: {records_with_titles}")
# Get all subject headings (repeating field)
all_subjects = reader.get_all_values("650$a")
# Find records with multiple subjects
for idx, subjects in enumerate(all_subjects):
if len(subjects) > 3:
record = reader.get_record(idx)
print(f"Record {idx} has {len(subjects)} subjects")
# Flatten to get all unique subjects
from itertools import chain
unique_subjects = set(chain.from_iterable(all_subjects))
print(f"Unique subjects: {len(unique_subjects)}")Get total record count.
Returns: int
Raises: RuntimeError if .build_index() not called
Example:
reader = MARCReader(fp).add_index("245$a").build_index()
print(f"{len(reader):,} records")Iterate through all records.
Yields: pymarc.Record
Note: Can iterate without calling .build_index() (no hooks/search)
Example:
reader = MARCReader(fp)
for record in reader:
print(record['245']['a'])Free memory and close resources.
Note: Called automatically on garbage collection.