A step-by-step guide to reverse-engineering a web app's network requests and automating library catalog searches.
The Phoenix Public Library catalog lets you search for books by title. But when you search for "magic" with Search by: Title, the results include books where "magic" only appears in the series name — not the actual book title. For example:
- ❌ "A bossy bad day" — keyword is in Series: Williams, Zanaiah Boss Magic
- ❌ "Found" — keyword is in Series: Prineas, Sarah. Magic Thief
- ✅ "Magic" — keyword IS in the actual title
- ✅ "The Magic Tree House" — keyword IS in the title
The manual approach: search, change results per page to 100, scroll through hundreds of pages, copy relevant rows by hand.
The goal: automate this — take a list of keywords, output a clean CSV of only those books where the keyword appears in the actual title.
Playwright (and Selenium) launch a real browser to click buttons and render JavaScript. For this library site, that approach breaks because:
- The site sets session cookies on the first visit — a browser handles this automatically, but headless automation often fails to sequence it correctly.
- The actual search results are loaded via a hidden AJAX request, not rendered in the initial page HTML. Playwright times out waiting for elements that load asynchronously.
The right approach: skip the browser entirely and talk directly to the API the website already uses.
Open Chrome, go to https://catalog.phoenixpubliclibrary.org/search/, press F12 to open DevTools, and click the Network tab.
- Type "magic" in the search box, set Search by to "Title", click Search.
- Watch the Network tab. You'll see a flood of requests. Filter by XHR/Fetch to see only API calls.
You'll notice two important things:
First: The search results page URL reveals the search parameters:
searchresults.aspx?type=Keyword&term=magic&by=TI&sort=RELEVANCE&page=0
by=TI means "search by Title". This is the key parameter.
Second: The #searchResults div on the page starts empty with a loading spinner — results load after the page. Filter by XHR and you'll find a call to:
/search/components/ajaxResults.aspx?page=1&hpp=100
This is the actual endpoint that returns the HTML results fragment.
The site uses ASP.NET_SessionId cookies. Without a valid session, requests redirect in a loop.
The fix: use requests.Session() — it automatically stores and replays cookies just like a browser.
import requests
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 ..."})
# Step 1: Hit the home page to get session cookies
session.get("https://catalog.phoenixpubliclibrary.org/search/")
# Step 2: Trigger the search (this sets search state on the server)
session.get(
"https://catalog.phoenixpubliclibrary.org/search/searchresults.aspx",
params={"type": "Keyword", "term": "magic", "by": "TI", "sort": "RELEVANCE", "page": "0"}
)
# Step 3: NOW fetch the actual results via the AJAX endpoint
r = session.get(
"https://catalog.phoenixpubliclibrary.org/search/components/ajaxResults.aspx",
params={"page": "1", "hpp": "100"}
)Order matters: you must visit the home page, then trigger the search, then call ajaxResults.aspx. Each step sets server-side session state the next step depends on.
ajaxResults.aspx returns an HTML fragment (not a full page). Inspect it with BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")
title_groups = soup.select(".nsm-brief-primary-title-group")
author_groups = soup.select(".nsm-brief-primary-author-group")
for i, title_el in enumerate(title_groups):
link = title_el.find("a", class_="nsm-brief-action-link")
title = link.get_text(strip=True)
author_span = author_groups[i].find("span", class_="nsm-browse-text")
author = author_span.get_text(strip=True)
# Clean up trailing ", author." suffix the site adds
author = author.rstrip(", author.").strip()
print(title, "|", author)Key CSS classes (discovered via DevTools > Elements panel):
| Element | CSS Class |
|---|---|
| Title container | .nsm-brief-primary-title-group |
| Title link | .nsm-brief-action-link |
| Author container | .nsm-brief-primary-author-group |
| Author text | .nsm-browse-text |
The result count lives in .c-results-utility-result-count and looks like "1 - 100 of 15145". Extract the total:
import re, math
count_div = soup.select_one(".c-results-utility-result-count")
text = count_div.get_text()
total = int(re.search(r"of\s+([\d,]+)", text).group(1).replace(",", ""))
total_pages = math.ceil(total / 100)Then loop:
for page in range(1, total_pages + 1):
r = session.get(AJAX_URL, params={"page": str(page), "hpp": "100"})
# parse, filter, collect...
time.sleep(0.5) # be polite to the serverThe library's "Search by Title" still returns items where the keyword is only in a series or subtitle field. Filter client-side:
def keyword_in_title(keyword: str, title: str) -> bool:
"""True if every word of the keyword phrase appears in the title."""
title_lower = title.lower()
if keyword.lower() in title_lower:
return True
# Multi-word: all words must be present
return all(w in title_lower for w in keyword.lower().split())This correctly:
- ✅ Includes "Magic Tree House" for keyword "magic tree"
- ❌ Excludes "A bossy bad day" (series name has magic, title doesn't)
When Ananth received the first draft of the script, he clarified: he only wanted audiobooks — specifically "Downloadable eAudio Book" and "Streaming Audiobook". The "magic" search returned results from books, DVDs, music CDs, e-books, and more. We needed to filter those out.
Finding the filter parameter
The catalog search page has a "Limit by" dropdown. Inspecting its HTML reveals the underlying limit query parameter:
<select name="...dropdownLimitFilter">
<option value="MAT=*">All Materials</option>
<option value="MAT=34">Downloadable eAudio Book</option>
<option value="MAT=46">Streaming Audiobook</option>
...
</select>The limit=MAT=34 parameter tells the server to return only that material type — no client-side filtering needed. This cuts the result set dramatically (15,000+ total → ~1,700 audiobooks for "magic").
Running two searches, combining results
Since we want both types, the script runs one search pass per material type and merges the rows:
MATERIAL_TYPES = {
"MAT=34": "Downloadable eAudio Book",
"MAT=46": "Streaming Audiobook",
}
for mat_filter, format_label in MATERIAL_TYPES.items():
session.get(SEARCH_URL, params={..., "limit": mat_filter, ...})
# paginate through ajaxResults.aspx, filter by title, collect rowsToggle any format in one line
The MATERIAL_TYPES dict is the only thing you need to edit. Every entry is one search pass. To add physical audiobooks on CD, for example:
MATERIAL_TYPES = {
"MAT=34": "Downloadable eAudio Book",
"MAT=46": "Streaming Audiobook",
"MAT=3": "Book on CD", # ← add this line
}To disable the filter entirely and get all material types, swap in MAT=*:
MATERIAL_TYPES = {
"MAT=*": "All Materials",
}The Format column
The CSV now includes a fourth column so you can tell the two audiobook types apart at a glance:
Keyword,Title,Author,Format
magic,Magic,Gosden Chris,Downloadable eAudio Book
magic,The magic thief,Prineas Sarah,Streaming Audiobook
...
Install dependencies:
pip install requests beautifulsoup4Create a keywords.txt file with one keyword per line:
magic
Run:
python scraper.py --keywords keywords.txt --output results.csvSample output (audiobooks only):
Keyword,Title,Author,Format
magic,Magic,Gosden Chris,Downloadable eAudio Book
magic,Black magic woman,Warren Christine,Downloadable eAudio Book
magic,The magic of Oz,Baum L. Frank,Downloadable eAudio Book
magic,The magic thief,Prineas Sarah,Streaming Audiobook
...
Results for "magic" with audiobook filter: 1,413 matches (313 Downloadable + 1,100 Streaming) out of ~1,670 total audiobook results — the difference is items where "magic" appears in a series name but not the title itself.
- Inspect XHR requests first — most modern sites load data via AJAX, not in the initial HTML.
requests.Session()replaces browser cookies — it automatically handles cookie-based session state.- Find the real API endpoint — the
ajaxResults.aspxURL was hidden in a JavaScript file (results.js). Searching for function names in loaded scripts reveals the real URLs. - Filter client-side when needed — the server-side search is "close enough" but not exact; a simple Python string check closes the gap.
- Use server-side filters when available — the
MAT=parameter cuts the result set before pagination, making the script faster and the output cleaner. - Be polite —
time.sleep(0.5)between pages prevents hammering the server and getting blocked.
| File | Purpose |
|---|---|
scraper.py |
Main script — run this |
keywords.txt |
Input: one keyword per line |
results.csv |
Output: Keyword, Title, Author, Format |
requirements.txt |
Python dependencies |