Skip to content

Add harx: comprehensive HAR file extraction and analysis toolkit#1

Open
Saul-BT wants to merge 1 commit intomasterfrom
claude/analyze-and-improve-3RJW7
Open

Add harx: comprehensive HAR file extraction and analysis toolkit#1
Saul-BT wants to merge 1 commit intomasterfrom
claude/analyze-and-improve-3RJW7

Conversation

@Saul-BT
Copy link
Copy Markdown
Owner

@Saul-BT Saul-BT commented May 6, 2026

This PR introduces harx, a complete command-line toolkit for working with HTTP Archive (HAR) files. It replaces the previous simple har-extract.pl script with a full-featured application supporting extraction, filtering, auditing, sanitization, and conversion of HAR files.

Key Changes

Core Infrastructure

  • New modular Perl library structure under lib/HAR/Extractor/ with 40+ modules
  • CLI entry point (bin/harx) with subcommand dispatch system
  • Comprehensive logging system with verbosity levels
  • HAR parser with validation support

Extraction & Output

  • Rich filtering engine supporting URL regex/glob, domain matching, HTTP methods, status codes (including ranges like 2xx, 404,500-599), MIME types, size ranges with K/M/G suffixes, time windows, headers, cookies, and body content
  • Multiple output formats: filesystem (default), ZIP, tar.gz, SQLite
  • Output modes: default (by MIME), mirror (URL-path-as-directories), full-trace (per-entry directories)
  • Manifest generation in JSON, CSV, JSONL, or SQLite formats
  • Support for decompression (gzip, deflate, brotli)

Auditing & Security

  • Secret detection (AWS keys, JWT, GitHub PAT, Stripe keys, PEM, etc.) via pattern catalog
  • Security header validation (CSP, HSTS, X-Frame-Options, etc.)
  • Cookie security analysis (HttpOnly, Secure, SameSite flags)
  • Mixed content detection (HTTP resources in HTTPS pages)
  • CORS policy validation
  • PII detection (email, phone numbers, IP addresses with locale support)

Advanced Features

  • HAR sanitization/redaction for safe sharing
  • Conversion to curl, wget, Postman, Insomnia, and cookies.jar formats
  • GraphQL operation extraction and analysis
  • WebSocket frame extraction
  • API endpoint inventory with ID normalization
  • Performance anti-pattern detection (uncompressed text, slow requests, etc.)
  • HAR diffing, merging, deduplication, and time-window slicing
  • Interactive HTML report generation with statistics and waterfall charts

Testing & Documentation

  • Comprehensive test suite (27 test files) covering all major functionality
  • Usage documentation (USAGE.md) with detailed flag reference
  • Cookbook with real-world recipes
  • Man page
  • Shell completion scripts (bash, zsh, fish)

Implementation Details

  • Written in Perl 5.36+ with modern syntax (signatures, postfix derefs)
  • Minimal external dependencies (mostly core modules; optional Brotli support)
  • Efficient streaming for large HAR files
  • Cross-platform path handling (Windows reserved names avoided)
  • Magic-byte MIME detection with fallback to HAR-declared types
  • Charset detection from Content-Type headers
  • Comprehensive URL parsing without URI dependency

https://claude.ai/code/session_01JrmAXvAwQh6yX3offyo81E

Replace the legacy 86-line `har-extract.pl` interactive script with a
modular, fully-tested CLI toolkit (`harx`) that turns the project into
a professional tool for web developers, QA, pentesters and performance
analysts. Old script is removed (clean replacement, no compat shim).

Highlights:

- 15 git-style subcommands: extract, validate, sanitize, audit,
  convert, report, diff, merge, dedupe, slice, graphql, websocket,
  endpoints, stats, completions.
- Powerful filtering: URL regex/glob, domain glob, method, status spec
  (200,2xx,4xx-5xx), MIME, size with K/M/G suffixes, time window,
  has-header, header-regex, has-cookie, body content regex, initiator,
  scheme, cache state, HTTP version, plus a free `--where` boolean DSL.
- Robust decoding: gzip/deflate (core), brotli (perl module or CLI
  fallback), zstd (CLI), charset-aware text decoding.
- Security audits: secrets (extensible JSON catalog covering JWT,
  AWS, GitHub, Stripe, OpenAI, Anthropic, GCP, NPM, Slack, PEM,
  password-in-URL, ...), headers (per-host scoring of CSP/HSTS/XFO/
  XCTO/Referrer/Permissions), cookies (Secure/HttpOnly/SameSite/scope),
  mixed-content, CORS, PII (Luhn-validated cards, IBAN, locale-aware
  DNI/NIE for `es`).
- `sanitize` produces a redacted HAR safe to share.
- Conversion: per-entry curl, single bash script, wget, Postman v2.1,
  Insomnia v4, Netscape cookies.txt jar.
- Reports: terminal stats (top hosts, MIME, slow, duplicates, redirect
  chains), ASCII waterfall, performance anti-pattern audit, fully
  self-contained interactive HTML report.
- Manipulation: diff, merge, dedupe, slice, validate.
- Specialized extractors: GraphQL operation splitting, WebSocket frame
  timeline, API endpoint inventory with :id/:uuid/:hex normalization.
- Output: filesystem, zip, tar.gz, sqlite. Manifests in JSON/CSV/JSONL.

Quality:

- 314 test assertions across 28 test files (parser, decoder, filter,
  naming, writer, audits, conversion, reports, special extractors,
  diff/merge/dedupe/slice, end-to-end CLI integration).
- Modular architecture: ~30 modules under `lib/HAR/Extractor/*`.
- Core-only Perl deps (5.36+); optional features detect their deps
  at runtime and degrade gracefully.
- `cpanfile`, `Makefile.PL`, GitHub Actions CI matrix
  (Perl 5.36/5.38/5.40 on Linux + smoke on macOS).
- README, USAGE, COOKBOOK (20 recipes), `harx(1)` man page,
  CHANGELOG.

The original GPL-3.0-only license and Saúl Blanco Tejero's authorship
are preserved.

https://claude.ai/code/session_01JrmAXvAwQh6yX3offyo81E
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants