Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
name: CI

on:
push:
branches: [master, main, "claude/**"]
pull_request:

jobs:
test:
name: perl ${{ matrix.perl }} on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest]
perl: ["5.36", "5.38", "5.40"]
include:
- os: macos-latest
perl: "5.38"

steps:
- uses: actions/checkout@v4

- name: Set up perl
uses: shogo82148/actions-setup-perl@v1
with:
perl-version: ${{ matrix.perl }}
install-modules-with: cpanm
install-modules: |
ExtUtils::MakeMaker
JSON::PP
MIME::Base64
Encode
Archive::Tar
IO::Compress::Gzip
IO::Uncompress::Gunzip
Test::More

- name: Install optional deps (best effort)
run: |
cpanm --notest --quiet IO::Compress::Brotli || true
cpanm --notest --quiet Archive::Zip || true
cpanm --notest --quiet DBD::SQLite || true
cpanm --notest --quiet URI || true

- name: Install brotli & zip CLIs
if: runner.os == 'Linux'
run: sudo apt-get update && sudo apt-get install -y brotli zip

- name: perl -V
run: perl -V

- name: Build
run: perl Makefile.PL && make

- name: Test
run: prove -lr -It/lib t/

- name: Smoke run
run: |
./bin/harx --version
./bin/harx --help
./bin/harx extract --help
./bin/harx audit secrets --help || true
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
/blib/
/Makefile
/Makefile.old
/MYMETA.*
/pm_to_blib
/HAR-Extractor-*
/cover_db/
/.build/
*.bak
*.swp
*~
.DS_Store
/tmp/
/out/
57 changes: 57 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.0] — 2025-05-05

### Added

- Brand-new `harx` CLI with subcommands: `extract`, `validate`, `sanitize`,
`audit`, `convert`, `report`, `diff`, `merge`, `dedupe`, `slice`, `graphql`,
`websocket`, `endpoints`, `stats`, `completions`.
- Modular library `HAR::Extractor::*` (Parser, Decoder, Filter, Naming,
Writer, Manifest, Logger, URL, Magic).
- Filtering: URL regex/glob, domain glob, method, status spec (`2xx`,
`4xx,5xx`, ranges), MIME, MIME regex, size with K/M/G suffix, time window,
has-header, header-regex, has-cookie, body content regex, initiator,
scheme, cache state, HTTP protocol, plus a free `--where` boolean DSL.
- Decompression: gzip, deflate, brotli (perl module *or* CLI fallback), zstd.
Charset-aware text decoding.
- Audits: `secrets` (extensible JSON catalog), `headers` (per-host scoring),
`cookies` (Secure / HttpOnly / SameSite / Domain / size), `mixed-content`,
`cors`, `pii` (Luhn-validated cards, IBAN, locale-aware DNI/NIE for `es`).
- `sanitize` produces a redacted HAR (auth headers, cookies, credential
query params, optional body redaction).
- Conversion: curl (per-entry or combined script), wget, Postman v2.1,
Insomnia v4, Netscape cookies.txt jar.
- Reports: terminal `stats`, ASCII `waterfall`, `perf` anti-pattern audit,
fully self-contained interactive `html` report.
- `diff`, `merge`, `dedupe`, `slice`, `validate` HAR manipulation.
- Specialized extractors: `graphql` (operation-aware splitting), `websocket`
(frames → NDJSON), `endpoints` (with `:id`/`:uuid`/`:hex` normalization
and OpenAPI-lite output).
- Output formats: filesystem, zip (with Archive::Zip or system `zip`),
tar.gz (Archive::Tar), SQLite (with DBD::SQLite).
- Manifests: JSON, CSV, JSONL, SQLite.
- Comprehensive test suite (300+ assertions covering modules, integration,
CLI subprocess flows, edge cases).
- Documentation: README, USAGE reference, COOKBOOK with 20 recipes, man
page, GitHub Actions CI matrix.

### Changed

- Replaced legacy `har-extract.pl` (single-file script with interactive menu)
with the modular CLI. **No backwards compatibility shim**: previous users
must migrate to `harx extract`.

### Removed

- `har-extract.pl` (replaced).

## [0.1.0] — 2019-11-24

- Initial release: a single 86-line Perl script that extracted multimedia
resources (images and videos) from HAR files via an interactive menu.
46 changes: 46 additions & 0 deletions Makefile.PL
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
use strict;
use warnings;
use ExtUtils::MakeMaker;

WriteMakefile(
NAME => 'HAR::Extractor',
AUTHOR => 'Saul Blanco Tejero <@elGolpista>',
VERSION_FROM => 'lib/HAR/Extractor.pm',
ABSTRACT => 'Extract, filter, audit and convert HTTP Archive (HAR) files',
LICENSE => 'gpl_3',
MIN_PERL_VERSION => '5.036',
EXE_FILES => ['bin/harx'],
PREREQ_PM => {
'JSON::PP' => 0,
'MIME::Base64' => 0,
'Encode' => 0,
'File::Path' => 0,
'File::Spec' => 0,
'File::Basename' => 0,
'File::Temp' => 0,
'Getopt::Long' => 0,
'Digest::SHA' => 0,
'Digest::MD5' => 0,
'Time::HiRes' => 0,
'Archive::Tar' => 0,
'IO::Uncompress::Gunzip' => 0,
'IO::Uncompress::Inflate' => 0,
},
TEST_REQUIRES => {
'Test::More' => '0.98',
},
META_MERGE => {
'meta-spec' => { version => 2 },
resources => {
repository => {
type => 'git',
url => 'https://github.com/Saul-BT/har-extractor.git',
web => 'https://github.com/Saul-BT/har-extractor',
},
bugtracker => {
web => 'https://github.com/Saul-BT/har-extractor/issues',
},
},
},
clean => { FILES => 'HAR-Extractor-*' },
);
166 changes: 166 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# har-extractor (`harx`)

> Extract, filter, audit, sanitize and convert HTTP Archive (HAR) files.

`harx` is a command-line toolkit for working with HAR files. It started as a
small Perl script that pulled images and videos out of HAR captures, and has
grown into a multi-purpose tool for web developers, QA engineers, security
testers, and performance analysts.

## Features

- **Extraction** — bodies, headers, cookies, request payloads, into a
filesystem tree, zip / tar.gz / SQLite, with a JSON / CSV / JSONL manifest.
Mirror mode (URL-paths-as-directories) and full-trace mode (one folder per
entry).
- **Powerful filters** — URL regex / glob, domain glob, method, status code
ranges (`2xx`, `404,500-599`), MIME, size with K/M/G suffix, time window,
header presence/regex, cookie presence, body regex, initiator, scheme,
cache state, HTTP version. Plus a tiny boolean DSL with `AND`/`OR`/`NOT`.
- **Decompression** — gzip, deflate, brotli (via `IO::Compress::Brotli`
*or* the `brotli` CLI as fallback), zstd via the `zstd` CLI. Charset-aware
text decoding.
- **Security audits** — `secrets` (JWT, AWS, GitHub PAT, Stripe, OpenAI,
PEM, ...), `headers` (CSP, HSTS, X-Frame-Options, X-Content-Type-Options,
Referrer-Policy, Permissions-Policy with per-host scoring), `cookies`
(Secure / HttpOnly / SameSite), `mixed-content`, `cors`, `pii`
(Luhn-validated cards, IBAN, locale-aware DNI/NIE for Spain).
- **Sanitize** — produce a redacted HAR safe to share: masks Authorization,
cookies, sensitive query params, and (with `--redact-bodies`) any matched
secret pattern in bodies.
- **Conversion** — to curl (per-entry or one combined bash script), wget,
Postman Collection v2.1, Insomnia v4 export, Netscape `cookies.txt` jar.
- **Reports** — terminal stats with top-N hosts/MIME/slow/large/duplicates and
redirect chains; ASCII waterfall; performance anti-patterns
(uncompressed compressible, no-cache static, render-blocking head scripts,
slow/large responses, redirect chains); fully self-contained interactive
HTML report with previews, syntax-highlighted bodies, and client-side
filters.
- **HAR manipulation** — `diff` two HARs (only-in-A / only-in-B / changed),
`merge` chronologically, `dedupe`, `slice` to a time window, `validate`.
- **Specialized extractors** — `graphql` (split by operation, write
`.graphql` + variables JSON), `websocket` (frames to NDJSON timeline),
`endpoints` (unique API inventory with `:id`/`:uuid` normalization, with
optional OpenAPI-lite output).

## Quickstart

```bash
# Inspect what you've got
harx validate capture.har
harx stats capture.har

# Extract everything from one host as a JSON-manifested directory tree
harx extract -i capture.har -o ./out --bodies --headers \
--domain '*.example.com' --status 2xx

# Generate a self-contained HTML report
harx report html -i capture.har -o report.html

# Find leaked secrets before sharing
harx audit secrets capture.har

# Sanitize a HAR safe to attach to a bug report
harx sanitize -i capture.har -o redacted.har --redact-bodies

# Replay an entire login flow
harx convert curl --combined -i capture.har -o replay.sh
bash replay.sh

# Reverse-engineer a GraphQL API
harx graphql -i capture.har --format files -o ./gql

# Diff two captures (e.g. before / after a fix)
harx diff before.har after.har
```

## Installation

`harx` requires Perl 5.36+ and only **core** modules to run with full
functionality. Optional features auto-detect their dependencies:

| Feature | Optional dependency |
| ------------------------------ | -------------------------------- |
| Brotli `Content-Encoding` | `IO::Compress::Brotli`, or `brotli` CLI |
| `--output-format zip` | `Archive::Zip`, or system `zip` |
| `--output-format sqlite` | `DBD::SQLite` |
| zstd `Content-Encoding` | system `zstd` |

```bash
git clone https://github.com/Saul-BT/har-extractor.git
cd har-extractor
perl Makefile.PL && make && make test
sudo make install # installs `harx` and the HAR::Extractor modules
```

Or run from the source tree:

```bash
./bin/harx --help
```

## Subcommands

```
harx extract Extract resources, optionally filtered, with manifests
harx validate Validate a HAR against the 1.2 schema
harx sanitize Produce a redacted HAR safe to share
harx audit secrets | headers | cookies | mixed-content | cors | pii
harx convert curl | wget | postman | insomnia | cookies
harx report stats | waterfall | perf | html
harx diff Compare two HAR files
harx merge Merge multiple HAR files chronologically
harx dedupe Drop duplicate entries
harx slice Trim to a time window
harx graphql Extract GraphQL operations to .graphql files
harx websocket Extract WebSocket frames to NDJSON
harx endpoints Inventory API endpoints (with :param normalization)
harx stats Aggregate statistics (alias for `report stats`)
harx completions Print bash / zsh / fish completion script
```

Run `harx <subcommand> --help` for per-command flags. See [docs/USAGE.md](docs/USAGE.md)
and [docs/COOKBOOK.md](docs/COOKBOOK.md) for the full reference and recipes.

## Filters cheat-sheet

```
--url-regex 'PATTERN' --domain '*.example.com' --exclude-domain
--method GET,POST --status 2xx | --status 200,301-302 | 4xx,5xx
--mime application/json --mime-regex 'json|xml'
--min-size 100K --max-size 5M
--since '+30s' | ISO8601 --until ...
--has-header 'Set-Cookie' --header-regex 'X-Foo: bar.*'
--has-cookie sessionid --content-regex 'EMAIL'
--initiator script --secure / --no-secure
--from-cache --protocol http/2
--where 'status=200 AND mime ~ json AND NOT domain ~ analytics'
```

## Exit codes

```
0 success
1 generic error
2 invalid usage / invalid HAR
3 no entries matched (extract subcommand)
4 I/O error
```

## Testing

```bash
prove -lr -It/lib t/
```

The test suite covers all modules and CLI paths (300+ assertions). Tests use
only core Perl modules; tests requiring optional dependencies skip cleanly
when those are absent.

## License

GPL-3.0-only — see [LICENSE](LICENSE).

Originally written by Saúl Blanco Tejero ([@elGolpista](https://github.com/elGolpista))
in 2019. Refactored, modularized and extended into a full toolkit while
preserving the original attribution.
Loading
Loading