Skip to content

Support host-specific proxies with proxy config YAML #837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ikreymer
Copy link
Member

  • Adds support for YAML-based config for multiple proxies, containing 'matchHosts' section by regex and 'proxies' declaration, allowing matching any number of hosts to any number of named proxies.
  • Specified via --proxyServerConfig option passed to both crawl and profile creation commands.
  • Implemented internally by generating a proxy PAC script which does regex matching and running browser with the specified proxy PAC script served by an internal http server.
  • Also support matching different undici Agents by regex, for using different proxies with direct fetching
  • Precedence: currently, single --proxyServer / PROXY_SERVER takes precedence
  • Updated proxies doc section with example
  • Updated tests with sample bad and good auth examples of proxy config

Fixes #836

@ikreymer ikreymer force-pushed the proxy-pac-support branch from 69773ea to d688041 Compare July 16, 2025 22:51
@@ -641,6 +642,12 @@ class ArgParser {
type: "string",
},

proxyServerConfig: {
describe:
"if set, path to yaml/json file that configures multiple path servers per URL regex",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about JSON support as above

@@ -123,6 +123,12 @@ function initArgs() {
type: "string",
},

proxyServerConfig: {
describe:
"if set, path to yaml/json file that configures multiple path servers per URL regex",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually support JSON? From docs and code I'm only seeing YAML support

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAML is supposed to be a superset of JSON, as of 1.2 at least! (Although there are some edge-cases, but not sure we need to worry about, according to: https://news.ycombinator.com/item?id=30052633)
We could attempt to parse as JSON if YAML parsing fails.

FWIW, the main config is also parsed via a YAML parser but is stored as JSON in Browsertrix - we have not had any issues with it.

Comment on lines +127 to +132
if (!params.proxyMap?.matchHosts || !params.proxyMap?.proxies) {
if (defaultProxyEntry) {
logger.debug("Using Single Proxy", {}, "proxy");
}
return { proxyServer: defaultProxyEntry?.proxyUrl };
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm reading this wrong, but it seems like the single proxyServer will only be returned here if host-specific proxies aren't also passed? If we're giving precedence to single proxy servers, maybe the wider if clause here isn't necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, I think I updated this later to only override the default proxy (with no host configured), so the other configs still take precedence. I can see a use case for both, I suppose could add another option for this..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe another option would be good. We have clients who require all traffic to come from a specific IP or country, so I think we do need to have an option to make sure that's the case, even if we want to by default have the host-specific proxies override.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support host-specific proxy settings.
2 participants