Skip to content

Support host-specific proxies with proxy config YAML #837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions docs/docs/user-guide/proxies.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,56 @@ The above proxy settings also apply to [Browser Profile Creation](browser-profil
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://[email protected] --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts
```

## Host-Specific Proxies

With the 1.7.0 release, the crawler also supports running with multiple proxies, defined in a separate proxy YAML config file. The file contains a match hosts section, matching hosts by regex to named proxies.

For example, the following YAML file can be passed to `--proxyConfigFile` option:

```yaml
matchHosts:
# load all URLs from example.com through 'example-1-proxy'
example.com/.*: example-1-proxy

# load all URLS from https://my-social.example.com/.*/posts/ through
# a different proxy
https://my-social.example.com/.*/posts/: social-proxy

# optional default proxy
"": default-proxy

proxies:
# SOCKS5 proxy just needs a URL
example-1-proxy: socks5://username:[email protected]

# SSH proxy also should have at least a 'privateKeyFile'
social-proxy:
url: ssh://[email protected]
privateKeyFile: /proxies/social-proxy-private-key
# optional
publicHostsFile: /proxies/social-proxy-public-hosts

default-proxy:
url: ssh://[email protected]
privateKeyFile: /proxies/default-proxy-private-key
```

If the above config is stored in `./proxies/proxyConfig.yaml` along with the SSH private keys and known public hosts
files, the crawler can be started with:

```sh
docker run -v $PWD/crawls:/crawls -v $PWD/proxies:/proxies -it webrecorder/browsertrix-crawler --url https://example.com/ --proxyServerConfig /proxies/proxyConfig.yaml
```

Note that if SSH proxies are provided, an SSH tunnel must be opened for each one before the crawl starts.
The crawl will not start if any of the SSH proxy connections fail, even if a host-specific proxy is not actually used.
SOCKS5 and HTTP proxy connections are attempted only on first use.

The same `--proxyServerConfig` option can also be used in browser profile creation with the `create-login-profile` command in the same way.

### Proxy Precedence

If both `--proxyServerConfig` and `--proxyServer`/`PROXY_SERVER` env var are specified, the single `--proxyServer`
option takes precedence.


29 changes: 25 additions & 4 deletions src/crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ export class Crawler {
maxHeapTotal = 0;

proxyServer?: string;
proxyPacUrl?: string;

driver:
| ((opts: {
Expand Down Expand Up @@ -509,7 +510,9 @@ export class Crawler {
setWARCInfo(this.infoString, this.params.warcInfo);
logger.info(this.infoString);

this.proxyServer = await initProxy(this.params, RUN_DETACHED);
const res = await initProxy(this.params, RUN_DETACHED);
this.proxyServer = res.proxyServer;
this.proxyPacUrl = res.proxyPacUrl;

this.seeds = await parseSeeds(this.params);
this.numOriginalSeeds = this.seeds.length;
Expand Down Expand Up @@ -1282,7 +1285,7 @@ self.__bx_behaviors.selectMainBehavior();
}
}

async pageFinished(data: PageState) {
async pageFinished(data: PageState, lastErrorText = "") {
// if page loaded, considered page finished successfully
// (even if behaviors timed out)
const { loadState, logDetails, depth, url, pageSkipped } = data;
Expand Down Expand Up @@ -1317,11 +1320,28 @@ self.__bx_behaviors.selectMainBehavior();
await this.serializeConfig();

if (depth === 0 && this.params.failOnFailedSeed) {
let errorCode = ExitCodes.GenericError;

switch (lastErrorText) {
case "net::ERR_SOCKS_CONNECTION_FAILED":
case "net::SOCKS_CONNECTION_HOST_UNREACHABLE":
case "net::ERR_PROXY_CONNECTION_FAILED":
case "net::ERR_TUNNEL_CONNECTION_FAILED":
errorCode = ExitCodes.ProxyError;
break;

case "net::ERR_TIMED_OUT":
case "net::ERR_INVALID_AUTH_CREDENTIALS":
if (this.proxyServer || this.proxyPacUrl) {
errorCode = ExitCodes.ProxyError;
}
break;
}
logger.fatal(
"Seed Page Load Failed, failing crawl",
{},
"general",
ExitCodes.GenericError,
errorCode,
);
}
}
Expand Down Expand Up @@ -1709,7 +1729,8 @@ self.__bx_behaviors.selectMainBehavior();
emulateDevice: this.emulateDevice,
swOpt: this.params.serviceWorker,
chromeOptions: {
proxy: this.proxyServer,
proxyServer: this.proxyServer,
proxyPacUrl: this.proxyPacUrl,
userAgent: this.emulateDevice.userAgent,
extraArgs: this.extraChromeArgs(),
},
Expand Down
15 changes: 12 additions & 3 deletions src/create-login-profile.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ import { initStorage } from "./util/storage.js";
import { CDPSession, Page, PuppeteerLifeCycleEvent } from "puppeteer-core";
import { getInfoString } from "./util/file_reader.js";
import { DISPLAY, ExitCodes } from "./util/constants.js";
import { initProxy } from "./util/proxy.js";
import { initProxy, loadProxyConfig } from "./util/proxy.js";
//import { sleep } from "./util/timing.js";

const profileHTML = fs.readFileSync(
Expand Down Expand Up @@ -123,6 +123,12 @@ function initArgs() {
type: "string",
},

proxyServerConfig: {
describe:
"if set, path to yaml/json file that configures multiple path servers per URL regex",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually support JSON? From docs and code I'm only seeing YAML support

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAML is supposed to be a superset of JSON, as of 1.2 at least! (Although there are some edge-cases, but not sure we need to worry about, according to: https://news.ycombinator.com/item?id=30052633)
We could attempt to parse as JSON if YAML parsing fails.

FWIW, the main config is also parsed via a YAML parser but is stored as JSON in Browsertrix - we have not had any issues with it.

type: "string",
},

sshProxyPrivateKeyFile: {
describe:
"path to SSH private key for SOCKS5 over SSH proxy connection",
Expand Down Expand Up @@ -161,7 +167,9 @@ async function main() {

process.on("SIGTERM", () => handleTerminate("SIGTERM"));

const proxyServer = await initProxy(params, false);
loadProxyConfig(params);

const { proxyServer, proxyPacUrl } = await initProxy(params, false);

if (!params.headless) {
logger.debug("Launching XVFB");
Expand Down Expand Up @@ -203,7 +211,8 @@ async function main() {
headless: params.headless,
signals: false,
chromeOptions: {
proxy: proxyServer,
proxyServer,
proxyPacUrl,
extraArgs: [
"--window-position=0,0",
`--window-size=${params.windowSize}`,
Expand Down
9 changes: 9 additions & 0 deletions src/util/argParser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import {
logger,
} from "./logger.js";
import { SaveState } from "./state.js";
import { loadProxyConfig } from "./proxy.js";

// ============================================================================
export type CrawlerArgs = ReturnType<typeof parseArgs> & {
Expand Down Expand Up @@ -641,6 +642,12 @@ class ArgParser {
type: "string",
},

proxyServerConfig: {
describe:
"if set, path to yaml/json file that configures multiple path servers per URL regex",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about JSON support as above

type: "string",
},

dryRun: {
describe:
"If true, no archive data is written to disk, only pages and logs (and optionally saved state).",
Expand Down Expand Up @@ -778,6 +785,8 @@ class ArgParser {
argv.emulateDevice = { viewport: null };
}

loadProxyConfig(argv);

if (argv.lang) {
if (!ISO6391.validate(argv.lang)) {
logger.fatal("Invalid ISO-639-1 country code for --lang: " + argv.lang);
Expand Down
6 changes: 4 additions & 2 deletions src/util/blockrules.ts
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,9 @@ export class BlockRules {
logDetails: Record<string, any>,
) {
try {
const res = await fetch(reqUrl, { dispatcher: getProxyDispatcher() });
const res = await fetch(reqUrl, {
dispatcher: getProxyDispatcher(reqUrl),
});
const text = await res.text();

return !!text.match(frameTextMatch);
Expand Down Expand Up @@ -303,7 +305,7 @@ export class BlockRules {
method: "PUT",
headers: { "Content-Type": "text/html" },
body,
dispatcher: getProxyDispatcher(),
dispatcher: getProxyDispatcher(putUrl.href),
});
}
}
Expand Down
16 changes: 9 additions & 7 deletions src/util/browser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ import { timedRun } from "./timing.js";
import assert from "node:assert";

type BtrixChromeOpts = {
proxy?: string;
proxyServer?: string;
proxyPacUrl?: string;
userAgent?: string | null;
extraArgs?: string[];
};
Expand Down Expand Up @@ -243,7 +244,8 @@ export class Browser {
}

chromeArgs({
proxy = "",
proxyServer = "",
proxyPacUrl = "",
userAgent = null,
extraArgs = [],
}: BtrixChromeOpts) {
Expand All @@ -262,14 +264,14 @@ export class Browser {
...extraArgs,
];

if (proxy) {
const proxyString = getSafeProxyString(proxy);
if (proxyServer) {
const proxyString = getSafeProxyString(proxyServer);
logger.info("Using proxy", { proxy: proxyString }, "browser");
}

if (proxy) {
args.push("--ignore-certificate-errors");
args.push(`--proxy-server=${proxy}`);
args.push(`--proxy-server=${proxyServer}`);
} else if (proxyPacUrl) {
args.push("--proxy-pac-url=" + proxyPacUrl);
}

return args;
Expand Down
2 changes: 1 addition & 1 deletion src/util/file_reader.ts
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ async function writeUrlContentsToFile(
pathPrefix: string,
pathDefaultExt: string,
) {
const res = await fetch(url, { dispatcher: getProxyDispatcher() });
const res = await fetch(url, { dispatcher: getProxyDispatcher(url) });
const fileContents = await res.text();

const filename =
Expand Down
2 changes: 1 addition & 1 deletion src/util/originoverride.ts
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ export class OriginOverride {

const resp = await fetch(newUrl, {
headers,
dispatcher: getProxyDispatcher(),
dispatcher: getProxyDispatcher(newUrl),
});

const body = Buffer.from(await resp.arrayBuffer());
Expand Down
Loading
Loading