Skip to content

Commit

Permalink
SOCKS5 over SSH Tunnel Support (#671)
Browse files Browse the repository at this point in the history
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.

- Same arguments are also available for create-login-profile

- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.

- Docs are updated to include a new 'Crawling with Proxies' page in the user guide

- Tests are updated to include crawling through an SSH proxy running locally.
---------

Co-authored-by: Vinzenz Sinapius <[email protected]>
  • Loading branch information
ikreymer and vnznznz authored Aug 29, 2024
1 parent 39c8f48 commit 8934fea
Show file tree
Hide file tree
Showing 12 changed files with 347 additions and 37 deletions.
2 changes: 2 additions & 0 deletions docs/docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,8 @@ code {
border-width: 1px;
border-color: #d1d5db;
border-style: solid;

white-space : pre-wrap !important;
}

.md-typeset h1,
Expand Down
63 changes: 36 additions & 27 deletions docs/docs/user-guide/cli-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,15 +94,15 @@ Options:
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
inks", "sitemap", "replay"] [default: []]
inks", "sitemap", "replay", "proxy"] [default: []]
--logExcludeContext Comma-separated list of contexts to
NOT include in logs
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
inks", "sitemap", "replay"] [default: ["recorderNetwork","jsError","screencast
"]]
inks", "sitemap", "replay", "proxy"] [default: ["recorderNetwork","jsError","s
creencast"]]
--text Extract initial (default) or final t
ext to pages.jsonl or WARC resource
record(s)
Expand Down Expand Up @@ -271,40 +271,49 @@ Options:
--qaDebugImageDiff if specified, will write crawl.png,
replay.png and diff.png for each pag
e where they're different [boolean]
--sshProxyPrivateKeyFile path to SSH private key for SOCKS5 o
ver SSH proxy connection [string]
--sshProxyKnownHostsFile path to SSH known hosts file for SOC
KS5 over SSH proxy connection
[string]
--config Path to YAML config file
```

## create-login-profile

```
Options:
--help Show help [boolean]
--version Show version number [boolean]
--url The URL of the login page [string] [required]
--user The username for the login. If not specified, will be promp
ted
--password The password for the login. If not specified, will be promp
ted (recommended)
--filename The filename for the profile tarball, stored within /crawls
/profiles if absolute path not provided
--help Show help [boolean]
--version Show version number [boolean]
--url The URL of the login page [string] [required]
--user The username for the login. If not specified, will b
e prompted
--password The password for the login. If not specified, will b
e prompted (recommended)
--filename The filename for the profile tarball, stored within
/crawls/profiles if absolute path not provided
[default: "/crawls/profiles/profile.tar.gz"]
--debugScreenshot If specified, take a screenshot after login and save as thi
s filename
--headless Run in headless mode, otherwise start xvfb
--debugScreenshot If specified, take a screenshot after login and save
as this filename
--headless Run in headless mode, otherwise start xvfb
[boolean] [default: false]
--automated Start in automated mode, no interactive browser
--automated Start in automated mode, no interactive browser
[boolean] [default: false]
--interactive Deprecated. Now the default option!
--interactive Deprecated. Now the default option!
[boolean] [default: false]
--shutdownWait Shutdown browser in interactive after this many seconds, if
no pings received [number] [default: 0]
--profile Path or HTTP(S) URL to tar.gz file which contains the brows
er profile directory [string]
--windowSize Browser window dimensions, specified as: width,height
[string] [default: "1360,1020"]
--proxyServer if set, will use specified proxy server. Takes precedence o
ver any env var proxy settings [string]
--cookieDays If >0, set all cookies, including session cookies, to have
this duration in days before saving profile
--shutdownWait Shutdown browser in interactive after this many seco
nds, if no pings received [number] [default: 0]
--profile Path or HTTP(S) URL to tar.gz file which contains th
e browser profile directory [string]
--windowSize Browser window dimensions, specified as: width,heigh
t [string] [default: "1360,1020"]
--cookieDays If >0, set all cookies, including session cookies, t
o have this duration in days before saving profile
[number] [default: 7]
--proxyServer if set, will use specified proxy server. Takes prece
dence over any env var proxy settings [string]
--sshProxyPrivateKeyFile path to SSH private key for SOCKS5 over SSH proxy co
nnection [string]
--sshProxyKnownHostsFile path to SSH known hosts file for SOCKS5 over SSH pro
xy connection [string]
```
86 changes: 86 additions & 0 deletions docs/docs/user-guide/proxies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Crawling with Proxies
Browser Crawler supports crawling through HTTP and SOCKS5 proxies, including through a SOCKS5 proxy over an SSH tunnel.

To specify a proxy, the `PROXY_SERVER` environment variable or `--proxyServer` CLI flag can be passed in.
If both are provided, the `--proxyServer` CLI flag will take precedence.

The proxy server can be specified as a `http://`, `socks5://`, or `ssh://` URL.

### HTTP Proxies

To crawl through an HTTP proxy running at `http://path-to-proxy-host.example.com:9000`, run the crawler with:

```sh
docker run -v $PWD/crawls/:/crawls/ -e PROXY_SERVER=http://path-to-proxy-host.example.com:9000 webrecorder/browsertrix-crawler crawl --url https://example.com/
```

or

```sh
docker run -v $PWD/crawls/:/crawls/ webrecorder/browsertrix-crawler crawl --url https://example.com/ --proxyServer http://path-to-proxy-host.example.com:9000
```

The crawler *does not* support authentication for HTTP proxies, as that is not supported by the browser.

(For backwards compatibility with crawler 0.x, `PROXY_HOST` and `PROXY_PORT` environment variables can be used to specify an HTTP proxy instead of `PROXY_SERVER`
which takes precedence if provided).


### SOCKS5 Proxies

To use a SOCKS5 proxy running at `path-to-proxy-host.example.com:9001`, run the crawler with:

```sh
docker run -v $PWD/crawls/:/crawls/ -e PROXY_SERVER=socks5://path-to-proxy-host.example.com:9001 webrecorder/browsertrix-crawler crawl --url https://example.com/
```

The crawler *does* support password authentication for SOCKS5 proxies, which can be provided as `user:password` in the proxy URL:

```sh
docker run-v $PWD/crawls/:/crawls/ -e PROXY_SERVER=socks5://user:[email protected]:9001 webrecorder/browsertrix-crawler crawl --url https://example.com/
```

### SSH Proxies

Starting with 1.3.0, the crawler also supports crawling through an SOCKS5 that is established over an SSH tunnel, via `ssh -D`.
With this option, the crawler can SSH into a remote machine that has SSH and port forwarding enabled and crawl through that machine's network.

To use this proxy, the private SSH key file must be provided via `--sshProxyPrivateKeyFile` CLI flag.

The private key and public host key should be mounted as volumes into a path in the container, as shown below.

For example, to connect via SSH to host `path-to-ssh-host.example.com` as user `user` with private key stored in `./my-proxy-private-key`, run:

```sh
docker run -v $PWD/crawls/:/crawls/ -v $PWD/my-proxy-private-key:/tmp/private-key webrecorder/browsertrix-crawler crawl --url https://httpbin.org/ip --proxyServer ssh://[email protected] --sshProxyPrivateKeyFile /tmp/private-key
```

To also provide the host public key (eg. `./known_hosts` file) for additional verification, run:

```sh
docker run -v $PWD/crawls/:/crawls/ -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler crawl --url https://httpbin.org/ip --proxyServer ssh://[email protected] --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts
```

The host key will only be checked if provided in a file via: `--sshProxyKnownHostsFile`.

A custom SSH port can be provided with `--proxyServer ssh://[email protected]:2222`, otherwise the
connection will be attempted via the default SSH port (port 22).

The SSH connection establishes a tunnel on a local port in the container (9722) which will forward inbound/outbound traffic through the remote proxy.
The `autossh` utility is used to automatically restart the SSH connection, if needed.

Only key-based authentication is supposed for SSH proxies for now.


## Browser Profiles

The above proxy settings also apply to [Browser Profile Creation](../browser-profiles), and browser profiles can also be created using proxies, for example:

```sh
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://[email protected] --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts
```





1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ nav:
- user-guide/crawl-scope.md
- user-guide/yaml-config.md
- user-guide/browser-profiles.md
- user-guide/proxies.md
- user-guide/behaviors.md
- user-guide/qa.md
- user-guide/cli-options.md
Expand Down
4 changes: 2 additions & 2 deletions src/crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -456,8 +456,6 @@ export class Crawler {
async bootstrap() {
const subprocesses: ChildProcess[] = [];

this.proxyServer = initProxy(this.params.proxyServer);

const redisUrl = this.params.redisStoreUrl || "redis://localhost:6379/0";

if (
Expand All @@ -482,6 +480,8 @@ export class Crawler {
setWARCInfo(this.infoString, this.params.warcInfo);
logger.info(this.infoString);

this.proxyServer = await initProxy(this.params, RUN_DETACHED);

logger.info("Seeds", this.seeds);

if (this.params.behaviorOpts) {
Expand Down
24 changes: 19 additions & 5 deletions src/create-login-profile.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import { initStorage } from "./util/storage.js";
import { CDPSession, Page, PuppeteerLifeCycleEvent } from "puppeteer-core";
import { getInfoString } from "./util/file_reader.js";
import { DISPLAY } from "./util/constants.js";
import { initProxy } from "./util/proxy.js";

const profileHTML = fs.readFileSync(
new URL("../html/createProfile.html", import.meta.url),
Expand Down Expand Up @@ -100,17 +101,28 @@ function cliOpts(): { [key: string]: Options } {
default: getDefaultWindowSize(),
},

cookieDays: {
type: "number",
describe:
"If >0, set all cookies, including session cookies, to have this duration in days before saving profile",
default: 7,
},

proxyServer: {
describe:
"if set, will use specified proxy server. Takes precedence over any env var proxy settings",
type: "string",
},

cookieDays: {
type: "number",
sshProxyPrivateKeyFile: {
describe: "path to SSH private key for SOCKS5 over SSH proxy connection",
type: "string",
},

sshProxyKnownHostsFile: {
describe:
"If >0, set all cookies, including session cookies, to have this duration in days before saving profile",
default: 7,
"path to SSH known hosts file for SOCKS5 over SSH proxy connection",
type: "string",
},
};
}
Expand Down Expand Up @@ -141,6 +153,8 @@ async function main() {

process.on("SIGTERM", () => handleTerminate("SIGTERM"));

const proxyServer = await initProxy(params, false);

if (!params.headless) {
logger.debug("Launching XVFB");
child_process.spawn("Xvfb", [
Expand Down Expand Up @@ -181,7 +195,7 @@ async function main() {
headless: params.headless,
signals: false,
chromeOptions: {
proxy: params.proxyServer,
proxy: proxyServer,
extraArgs: [
"--window-position=0,0",
`--window-size=${params.windowSize}`,
Expand Down
12 changes: 12 additions & 0 deletions src/util/argParser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,18 @@ class ArgParser {
"if specified, will write crawl.png, replay.png and diff.png for each page where they're different",
type: "boolean",
},

sshProxyPrivateKeyFile: {
describe:
"path to SSH private key for SOCKS5 over SSH proxy connection",
type: "string",
},

sshProxyKnownHostsFile: {
describe:
"path to SSH known hosts file for SOCKS5 over SSH proxy connection",
type: "string",
},
};
}

Expand Down
1 change: 1 addition & 0 deletions src/util/logger.ts
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ export const LOG_CONTEXT_TYPES = [
"links",
"sitemap",
"replay",
"proxy",
] as const;

export type LogContext = (typeof LOG_CONTEXT_TYPES)[number];
Expand Down
Loading

0 comments on commit 8934fea

Please sign in to comment.