Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent HTTP Header Formatting Across Environments #552

Closed
fduman opened this issue Sep 6, 2024 · 3 comments
Closed

Inconsistent HTTP Header Formatting Across Environments #552

fduman opened this issue Sep 6, 2024 · 3 comments
Labels
t-platform Issues with this label are in the ownership of the platform team. validated Issues that are resolved and their solutions fulfill the acceptance criteria.

Comments

@fduman
Copy link

fduman commented Sep 6, 2024

We are using Node.js 20.9.0 and proxy-chain 2.3.0 within crawlee 3.11.2.

Our proxy uses basic authentication, and we pass the proxy URL along with credentials to PlaywrightCrawler from crawlee.

While proxy-chain works without any issues on my local machine, it fails in our staging and production environments. All environments use Linux x64 containers.

To troubleshoot, I patched proxy-chain on the container to enable verbose mode and added extra log lines to ensure that the correct credentials were being sent.

Here’s a sample log output:

INFO  PlaywrightCrawler: Starting the crawler.
ProxyServer[32941]: Listening...
ProxyServer[32941]: !!! Handling CONNECT example.com:443 HTTP/1.1
ProxyServer[32941]: Using upstream proxy http://<redacted>:<redacted>@51.X.X.X:8888/
ProxyServer[32941]: Using chain() => example.com:443
ProxyServer[32941]: Failed to authenticate upstream proxy: 407 host,example.com:443,proxy-authorization,Basic Ym4..<redacted>
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Detected a session error, rotating session...
page.goto: net::ERR_TUNNEL_CONNECTION_FAILED at https://example.com/

I confirmed that the correct credentials were sent. We used tcpdump for packet inspection on the proxy machine, and here’s what we found:

CONNECT www.example.com:443 HTTP/1.1
0: host
1: www.example.com:443
2: proxy-authorization
3: Basic xxx=
Host: 127.0.0.1:8888
Connection: keep-alive

As you can see, the HTTP headers are not well formatted.
I reviewed the arguments for Node.js's http.request function, and it specifies that headers should be sent as a dictionary, but in this case, they were being sent as an array. I am unsure why proxy-chain behaves inconsistently across environments.

To resolve the issue, I patched proxy-chain on the container. Below are the changes:

Original code:

const options = {
    method: 'CONNECT',
    path: request.url,
    headers: [
        'host',
        request.url,
    ],
    localAddress: handlerOpts.localAddress,
    family: handlerOpts.ipFamily,
    lookup: handlerOpts.dnsLookup,
};
if (proxy.username || proxy.password) {
    options.headers.push('proxy-authorization', (0, get_basic_1.getBasicAuthorizationHeader)(proxy));
}

Modified code:

const options = {
    method: 'CONNECT',
    path: request.url,
    headers: { 'host': request.url },
    localAddress: handlerOpts.localAddress,
    family: handlerOpts.ipFamily,
    lookup: handlerOpts.dnsLookup,
};
if (proxy.username || proxy.password) {
    options.headers['proxy-authorization'] = get_basic_1.getBasicAuthorizationHeader(proxy);
}

After applying this modification, the issue was resolved.
Any comments?

@fduman fduman changed the title Sending badly formatted proxy headers on CONNECT Sending not well formatted proxy headers on CONNECT Sep 6, 2024
@fduman fduman changed the title Sending not well formatted proxy headers on CONNECT Inconsistent HTTP Header Formatting Across Environments Sep 6, 2024
@fnesveda fnesveda added the t-platform Issues with this label are in the ownership of the platform team. label Sep 6, 2024
@jirimoravcik
Copy link
Member

Hello,
this has already been discussed in the past, e.g. see #528

@fduman
Copy link
Author

fduman commented Sep 13, 2024

This is probably another Sentry issue as mentioned in #547.

@jirimoravcik jirimoravcik removed their assignment Sep 17, 2024
@fduman
Copy link
Author

fduman commented Sep 24, 2024

I can confirm that the main issue was Sentry.

@jancurn jancurn closed this as completed Sep 24, 2024
@fnesveda fnesveda added the validated Issues that are resolved and their solutions fulfill the acceptance criteria. label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-platform Issues with this label are in the ownership of the platform team. validated Issues that are resolved and their solutions fulfill the acceptance criteria.
Projects
None yet
Development

No branches or pull requests

4 participants