Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to avoid intermittent errors when checking links #86

Merged
merged 10 commits into from
Jul 12, 2022

Conversation

douglascamata
Copy link
Contributor

@douglascamata douglascamata commented Jul 11, 2022

Related to #22.

  • Stop relying on the http.DefaultHTTPTransport: it's effectively global state. It can be
    modified by any other package. It has different default configuration depending on the
    Go compiler version and we cannot trust that these defaults are good for us.

  • Considering this is a tool often ran in resource constrained
    environments, like Github Actions, the HTTP client timeout was bumped
    to 30 seconds.

  • Colly was configured with a depth of 1, to avoid crawling too much
    information from the websites

  • The parallelism configuration of colly has been brought down from 100
    to 10. This setting means that at most 10 requests will be sent in
    parallel for the matching domains, instead of 100. This way we are
    more friendly to the servers and hopefully they will be happier with us too,
    returning more responses with 200 status code faster.

  • Added a random delay of up to 1 second to create new requests to
    matching domains, again to be more friendly with the servers, etc, etc.

How did I test this?

I compiled a binary and ran the make check-docs target of the https://github.com/thanos-io/thanos project a couple of times without my changes and with them.

Without the changes it's easy to reproduce the failures indicated by #22.

With the changes, I couldn't reproduce them.

How does it affect run time?

Locals runs were taking:

  • Without changes: Failed local runs that failed were taking 2m48s. Successful local runs could be as fast as 1m31s.
  • With changes
    • With 1s random delay: 3m20s.
    • Without random delay: 1m30s.

There are a lot of variables involved though, it isn't a reliable measurement.

I think I could be able to get rid of the random delay to make this faster while still avoiding the intermittent failures, if that's desirable. WDYT, @bwplotka?

It's a global tranpsort and could be changed by any dependency.

Signed-off-by: Douglas Camata <[email protected]>
This is an effort to avoid random errors when checking a bunch of links
across many files:

- Considering this is a tool often ran in resource constrained
  environments, like Github Actions, the HTTP client timeout was bumped
  to 30 seconds.

- Colly was configured with a depth of 1, to avoid crawling too much
  information from the websites

- The parallelism configuration of `colly` has been brought down from 100
  to 10. This setting means that at most 10 requests will be sent in
  parallel for the same matching domain. This way we are more friendly
  to the servers and hopefully they will be happier with us too, returning
  more responses with 200 status code.

- Added a random delay of up to 1 second to create new requests to
  matching domains, again to be more friendly with the servers, etc,  etc.

Signed-off-by: Douglas Camata <[email protected]>
`DialContext` is the new way to go.

Signed-off-by: Douglas Camata <[email protected]>
Copy link
Owner

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, some comments - it's though tuning play we have to until we do proper caching with ~hours/days jitter as we have planned in #54

Thanks for helping here! 💪🏽

@@ -241,7 +257,8 @@ func NewValidator(ctx context.Context, logger log.Logger, linksValidateConfig []
}
if err := v.c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 100,
Parallelism: 10,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallelism configuration of colly has been brought down from 100
to 10. This setting means that at most 10 requests will be sent in
parallel for the matching domains, instead of 100. This way we are
more friendly to the servers and hopefully they will be happier with us too,
returning more responses with 200 status code faster.

Hope based development 🙈

I think this makes sense generally, but I wonder if it makes sense to perhaps expose those options so we can tune it to every environment. Right now we are bit guessing with programmatic change that takes time to modify and release.

Copy link
Contributor Author

@douglascamata douglascamata Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only hope when it comes to checking third-party links (on remote servers we don't own) from third-party servers (Github Actions runners), sending many concurrent requests . 😂

I'll try to make some important parameters configurable via the command line 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, I am moving them to the config file, like the timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some parameters in the config file via 23f1dec.

@@ -241,7 +257,8 @@ func NewValidator(ctx context.Context, logger log.Logger, linksValidateConfig []
}
if err := v.c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 100,
Parallelism: 10,
RandomDelay: 1 * time.Second,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might take forever to validate things 🤔 Again, useful to expose an option?

Copy link
Contributor Author

@douglascamata douglascamata Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't always take forever to validate things. The delays are a random number between 0 and 1s, not always 1s. I have some runs that are just 1 minute slower than without the random delay, but again, it's hard to measure and say with confidence why a run was slow (was my network slower this time? was somebody else's server slower? which one? how slow?).

collector.SetClient(&http.Client{
Timeout: 30 * time.Second,
})
transport := &http.Transport{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, but also before tuning we should probably expose the options we want to tune to outside of binary. The problem is that running it locally almost never reproduces what's will happen on GH actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which configuration from the transport do you think would be valuable to expose as configuration, @bwplotka? I think the MaxIdleConns could be useful. But the other ones all seem like too much/too deep.

Copy link
Contributor Author

@douglascamata douglascamata Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't think tuning MaxIdleConns to different values will be useful. But I found out MaxConnsPerHost (by default 2), which can be useful. For example, Colly might be trying to concurrently send 10 requests to Github, but the transport will only hold 2 connections to Github.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 23f1dec.

Copy link
Collaborator

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking this up! 💪🏻

Some tiny comments.

pkg/mdformatter/linktransformer/link.go Outdated Show resolved Hide resolved
linktransformerMetrics := newLinktransformerMetrics(reg)
collector := colly.NewCollector(colly.Async(), colly.StdlibContext(ctx), colly.MaxDepth(1))
collector.SetClient(&http.Client{
Timeout: 30 * time.Second,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mdox already has a parameter in config which allows users to customize this, as different users will likely have different types of docs (how many links/type of links). 🙂

We use this parameter in prometheus-operator too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saswatamcode if the user doesn't provide a timeout, we would end up relying on the default timeout of Colly: 10 seconds. Are we fine with that? Or do we prefer to have a higher default timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a02848d

These options are:
* Colly's HTTP parallelism
* Colly's random delay between HTTP requests
* HTTP transport's max connections per host
Signed-off-by: Douglas Camata <[email protected]>
@douglascamata
Copy link
Contributor Author

I made some parameters configurable via the validate config file and updated the example in the README to include them and the timeout parameter (it wasn't documented yet).

Copy link
Collaborator

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🎉

Copy link
Owner

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks!

@bwplotka bwplotka merged commit d39d57f into bwplotka:main Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants