Some improvements for Production deployments #71

fschoell · 2024-09-05T08:43:35Z

Don't log 4xx responses as error. Either log them as info or not at all, they aren't an error from our perspective but only client errors.

Don't expose internal errors to the client (make sure they are properly logged though). They are not helpful for users and might expose internal information. If you want traceability, you could generate a random id and return that to the user instead (and also log it so we can grep the logs for a specific failed request).

Count failed Clickhouse queries as Prometheus metric, that way we can easily add alerts for database issues (for now this is probably equivalent with all 500 errors, but that might diverge in the future).

Use a Promtheus histogram to bucket query times instead of a counter. That way we can monitor query time percentiles, which is more useful than a global average of query times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some improvements for Production deployments #71

Some improvements for Production deployments #71

fschoell commented Sep 5, 2024

Some improvements for Production deployments #71

Some improvements for Production deployments #71

Comments

fschoell commented Sep 5, 2024