Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up filtered queries (but how?) #716

Open
SIMULATAN opened this issue Dec 7, 2024 · 5 comments
Open

Speed up filtered queries (but how?) #716

SIMULATAN opened this issue Dec 7, 2024 · 5 comments

Comments

@SIMULATAN
Copy link
Contributor

Today, I tried looking up my coding time for a specific label. Unfortunately, the requests time out with a HTTP 502 before completing.

Sadly, my host systems are pretty slow (old server CPU & RAM) and thus take ages to compute summaries. Checking routes/summaries, it appears as if all heartbeats are fetched no matter what filters are applied, which leads me to believe that filtering is done either in the template or somewhere in the frontend - at least not on a database level. My user has about 350 000 heartbeats, filtering them already took long enough before but it has reached the breaking point today.

Since these tables can utilize indexes for queryable fields on a database level and are overall faster than application level filters, I'd appreciate if we could brainstorm ideas on how to improve this current situation.

@SIMULATAN
Copy link
Contributor Author

Please let me know in case I misunderstood something, my experience with the codebase is rather limited

@muety
Copy link
Owner

muety commented Dec 8, 2024

Thanks for bringing this up. You're right, as soon as filters are involved, summaries need to be recomputed, instead of being fetched from "cache" (aka. the summaries table). The relevant part in the code is this method, which, in turn, is called from the SummaryService.

At first sight, your suggestion to simply just fetch those heartbeats, that are relevant for the filtering (e.g. ... WHERE LABEL = 'foo') seems sensible. Unfortunately, it's not as easy, though.

If you take a look at the aggregation logic, you'll find that, in order to get total coding time for label foo correctly, you might still need other heartbeats that fall "in between". Suppose this situation:

Timestamp Heartbeat ID Label
t=0s 1 foo
t=60s 2 bar
t=150s 3 foo

With heartbeatsTimeout = 120s.

When only fetching heartbeats for label foo, you'll end up with a total coding time of 120s + 120s = 240s. When considering the fact that, in between of working on label foo, you were additionally working on a different label (e.g. different project) (thus Wakapi gets more fine-grained heartbeats information), total coding time will be 60s + 90s + 120s = 270s. Or, similarly, if there were only heartbeats 1 and 2, then you'd get a total of 120s when only considering foo-heartbeats, but 60s + 120s = 180s when take heartbeat 2 into account as well.

So instead of "give me all heartbeats for 'foo'", your query would have to be something like "give me all heartbeats for 'foo' and everything in between each two of those that are farther than apart" or something - which, of course, would be hard to implement as an indexed query either.

So while I definitely see the problem here, I can't think of a straightforward solution right now.

@muety muety changed the title Database-level queries Speed up filtered queries (but how?) Dec 8, 2024
@SIMULATAN
Copy link
Contributor Author

Ah, yeah that explains why this wasn't implemented on a DB level in the first place. A classic case of the user thinking an improvement is easy, but failing to understand the complexity behind it :^)

In theory, I guess we could move the current calculation into an SQL query. Especially Postgres should have support for these complex computations. I do imagine that a vendor-agnostic implementation would be rather challenging though. The performance would probably increase, yes, but I don't think the difference is major enough to warrant the complicated change.

My next course of action, should I get around to it, would be to try profiling Wakapi on my local machine, connected to the remote database, to find potential bottlenecks. It'll probably just be my slow I/O, but there may be some potential left.

@muety
Copy link
Owner

muety commented Dec 10, 2024

I attempted to implement (a previous version of) the calculation logic entirely in SQL before (see here). However, as you correctly pointed out, it would require individual queries per supported database system, as the query can't be done in standard SQL.

We could implement a MySQL- and Postgres-specific query and fall back to "application-side" aggregation for all other databases. But I'm a bit reluctant, because it would increase complexity by a lot and every change to it would then have to be done three times. When I find a spare moment, I might give it a try, let's see...

@muety
Copy link
Owner

muety commented Dec 31, 2024

Most promising refactoring in order to speed up filtered queries would probably be to make Durations a persisted (yet "volatile", i.e. may be deleted and recomputed at any time) entity as well, sitting on a granularity leven between heartbeats and summaries. We could probably even compute them "on the fly" as new heartbeats arrive, instead of "batch-wise" on request. Will have a look soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants