Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine action tables #4459

Open
wants to merge 267 commits into
base: main
Choose a base branch
from

Conversation

dullbananas
Copy link
Collaborator

@dullbananas dullbananas commented Feb 16, 2024

This should cause a huge improvement in query plans, especially for queries that previously reached the from/join collapse limits. For example, getting saved posts might now start with an index scan of the post_actions table, which avoids scanning posts that the user didn't do anything with (or all non-saved posts if I add partial indexes, but I don't know if I should do that).

This will also make the code much cleaner and reduce the size of the database. (Edit: it may or may not reduce size)

Indexes for the new action tables will use INCLUDE WHERE with IS NULL for each action column to keep index-only scans possible.

In the new joins, person_id will not use a bind parameter if it's None, so there can still be separate generic query plans for users that are not logged in.

@dessalines
Copy link
Member

Before you go forward and spend too much time on this, it needs a lot of discussion, because we could lose a lot of data integrity solely for the sake of post_view query speed. An update to a person_action table, when that action could be many different columns is a lot more confusing than single-action tables with solid constraints.

There are a lot of inside-postgres things we could do before getting rid of the post_like or comment_like table (unfortunately most of them would be some form of caching / non-source data store tho).

@dullbananas
Copy link
Collaborator Author

dullbananas commented Feb 16, 2024

@dessalines Would that problem be fixed by using a composite type for each action that stores multiple values?

Edit: or multi-column constraints, like (a IS NULL) = (b IS NULL)

@dessalines
Copy link
Member

I'm not sure I like that option either, at least for source data.

The only thing I can think of rn, that would also help with the linked issue below, is to do what you're doing with the post_action_table (with many optional columns), but have it act as a cache / secondary store, being filled by triggers on inserts / updates to source tables like post_like. I don't like this too much, since these secondary stores are nearly always imperfect and tend to get out of sync, and solving problems with them can be a nightmare.

We desperately need some SQL experts that could help us with this one, as well as #2444 which is a similar problem.

@Nutomic
Copy link
Member

Nutomic commented Feb 19, 2024

I dont think this implementation would create any problems with data integrity, as you have mandatory columns for person_id, post_id etc and then optional columns for each action. In effect its the same integrity we have with existing table definitions. There is a risk to read or write the wrong column, but that seems unlikely as we can keep using existing wrapper methods such as PostLike::like.

On the other hand storing the data in another table and using triggers will definitely give us consistency bugs, as happened with comment counts. So I would say go ahead with this approach.

@dessalines
Copy link
Member

I've posted this to ![email protected] to see if any SQL experts can chime in on a correct way of doing this.

https://programming.dev/post/10280707

@dullbananas
Copy link
Collaborator Author

I changed the implementation of the existing post functions to use the post_actions table.

The only remotely scary thing is automatically deleting rows after all actions are unset. I will do that with a trigger that runs DELETE. It shouldn't have concurrency problems because the condition after WHERE is re-checked if needed after locking the row. Also, forgetting to update the trigger after adding columns will be guaranteed to raise an error because tuple comparison with the whole row will be used (e.g. (foo.*) = (foo.a, foo.b, NULL, NULL)).

@Nutomic
Copy link
Member

Nutomic commented Jul 9, 2024

Or otherwise we make a branch release/0.19, cherry-pick commits for 0.19.6 and then merge this to main. This way we can also start merging all the other breaking PRs.

Copy link
Collaborator

@phiresky phiresky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I think this is probably a good idea, though i'm not 100% confident.

Wrt clean database design, this could be considered a bad idea, since it's kind of denormalization - instead of not having rows when the values aren't present, there's now a lot more null columns. But it's not very unclean and joins are both hard to write and read, and wrt performance, it's probably good.

Wrt the code, there's a lot of changes and without spending a lot of time it's difficult to tell whether everything is transferred perfectly. Like that uplete stuff, no idea whether that's right or not.a
The non-null assuming overrides in diesel seem like a bit of a hack to me that might cause problems in the future, maybe it's possible to solve that more elegantly (like removing all the separate structs that now reference the same tables, but that would be an even bigger refactoring).

@dessalines
Copy link
Member

cc @dullbananas some merge conflicts

@Nutomic
Copy link
Member

Nutomic commented Sep 17, 2024

You havent answered my question above regarding assume_not_null. Will Lemmy crash if that assumption is wrong? Why cant you mark it as not null directly in sql?

Edit: I get it now, we have sql tables like comment_action.score which are null if the user hasnt cast any vote. But in the api there cant be an option, so we need to convert exclude null values in the query. Makes sense, but I hope this extra complexity wont cause problems in the future.


Ok(())
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this looks quite complicated, would be good to have some unit tests.

@dullbananas
Copy link
Collaborator Author

dullbananas commented Sep 20, 2024

Maybe in the future I could create something that causes filter and assume_not_null to be encapsulated in a way that prevents accidentally making unexpected null errors possible. It would probably be a variant of the Selectable derive macro that creates the whole query. Until then, the as_select calls just need to be used with the right filter when implementing the query for a new action type.

@dullbananas
Copy link
Collaborator Author

conflicts are resolved now

@dessalines
Copy link
Member

We still gotta get more ppl than me looking at this. Its been on our PR list for too long, and it'll give a lot of potential performance benefits.

@Nutomic
Copy link
Member

Nutomic commented Oct 23, 2024

My comments are not adressed yet.

@Nutomic
Copy link
Member

Nutomic commented Oct 31, 2024

Did you actually compare the query plans eg for PostView before and after these changes to verify that there is a major benefit? These changes are very complex and can cause strange bugs from AssumeNotNull, as well as making future code changes much more difficult. So if there is only a minor benefit I would rather skip it and keep the current implementation. It may not be the most efficient, but at least its easy to understand and maintain.

If we merge this then you definitely need to add tests for uplete.rs. In case there is a weird failure in api tests it would be very hard to track it down to a specific part of that file otherwise.

@dullbananas
Copy link
Collaborator Author

Now there's tests in the uplete module.

I don't remember checking the query plans and durations. I will do that soon. Or you could do it if you have enough time in the next few days, which should be super easy with scripts/db_perf.sh. If you do, remember to merge from main right before checking.

I don't completely agree about the maintainability tradeoff. I think the current action-related code is completely the opposite of "easy to understand and maintain". There's already much simpler joins now with the combined tables, and maybe overall more ease in adding more actions. In the future there can be less maintainability problems by not using separate structs, or separate fields in views, for each individual action type.

Let me know if you want me to reduce the assume_not_null risk before this PR is merged, at the expense of this PR taking a much longer time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants