-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Front-end for Initiative Portfolio Participation #91
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
WHERE portfolio_id IN (SELECT id FROM selected_portfolio_ids) | ||
GROUP BY portfolio_id | ||
) itvs ON itvs.portfolio_id = portfolio.id | ||
%[1]s;`, where) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand these changes, specifically why these LEFT JOIN
s are now structured as subqueries, since they just join on the portfolio ID anyway. I know you mentioned a "cardinality bug", but I don't follow how this fixes that, when the subqueries are grouping by the same thing we were initially grouping by anyway (portfolio ID).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. This wasn't clear to me and took a while to debug. Here's my best explanation for posterity:
- Joins happen before any aggregation or group by.
- 3-table joins will create rows based on the cartesian product of their rows. If we have a primary table (P) and two secondary tables (A) and (B), for a given primary key, if we have 3 values in A (A1, A2, A3), and 2 values in B (B1, B2), then the Cartesian product will have 6 values in the join.
- When we then group by the primary key, that row has six values for both A and B in the join result (A1, A3, A3, A1, A2, A3), (B1, B2, B1, B2, B1, B2).
- We could unwind this through uniqueness sets or similar. However, there are two complicating factors that make this much harder to do: (a) nulls (b) objects that are unbunded into different columns (c) (most difficult) objects that are unbundled into different columns across tables.
After playing around with some row-based solutions, I found this was the simplest way (and probably the most extensible/easy to change: do your group-bys where it's not over a join result, and then just select). This also has the advantage of not generating huge cross products when the cardinality of the subtables is large (as it might be in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* 3-table joins will create rows based on the cartesian product of their rows. If we have a primary table (P) and two secondary tables (A) and (B), for a given primary key, if we have 3 values in A (A1, A2, A3), and 2 values in B (B1, B2), then the Cartesian product will have 6 values in the join.
Huh yeah, sure enough:
CREATE TABLE p (
id TEXT PRIMARY KEY NOT NULL
);
CREATE TABLE a (
id TEXT PRIMARY KEY NOT NULL,
p_id TEXT REFERENCES p (id) NOT NULL
);
CREATE TABLE b (
id TEXT PRIMARY KEY NOT NULL,
p_id TEXT REFERENCES p (id) NOT NULL
);
INSERT INTO p (id) VALUES ('p1');
INSERT INTO a (id, p_id) VALUES ('a1', 'p1'), ('a2', 'p1');
INSERT INTO b (id, p_id) VALUES ('b1', 'p1'), ('b2', 'p1'), ('b3', 'p1');
SELECT p.id, ARRAY_AGG(a.id), ARRAY_AGG(b.id)
FROM p
LEFT JOIN a ON p.id = a.p_id
LEFT JOIN b ON p.id = b.p_id
GROUP BY p.id;
produces
id | array_agg | array_agg
----+---------------------+---------------------
p1 | {a2,a1,a2,a1,a2,a1} | {b1,b1,b2,b2,b3,b3}
When asking the internet about this, I got the suggestion:
SELECT p.id,
(SELECT ARRAY_AGG(a.id) FROM a WHERE a.p_id = p.id),
(SELECT ARRAY_AGG(b.id) FROM b WHERE b.p_id = p.id)
FROM p;
Which involves subqueries but otherwise seems like the simplest approach, and I think applies here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally fair, that approach also would work. I think the downside of it is when we have multiple columns from a
or b
(and because memberships will often have an id
and a created at
, I think this is most cases), a nested query in the select statement then needs to be either (a) assumed order equivalent across multiple expressions, which might be the case or might not, or (b) needs to be multi-selected and the unnested (i.e. put into a composite object, then decomposed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Do you expect running this migration to cause any problems (for my own edification when I try to deploy this)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. My understanding is that it's essentially equivalent to running a not null
and index creation
and unique
column modification. The not-null constraint should hold based on our bizlogic. The uniquness constraint is the actual source of the need here, so it's plausible that there are duplicate rows. Should that be the case, we could delete those rows when deleting. However, since nobody has used these features in dev yet, we probably won't experience this at all, and if we needed to we could drop the contents of the table to make these hold.
Additionally, fixes a few backend errors that arose during testing
ARRAY_AGG
, which led to strange outcomes when we had aggregated arrays of differing lengths.conv
layer for partially populated entities.